Reinforcement Learning Practice

Posted by : on

Category : project


🚀 "Implementation of Basic RL Algorithms" 🌟

Algorithm Test

  • Basic RL algorithms were implemented and tested in the OpenAI Gym environment.
  • The implemented algorithms are as follows:
    • DQN, DDQN, Dueling DQN, DDQN + Dueling, PPO, A2C, A3C

UAV with DDPG

$\textbf{Motivation}$

  • A paper presented an experiment where a UAV uses the received signal strength as a reward and employs reinforcement learning (DDPG algorithm) to approach the user.
  • The project was carried out with the idea of directly implementing the paper.

$\textbf{Environment}$

  • The UAV must reach the user.
  • The starting position of the UAV is fixed at ($20, 20, 200$).
  • K users (default: $5$) are generated at random positions between [rd.randint(150, 180), rd.randint(150, 180), 0].
  • The DDPG algorithm is used, and the reward is based on the received signal strength between the user and the UAV.
  • The network is implemented using a simple DNN.
  • The experiment ends after $60$ steps.
Image 1
Caption 1: Before training
Image 2
Caption 1: After training

Black Jack with TD vs MC

  • TD/MC prediction/control
  • The results vary depending on the presence of ACE.
    • Areas marked with a star represent “hit,” while the unmarked areas represent “stand.”
Image 1
Caption 1: Use ACE as 1
Image 2
Caption 2: Use ACE as 11

Unity with PPO

  • Unity provides a service that allows various RL algorithms to be applied to environments created with Unity.
  • A car racing environment was created, and the PPO algorithm was applied to test whether it could complete the race.
  • The configuration is as follows:
behaviors:
  ArcadeDriver:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 204800
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 7
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 200
    max_steps: 100000000
    time_horizon: 32
    summary_freq: 10000
    checkpoint_interval: 50000

Dodge the poop game

$\textbf{Environment}$

  • Action: Move left or right

  • Termination: Terminated when hit by the ball

  • Reward: $-0.01$ over time, $+1$ for small ball, $+10$ for large ball

  • Base network: MLP

  • Algorithms tested: DQN, A2C, DDPG, A3C, PPO, PPO + RNN, PPO + LSTM, DDQN

  • Main result: The task was easy, resulting in good performance from DNN, DDQN, and DQN.

Method # of layers Accelerated epi Best score Epi achieved best score Avg score ($\times 10^2$)
DQN 4 400 1578 178 150
DQN with CNN 4 x x x x
DDQN 4 400 1729 299 160
PPO 5 600 173 200 11
PPO with RNN 5 x 45 100 11
PPO with LSTM 5 5000 140 205 24