Reinforcement Learning Practice

Implementations of basic RL algorithms

Posted by : Junseo Park on Jun 17, 2023

Category : project

🚀 "Implementation of Basic RL Algorithms" 🌟

Algorithm Test

Basic RL algorithms were implemented and tested in the OpenAI Gym environment.
The implemented algorithms are as follows:
- DQN, DDQN, Dueling DQN, DDQN + Dueling, PPO, A2C, A3C

UAV with DDPG

$\textbf{Motivation}$

A paper presented an experiment where a UAV uses the received signal strength as a reward and employs reinforcement learning (DDPG algorithm) to approach the user.
The project was carried out with the idea of directly implementing the paper.

$\textbf{Environment}$

The UAV must reach the user.
The starting position of the UAV is fixed at ($20, 20, 200$).
K users (default: $5$) are generated at random positions between [rd.randint(150, 180), rd.randint(150, 180), 0].
The DDPG algorithm is used, and the reward is based on the received signal strength between the user and the UAV.
The network is implemented using a simple DNN.
The experiment ends after $60$ steps.

Black Jack with TD vs MC

TD/MC prediction/control
The results vary depending on the presence of ACE.
- Areas marked with a star represent “hit,” while the unmarked areas represent “stand.”

Unity with PPO

Unity provides a service that allows various RL algorithms to be applied to environments created with Unity.
A car racing environment was created, and the PPO algorithm was applied to test whether it could complete the race.
The configuration is as follows:

behaviors:
  ArcadeDriver:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 204800
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 7
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 200
    max_steps: 100000000
    time_horizon: 32
    summary_freq: 10000
    checkpoint_interval: 50000

Dodge the poop game

$\textbf{Environment}$

Action: Move left or right
Termination: Terminated when hit by the ball
Reward: $-0.01$ over time, $+1$ for small ball, $+10$ for large ball
Base network: MLP
Algorithms tested: DQN, A2C, DDPG, A3C, PPO, PPO + RNN, PPO + LSTM, DDQN
Main result: The task was easy, resulting in good performance from DNN, DDQN, and DQN.

Method	# of layers	Accelerated epi	Best score	Epi achieved best score	Avg score ($\times 10^2$)
DQN	4	400	1578	178	150
DQN with CNN	4	x	x	x	x
DDQN	4	400	1729	299	160
PPO	5	600	173	200	11
PPO with RNN	5	x	45	100	11
PPO with LSTM	5	5000	140	205	24