TD3

Twin Delayed DDPG (TD3) is a successor of DDPG algorithm with the usage of three additional tricks. In TD3, the usage of Clipped Double-Q Learning, Delayed Policy Updates, and Target Policy Smoothing overcomes the overestimation of Q-values and smooths out Q-values along with changes in action, which shows improved performance over baseline DDPG. This implementation provides TD3 and supports the following extensions:

  • Experience replay: ✔️

  • Target network: ✔️

  • Gradient clipping: ✔️

  • Reward clipping: ❌

  • Prioritized Experience Replay (PER): ✔️

Note

With respect to the clipped Double-Q learning, we use two Q-networks with shared parameters under a single Class CriticTwin. Such an implementation allows a lower computational and training time cost.

Warning

In the TD3 implementation, it contains a number of highly sensitive hyper-parameters, which requires the user to carefully tune these hyper-parameters to obtain a satisfied result.

Code Snippet

import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.train.config import build_env
from elegantrl.agents.AgentTD3 import AgentTD3

# train and save
args = Arguments(env=build_env('Pendulum-v0'), agent=AgentTD3())
args.cwd = 'demo_Pendulum_TD3'
args.env.target_return = -200
args.reward_scale = 2 ** -2
train_and_evaluate(args)

# test
agent = AgentTD3()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)

env = build_env('Pendulum-v0')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
    action = agent.select_action(state)
    next_state, reward, done, _ = env.step(action)

    episode_reward += reward
    if done:
        print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
        break
    else:
        state = next_state
    env.render()

Parameters

class elegantrl.agents.AgentTD3.AgentTD3(net_dim: int, state_dim: int, action_dim: int, gpu_id: int = 0, args: Optional[Arguments] = None)[source]

Bases: AgentBase

Twin Delayed DDPG algorithm. “Addressing Function Approximation Error in Actor-Critic Methods”. Scott Fujimoto. et al.. 2015.

Parameters
  • net_dim[int] – the dimension of networks (the width of neural networks)

  • state_dim[int] – the dimension of state (the number of state vector)

  • action_dim[int] – the dimension of action (the number of discrete action)

  • learning_rate[float] – learning rate of optimizer

  • if_per_or_gae[bool] – PER (off-policy) or GAE (on-policy) for sparse reward

  • env_num[int] – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv

  • agent_id[int] – if the visible_gpu is ‘1,9,3,4’, agent_id=1 means (1,9,4,3)[agent_id] == 9

get_obj_critic_per(buffer: ReplayBuffer, batch_size: int)[source]

Calculate the loss of the network with Prioritized Experience Replay (PER).

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns

the loss of the network and states.

get_obj_critic_raw(buffer: ReplayBuffer, batch_size: int)[source]

Calculate the loss of networks with uniform sampling.

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns

the loss of the network and states.

Networks

class elegantrl.agents.net.Actor(*args: Any, **kwargs: Any)[source]
class elegantrl.agents.net.CriticTwin(*args: Any, **kwargs: Any)[source]