DQN¶

Deep Q-Network (DQN) is an off-policy value-based algorithm for discrete action space. It uses a deep neural network to approximate a Q function defined on state-action pairs. This implementation starts from a vanilla Deep Q-Learning and supports the following extensions:

Experience replay: ✔️
Target network (soft update): ✔️
Gradient clipping: ✔️
Reward clipping: ❌
Prioritized Experience Replay (PER): ✔️
Dueling network architecture: ✔️

Note

This implementation has no support for reward clipping because we introduce the hyper-paramter reward_scale for reward scaling as an alternative. We believe that the clipping function may omit information since it cannot map the clipped reward back to the original reward; however, the reward scaling function is able to manipulate the reward back and forth.

Warning

PER leads to a faster learning speed and is also critical for environments with sparse rewards. However, a replay buffer with small size may hurt the performance of PER.

Code Snippet¶import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.train.config import build_env
from elegantrl.agents.AgentDQN import AgentDQN

# train and save
args = Arguments(env=build_env('CartPole-v0'), agent=AgentDQN())
args.cwd = 'demo_CartPole_DQN'
args.target_return = 195
args.agent.if_use_dueling = True
train_and_evaluate(args)

# test
agent = AgentDQN()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)

env = build_env('CartPole-v0')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
    action = agent.select_action(state)
    next_state, reward, done, _ = env.step(action)

    episode_reward += reward
    if done:
        print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
        break
    else:
        state = next_state
    env.render()

Parameters¶

class elegantrl.agents.AgentDQN.AgentDQN(net_dims: [<class 'int'>], state_dim: int, action_dim: int, gpu_id: int = 0, args: ~elegantrl.train.config.Config = <elegantrl.train.config.Config object>)[source]¶

Deep Q-Network algorithm. “Human-Level Control Through Deep Reinforcement Learning”. Mnih V. et al.. 2015.

net_dims: the middle layer dimension of MLP (MultiLayer Perceptron) state_dim: the dimension of state (the number of state vector) action_dim: the dimension of action (or the number of discrete action) gpu_id: the gpu_id of the training device. Use CPU when cuda is not available. args: the arguments for agent training. args = Config()

explore_one_env(env, horizon_len: int, if_random: bool = False) → Tuple[torch.Tensor, ...][source]¶

Collect trajectories through the actor-environment interaction for a single environment instance.

env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks if_random: uses random action for warn-up exploration return: (states, actions, rewards, undones) for off-policy

num_envs == 1 states.shape == (horizon_len, num_envs, state_dim) actions.shape == (horizon_len, num_envs, action_dim) rewards.shape == (horizon_len, num_envs) undones.shape == (horizon_len, num_envs)

explore_vec_env(env, horizon_len: int, if_random: bool = False) → Tuple[torch.Tensor, ...][source]¶

Collect trajectories through the actor-environment interaction for a vectorized environment instance.

env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks if_random: uses random action for warn-up exploration return: (states, actions, rewards, undones) for off-policy

states.shape == (horizon_len, num_envs, state_dim) actions.shape == (horizon_len, num_envs, action_dim) rewards.shape == (horizon_len, num_envs) undones.shape == (horizon_len, num_envs)

get_obj_critic_per(buffer: ReplayBuffer, batch_size: int) → Tuple[torch.Tensor, torch.Tensor][source]¶

Calculate the loss of the network and predict Q values with Prioritized Experience Replay (PER).

Parameters:

buffer – the ReplayBuffer instance that stores the trajectories.
batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns:

the loss of the network and Q values.

get_obj_critic_raw(buffer: ReplayBuffer, batch_size: int) → Tuple[torch.Tensor, torch.Tensor][source]¶

Calculate the loss of the network and predict Q values with uniform sampling.

Parameters:

buffer – the ReplayBuffer instance that stores the trajectories.
batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns:

the loss of the network and Q values.

Networks¶

class elegantrl.agents.net.QNet(*args: Any, **kwargs: Any)[source]¶

class elegantrl.agents.net.QNetDuel(*args: Any, **kwargs: Any)[source]¶