REDQ

Randomized Ensembled Double Q-Learning: Learning Fast Without a Model (REDQ) has three carefully integrated ingredients to achieve its high performance:

  • update-to-data (UTD) ratio >> 1.

  • an ensemble of Q functions.

  • in-target minimization across a random subset of Q functions.

This implementation is based on SAC.

Code Snippet

import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.train.config import build_env
from elegantrl.agents.AgentREDQ import AgentREDQ

# train and save
args = Arguments(env=build_env('Hopper-v2'), agent=AgentREDQ())
args.cwd = 'demo_Hopper_REDQ'
train_and_evaluate(args)

# test
agent = AgentREDQ()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)

env = build_env('Pendulum-v0')
state = env.reset()
episode_reward = 0
for i in range(125000):
    action = agent.select_action(state)
    next_state, reward, done, _ = env.step(action)

    episode_reward += reward
    if done:
        print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
        break
    else:
        state = next_state
    env.render()

Parameters

class elegantrl.agents.AgentREDQ.AgentREDQ(net_dim, state_dim, action_dim, gpu_id=0, args=None)[source]

Bases: AgentBase

Randomized Ensemble Double Q-learning algorithm. “Randomized Ensembled Double Q-Learning: Learning Fast Without A Model”. Xinyue Chen et al.. 2021.

Parameters
  • net_dim[int] – the dimension of networks (the width of neural networks)

  • state_dim[int] – the dimension of state (the number of state vector)

  • action_dim[int] – the dimension of action (the number of discrete action)

  • reward_scale – scale the reward to get a appropriate scale Q value

  • gamma – the discount factor of Reinforcement Learning

  • learning_rate – learning rate of optimizer

  • if_per_or_gae – PER (off-policy) or GAE (on-policy) for sparse reward

  • env_num – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv

  • gpu_id – the gpu_id of the training device. Use CPU when cuda is not available.

  • G – Update to date ratio

  • M – subset size of critics

  • N – ensemble number of critics

get_obj_critic_per(buffer, batch_size)[source]

Calculate the loss of the network with Prioritized Experience Replay (PER).

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns

the loss of the network and states.

get_obj_critic_raw(buffer, batch_size)[source]

Calculate the loss of networks with uniform sampling.

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

Returns

the loss of the network and states.

get_obj_critic_raw_(buffer, batch_size, alpha)[source]

Calculate the loss of networks with uniform sampling.

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

  • alpha – the trade-off coefficient of entropy regularization.

Returns

the loss of the network and states.

select_actions_(state, size, env)[source]

Select continuous actions for exploration

Parameters

state – states.shape==(batch_size, state_dim, )

Returns

actions.shape==(batch_size, action_dim, ), -1 < action < +1

update_net_(buffer, batch_size, soft_update_tau)[source]

Update the neural networks by sampling batch data from ReplayBuffer.

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

  • soft_update_tau – the soft update parameter.

Returns

a tuple of the log information.

Networks

class elegantrl.agents.net.ActorSAC(*args: Any, **kwargs: Any)[source]
class elegantrl.agents.net.Critic(*args: Any, **kwargs: Any)[source]