PPO

Proximal Policy Optimization (PPO) is an on-policy Actor-Critic algorithm for both discrete and continuous action spaces. It has two primary variants: PPO-Penalty and PPO-Clip, where both utilize surrogate objectives to avoid the new policy changing too far from the old policy. This implementation provides PPO-Clip and supports the following extensions:

  • Target network: ✔️

  • Gradient clipping: ✔️

  • Reward clipping: ❌

  • Generalized Advantage Estimation (GAE): ✔️

  • Discrete version: ✔️

Note

The surrogate objective is the key feature of PPO since it both regularizes the policy update and enables the reuse of training data.

A clear explanation of PPO algorithm and implementation in ElegantRL is available here.

Code Snippet

import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.train.config import build_env
from elegantrl.agents.AgentPPO import AgentPPO

# train and save
args = Arguments(env=build_env('BipedalWalker-v3'), agent=AgentPPO())
args.cwd = 'demo_BipedalWalker_PPO'
args.env.target_return = 300
args.reward_scale = 2 ** -2
train_and_evaluate(args)

# test
agent = AgentPPO()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)

env = build_env('BipedalWalker-v3')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
    action = agent.select_action(state)
    next_state, reward, done, _ = env.step(action)

    episode_reward += reward
    if done:
        print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
        break
    else:
        state = next_state
    env.render()

Parameters

class elegantrl.agents.AgentPPO.AgentPPO(net_dim: int, state_dim: int, action_dim: int, gpu_id=0, args=None)[source]

Bases: AgentBase

PPO algorithm. “Proximal Policy Optimization Algorithms”. John Schulman. et al.. 2017.

Parameters
  • net_dim[int] – the dimension of networks (the width of neural networks)

  • state_dim[int] – the dimension of state (the number of state vector)

  • action_dim[int] – the dimension of action (the number of discrete action)

  • learning_rate[float] – learning rate of optimizer

  • if_per_or_gae[bool] – PER (off-policy) or GAE (on-policy) for sparse reward

  • env_num[int] – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv

  • agent_id[int] – if the visible_gpu is ‘1,9,3,4’, agent_id=1 means (1,9,4,3)[agent_id] == 9

explore_one_env(env, horizon_len) list[source]

Collect trajectories through the actor-environment interaction.

Parameters
  • env – the DRL environment instance.

  • target_step – the total step for the interaction.

Returns

a list of trajectories [traj, …] where traj = [(state, other), …].

explore_vec_env(env, target_step, random_exploration=None) list[source]

Collect trajectories through the actor-environment interaction for a vectorized environment instance.

Parameters
  • env – the DRL environment instance.

  • target_step – the total step for the interaction.

Returns

a list of trajectories [traj, …] where each trajectory is a list of transitions [(state, other), …].

get_reward_sum_gae(buf_len, ten_reward, ten_mask, ten_value) Tuple[torch.Tensor, torch.Tensor][source]

Calculate the reward-to-go and advantage estimation using GAE.

Parameters
  • buf_len – the length of the ReplayBuffer.

  • ten_reward – a list of rewards for the state-action pairs.

  • ten_mask – a list of masks computed by the product of done signal and discount factor.

  • ten_value – a list of state values estimated by the Critic network.

Returns

the reward-to-go and advantage estimation.

get_reward_sum_raw(buf_len, buf_reward, buf_mask, buf_value) Tuple[torch.Tensor, torch.Tensor][source]

Calculate the reward-to-go and advantage estimation.

Parameters
  • buf_len – the length of the ReplayBuffer.

  • buf_reward – a list of rewards for the state-action pairs.

  • buf_mask – a list of masks computed by the product of done signal and discount factor.

  • buf_value – a list of state values estimated by the Critic network.

Returns

the reward-to-go and advantage estimation.

update_net(buffer)[source]

Update the neural networks by sampling batch data from ReplayBuffer.

Note

Using advantage normalization and entropy loss.

Parameters
  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

  • repeat_times – the re-using times of each trajectory.

  • soft_update_tau – the soft update parameter.

Returns

a tuple of the log information.

class elegantrl.agents.AgentPPO.AgentDiscretePPO(net_dim: int, state_dim: int, action_dim: int, gpu_id=0, args=None)[source]

Bases: AgentPPO

Parameters
  • net_dim[int] – the dimension of networks (the width of neural networks)

  • state_dim[int] – the dimension of state (the number of state vector)

  • action_dim[int] – the dimension of action (the number of discrete action)

  • learning_rate[float] – learning rate of optimizer

  • if_per_or_gae[bool] – PER (off-policy) or GAE (on-policy) for sparse reward

  • env_num[int] – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv

  • agent_id[int] – if the visible_gpu is ‘1,9,3,4’, agent_id=1 means (1,9,4,3)[agent_id] == 9

Networks

class elegantrl.agents.net.ActorPPO(*args: Any, **kwargs: Any)[source]
class elegantrl.agents.net.ActorDiscretePPO(*args: Any, **kwargs: Any)[source]
class elegantrl.agents.net.CriticPPO(*args: Any, **kwargs: Any)[source]