PPO¶
Proximal Policy Optimization (PPO) is an on-policy Actor-Critic algorithm for both discrete and continuous action spaces. It has two primary variants: PPO-Penalty and PPO-Clip, where both utilize surrogate objectives to avoid the new policy changing too far from the old policy. This implementation provides PPO-Clip and supports the following extensions:
Target network: ✔️
Gradient clipping: ✔️
Reward clipping: ❌
Generalized Advantage Estimation (GAE): ✔️
Discrete version: ✔️
Note
The surrogate objective is the key feature of PPO since it both regularizes the policy update and enables the reuse of training data.
A clear explanation of PPO algorithm and implementation in ElegantRL is available here.
Code Snippet¶
import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.train.config import build_env
from elegantrl.agents.AgentPPO import AgentPPO
# train and save
args = Arguments(env=build_env('BipedalWalker-v3'), agent=AgentPPO())
args.cwd = 'demo_BipedalWalker_PPO'
args.env.target_return = 300
args.reward_scale = 2 ** -2
train_and_evaluate(args)
# test
agent = AgentPPO()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)
env = build_env('BipedalWalker-v3')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
break
else:
state = next_state
env.render()
Parameters¶
- class elegantrl.agents.AgentPPO.AgentPPO(net_dims: [<class 'int'>], state_dim: int, action_dim: int, gpu_id: int = 0, args: ~elegantrl.train.config.Config = <elegantrl.train.config.Config object>)[source]¶
PPO algorithm. “Proximal Policy Optimization Algorithms”. John Schulman. et al.. 2017.
net_dims: the middle layer dimension of MLP (MultiLayer Perceptron) state_dim: the dimension of state (the number of state vector) action_dim: the dimension of action (or the number of discrete action) gpu_id: the gpu_id of the training device. Use CPU when cuda is not available. args: the arguments for agent training. args = Config()
- explore_one_env(env, horizon_len: int, if_random: bool = False) Tuple[torch.Tensor, ...] [source]¶
Collect trajectories through the actor-environment interaction for a single environment instance.
env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks return: (states, actions, rewards, undones) for off-policy
env_num == 1 states.shape == (horizon_len, env_num, state_dim) actions.shape == (horizon_len, env_num, action_dim) logprobs.shape == (horizon_len, env_num, action_dim) rewards.shape == (horizon_len, env_num) undones.shape == (horizon_len, env_num)
- explore_vec_env(env, horizon_len: int, if_random: bool = False) Tuple[torch.Tensor, ...] [source]¶
Collect trajectories through the actor-environment interaction for a vectorized environment instance.
env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks return: (states, actions, rewards, undones) for off-policy
states.shape == (horizon_len, env_num, state_dim) actions.shape == (horizon_len, env_num, action_dim) logprobs.shape == (horizon_len, env_num, action_dim) rewards.shape == (horizon_len, env_num) undones.shape == (horizon_len, env_num)
- class elegantrl.agents.AgentPPO.AgentDiscretePPO(net_dims: [<class 'int'>], state_dim: int, action_dim: int, gpu_id: int = 0, args: ~elegantrl.train.config.Config = <elegantrl.train.config.Config object>)[source]¶
- explore_one_env(env, horizon_len: int, if_random: bool = False) Tuple[torch.Tensor, ...] [source]¶
Collect trajectories through the actor-environment interaction for a single environment instance.
env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks return: (states, actions, rewards, undones) for off-policy
env_num == 1 states.shape == (horizon_len, env_num, state_dim) actions.shape == (horizon_len, env_num, action_dim) logprobs.shape == (horizon_len, env_num, action_dim) rewards.shape == (horizon_len, env_num) undones.shape == (horizon_len, env_num)
- explore_vec_env(env, horizon_len: int, if_random: bool = False) Tuple[torch.Tensor, ...] [source]¶
Collect trajectories through the actor-environment interaction for a vectorized environment instance.
env: RL training environment. env.reset() env.step(). It should be a vector env. horizon_len: collect horizon_len step while exploring to update networks return: (states, actions, rewards, undones) for off-policy
states.shape == (horizon_len, env_num, state_dim) actions.shape == (horizon_len, env_num, action_dim) logprobs.shape == (horizon_len, env_num, action_dim) rewards.shape == (horizon_len, env_num) undones.shape == (horizon_len, env_num)