Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a multi-agent reinforcement learning algorithm for continuous action space:

  • Implementation is based on DDPG ✔️

  • Initialize n DDPG agents in MADDPG ✔️

Code Snippet

def update_net(self, buffer, batch_size, repeat_times, soft_update_tau):
    self.batch_size = batch_size
    self.update_tau = soft_update_tau
    rewards, dones, actions, observations, next_obs = buffer.sample_batch(self.batch_size)
    for index in range(self.n_agents):
        self.update_agent(rewards, dones, actions, observations, next_obs, index)

    for agent in self.agents:
        self.soft_update(agent.cri_target, agent.cri, self.update_tau)
        self.soft_update(agent.act_target, agent.act, self.update_tau)



class elegantrl.agents.AgentMADDPG.AgentMADDPG[source]

Bases: AgentBase

Multi-Agent DDPG algorithm. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive”. R Lowe. et al.. 2017.

  • net_dim[int] – the dimension of networks (the width of neural networks)

  • state_dim[int] – the dimension of state (the number of state vector)

  • action_dim[int] – the dimension of action (the number of discrete action)

  • learning_rate[float] – learning rate of optimizer

  • gamma[float] – learning rate of optimizer

  • n_agents[int] – number of agents

  • if_per_or_gae[bool] – PER (off-policy) or GAE (on-policy) for sparse reward

  • env_num[int] – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv

  • agent_id[int] – if the visible_gpu is ‘1,9,3,4’, agent_id=1 means (1,9,4,3)[agent_id] == 9

explore_one_env(env, target_step) list[source]

Exploring the environment for target_step. param env: the Environment instance to be explored. param target_step: target steps to explore.

save_or_load_agent(cwd, if_save)[source]

save or load training files for Agent

  • cwd – Current Working Directory. ElegantRL save training files in CWD.

  • if_save – True: save files. False: load files.


Select continuous actions for exploration


state – states.shape==(n_agents,batch_size, state_dim, )


actions.shape==(n_agents,batch_size, action_dim, ), -1 < action < +1

update_agent(rewards, dones, actions, observations, next_obs, index)[source]

Update the single agent neural networks, called by update_net.

  • rewards – reward list of the sampled buffer

  • dones – done list of the sampled buffer

  • actions – action list of the sampled buffer

  • observations – observation list of the sampled buffer

  • next_obs – next_observation list of the sample buffer

  • index – ID of the agent

update_net(buffer, batch_size, repeat_times, soft_update_tau)[source]

Update the neural networks by sampling batch data from ReplayBuffer.

  • buffer – the ReplayBuffer instance that stores the trajectories.

  • batch_size – the size of batch data for Stochastic Gradient Descent (SGD).

  • repeat_times – the re-using times of each trajectory.

  • soft_update_tau – the soft update parameter.


class*args: Any, **kwargs: Any)[source]
class*args: Any, **kwargs: Any)[source]