Replay Buffer: replay_buffer.py

ElegantRL provides ReplayBuffer to store sampled transitions.

In ElegantRL, we utilize Worker for exploration (data sampling) and Learner for exploitation (model learning), and we view such a relationship as a “producer-consumer” model, where a worker produces transitions and a learner consumes, and a learner updates the actor net at worker to produce new transitions. In this case, the ReplayBuffer is the storage buffer that connects the worker and learner.

Each transition is in a format (state, (reward, done, action)).

Note

We allocate the ReplayBuffer on continuous RAM for high performance training. Since the collected transitions are packed in sequence, the addressing speed increases dramatically when a learner randomly samples a batch of transitions.

Implementations

class elegantrl.train.replay_buffer.ReplayBuffer(max_size: int, state_dim: int, action_dim: int, gpu_id: int = 0, num_seqs: int = 1, if_use_per: bool = False, args: ~elegantrl.train.config.Config = <elegantrl.train.config.Config object>)[source]
device

The struction of ReplayBuffer (for example, num_seqs = num_workers * num_envs == 2*4 = 8 ReplayBuffer: worker0 for env0: sequence of sub_env0.0 self.states = Tensor[s, s, …, s, …, s]

self.actions = Tensor[a, a, …, a, …, a] self.rewards = Tensor[r, r, …, r, …, r] self.undones = Tensor[d, d, …, d, …, d]

<—–max_size—–> <-cur_size->

↑ pointer

sequence of sub_env0.1 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env0.2 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env0.3 s, s, …, s a, a, …, a r, r, …, r d, d, …, d

worker1 for env1: sequence of sub_env1.0 s, s, …, s a, a, …, a r, r, …, r d, d, …, d

sequence of sub_env1.1 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env1.2 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env1.3 s, s, …, s a, a, …, a r, r, …, r d, d, …, d

D: done=True d: done=False sequence of transition: s-a-r-d, s-a-r-d, s-a-r-D s-a-r-d, s-a-r-d, s-a-r-d, s-a-r-d, s-a-r-D s-a-r-d, …

<——trajectory——-> <———-trajectory———————> <———–

per_beta

PER. Prioritized Experience Replay. Section 4 alpha, beta = 0.7, 0.5 for rank-based variant alpha, beta = 0.6, 0.4 for proportional variant

Multiprocessing

Initialization

Utils