Replay Buffer: replay_buffer.py¶
ElegantRL provides ReplayBuffer
to store sampled transitions.
In ElegantRL, we utilize Worker
for exploration (data sampling) and Learner
for exploitation (model learning), and we view such a relationship as a “producer-consumer” model, where a worker produces transitions and a learner consumes, and a learner updates the actor net at worker to produce new transitions. In this case, the ReplayBuffer
is the storage buffer that connects the worker and learner.
Each transition is in a format (state, (reward, done, action)).
Note
We allocate the ReplayBuffer
on continuous RAM for high performance training. Since the collected transitions are packed in sequence, the addressing speed increases dramatically when a learner randomly samples a batch of transitions.
Implementations¶
- class elegantrl.train.replay_buffer.ReplayBuffer(max_size: int, state_dim: int, action_dim: int, gpu_id: int = 0, num_seqs: int = 1, if_use_per: bool = False, args: ~elegantrl.train.config.Config = <elegantrl.train.config.Config object>)[source]¶
- device¶
The struction of ReplayBuffer (for example, num_seqs = num_workers * num_envs == 2*4 = 8 ReplayBuffer: worker0 for env0: sequence of sub_env0.0 self.states = Tensor[s, s, …, s, …, s]
self.actions = Tensor[a, a, …, a, …, a] self.rewards = Tensor[r, r, …, r, …, r] self.undones = Tensor[d, d, …, d, …, d]
<—–max_size—–> <-cur_size->
↑ pointer
sequence of sub_env0.1 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env0.2 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env0.3 s, s, …, s a, a, …, a r, r, …, r d, d, …, d
- worker1 for env1: sequence of sub_env1.0 s, s, …, s a, a, …, a r, r, …, r d, d, …, d
sequence of sub_env1.1 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env1.2 s, s, …, s a, a, …, a r, r, …, r d, d, …, d sequence of sub_env1.3 s, s, …, s a, a, …, a r, r, …, r d, d, …, d
D: done=True d: done=False sequence of transition: s-a-r-d, s-a-r-d, s-a-r-D s-a-r-d, s-a-r-d, s-a-r-d, s-a-r-d, s-a-r-D s-a-r-d, …
<——trajectory——-> <———-trajectory———————> <———–
- per_beta¶
PER. Prioritized Experience Replay. Section 4 alpha, beta = 0.7, 0.5 for rank-based variant alpha, beta = 0.6, 0.4 for proportional variant