ShengjieWang, Shaohuai Liu, Weirui Ye, Jiacheng You, Yang GaoAbstractSample efficiency remains a crucial challenge in applying Reinforcement Learning (RL) to real-worldtasks. While recent algorithms have made significant strides in improving sample efficiency,none have achieved consistently superior performance across diverse domains. In this paper, weintroduce EfficientZero V2, a general framework designed for sample-efficient RL algorithms. Wehave expanded the performance of EfficientZero to multiple domains, encompassing both continuousand discrete actions, as well as visual and low-dimensional inputs. With a series of improvements we propose, EfficientZero V2 outperforms the current state-of-the-art (SOTA) by a significant margin in diverse tasks under the limited data setting. EfficientZero V2 exhibits a notable advancement over the prevailing general algorithm, DreamerV3, achieving superior outcomes in 50 of 66 evaluated tasks across diverse benchmarks, such as Atari 100k, Proprio Control, and Vision Control.
Proprio Control: The results in Table 2 showcase that our method achieves a mean score of 723.2 across 20 tasks with limited data. While the performance of the current state-of-the-art, TD-MPC2, is comparable to that of EZ-V2, our method achieves faster inference times. TD-MPC2s planning with MPPI involves predicting 9216 latent states to attain similar performance levels. In contrast, EZ-V2s tree search-based planning only utilizes 32 imagined latent states, resulting in lighter computational demands.
id: 118da2e2acf68a49c064e23087efa05d - page: 7
5.2. Comparison with Baselines Atari 100k: The performance of EZ-V2 on the Atari 100k benchmark is elaborated in Table 1. When scores are normalized against those of human players, EZ-V2 attains a mean score of 2.428 and a median score of 1.286, surpassing the previous state-of-the-art, BBF (Schwarzer et al., 2023) and EfficientZero (Ye et al., 2021). In contrast to BBF, our method employs fewer network parameters and a lower replay ratio. Such enhancements in performance and computational efficiency are attributed to the learning of the environment model and the implementation of Gumbel
id: 49bf65d53396443cd39b7fe45b28ccc1 - page: 7
Vision Control: As shown in Table 2, our method achieves a mean score of 726.1, surpassing the previous state-of-the-art, DreamerV3, by 45%. Notably, it sets new records in 16 out of 20 tasks. Furthermore, our method demonstrates significant improvements in tasks with sparse rewards, as shown in Fig. 7. For instance, in the Cartpole-Swingup-Sparse task, our method scores 763.6 compared to DreamerV3s 392.4. This substantial progress is attributed to two key algorithmic modifications: the planning with tree search, which ensures policy improvement and offers excellent exploratory capabilities, and the mixed value target, which 7
id: 8577e2a053a73c4014dfba5b7b84e943 - page: 7
EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data enhances the accuracy of value learning, especially with stale data. As a general and sample-efficient RL framework, EZ-V2 consistently demonstrates high sample efficiency in tasks featuring low and high-dimensional observations, discrete and continuous action spaces, and both dense and sparse reward structures. Detailed training curves can be found in Appendix J. Acrobot Swingup Quadruped WalkEnvironment StepsScoreEnvironment Steps Sample MCTS (n=50)S-Gumbel Search (n=16)S-Gumbel Search (n=8) Proprio ControlVision Control S-Gumbel Search (n=32)
id: 632e568affd8ed860a6013249213875e - page: 8