EfficientZero Remastered

EfficientZero Remastered Release

Now you can have state-of-the-art reinforcement learning at your fingertips!

Gigglebit Studios is proud to announce the release of EfficientZero Remastered, a remastered machine-learning model capable of playing Atari games at superhuman levels.

EfficientZero Remastered is a revamped version of the state-of-the-art EfficientZero model, which is the #1 top performer in the Atari 100k benchmark. In the remastered version, we've patched several bugs and added quality-of-life features that make it the easiest, cheapest, and most powerful reinforcement learning agent currently available to the public.

Now you can try this model out for yourself! Anyone can train a model using the training scripts in our GitHub repo. Training a model from scratch only costs about $50 on most cloud providers, which is several orders of magnitude cheaper than similar state-of-the-art models. Or, if you want to avoid training a model from scratch, check out one of the pre-trained models from our Hugging Face 🤗 repository.

What is EfficientZero?

EfficientZero is a data-efficient RL algorithm from Weirui Ye and Shaohuai Liu of Tsinghua University that has shown promise in reinforcement learning. It is based on the highly capable MuZero model by DeepMind but requires substantially less data and compute time compared to other state-of-the-art algorithms. It is currently the top-ranking model for data-efficient RL on Atari.

EfficientZero adds three novel features to MuZero:

A self-supervised next-state consistency loss forces the model to make its future state predictions more accurate.
End-to-end value prefix prediction tries to predict all future rewards rather than only the next step's reward.
Model-based off-policy correction de-prioritizes experiences that don’t help the current iteration of the model learn.

The first feature is by far the most important addition to the model because it improves the stability of the dynamics model. The other two features appear more tailored to the Atari 100k benchmark and might be less useful in other benchmarks.

You can learn more about EfficientZero and how it works from this helpful blog post by 1a3orn, explaining EfficientZero in more detail.

What's new in EfficientZero Remastered?

We've kept the original algorithm more or less intact, but added several features to improve ease of use. EfficientZero Remastered adds preemption support, replay buffer saving and loading, better exception handling and logging, training scripts, and A100 support. We also provide pre-trained models for others to experiment with.

Training

To train a new model from scratch, you can visit our GitHub repo and clone the source code, then run the train.sh script. You may have to adjust some of the parameters if you're running on something other than an A100 GPU. Training is best done on a single GPU node with about ~20 CPU cores, and takes between 1-2 days.

Pretrained Models

We also provide 7 pre-trained models for users to play with: Breakout, Space Invaders, Seaquest, Pong, Qbert, Boxing, and Alien. We recommend starting with Breakout, the most capable model, which attains a score 10x the human baseline.

You can view the test code for an example of how to call the model from within Python.

Findings

Our primary goal in this project was to test out EfficientZero and see its capabilities. We were amazed by the model overall, especially on Breakout, where it far outperformed the human baseline. The overall cost was only about $50 per fully trained model, compared to the hundreds of thousands of dollars needed to train MuZero.

Though the trained models achieved impressive scores in Atari, they didn't reach the stellar scores demonstrated in the paper. This could be because we used different hardware and dependencies or because ML research papers tend to cherry-pick models and environments to showcase good results.

Additionally, the models tended to hit a performance wall between 75-100k steps. While we don't have enough data to know why or how often this happens, it's not surprising: the model was tuned specifically for data efficiency, so it hasn't been tested at larger scales. A model like MuZero might be more appropriate if you have a large budget.

Training times seemed longer than those reported in the EfficientZero paper. The paper stated that they could train a model to completion in 7 hours, while in practice, we've found that it takes an A100 with 32 cores between 1 to 2 days to train a model to completion. This is likely because the training process uses more CPU than other models and therefore does not perform well on the low-frequency, many-core CPUs found in GPU clusters.

Support

If you have any issues or ideas for improvement while using this model, don't hesitate to file a bug on our GitHub repo. If you'd like to send us feedback, you can email us at feedback@gigglebit.net.

Acknowledgements

This work was made possible thanks to an open-source AI research grant from Stability AI. We couldn't have done this project without their support! 🙌

Marjie Volk, Derek Simi, and Dan Colish provided technical advice and reviewed various written materials. Thank you for your time and patience! 😊