Zephyr: Direct distillation of lm alignment L Tunstall, E Beeching, N Lambert, N Rajani, K Rasul, Y Belkada, ... arXiv preprint arXiv:2310.16944, 2023 | 454 | 2023 |
A closer look at invalid action masking in policy gradient algorithms S Huang, S Ontañón The International FLAIRS Conference 2022 35, 2022 | 400 | 2022 |
Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms S Huang, RFJ Dossa, C Ye, J Braga, D Chakraborty, K Mehta, ... Journal of Machine Learning Research 23 (274), 1-18, 2022 | 296 | 2022 |
Trl: Transformer reinforcement learning L von Werra, Y Belkada, L Tunstall, E Beeching, T Thrush, N Lambert, ... GitHub. Available online at: https://github. com/lvwerra/trl, 2020 | 194 | 2020 |
The 37 Implementation Details of Proximal Policy Optimization S Huang, RFJ Dossa, A Raffin, A Kanervisto, W Wang International Conference on Learning Representations Blog Track, 2022 | 118 | 2022 |
Envpool: A highly parallel reinforcement learning environment execution engine J Weng, M Lin, S Huang, B Liu, D Makoviichuk, V Makoviychuk, Z Liu, ... Advances in Neural Information Processing Systems 35, 22409-22421, 2022 | 54 | 2022 |
The alignment handbook L Tunstall, E Beeching, N Lambert, N Rajani, S Huang, K Rasul, AM Rush, ... | 48 | 2023 |
Gym-RTS: Toward Affordable Full Game Real-time Strategy Games Research with Deep Reinforcement Learning S Huang, S Ontañón, C Bamford, L Grela Proceedings of the 3rd IEEE Conference on Games, 2021 | 45 | 2021 |
A2C is a special case of PPO S Huang, A Kanervisto, A Raffin, W Wang, S Ontañón, RFJ Dossa arXiv preprint arXiv:2205.09123, 2022 | 26 | 2022 |
Zephyr: Direct distillation of lm alignment, 2023 L Tunstall, E Beeching, N Lambert, N Rajani, K Rasul, Y Belkada, ... URL https://arxiv. org/abs/2310.16944 6, 2023 | 17 | 2023 |
An empirical investigation of early stopping optimizations in proximal policy optimization RFJ Dossa, S Huang, S Ontañón, T Matsubara IEEE access 9, 117981-117992, 2021 | 16 | 2021 |
The N+ Implementation Details of RLHF with PPO: A Case Study on TL; DR Summarization S Huang, M Noukhovitch, A Hosseini, K Rasul, W Wang, L Tunstall arXiv preprint arXiv:2403.17031, 2024 | 14 | 2024 |
Action guidance: Getting the best of sparse rewards and shaped rewards for real-time strategy games S Huang, S Ontañón AIIDE-20 Workshop on Artificial Intelligence for Strategy Games, 2020 | 13 | 2020 |
Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning S Huang, Q Gallouédec, F Felten, A Raffin, RFJ Dossa, Y Zhao, ... arXiv preprint arXiv:2402.03046, 2024 | 7 | 2024 |
Comparing Observation and Action Representations for Deep Reinforcement Learning in RTS S Huang, S Ontañón AIIDE-19 Workshop on Artificial Intelligence for Strategy Games, 2019 | 7* | 2019 |
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions J Li, E Beeching, L Tunstall, B Lipkin, R Soletskyi, S Huang, K Rasul, L Yu, ... Hugging Face repository 13, 2024 | 6 | 2024 |
Medcod: A medically-accurate, emotive, diverse, and controllable dialog system R Compton, I Valmianski, L Deng, C Huang, N Katariya, X Amatriain, ... Machine Learning for Health, 110-129, 2021 | 5 | 2021 |
Reward scale robustness for proximal policy optimization via DreamerV3 tricks R Sullivan, A Kumar, S Huang, J Dickerson, J Suarez Advances in Neural Information Processing Systems 36, 2024 | 3 | 2024 |
The n implementation details of rlhf with ppo S Huang, T Liu, L Von Werra The Third Blogpost Track at ICLR 2024, 2024 | 3 | 2024 |
Tülu 3: Pushing Frontiers in Open Language Model Post-Training N Lambert, J Morrison, V Pyatkin, S Huang, H Ivison, F Brahman, ... arXiv preprint arXiv:2411.15124, 2024 | 1 | 2024 |