GARDO: Reinforcing Diffusion Models without Reward Hacking Paper • 2512.24138 • Published 8 days ago • 25
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies Paper • 2512.19673 • Published 16 days ago • 60
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models Paper • 2512.07783 • Published 30 days ago • 36