CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
Paper
β’
2505.12504
β’
Published
β’
24
CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
We proposed a novel RL algorithm called Clipped Policy Gradient Optimization with Policy Drift (CPGD), which is based on policy gradient loss with a clipping mechanism and a policy drift regularizer. In our experiments, we found that it is more stable and performs better than GRPO.
Based on the key factors identified by https://github.com/ModalMinds/MM-EUREKA for achieving stable training, we enhanced the model, dataset, and algorithmic modules. Specifically, we maintained the strategy of omitting the KL divergence term and applying data filtering, while implementing the following critical modifications:
| Model | MathVista | MathVerse | MathVision | OlympiadBench | WeMath | MMK12 |
|---|---|---|---|---|---|---|
| Claude3.7-Sonnet | 66.8 | 52.0 | 41.3 | 48.9 | 72.6 | 55.3 |
| GPT-4o | 63.8 | 50.2 | 30.4 | 35.0 | 68.8 | 49.9 |
| o1 | 73.9 | 57.0 | 60.3 | 68.0 | 98.7 | 73.9 |
| Gemini2-flash | 70.4 | 59.3 | 41.3 | 51.0 | 71.4 | 65.2 |
| Qwen-2.5-VL-7B | 68.2 | 47.9 | 25.4 | 20.2 | 62.1 | 53.6 |
| Qwen-2.5-VL-32B | 74.7/71.7 | 49.9 | 40.1 | 30.0 | 69.1 | 66.8 |
| Qwen-2.5-VL-72B | 74.8 | 57.6 | 38.1 | 40.4 | 72.4 | 70.5 |
| InternVL2.5-VL-78B | 72.3 | 51.7 | 32.2 | 31.1 | 66.3 | 61.6 |
| QVQ-72B-Preview | 71.4 | 48.2 | 35.9 | 33.2 | 65.4 | 61.5 |
| Adora-7B | 73.5 | 50.1 | 23.0 | 20.1 | 64.2 | 58.1 |
| R1-Onevision-7B | 64.1 | 47.1 | 29.9/23.5 | 17.3 | 61.8 | 39.8 |
| MM-Eureka-Qwen-7B | 73.0 | 50.3 | 26.9 | 20.1 | 66.1 | 64.5 |
| MM-Eureka-Qwen-32B | 74.8 | 56.5 | 34.4 | 35.9 | 73.4 | 72.2 |
| MM-Eureka-CPGD-Qwen-7B | 74.0 | 50.6 | 28.3 | 21.4 | 68.3 | 65.3 |