References
- https://yugeten.github.io/posts/2025/01/ppogrpo/
- https://huggingface.co/blog/NormalUhr/grpo
- https://medium.com/@sulbha.jindal/proximal-policy-optimization-ppo-vs-group-relative-policy-optimization-grpo-988fa7af0241
- https://arxiv.org/abs/2402.03300