DAPO
They introduced,
- Higher clip for more exploration
- Dynamic sampling in case of 0 advantage (all same reward across group)
- Token level loss normalization to fix bias based on response length
- Overlong filtering
- Soft overlong punishment
They introduced,