DAPO

They introduced,

Higher clip for more exploration
Dynamic sampling in case of 0 advantage (all same reward across group)
Token level loss normalization to fix bias based on response length
Overlong filtering
Soft overlong punishment

References

Related Notes