Mixed Precision

Mixed precision is the best of both worlds, keeping some layers like (layer norm or softmax) to the full precision and some layers to half precision

Roughly on average it is calculated as 3 bytes per parameter, so for example, for a 7b model, it takes around 21GB for the model parameters.


References


Related Notes