DistilBERT
DistilBERT is a distilled version of the BERT. It was created to train or do inference in a very low compute consumer GPU.
Changes:
- 66M parameter
- 60% faster in inference while 97% capacity of the original BERT model
- Layers 12 --> 6
Training Changes:
- Removed Next Sentence Prediction
- Removed the pooler from the BERT model
- Uses 3 different loss
- Distillation loss - KL Divergence between teacher vs student logits
- Hidden state matching - L2 or Ridge Regression loss between the selected layers in student and teacher
- MLM loss - standard Masked Language Modeling loss