BERT
BERT is a Encoder Only Transformer which was pre-trained with Next Sentence Prediction and Masked Language Modeling.
BERT-base -- 110M
- 12 layers
- hidden size 768
- 12 attention heads
- Hidden state size 3072
BERT-large -- 340M
- 24 layers
- hidden size 1024
- 16 heads
- Hidden state size 4096