Encoder Only Transformer

In encoder-only model, they use only the encoder part of the transformer. The pre-training only includes masking out the sentence and the model has to predict the masked words.

These models only use the Self-Attention layer which has access to all the words in the inputs, thats why it is sometimes called as bi-directional encoder

Encoder only models are good for understanding sentences, semantics of the sentence like sentence classification, sentence embedding and so on.

The encoder only model's output layer is usually the final hidden states for each of the token.

Some examples of encoder-only transformers:

  1. BERT
  2. ALBERT
  3. DistilBERT
  4. RoBERTa
  5. ELECTRA

References


Related Notes