Decoder Only Transformer

In decoder-only transformer, it uses the decoder part of the transformer where it can only attend to the previous words of the sentences.

The decoder-only transformer has Masked Self-Attention, which can only attend to the token that has been already generated. This was done so that the decoder can't see the future before generating the next token.

The decoder is usually trained with Multi Class Cross Entropy loss for each of the token position.

These models are best suited for text-generation.

Some examples of decoder-only models:

  1. GPT-2
  2. GPT-3
  3. LLaMA

References


Related Notes