Masked Self-Attention

#llm #interview

The masked self attention module is only used in the Decoder Only Transformer. Compared to the Self-Attention, this module can't see the future. It only has to generate the next token by attending to the previously generated tokens.

Why?

This was done so that the decoder which has to generate the t th token, can't already see the t or future tokens. If it could, it won't learn anything, it will just output the t th token what it has just seen.

How?

This was done by masking the future or current tokens a very small value (like -inf), which will make the softmax to 0 and will have 0 as a attention weight.

References

Related Notes

Rotary Position Embedding (RoPE)
Hypothetical Document Embedding (HyDE)
DistilBERT
DAPO
Sliding Window Attention