Multi-Head Attention

Multi head attention is a either of Self-Attention or Masked Self-Attention depending on its encoder or decoder nature.

The main reason of having multiple head rather one single is that by multiple head we can look at multiple relationship. For example, "dog sat on the bench", one attention can embed the relationship between the dog and sat and another attention can embed the relation between sat and bench.

This helps to increase the capacity and generalization ability of the model.

A t t e n t i o n_{i} = \frac{Q K^{T}}{\sqrt{d^{k}}}

d^{k} = d_{m o d e l} / h

MultiHeadAttention = c o n c a t (A t t e n t i o n_{i}) * W

References

Related Notes

Encoder-Decoder Transformer
Auto Regressive Model
Yet another Rope Extension (YaRN)
Self-Attention
Why do we scale attention weights?