Additive Attention

In additive attention, unlike Dot-product Attention, a layer is learned to combine the query and key.

A t t e n t i o n = t a n h (W_{q} Q + W_{k} K)

Mainly rather than using dot product, it learns two weight matrices to combine the query and key.

Additive attention is slower than the Dot-Product attention, as in dot-product attention, everything can be vectorized and can be done matrix multiplication, which is much faster due to the modern gpu architecture.

References

Related Notes

DAPO
KV Cache
Parallelism in LLM
Pre-Training LLM
Rotary Position Embedding (RoPE)