Additive Attention

In additive attention, unlike Dot-product Attention, a layer is learned to combine the query and key.

Attention=tanh(WqQ+WkK)

Mainly rather than using dot product, it learns two weight matrices to combine the query and key.

Additive attention is slower than the Dot-Product attention, as in dot-product attention, everything can be vectorized and can be done matrix multiplication, which is much faster due to the modern gpu architecture.


References


Related Notes