Additive Attention
In additive attention, unlike Dot-product Attention, a layer is learned to combine the query and key.
Mainly rather than using dot product, it learns two weight matrices to combine the query and key.
Additive attention is slower than the Dot-Product attention, as in dot-product attention, everything can be vectorized and can be done matrix multiplication, which is much faster due to the modern gpu architecture.