Why do we scale attention weights?

Without the scaling of the dot-product attention, we get a variance very high, which can disrupt stability of the training by Exploding Gradient.


References


Related Notes