Why do we scale attention weights?

Without the scaling of the dot-product attention, we get a variance very high, which can disrupt stability of the training by Exploding Gradient.

References