Why do we scale attention weights?
Without the scaling of the dot-product attention, we get a variance very high, which can disrupt stability of the training by Exploding Gradient.
Without the scaling of the dot-product attention, we get a variance very high, which can disrupt stability of the training by Exploding Gradient.