Vanishing Gradient in Transformers

Transformer has mitigated the vanishing gradient problem by following these strategies:

  1. Skip Connection: The gradient can flow easily by skipping connection and without any changes
  2. Variant of ReLU: ReLU is less susceptible to the Vanishing Gradient issue
  3. Layer Normalization: Layer normalization makes the training and gradient stable
  4. Self-Attention: Self attention mitigates the long sequential dependency

References


Related Notes