Vanishing Gradient in Transformers
Transformer has mitigated the vanishing gradient problem by following these strategies:
- Skip Connection: The gradient can flow easily by skipping connection and without any changes
- Variant of ReLU: ReLU is less susceptible to the Vanishing Gradient issue
- Layer Normalization: Layer normalization makes the training and gradient stable
- Self-Attention: Self attention mitigates the long sequential dependency