Vanishing Gradient in Transformers

Transformer has mitigated the vanishing gradient problem by following these strategies:

Skip Connection: The gradient can flow easily by skipping connection and without any changes
Variant of ReLU: ReLU is less susceptible to the Vanishing Gradient issue
Layer Normalization: Layer normalization makes the training and gradient stable
Self-Attention: Self attention mitigates the long sequential dependency

References