Vanishing Gradient

Vanishing gradient means the gradient becomes so less that to a computer it is 0
Why it arises?
- Because of memory precision
- Because of multiplication of all layers (too deep)
How to identify?
- Parameters of the top layers are changing, whereas on bottom layers no change
- Model learns on a very slow pace
- Training could stop learning at a very early phase after a few iterations
What to do?
- What to do depends on the architecture and the reason of vanishing gradient
- The few common ways are
  1. LSTM
  2. ReLU - which introduces Exploding Gradient
  3. Batch Normalization
  4. Weight Initialization
  5. Skip Connection
  6. GRU
  7. Reduce Network depth

Related Notes