Stochastic Gradient Descent with Momentum
- SGD with momentum is a variation of Stochastic Gradient Descent (SGD)
- AKA Nesterov Momentum
- It adds a momentum which push the gradient to the right direction
- It makes the noisy gradient smoother
- The momentum term is typically set to a value between 0 and 1
Intuition
When we are using momentum, we are actually pushing a ball from the hill. As like a ball, it will accumulate acceleration and goes faster by time.
In case of gradient, it will also accumulates gradient and the update of gradient will be more and more by iteration.
Pros:
- It gives smoother gradient than Stochastic Gradient Descent (SGD)
- Improve convergence time
Cons:
- It can overshoot good solutions and settle for suboptimal ones if the momentum is high
- Requires tuning of momentum hyperparameter