Stochastic Gradient Descent with Momentum
- SGD with momentum is a variation of Stochastic Gradient Descent (SGD)
 - AKA Nesterov Momentum
 - It adds a momentum which push the gradient to the right direction
- It makes the noisy gradient smoother
 
 - The momentum term is typically set to a value between 0 and 1
 
Intuition
When we are using momentum, we are actually pushing a ball from the hill. As like a ball, it will accumulate acceleration and goes faster by time.
In case of gradient, it will also accumulates gradient and the update of gradient will be more and more by iteration.
Pros:
- It gives smoother gradient than Stochastic Gradient Descent (SGD)
 - Improve convergence time
 
Cons:
- It can overshoot good solutions and settle for suboptimal ones if the momentum is high
 - Requires tuning of momentum hyperparameter
 
