Stochastic Gradient Descent with Momentum

#deep-learning #interview

SGD with momentum is a variation of Stochastic Gradient Descent (SGD)
AKA Nesterov Momentum
It adds a momentum which push the gradient to the right direction
- It makes the noisy gradient smoother
The momentum term is typically set to a value between 0 and 1

\begin{aligned} v & = β . v + (1 - β) . δ_{θ} L (θ) \\ θ & = θ - α . v \end{aligned}

Intuition

When we are using momentum, we are actually pushing a ball from the hill. As like a ball, it will accumulate acceleration and goes faster by time.

In case of gradient, it will also accumulates gradient and the update of gradient will be more and more by iteration.

Pros:

It gives smoother gradient than Stochastic Gradient Descent (SGD)
Improve convergence time

Cons:

It can overshoot good solutions and settle for suboptimal ones if the momentum is high
Requires tuning of momentum hyperparameter

References

Related Notes