Stochastic Gradient Descent with Momentum

v=β.v+(1β).δθL(θ)θ=θα.v

Intuition

When we are using momentum, we are actually pushing a ball from the hill. As like a ball, it will accumulate acceleration and goes faster by time.

In case of gradient, it will also accumulates gradient and the update of gradient will be more and more by iteration.

Pros:

  1. It gives smoother gradient than Stochastic Gradient Descent (SGD)
  2. Improve convergence time

Cons:

  1. It can overshoot good solutions and settle for suboptimal ones if the momentum is high
  2. Requires tuning of momentum hyperparameter


References


Related Notes