Nesterov Accelerated Gradient (NAG)

#deep-learning #interview

It is an update on the Stochastic Gradient Descent with Momentum.
In the base one, the ball goes faster by time but it doesn't know where it is going
- In this algorithm, the algorithm computes the loss for the next gradient and goes to the direction where the loss will be less

\begin{aligned} v & = β . v + (1 - β) . δ_{θ} L (θ - β \cdot v) \\ θ & = θ - α . v \end{aligned}

Only the term inside the loss function is changed compared to Stochastic Gradient Descent with Momentum

Pros:

Does not miss the Local Minima
Slows down if the minima is nearby (adaptive learning rate)

Cons:

Hyperparameters need to be tuned

References

Related Notes