Adam

#deep-learning #interview

Adam = Adaptive Moment Estimation
Adam combines both RMSProp and Stochastic Gradient Descent with Momentum
- It uses the exponentially decaying average of the gradients and the squares of the gradients to determine the updated learning rate like RMSProp
- It uses a momentum term to help optimizer move more efficiently like Stochastic Gradient Descent with Momentum

\begin{aligned} g & = δ_{θ} L (θ) \\ m & = β_{1} \cdot m + (1 - β_{1}) \cdot g \\ v & = β_{2} \cdot v + (1 - β_{2}) \cdot g^{2} \\ \hat{m} & = \frac{m}{1 - β_{1}} \\ \hat{v} & = \frac{v}{1 - β_{2}} \\ θ & = θ - \frac{α}{\sqrt{\hat{v} + ϵ}} ⊙ \hat{m} \end{aligned}

Intuition:

The intuition is same as Stochastic Gradient Descent with Momentum where the ball is pushed from hill. But this time its a large ball with friction.

Pros:

Converge faster that any other optimization algorithms
Work well with noisy data

Cons:

It require more tuning of hyperparameters
Computationally costly

References

Related Notes