Adam

g=δθL(θ)m=β1m+(1β1)gv=β2v+(1β2)g2m^=m1β1v^=v1β2θ=θαv^+ϵm^

Intuition:

The intuition is same as Stochastic Gradient Descent with Momentum where the ball is pushed from hill. But this time its a large ball with friction.

Pros:

  1. Converge faster that any other optimization algorithms
  2. Work well with noisy data

Cons:

  1. It require more tuning of hyperparameters
  2. Computationally costly

References


Related Notes