Adam
- Adam = Adaptive Moment Estimation
- Adam combines both RMSProp and Stochastic Gradient Descent with Momentum
- It uses the exponentially decaying average of the gradients and the squares of the gradients to determine the updated learning rate like RMSProp
- It uses a momentum term to help optimizer move more efficiently like Stochastic Gradient Descent with Momentum
Intuition:
The intuition is same as Stochastic Gradient Descent with Momentum where the ball is pushed from hill. But this time its a large ball with friction.
Pros:
- Converge faster that any other optimization algorithms
- Work well with noisy data
Cons:
- It require more tuning of hyperparameters
- Computationally costly