AdaGrad

#deep-learning #interview

AdaGrad = Adaptive Gradient Descent
It uses an adaptive learning rate
In AdaGrad it adapt learning rate
- The parameters that received more updates, get lower learning rates
- And the parameters that received less updates, get more learning rates

\begin{aligned} g & = δ_{θ} L (θ) \\ G & = G + g ⊙ g \\ θ & = θ - \frac{α}{\sqrt{G + ϵ}} ⊙ g \end{aligned}

Here, $G$ is a matrix that accumulates the squares of the gradients

Pros:

Work well with sparse data
Don't need to tune learning rate parameter
1. Automatically adjusts learning rates based on parameter updates

Cons:

Can converge too slow
As learning rate is always decreasing by the gradient accumulation, it might stop learning at all after some iterations

References

Related Notes