AdaDelta

#machine-learning #interview

AdaDelta is similar to RMSProp
The only variation is that It doesn't need a learning rate parameter
Instead, it uses an exponentially decaying average of the gradients and the squares of the gradients to determine the scale

\begin{aligned} g & = δ_{θ} L (θ) \\ G & = β \cdot G + (1 - β) \cdot g ⊙ g \\ Δ θ & = - \frac{\sqrt{S + ϵ}}{\sqrt{G + ϵ}} ⊙ g \\ S & = β \cdot S + (1 - β) \cdot Δ θ ⊙ Δ θ \\ θ & = θ + Δ θ \end{aligned}

Where G accumulates the gradients and S accumulates the squares of the updates

Pros:

Work well with sparse data
Automatically adjusts learning rates based on parameter updates

Cons:

Can converge too slowly for some problems
Stops learning altogether if the learning rate becomes too small

References

Related Notes