AdaDelta

g=δθL(θ)G=βG+(1β)ggΔθ=S+ϵG+ϵgS=βS+(1β)ΔθΔθθ=θ+Δθ

Where G accumulates the gradients and S accumulates the squares of the updates

Pros:

  1. Work well with sparse data
  2. Automatically adjusts learning rates based on parameter updates

Cons:

  1. Can converge too slowly for some problems
  2. Stops learning altogether if the learning rate becomes too small

References


Related Notes