AdaDelta

g=δθL(θ)G=β⋅G+(1−β)⋅g⊙gΔθ=−S+ϵG+ϵ⊙gS=β⋅S+(1−β)⋅Δθ⊙Δθθ=θ+Δθ

Where G accumulates the gradients and S accumulates the squares of the updates

Pros:

  1. Work well with sparse data
  2. Automatically adjusts learning rates based on parameter updates

Cons:

  1. Can converge too slowly for some problems
  2. Stops learning altogether if the learning rate becomes too small

References


Related Notes