AdaGrad

g=δθL(θ)G=G+ggθ=θαG+ϵg

Here, G is a matrix that accumulates the squares of the gradients

Pros:

  1. Work well with sparse data
  2. Don't need to tune learning rate parameter
    1. Automatically adjusts learning rates based on parameter updates

Cons:

  1. Can converge too slow
  2. As learning rate is always decreasing by the gradient accumulation, it might stop learning at all after some iterations

References


Related Notes