AdaGrad
- AdaGrad = Adaptive Gradient Descent
- It uses an adaptive learning rate
- In AdaGrad it adapt learning rate
- The parameters that received more updates, get lower learning rates
- And the parameters that received less updates, get more learning rates
Here,
Pros:
- Work well with sparse data
- Don't need to tune learning rate parameter
- Automatically adjusts learning rates based on parameter updates
Cons:
- Can converge too slow
- As learning rate is always decreasing by the gradient accumulation, it might stop learning at all after some iterations