AdaDelta
- AdaDelta is similar to RMSProp
- The only variation is that It doesn't need a learning rate parameter
- Instead, it uses an exponentially decaying average of the gradients and the squares of the gradients to determine the scale
Where G accumulates the gradients and S accumulates the squares of the updates
Pros:
- Work well with sparse data
- Automatically adjusts learning rates based on parameter updates
Cons:
- Can converge too slowly for some problems
- Stops learning altogether if the learning rate becomes too small