Stochastic Gradient Descent (SGD)
- SGD is a variation of Gradient Descent
- As in modern world there can be million of data, so like Gradient Descent, models can't see all the data once and find the slope (though that would be the optimal)
- Memory issue
- Compute Issue
- So SGD look at one data at a time
Pros:
- Frequent update of model parameters
- Need very less memory as it only looks at one example at a time
- Can handle large data set
Cons:
- The frequent update gives noisy gradient, so the convergence can be slow and in the worst case trapped in Local Minima
- High variance
- Frequent updates are computationally expensive
- May overshoot even after getting global minima (for outliers)