Stochastic Gradient Descent (SGD)

#machine-learning #interview

SGD is a variation of Gradient Descent
As in modern world there can be million of data, so like Gradient Descent, models can't see all the data once and find the slope (though that would be the optimal)
- Memory issue
- Compute Issue
So SGD look at one data at a time

Pros:

Frequent update of model parameters
Need very less memory as it only looks at one example at a time
Can handle large data set

Cons:

The frequent update gives noisy gradient, so the convergence can be slow and in the worst case trapped in Local Minima
High variance
Frequent updates are computationally expensive
May overshoot even after getting global minima (for outliers)

Related Notes