Stochastic Gradient Descent (SGD) is an optimization algorithm that updates model parameters using the gradient computed from a single randomly selected training example at each iteration, rather than the entire dataset, as in Gradient Descent .
Why do we use SGD
- It allows efficient optimization (Optimisation) when working with large datasets, as computing the gradient on the entire dataset is expensive.
- Introduces randomness, which can help escape local minima.
Key Characteristics
-
Update Rule: Parameters are updated for each sample using its gradient contribution.
-
Objective: Minimize the loss function efficiently without processing the full dataset at every step.
Pros:
- Fast parameter updates.
- Handles large-scale and streaming data well.
Cons:
- Noisy updates can cause high variance in the cost function.
- Requires techniques like learning rate scheduling or momentum for stable convergence.