Unlock AI power-ups — upgrade and save 20%!
Use code STUBE20OFF during your first month after signup. Upgrade now →

By StatQuest with Josh Starmer
Published Loading...
N/A views
N/A likes
Stochastic Gradient Descent (SGD) Overview
📌 Stochastic Gradient Descent (SGD) is introduced as an alternative to standard Gradient Descent, especially for complex models with large datasets.
📐 Standard gradient descent uses the entire dataset to calculate derivatives and update parameters (like the intercept and slope in a line fit, $y = mx + b$).
📈 For models like logistic regression with 23,000 genes and 1 million samples, standard gradient descent requires calculating approximately 2.3 trillion terms per step, making it computationally slow.
Mechanism of Stochastic Gradient Descent
🎲 SGD addresses computational load by randomly picking one sample per step (or a small subset, known as a mini-batch) to calculate derivatives and update parameters.
🎯 In the simple example (3 data points), SGD reduced the number of terms computed by a factor of 3 per step; for 1 million samples, this factor is 1 million.
📉 SGD is particularly beneficial when there are redundancies in the data, as seen when using only one sample from a cluster to represent the step.
Learning Rate and Mini-Batching
🚦 SGD is sensitive to the learning rate, where the general strategy is to start relatively large and decrease it over time (the schedule).
🧩 While the strict definition uses only one sample, it is more common to use a mini-batch (a small subset, e.g., 3 samples) per step.
🌟 Using a mini-batch balances stability (like using all data) with speed (like using a single sample), often resulting in more stable parameter estimates in fewer steps.
Parameter Updates with New Data
🔄 A significant advantage of SGD is the ability to easily update parameter estimates when new data arrives, without restarting the entire process from the initial guesses.
🚀 The process picks up from the most recent estimates, using the new sample(s) to calculate the next step for the slope and intercept.
Key Points & Insights
➡️ SGD is ideal for Big Data and complex models where standard Gradient Descent is computationally infeasible due to the massive number of terms required for each step.
➡️ If model convergence is poor, try adjusting the learning rate schedule, which dictates how the learning rate changes from large to small across steps.
➡️ Using a mini-batch (a small subset of data per step) provides a practical balance, offering faster computation than full batch updates while maintaining parameter stability.
➡️ SGD facilitates incremental learning, allowing model parameters to be easily updated using new data points without needing to reprocess the entire dataset.
📸 Video summarized with SummaryTube.com on Feb 08, 2026, 21:05 UTC
Full video URL: youtube.com/watch?v=vMh0zPT0tLI
Duration: 10:46

Summarize youtube video with AI directly from any YouTube video page. Save Time.
Install our free Chrome extension. Get expert level summaries with one click.