Unlock AI power-ups — upgrade and save 20%!
Use code STUBE20OFF during your first month after signup. Upgrade now →

By StatQuest with Josh Starmer
Published Loading...
N/A views
N/A likes
Stochastic Gradient Descent (SGD) Overview
📌 Stochastic Gradient Descent (SGD) is introduced as an alternative to standard Gradient Descent, especially for complex models with large datasets.
📐 Standard gradient descent uses the entire dataset to calculate derivatives and update parameters (like the intercept and slope in a line fit, $y = mx + b$).
📈 For models like logistic regression with 23,000 genes and 1 million samples, standard gradient descent requires calculating approximately 2.3 trillion terms per step, making it computationally slow.
Mechanism of Stochastic Gradient Descent
🎲 SGD addresses computational load by randomly picking one sample per step (or a small subset, known as a mini-batch) to calculate derivatives and update parameters.
🎯 In the simple example (3 data points), SGD reduced the number of terms computed by a factor of 3 per step; for 1 million samples, this factor is 1 million.
📉 SGD is particularly beneficial when there are redundancies in the data, as seen when using only one sample from a cluster to represent the step.
Learning Rate and Mini-Batching
🚦 SGD is sensitive to the learning rate, where the general strategy is to start relatively large and decrease it over time (the schedule).
🧩 While the strict definition uses only one sample, it is more common to use a mini-batch (a small subset, e.g., 3 samples) per step.
🌟 Using a mini-batch balances stability (like using all data) with speed (like using a single sample), often resulting in more stable parameter estimates in fewer steps.
Parameter Updates with New Data
🔄 A significant advantage of SGD is the ability to easily update parameter estimates when new data arrives, without restarting the entire process from the initial guesses.
🚀 The process picks up from the most recent estimates, using the new sample(s) to calculate the next step for the slope and intercept.
Key Points & Insights
➡️ SGD is ideal for Big Data and complex models where standard Gradient Descent is computationally infeasible due to the massive number of terms required for each step.
➡️ If model convergence is poor, try adjusting the learning rate schedule, which dictates how the learning rate changes from large to small across steps.
➡️ Using a mini-batch (a small subset of data per step) provides a practical balance, offering faster computation than full batch updates while maintaining parameter stability.
➡️ SGD facilitates incremental learning, allowing model parameters to be easily updated using new data points without needing to reprocess the entire dataset.
📸 Video summarized with SummaryTube.com on Feb 08, 2026, 21:05 UTC
Find relevant products on Amazon related to this video
As an Amazon Associate, we earn from qualifying purchases
Full video URL: youtube.com/watch?v=vMh0zPT0tLI
Duration: 10:46
Stochastic Gradient Descent (SGD) Overview
📌 Stochastic Gradient Descent (SGD) is introduced as an alternative to standard Gradient Descent, especially for complex models with large datasets.
📐 Standard gradient descent uses the entire dataset to calculate derivatives and update parameters (like the intercept and slope in a line fit, $y = mx + b$).
📈 For models like logistic regression with 23,000 genes and 1 million samples, standard gradient descent requires calculating approximately 2.3 trillion terms per step, making it computationally slow.
Mechanism of Stochastic Gradient Descent
🎲 SGD addresses computational load by randomly picking one sample per step (or a small subset, known as a mini-batch) to calculate derivatives and update parameters.
🎯 In the simple example (3 data points), SGD reduced the number of terms computed by a factor of 3 per step; for 1 million samples, this factor is 1 million.
📉 SGD is particularly beneficial when there are redundancies in the data, as seen when using only one sample from a cluster to represent the step.
Learning Rate and Mini-Batching
🚦 SGD is sensitive to the learning rate, where the general strategy is to start relatively large and decrease it over time (the schedule).
🧩 While the strict definition uses only one sample, it is more common to use a mini-batch (a small subset, e.g., 3 samples) per step.
🌟 Using a mini-batch balances stability (like using all data) with speed (like using a single sample), often resulting in more stable parameter estimates in fewer steps.
Parameter Updates with New Data
🔄 A significant advantage of SGD is the ability to easily update parameter estimates when new data arrives, without restarting the entire process from the initial guesses.
🚀 The process picks up from the most recent estimates, using the new sample(s) to calculate the next step for the slope and intercept.
Key Points & Insights
➡️ SGD is ideal for Big Data and complex models where standard Gradient Descent is computationally infeasible due to the massive number of terms required for each step.
➡️ If model convergence is poor, try adjusting the learning rate schedule, which dictates how the learning rate changes from large to small across steps.
➡️ Using a mini-batch (a small subset of data per step) provides a practical balance, offering faster computation than full batch updates while maintaining parameter stability.
➡️ SGD facilitates incremental learning, allowing model parameters to be easily updated using new data points without needing to reprocess the entire dataset.
📸 Video summarized with SummaryTube.com on Feb 08, 2026, 21:05 UTC
Find relevant products on Amazon related to this video
As an Amazon Associate, we earn from qualifying purchases

Summarize youtube video with AI directly from any YouTube video page. Save Time.
Install our free Chrome extension. Get expert level summaries with one click.