Bias-variance equation
Imbalanced dataset
Specifically, assuming a parameter \(\theta\), gradient descent works by moving toward the slope of the loss function \(L\) wrt \(\theta\):
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]
where \(L\) is calculated across all data points:
\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] In the above, \(l(y_i, f(x_i,\theta_t))\) is the error of a single point prediction \(f(x_i,\theta_t)\).
Using the same notation, In stochastic gradient descent, our update will be:
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \] Stochastic gradient descent tends to converge faster, since it updates at every point. However, exactly because it updates at every point, it might end up oscilating without converging.
This is an AI-enhanced solution that took as input the original solution.
In Gradient Descent, the entire dataset is utilized for updating parameters during each epoch. Conversely, in Stochastic Gradient Descent, updates occur after evaluating each training example individually.
Assuming a parameter \(\theta\), gradient descent involves iteratively moving toward the slope of the loss function \(L\) relative to \(\theta\):
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]
where \(L\) is computed using the full dataset:
\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] Here, \(l(y_i, f(x_i,\theta_t))\) represents the error of a prediction for a single data point \(f(x_i,\theta_t)\).
In stochastic gradient descent, using similar notation, each iteration update is as follows:
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \]
SGD generally achieves faster convergence due to frequent updates, but this can also cause oscillations, hindering convergence.
Gradient descent, Stochastic gradient descent, Minibatch