Analyze prediction error
Imbalanced dataset
Specifically, assuming a parameter \(\theta\), gradient descent works by moving toward the slope of the loss function \(L\) wrt \(\theta\):
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]
where \(L\) is calculated across all data points:
\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] In the above, \(l(y_i, f(x_i,\theta_t))\) is the error of a single point prediction \(f(x_i,\theta_t)\).
Using the same notation, In stochastic gradient descent, our update will be:
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \] Stochastic gradient descent tends to converge faster, since it updates at every point. However, exactly because it updates at every point, it might end up oscilating without converging.
Gradient descent, Stochastic gradient descent, Minibatch