Specifically, assuming a parameter \(\theta\), gradient descent works by moving toward the slope of the loss function \(L\) wrt \(\theta\):
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]
where \(L\) is calculated across all data points:
\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] In the above, \(l(y_i, f(x_i,\theta_t))\) is the error of a single point prediction \(f(x_i,\theta_t)\).
Using the same notation, In stochastic gradient descent, our update will be:
\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \] Stochastic gradient descent tends to converge faster, since it updates at every point. However, exactly because it updates at every point, it might end up oscilating without converging.