We are in Beta and we are offering 50% off! Use code BETATESTER at checkout.
You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

Gradient descent vs. stochastic gradient descent

Machine Learning Medium Seen in real interview

Describe gradient descent and the motivations behind stochastic Gradient descent.

In Gradient Descent we use the whole training data per epoch for updating the paremeters of interest. However, in stochastic gradient descent, we update the parameters of interest at every single training example.

Specifically, assuming a parameter \(\theta\), gradient descent works by moving toward the slope of the loss function \(L\) wrt \(\theta\):

\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]

where \(L\) is calculated across all data points:

\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] In the above, \(l(y_i, f(x_i,\theta_t))\) is the error of a single point prediction \(f(x_i,\theta_t)\).

Using the same notation, In stochastic gradient descent, our update will be:

\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \] Stochastic gradient descent tends to converge faster, since it updates at every point. However, exactly because it updates at every point, it might end up oscilating without converging.


Mini-batch Gradient Descent lies in between of these two extremes, in which we can use a mini-batch (small portion) of training data per epoch. Rule of thumb for selecting the size of mini-batch is a power of 2 like 32, 64, 128, etc. We use poowers of 2 because many vectorized operation implementations work faster when their inputs are sized in powers of 2.



Topics

Gradient descent, Stochastic gradient descent, Minibatch
Similar questions

Provide feedback