You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

Describe gradient descent and the motivations behind stochastic Gradient descent.

In Gradient Descent we use the whole training data per epoch for updating the paremeters of interest. However, in stochastic gradient descent, we update the parameters of interest at every single training example.
**Mini-batch** Gradient Descent lies in between of these two extremes, in which we can use a mini-batch (small portion) of training data per epoch. Rule of thumb for selecting the size of **mini-batch** is a power of 2 like 32, 64, 128, etc. We use poowers of 2 because many vectorized operation implementations work faster when their inputs are sized in powers of 2.

Specifically, assuming a parameter \(\theta\), gradient descent works by moving toward the slope of the loss function \(L\) wrt \(\theta\):

\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{L}} {\partial \theta_t} \]

where \(L\) is calculated across all data points:

\[ L(\theta_t) = \frac{1}{N} \sum_i l(y_i, f(x_i,\theta_t)) \] In the above, \(l(y_i, f(x_i,\theta_t))\) is the error of a single point prediction \(f(x_i,\theta_t)\).

Using the same notation, In stochastic gradient descent, our update will be:

\[ \theta_{t+1} = \theta_{t} - \lambda \frac{\partial{l_i}} {\partial \theta_t} \] Stochastic gradient descent tends to converge faster, since it updates at every point. However, exactly because it updates at every point, it might end up oscilating without converging.

Find more here: https://cs231n.github.io/optimization-1/#gd

Gradient descent, Stochastic gradient descent, Minibatch

- Bias-variance equation Easy (Bias-variance tradeoff, Formula derivation)
- Range of R2 when combining regressions Medium (Linear regression, Goodness of fit, R-squared, Correlation)
- Bias-variance biased estimator Medium (Bias-variance tradeoff)
- Weighted and importance sampling Easy (Sampling, Weighted sampling, Importance sampling)
- Missing data Easy (Missing data)