You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

Assume that you are trying to estimate a linear model of the form:

\[ y = \mathbf{w} X' + b \] where \(\mathbf{w} \in \mathbb{R}^{1\times M}\) and \(X \in \mathbb{R}^{N\times M}\) (i.e., \(N\) data points, \(M\) features). Estimate this model by using gradient descent. In particular:

- Derive the gradient descent updates
- Write a class that estimates these parameters and returns the final mean squared loss.

The goal in linear regression is to minimize the sum of squared errors. Hence, the loss function we want to minimize can be written as follows:

\[ L = \sum_i (y-\hat{y})^2 \]

where \(y\) is the true value, and \(\hat{y} = b + \mathbf{w}' X\) (\(b\) is a constant). Hence our goal is to find the values of \(b\) and \(\mathbf{w}\) that minimize \(L\).

One way to do this is with gradient descent. First, we need to estimate the gradients wrt \(b\) and \(\mathbf{w}\):

\[ \begin{align} \frac{\partial L}{\partial \mathbf{w}} &= - 2 \sum_i (y - b - \mathbf{w} X') X \\ \frac{\partial L}{\partial b} &= - 2 \sum_i (y - b - \mathbf{w} X') \end{align} \]

Then, once we know the gradients, we can estimate the optimal values for \(b\) and \(\mathbf{w}\) through gradient descent in an iterative matter as follows:

\[ \begin{align} \mathbf{w} &= \mathbf{w} - \lambda \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} + \lambda \sum_i (y - b - \mathbf{w} X') X \\ b &= b - \lambda \frac{\partial L}{\partial b} = b + \lambda \sum_i (y - b - \mathbf{w} X') \end{align} \] where \(\lambda\) is the learning rate.

**Vectorised approach:** Since we have written the gradient descent updates in the form of vectors, we can also write our code in a similar way, by using vectors and matrices. Numpy is a requirement for this approach. We will code it as follows:

- Create a custom class and name it
`LM_GD`

. Creating a class is not required but it is good practice. In an interview setting, you will certainly lose points if you do not create a class for this problem. - Compartmentalize the actions into functions within the class. In particular, looking in our breakdown equations, there are three main actions:
- Estimate the predictions based on the current values of \(\mathbf{w}\) and \(b\), \(\hat{y} = b + \mathbf{w} X'\).
- Estimate the error \(e = y - \hat{y}\)
- Estimate the loss (summer of squared errors) as per the question’s request

- Finally, we will need to define a max number of iterations and the learning rate \(\lambda\), along with an exit condition.

Based on these points, we design the solution below:

```
class LM_GD:
def __init__(self, learning_rate=0.1,max_iter=1000):
self.max_iter = max_iter
self.learning_rate = learning_rate
# the @ operator multiplies matrices:
# https://numpy.org/doc/stable/reference/generated/numpy.matmul.html
def predict(self,X):
return self.b + self.w @ X.T
# We use `__` in front of the method to show that
# it is not supposed to be accessed outside the class
# See https://stackoverflow.com/questions/17193457/private-methods-in-python
def __get_error(self,y,y_hat):
return y-y_hat
def __calculate_gradients (self,error,X):
return error.T @ X, np.sum(error)
def __calculate_loss(self,error):
return np.sum((error)**2)
def fit(self,X,y):
# initialize weights
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
# initialize intercept
self.b = np.random.random()
prev_loss = float("inf")
for i in range(self.max_iter):
y_hat = self.predict(X)
error = self.__get_error(y,y_hat)
gradient_w, gradient_b = self.__calculate_gradients(error,X)
self.w += self.learning_rate * gradient_w/len(X)
self.b += self.learning_rate * gradient_b/len(X)
cur_loss = self.__calculate_loss(error)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
```

To run it:

```
## Converged in 609 iterations.
## Final loss: 13.54
```

```
## Coefficients:
## X1 0.299
## X2 0.023
```

`## b -0.184`

If you are wondering whether these results are correct, let’s estimate these coefficients from the `sklearn`

implementation:

```
from sklearn import linear_model
lm = linear_model.LinearRegression()
m = lm.fit(X,y)
print(f"Coefficients:\n X1\t{round(m.coef_[0],3)}\n X2\t{round(m.coef_[1],3)}")
```

```
## Coefficients:
## X1 0.299
## X2 0.023
```

`## b -0.184`

There are a few points worth of mentioning:

- We dropped the “2” from the derivative as it can be absorbed by the learning rate
- We did not discuss the learning rate choice – you should expect that the interviewer will lean on a lot of how we choose the learning rate values etc. This is not the focus of this question so please search for learning rate problems at https://www.madinterview.com.
- We devide the gradient by the size of the input dataset to speed up convergence and avoid overflows.
- In a real interview, a quick and dirty approach to chek whether your code works is to print out the loss at each iteration. The loss should be decreasing over time.
- Unless you feel super comfortable with matrix multiplications, I would avoid implementing this approach during an interview. You might end up in a situation where you can’t figure out which matrix/vector to transpose, and combined with the stress of the interview, you might end up wasting too much time on making the code work. Furthermore, you might get an interviewer who will expect the non-vectorized approach as they might not be super comfortable with linear algebra.

**For-loops approach:** Following up on my last point, let’s see how we can implement this without linear algebra:

```
class LM_GD_For_Loops:
def __init__(self, learning_rate=0.1,iterations=1000):
self.iterations = iterations
self.learning_rate = learning_rate
# this is now a dot product between two vectors, w and X_i
def predict(self,X_i):
return self.b + np.dot(self.w,X_i)
# error is the difference between two scalars
def __get_error(self,y_i,y_hat_i):
return y_i-y_hat_i
# gradients are vector, scalar
def __calculate_gradients (self,error_i,X_i):
return error_i * X_i, error_i
# loss is calculated per point
def __calculate_loss(self,error_i):
return (error_i)**2
def fit(self,X,y):
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
self.b = np.random.random()
prev_loss = float("inf")
for i in range(self.iterations):
cur_loss = 0
gradient_w, gradient_b = 0,0
for X_i,y_i in zip(X,y):
y_hat_i = self.predict(X_i)
error_i = self.__get_error(y_i,y_hat_i)
pointg_w, pointg_b = self.__calculate_gradients(error_i,X_i)
gradient_w += pointg_w
gradient_b += pointg_b
cur_loss += self.__calculate_loss(error_i)
self.w += self.learning_rate * gradient_w/len(X)
self.b += self.learning_rate * gradient_b/len(X)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
```

To run it:

```
## Converged in 609 iterations.
## Final loss: 13.54
```

```
## Coefficients:
## X1 0.299
## X2 0.023
```

`## b -0.184`

**No-intercept approach:** For those of us who took an econometrics class, it might seem redundant that I am calculating the coefficients of \(b\) and \(\mathbf{w}\) separately. Indeed, we can add a column of ones in our matrix \(X\) and drop the \(b\) calculations. Adjusting the vectorized approach, the code would have been:

```
class LM_GD_No_Intercept:
def __init__(self, learning_rate=0.1,max_iter=1000):
self.max_iter = max_iter
self.learning_rate = learning_rate
def predict(self,X):
return self.w @ X.T
def __get_error(self,y,y_hat):
return y-y_hat
def __calculate_gradients (self,error,X):
return error.T @ X
def __calculate_loss(self,error):
return np.sum((error)**2)
def fit(self,X,y):
# initialize weights
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
prev_loss = float("inf")
for i in range(self.max_iter):
y_hat = self.predict(X)
error = self.__get_error(y,y_hat)
gradient_w = self.__calculate_gradients(error,X)
self.w += self.learning_rate * gradient_w/len(X)
cur_loss = self.__calculate_loss(error)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
```

To run it:

```
np.random.seed(1234)
c = LM_GD_No_Intercept()
X_with_ones = np.c_[np.ones(X.shape[0]), X]
print(f"Final loss: {c.fit(X_with_ones,y):.2f}")
```

```
## Converged in 467 iterations.
## Final loss: 13.54
```

```
## Coefficients:
## b -0.184
## X1 0.299
## X2 0.023
```

Despite the no-intercept approach being the best in my opinion, I would be very cautious using it in an interview. It is very likely your interviewer did not take an econometrics class and won’t follow your logic. It has happened to me personally and I had to switch mid-interview.

Gradient descent, Linear regression

- Principal Component Analysis (PCA) from scratch Medium (Principal Component Analysis (PCA))
- Naive Bayes from scratch Medium (Gaussian Naive Bayes)
- K-means from scratch Medium (k-means)
- Linear regression with stochastic gradient descent Medium (Stochastic Gradient descent, Linear regression)
- Logistic regression with gradient descent Medium (Gradient descent, Logistic regression)