We are in Beta and we are offering 50% off! Use code BETATESTER at checkout.
You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

Linear regression with gradient descent

Machine Learning Coding Medium Seen in real interview

Assume that you are trying to estimate a linear model of the form:

\[ y = \mathbf{w} X' + b \] where \(\mathbf{w} \in \mathbb{R}^{1\times M}\) and \(X \in \mathbb{R}^{N\times M}\) (i.e., \(N\) data points, \(M\) features). Estimate this model by using gradient descent. In particular:

  • Derive the gradient descent updates
  • Write a class that estimates these parameters and returns the final mean squared loss.
The goal in linear regression is to minimize the sum of squared errors. Hence, the loss function we want to minimize can be written as follows:

\[ L = \sum_i (y-\hat{y})^2 \]

where \(y\) is the true value, and \(\hat{y} = b + \mathbf{w}' X\) (\(b\) is a constant). Hence our goal is to find the values of \(b\) and \(\mathbf{w}\) that minimize \(L\).

One way to do this is with gradient descent. First, we need to estimate the gradients wrt \(b\) and \(\mathbf{w}\):

\[ \begin{align} \frac{\partial L}{\partial \mathbf{w}} &= - 2 \sum_i (y - b - \mathbf{w} X') X \\ \frac{\partial L}{\partial b} &= - 2 \sum_i (y - b - \mathbf{w} X') \end{align} \]

Then, once we know the gradients, we can estimate the optimal values for \(b\) and \(\mathbf{w}\) through gradient descent in an iterative matter as follows:

\[ \begin{align} \mathbf{w} &= \mathbf{w} - \lambda \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} + \lambda \sum_i (y - b - \mathbf{w} X') X \\ b &= b - \lambda \frac{\partial L}{\partial b} = b + \lambda \sum_i (y - b - \mathbf{w} X') \end{align} \] where \(\lambda\) is the learning rate.

Vectorised approach: Since we have written the gradient descent updates in the form of vectors, we can also write our code in a similar way, by using vectors and matrices. Numpy is a requirement for this approach. We will code it as follows:

  • Create a custom class and name it LM_GD. Creating a class is not required but it is good practice. In an interview setting, you will certainly lose points if you do not create a class for this problem.
  • Compartmentalize the actions into functions within the class. In particular, looking in our breakdown equations, there are three main actions:
    • Estimate the predictions based on the current values of \(\mathbf{w}\) and \(b\), \(\hat{y} = b + \mathbf{w} X'\).
    • Estimate the error \(e = y - \hat{y}\)
    • Estimate the loss (summer of squared errors) as per the question’s request
  • Finally, we will need to define a max number of iterations and the learning rate \(\lambda\), along with an exit condition.

Based on these points, we design the solution below:

class LM_GD:
    def __init__(self, learning_rate=0.1,max_iter=1000):
        self.max_iter = max_iter
        self.learning_rate = learning_rate

    # the @ operator multiplies matrices: 
    # https://numpy.org/doc/stable/reference/generated/numpy.matmul.html 
    def predict(self,X):
        return self.b + self.w @ X.T

    # We use `__` in front of the method to show that 
    # it is not supposed to be accessed outside the class
    # See https://stackoverflow.com/questions/17193457/private-methods-in-python 
    def __get_error(self,y,y_hat):
        return y-y_hat

    def __calculate_gradients (self,error,X):
        return  error.T @ X, np.sum(error)
    
    def __calculate_loss(self,error):
        return np.sum((error)**2)
    
    def fit(self,X,y):
        # initialize weights
        self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
        # initialize intercept
        self.b = np.random.random() 
        prev_loss = float("inf")
        for i in range(self.max_iter):
            y_hat = self.predict(X)
            error = self.__get_error(y,y_hat)
            gradient_w, gradient_b = self.__calculate_gradients(error,X)
            self.w += self.learning_rate * gradient_w/len(X)
            self.b += self.learning_rate * gradient_b/len(X)
            cur_loss = self.__calculate_loss(error)
            if np.isclose(cur_loss,prev_loss,rtol=1E-9):
                print(f"Converged in {i} iterations.")
                return cur_loss
            prev_loss = cur_loss

        return cur_loss

To run it:

c = LM_GD()
print(f"Final loss: {c.fit(X,y):.2f}")
## Converged in 609 iterations.
## Final loss: 13.54
print(f"Coefficients:\n X1\t{round(c.w[0],3)}\n X2\t{round(c.w[1],3)}")
## Coefficients:
##  X1  0.299
##  X2  0.023
print(f" b\t{round(c.b,3)}")
##  b   -0.184

If you are wondering whether these results are correct, let’s estimate these coefficients from the sklearn implementation:

from sklearn import linear_model
lm = linear_model.LinearRegression()
m = lm.fit(X,y)
print(f"Coefficients:\n X1\t{round(m.coef_[0],3)}\n X2\t{round(m.coef_[1],3)}")
## Coefficients:
##  X1  0.299
##  X2  0.023
print(f" b\t{round(m.intercept_,3)}")
##  b   -0.184


There are a few points worth of mentioning:

  • We dropped the “2” from the derivative as it can be absorbed by the learning rate
  • We did not discuss the learning rate choice – you should expect that the interviewer will lean on a lot of how we choose the learning rate values etc. This is not the focus of this question so please search for learning rate problems at https://www.madinterview.com.
  • We devide the gradient by the size of the input dataset to speed up convergence and avoid overflows.
  • In a real interview, a quick and dirty approach to chek whether your code works is to print out the loss at each iteration. The loss should be decreasing over time.
  • Unless you feel super comfortable with matrix multiplications, I would avoid implementing this approach during an interview. You might end up in a situation where you can’t figure out which matrix/vector to transpose, and combined with the stress of the interview, you might end up wasting too much time on making the code work. Furthermore, you might get an interviewer who will expect the non-vectorized approach as they might not be super comfortable with linear algebra.


For-loops approach: Following up on my last point, let’s see how we can implement this without linear algebra:




class LM_GD_For_Loops:
    def __init__(self, learning_rate=0.1,iterations=1000):
        self.iterations = iterations
        self.learning_rate = learning_rate
        
    # this is now a dot product between two vectors, w and X_i
    def predict(self,X_i):
        return self.b + np.dot(self.w,X_i)
    # error is the difference between two scalars
    def __get_error(self,y_i,y_hat_i):
        return y_i-y_hat_i
    # gradients are vector, scalar
    def __calculate_gradients (self,error_i,X_i):
        return  error_i * X_i, error_i
    
    # loss is calculated per point
    def __calculate_loss(self,error_i):
        return (error_i)**2
    
    def fit(self,X,y):
        self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
        self.b = np.random.random() 
        prev_loss = float("inf")
        for i in range(self.iterations):
            cur_loss = 0
            gradient_w, gradient_b = 0,0
            for X_i,y_i in zip(X,y):
                y_hat_i = self.predict(X_i)
                error_i = self.__get_error(y_i,y_hat_i)
                pointg_w, pointg_b = self.__calculate_gradients(error_i,X_i)
                gradient_w += pointg_w
                gradient_b += pointg_b
                cur_loss += self.__calculate_loss(error_i)
            self.w += self.learning_rate * gradient_w/len(X)
            self.b += self.learning_rate * gradient_b/len(X)
            if np.isclose(cur_loss,prev_loss,rtol=1E-9):
                print(f"Converged in {i} iterations.")
                return cur_loss
            prev_loss = cur_loss

        return cur_loss

To run it:

np.random.seed(1234)
c = LM_GD_For_Loops()
print(f"Final loss: {c.fit(X,y):.2f}")
## Converged in 609 iterations.
## Final loss: 13.54
print(f"Coefficients:\n X1\t{round(c.w[0],3)}\n X2\t{round(c.w[1],3)}")
## Coefficients:
##  X1  0.299
##  X2  0.023
print(f" b\t{round(c.b,3)}")
##  b   -0.184


No-intercept approach: For those of us who took an econometrics class, it might seem redundant that I am calculating the coefficients of \(b\) and \(\mathbf{w}\) separately. Indeed, we can add a column of ones in our matrix \(X\) and drop the \(b\) calculations. Adjusting the vectorized approach, the code would have been:

class LM_GD_No_Intercept:
    def __init__(self, learning_rate=0.1,max_iter=1000):
        self.max_iter = max_iter
        self.learning_rate = learning_rate

    def predict(self,X):
        return  self.w @ X.T

    def __get_error(self,y,y_hat):
        return y-y_hat

    def __calculate_gradients (self,error,X):
        return  error.T @ X
    
    def __calculate_loss(self,error):
        return np.sum((error)**2)
    
    def fit(self,X,y):
        # initialize weights
        self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
        prev_loss = float("inf")
        for i in range(self.max_iter):
            y_hat = self.predict(X)
            error = self.__get_error(y,y_hat)
            gradient_w = self.__calculate_gradients(error,X)
            self.w += self.learning_rate * gradient_w/len(X)
            cur_loss = self.__calculate_loss(error)
            if np.isclose(cur_loss,prev_loss,rtol=1E-9):
                print(f"Converged in {i} iterations.")
                return cur_loss
            prev_loss = cur_loss

        return cur_loss

To run it:

np.random.seed(1234)
c = LM_GD_No_Intercept()
X_with_ones = np.c_[np.ones(X.shape[0]), X]
print(f"Final loss: {c.fit(X_with_ones,y):.2f}")
## Converged in 467 iterations.
## Final loss: 13.54
print(f"Coefficients:\n b \t{round(c.w[0],3)}\n X1\t{round(c.w[1],3)}\n X2\t{round(c.w[2],3)}")
## Coefficients:
##  b   -0.184
##  X1  0.299
##  X2  0.023
Despite the no-intercept approach being the best in my opinion, I would be very cautious using it in an interview. It is very likely your interviewer did not take an econometrics class and won’t follow your logic. It has happened to me personally and I had to switch mid-interview.


Topics

Gradient descent, Linear regression
Similar questions

Provide feedback