\[ y = \mathbf{w} X' + b \] where \(\mathbf{w} \in \mathbb{R}^{1\times M}\) and \(X \in \mathbb{R}^{N\times M}\) (i.e., \(N\) data points, \(M\) features). Estimate this model by using gradient descent. In particular:
\[ L = \sum_i (y-\hat{y})^2 \]
where \(y\) is the true value, and \(\hat{y} = b + \mathbf{w}' X\) (\(b\) is a constant). Hence our goal is to find the values of \(b\) and \(\mathbf{w}\) that minimize \(L\).
One way to do this is with gradient descent. First, we need to estimate the gradients wrt \(b\) and \(\mathbf{w}\):
\[ \begin{align} \frac{\partial L}{\partial \mathbf{w}} &= - 2 \sum_i (y - b - \mathbf{w} X') X \\ \frac{\partial L}{\partial b} &= - 2 \sum_i (y - b - \mathbf{w} X') \end{align} \]
Then, once we know the gradients, we can estimate the optimal values for \(b\) and \(\mathbf{w}\) through gradient descent in an iterative matter as follows:
\[ \begin{align} \mathbf{w} &= \mathbf{w} - \lambda \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} + \lambda \sum_i (y - b - \mathbf{w} X') X \\ b &= b - \lambda \frac{\partial L}{\partial b} = b + \lambda \sum_i (y - b - \mathbf{w} X') \end{align} \] where \(\lambda\) is the learning rate.
Vectorised approach: Since we have written the gradient descent updates in the form of vectors, we can also write our code in a similar way, by using vectors and matrices. Numpy is a requirement for this approach. We will code it as follows:
LM_GD
. Creating a class is not required but it is good practice. In an interview setting, you will certainly lose points if you do not create a class for this problem.Based on these points, we design the solution below:
class LM_GD:
def __init__(self, learning_rate=0.1,max_iter=1000):
self.max_iter = max_iter
self.learning_rate = learning_rate
# the @ operator multiplies matrices:
# https://numpy.org/doc/stable/reference/generated/numpy.matmul.html
def predict(self,X):
return self.b + self.w @ X.T
# We use `__` in front of the method to show that
# it is not supposed to be accessed outside the class
# See https://stackoverflow.com/questions/17193457/private-methods-in-python
def __get_error(self,y,y_hat):
return y-y_hat
def __calculate_gradients (self,error,X):
return error.T @ X, np.sum(error)
def __calculate_loss(self,error):
return np.sum((error)**2)
def fit(self,X,y):
# initialize weights
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
# initialize intercept
self.b = np.random.random()
prev_loss = float("inf")
for i in range(self.max_iter):
y_hat = self.predict(X)
error = self.__get_error(y,y_hat)
gradient_w, gradient_b = self.__calculate_gradients(error,X)
self.w += self.learning_rate * gradient_w/len(X)
self.b += self.learning_rate * gradient_b/len(X)
cur_loss = self.__calculate_loss(error)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
To run it:
## Converged in 609 iterations.
## Final loss: 13.54
## Coefficients:
## X1 0.299
## X2 0.023
## b -0.184
If you are wondering whether these results are correct, let’s estimate these coefficients from the sklearn
implementation:
from sklearn import linear_model
lm = linear_model.LinearRegression()
m = lm.fit(X,y)
print(f"Coefficients:\n X1\t{round(m.coef_[0],3)}\n X2\t{round(m.coef_[1],3)}")
## Coefficients:
## X1 0.299
## X2 0.023
## b -0.184
There are a few points worth of mentioning:
For-loops approach: Following up on my last point, let’s see how we can implement this without linear algebra:
class LM_GD_For_Loops:
def __init__(self, learning_rate=0.1,iterations=1000):
self.iterations = iterations
self.learning_rate = learning_rate
# this is now a dot product between two vectors, w and X_i
def predict(self,X_i):
return self.b + np.dot(self.w,X_i)
# error is the difference between two scalars
def __get_error(self,y_i,y_hat_i):
return y_i-y_hat_i
# gradients are vector, scalar
def __calculate_gradients (self,error_i,X_i):
return error_i * X_i, error_i
# loss is calculated per point
def __calculate_loss(self,error_i):
return (error_i)**2
def fit(self,X,y):
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
self.b = np.random.random()
prev_loss = float("inf")
for i in range(self.iterations):
cur_loss = 0
gradient_w, gradient_b = 0,0
for X_i,y_i in zip(X,y):
y_hat_i = self.predict(X_i)
error_i = self.__get_error(y_i,y_hat_i)
pointg_w, pointg_b = self.__calculate_gradients(error_i,X_i)
gradient_w += pointg_w
gradient_b += pointg_b
cur_loss += self.__calculate_loss(error_i)
self.w += self.learning_rate * gradient_w/len(X)
self.b += self.learning_rate * gradient_b/len(X)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
To run it:
## Converged in 609 iterations.
## Final loss: 13.54
## Coefficients:
## X1 0.299
## X2 0.023
## b -0.184
No-intercept approach: For those of us who took an econometrics class, it might seem redundant that I am calculating the coefficients of \(b\) and \(\mathbf{w}\) separately. Indeed, we can add a column of ones in our matrix \(X\) and drop the \(b\) calculations. Adjusting the vectorized approach, the code would have been:
class LM_GD_No_Intercept:
def __init__(self, learning_rate=0.1,max_iter=1000):
self.max_iter = max_iter
self.learning_rate = learning_rate
def predict(self,X):
return self.w @ X.T
def __get_error(self,y,y_hat):
return y-y_hat
def __calculate_gradients (self,error,X):
return error.T @ X
def __calculate_loss(self,error):
return np.sum((error)**2)
def fit(self,X,y):
# initialize weights
self.w = np.random.uniform(low=-1,high=1, size = X.shape[1])
prev_loss = float("inf")
for i in range(self.max_iter):
y_hat = self.predict(X)
error = self.__get_error(y,y_hat)
gradient_w = self.__calculate_gradients(error,X)
self.w += self.learning_rate * gradient_w/len(X)
cur_loss = self.__calculate_loss(error)
if np.isclose(cur_loss,prev_loss,rtol=1E-9):
print(f"Converged in {i} iterations.")
return cur_loss
prev_loss = cur_loss
return cur_loss
To run it:
np.random.seed(1234)
c = LM_GD_No_Intercept()
X_with_ones = np.c_[np.ones(X.shape[0]), X]
print(f"Final loss: {c.fit(X_with_ones,y):.2f}")
## Converged in 467 iterations.
## Final loss: 13.54
## Coefficients:
## b -0.184
## X1 0.299
## X2 0.023