Linear Regression from Scratch using Python and its Time Complexity (Part I)

Hongnan Gao
Analytics Vidhya
Published in
7 min readFeb 28, 2021

--

Photo by Isaac Smith on Unsplash

Introduction

I love to write and document what I learn, and recently decided to write on a bigger platform! Here is the first series of Linear Regression using Python and utilizing Object Oriented Programming to keep the code clean and reusable.

We need to know some simple Linear Algebra in order to craft our code. Before writing, it is always super necessary to have a clear outline of the Machine Learning Algorithm mathematically.

On a higher level, the Linear Regression Algorithm takes in a 2-dimensional Matrix or a 2-d numpy array X and a 1-dimensional vector, or a 1-d array y, and solve for the optimal beta coefficients β using the Normal Equation:

The Normal Equation

The mathematics

I won’t go too deep into the mathematical nuances and details for Linear Regression, but for the interested, you can follow my Kaggle profile for a more rigorous explanation of the math behind it. To have a more concrete understanding, attached is a dataset for house pricing, where the first three columns are the predictor variables x₁, x₂, and x₃, where the last column price, is the response variable y.

The idea is simple, given m samples (here there are only 5 rows), and n predictor variables (here there are only 3 predictor variables), we aim to find coefficients that best fits the data such that our model can be represented as follows:

Linear Regression Matrix Form

Notation

The notation is important, many authors fail to give a clear definition of what each symbol represents, and this might prove to be difficult to read for beginners. So here goes:

  • X is the Design Matrix: Let X be the design matrix of dimensions m × (n + 1) where m is the number of observations (training samples) and n independent feature/input variables. Take note that there is a column of 1’s in the first column of this matrix. That is the intercept column and we usually include it if we want to fit our linear model with an intercept term β₀.
The Design Matrix X
  • The ith column of X is defined as x to the power of i, which is also known as the ith training sample, represented as a n × 1 vector.
  • y the output vector: The column vector y contains the output for the m observations.
  • β the vector of coefficients/parameters: The column vector β contains all the coefficients of the linear model.
  • ε the random vector of the error terms: The column vector ε contains m error terms corresponding to the m observations.

Normal Equation

With the above equation, we aim to find the best estimate for the true population parameter β. We call the best estimate as beta-hat. We find this optimal estimate by minimizing the cost function using the Ordinary Least Squares (OLS) method. In a nutshell, we minimize the sum of the squares of the differences between the observed y, or y_true in the dataset and the predicted y, or y_hat by the linear model that we have. Note that the cost function is usually denoted as such:

Cost Function for Linear Regression.

In the usual Machine Learning course, we may use Gradient Descent Algorithm to optimize the Cost Function. But in Linear Regression, there already exists a closed-form solution for us to find the minimum of the above function, and for a start where we do not really care about other stuff, having this handy closed-formed solution is god-sent. We found this to be:

The Normal Equation

Coding Linear Regression from Scatch

I know, we are bored of the mathematical equations, here is the code as promised:

Linear Regression Codes

I won’t go into details of the code in this article, but I will like to bring attention to the time complexity of this algorithm.

Time Complexity

Time Complexity is an important topic, you do not want your code to run for 1 billion years, and therefore, an efficient code will be important to businesses. That is also why Time Complexity questions are becoming increasingly popular in Machine Learning and Data Science interviews!

The Linear Algorithm that we used here simply uses matrix multiplication. We will also ignore the codes that are of constant time O(1). For example, self.coef_=None in the constructor is O(1) and we do not really wish to consider this in the grand scheme of things.

What are the really important ones are in code lines 37–40. Given X to be a m by n matrix/array, where m is the number of samples and n the number of features. In addition, y is a m by 1 vector. Refer to this Wikipedia Page for a handy helpsheet on the various time complexity for mathematical operations.

Line 37: np.dot(X.T,X) In the dot product, we transpose the m × n matrix to become n × m, this operation takes O(m × n) time because we are effectively performing two for loops. Next up is performing matrix multiplication, note carefully that np.dot between two 2-d arrays do not mean dot product, instead, they are matrix multiplication, which takes O(m × n²) time. The output matrix of this step is n× n.

Line 38: Inverting a n × n matrix takes n³ time. The output matrix is n × n.

Line 39: Now we perform matrix multiplication of n × n and n × m, which gives O(m × n²), the output matrix is n × m.

Line 40: Lastly, the time complexity is O(m × n).

Adding them all up gives you O(2mn+2mn²+n³) whereby simple triangle inequality of mn<mn² implies we can remove the less dominant 2mn term. In the end, the run time complexity of this Linear Regression Algorithm using Normal Equation is O(n²(m+n)). However, you noticed that there are two variables in the bigO notation, and you wonder if we can further reduce the bigO notation to a single variable? Well, if the number of variables is small, which means n is kept small and constant, we can reduce the time complexity to O(m), however, if your variables are increasing, then your time complexity will explode if n → ∞.

This ends the first series, and also the first article published by me. Stay tuned for updates and see me code various Machine Learning Algorithms from scratch.

Preamble for the next series on Linear Regression

Just a heads up, I may not be doing part II of the series for Linear Regression just yet, as I want to cover a wide variety of algorithms on a surface level, just enough for beginners (or intermediate) learners. However, as a preamble, I will definitely include more and touch on the following topics that are not covered in today’s session.

Orthogonalization

We can speed up the Normal Equation’s time complexity by using a technique called Orthogonalization, whereby we make use of QR Factorization so we do not need to invert the annoying XᵀX where it took n³ time!

Regularization

You basically cannot leave Linear Models without knowing L1–2 Regularization! The Ridge, Lasso, and the ElasticNet! Note that Regularization is a broad term that traverses through all Machine Learning Models. Stay tuned on understanding how Regularization can reduce overfitting. In addition, one caveat that I didn’t mention is what if XᵀX is not invertible in our Normal Equation? This can happen if some columns of X are linearly dependent (redundancy in our feature variables), or there are too many features whereby somehow… the number of training samples m is lesser than the number of features n. If you use say, Ridge Regression, then the Modified Normal Equation guarantees a solution. We will talk about it in Part II of Linear Regression.

Statistical and Interpretation of Linear Regression

I didn’t mention much about how to interpret Linear Regression. This is important, even if you know how to code up a Linear Regression Algorithm from scratch, if you do not know how to interpret the results in a statistically rigorous way, then that is not meaningful! Learn more on Hypothesis Testing, Standard Errors, and Confidence Levels. I may delve a bit into Maximum Likelihood Estimators as well!

Credits and References:

Statistics by jim. (2020, July 06). Retrieved March 01, 2021, from https://statisticsbyjim.com/

Computational complexity of mathematical operations. (2021, February 04). Retrieved March 01, 2021, from https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra

--

--