Linear Regression

06/05/2025

Linear Regression

Linear regression is a tool used to model the relationship between a dependent variable and one or more independent variables. To do this, we must assume that the relationship between these variables are linear. Linear regression can help us answer interesting questions such as:

Does studying more result in higher test scores?
Does spending more on advertisement result in higher revenue?
Does a car’s price decrease over time?
Does a house’s size influence its price?
Does sun exposure reduce someone’s lifespan?

Example of Linear Regression

By assuming a linear relationship between the dependent and independent variables, linear regression can be used to find a line that best fits the observed data points, otherwise known as the line of best fit.

The following is an example showing the line of best fit that best represents the linear relationship between a car’s price (dependent variable) and its mileage (dependent variable).

The Math Behind Linear Regression

While a linear regression model is not obligated to use the least squares method, it is the most commonly used. This method ensures that the model minimizes the square distances between the line of best fit and the observed data points.

These distances are referred to as residuals. By squaring these residuals and minimizing their sum, it ensures that the line of best fit best fits the data.

For a dataset of \(n\) observations, we have

\[y_1 = \beta_0 + \beta_1 x_1\] \[y_2 = \beta_0 + \beta_1 x_2\] \[\vdots\] \[y_n = \beta_0 + \beta_1 x_n\]

where \(x_n\) and \(y_n\) are the \(n^{th}\) observed data points and \(\beta_0\) and \(\beta{1}\) are the parameters of the model that represent the intercept and slope respectively.

To solve for \(\beta_0\) and \(\beta_1\), we first put them into their vector forms to get

\[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_0 \\ \vdots \\ \beta_0 \end{bmatrix} + \begin{bmatrix} \beta_1 x_1 \\ \beta_1 x_2 \\ \vdots \\ \beta_1 x_n \end{bmatrix}\]

which can be rewritten as

\[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \beta_0 \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} + \beta_1 \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\]

and once again rewritten as:

\[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 \ \ x_1 \\ 1 \ \ x_2 \\ \vdots \ \ \vdots \\ 1 \ \ x_n \end{bmatrix} \cdot \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}\]

It’s now easy to see that there are three vectors of interest: \[\vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}, \vec{1} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}, \vec{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\]

Since \(\vec{y}\) does not live in the vector space spanned by the basis \([\vec{1}, \vec{x}]\), we have to project \(\vec{y}\) onto this vector span. We call this projected vector \(\vec{\hat{y}}\), and it represents the least squares solution to this equation. The least squares comes from the fact that by projecting directly downward onto vector space, we minimize the distance between \(\vec{y}\) and \(\vec{\hat{y}}\).

To obtain \(\vec{\hat{y}}\), we can use the projection formula taught in linear algebra

\[\vec{\hat{y}} = A (A^T A)^{-1} A^T \vec{y}\]

where:

\[A = \begin{bmatrix} 1 \ \ x_1 \\ 1 \ \ x_2 \\ \vdots \ \ \vdots \\ 1 \ \ x_n \end{bmatrix}\]

To obtain the intercept and slope, we use the formula:

\[\begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = (A^T A)^{-1} A^T \vec{y}\]

Making Predictions via Linear Regression

Once a linear regression model has been obtained, it can be used to predict values within the range of observed data points. For example, if I want to predict the price of a car with 80,0000 miles, I would simply plug these values into the following equation:

\[\hat{y} = \beta_0 + \beta_1 x\]

However, there are some important considerations when using linear models. You should not try to predict values far outside the range of your input data. This practice is called extrapolation and it can produce highly inaccurate and misleading results. For example, if your dataset only contains cars with mileage between 10,000 and 150,000 miles, using the model to predict the price of a car with 300,000 miles may not be reliable — because the model hasn’t seen such data and can’t infer trends that far beyond its training domain.

Linear regression is only as good as the variables it considers. Other important variables that might influence a car’s price might include:

Car age
Make and model
Title status

Omitting these variables is called underfitting which is when the model is too simple to capture the true complexity of a car’s pricing behavior.

Multiple Linear Regression

We will now include a second independent variable: car age. By including this second variable, we will obtain a multiple linear regression model that may or may not be better at predicting a car’s pricing behavior. A 2D plot will be created to this new variable. The car’s mileage will be plotted on the x-axis, the car’s age will be color-mapped, and the car’s price will be plotted on the y-axis.

Now that two independent variables are being used, we move up one dimension resulting in not a line of best fit, but rather a plane of best fit. A 3D plot will be shown displaying this plane of best fit.

Comparing the Models via Residual Sum of Squares (RSS)

Single Linear Regression Model (Mileage vs Price)

The calculated RSS is: 2705108938.288

Multiple Linear Regression Model (Mileage & Age vs Price)

The calculated RSS is: 2465277563.618

Conclusion

In this example, we see that by including a car’s age in the linear regression model, we decrease the residual sum of squares, thus improving its predictability power. This suggests that a car’s age provides meaningful additional information about the car’s price beyond mileage alone. In real-world applications, more features and careful validation are crucial to building reliable predictive models.

Thank You

Have a great day! :)