Intro to Simple Linear Regression

Linear regression is an important concept in the field of statistics and data analysis. The goal of linear regression is to accurately model the relationship between a dependent variable and an independent variable. This can be done with multiple independent variables, but we will be looking at cases of just one independent variable, which is called simple linear regression.

Given a scatter plot of the correct variables (one independent, one dependent), the goal is to form a line on the plot at which each point is as close to the line as possible. This gives an accurate model of how these variables interact with each other.

How Linear Regression is Useful

Take for example this plot of cars by weight and miles per gallon: It seems as if there is a downward trend of mpg as car weight increases, but it would be very helpful to see the relationship displayed as a continuous line. That way we can predict a car’s mpg given its weight with a good degree of accuracy

How Linear Regression is Useful Part 2

Now with a line closest to each of the points on the plot, we can predict with some accuracy what a car’s mpg is given its weight.

For example, if given a car that weighed 4000 lbs you could estimate its mpg to be around 15-18.

How to Find the Line of Best Fit Using Linear Algebra

In order to find a line of best fit, it is useful to write out all the lines that would go through each point. Say for example there are four points for which we want to find the best fit. Then we would write out four equations as such:

\[x_1S + Y = y_1\\\] \[x_2S + Y = y_2\\\] \[x_3S + Y = y_3\\\] \[x_4S + Y = y_4\\\]

Each \(x\) and \(y\) correspond to one point on the graph. \(S\) and \(Y\) are the slope and y-intercept of the best fit line we want to find. One way to find these variables is to create a matrix from them and solve the system \(A\vec{x} = \vec{b}\), where \(A\) is a matrix and \(\vec{x}\) and \(\vec{b}\) are column vectors.

How to Find the Line of Best Fit Using Linear Algebra Part 2

The previous four equations can be written into one matrix equation as such:

\[\begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ x_4 & 1 \end{bmatrix} * \begin{bmatrix} S \\ Y \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \end{bmatrix}\]

To solve for \(\begin{bmatrix} S \\ Y \end{bmatrix}\), you must manipulate the equation to get the vector by itself. The first step to doing this is to multiply \(A\) by its transpose, which you must add to both sides of the equation to keep it equal. The equation now looks like \(A^TA\vec{x} = A^T\vec{b}\). Then you multiply by \((A^TA)^{-1}\) on both sides, which simplifies the equation to \(\vec{x} = (A^TA)^{-1}A^T\vec{b}\) (since any matrix multiplied by its inverse is an identity matrix, equvalent to multiplying by 1). Solving the new equation \(\vec{x} = (A^TA)^{-1}A^T\vec{b}\) will result in a column vector of size 2 with the slope and y intercept of our best fit line.

Simple Regression Problem Example

Find the line of best fit for this data:

Regression Problem Solution (using Linear Algebra)

Putting the data into a matrix equation will look like this:

\[\begin{bmatrix} 1995 & 1 \\ 1996 & 1 \\ 1997 & 1 \\ 1998 & 1 \\ 1999 & 1 \\ 2000 & 1 \end{bmatrix} * \begin{bmatrix} S \\ Y \end{bmatrix} = \begin{bmatrix} 18567343 \\ 18848350 \\ 19060850 \\ 19282965 \\ 19620692 \\ 20144584 \end{bmatrix}\]

Using a matrix calculator, you can find the soltion to the problem:

\[(A^TA)A^T\vec{b} = \begin{bmatrix} 297867 \\ -5.7574*10^8 \end{bmatrix}\]

Regression Problem Solution Part 2

Including the line with the correct slope and y-intercept will result in this plot:

abline(-5.7574*10^8, 297867)

Checking the Answer using R

We can double check the answer we got using R. The lm() command creates a linear model of the data entered. Once plotted it looks identical to the line calculated using linear algebra.

regression = lm(population ~ year, S)
abline(regression)