2026-03-08

Simple Linear Regression

Simple Linear Regressions is all about statistical relationships as opposed to deterministic relationships. This is when the relationship between variables is imperfect. Notice how the plot has a trend, but also exhibits some scatter.

The Best Fitting Line

Lets introduce some notation.

  • \(y_i\) denotes the observed response for experimental unit \(i\)
  • \(x_i\) denotes the predictor value for experimental unit \(i\)
  • \(\hat{y_i}\) is the predicted response (or fitted value) for experimental unit \(i\)

Then the equation for the best-fitting line is:

\[ \hat{y_i} = b_0 + b_1 x_i \]

where we make a prediction error (residual) of size

\[e_i = y_i - \hat{y_i}\]

Estimating the best fit

We need to find the line that minimizes the squares of the residuals.

This means we need to find values for \(b_0\) and \(b_1\) which minimize the following:

\[ \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

What is the best fitted line?

Let us consider the first 10 data points from MTCars:

cars = tail(mtcars, n = 10)
cars[, c("wt", "mpg")]
                    wt  mpg
AMC Javelin      3.435 15.2
Camaro Z28       3.840 13.3
Pontiac Firebird 3.845 19.2
Fiat X1-9        1.935 27.3
Porsche 914-2    2.140 26.0
Lotus Europa     1.513 30.4
Ford Pantera L   3.170 15.8
Ferrari Dino     2.770 19.7
Maserati Bora    3.570 15.0
Volvo 142E       2.780 21.4

The plot

Here we have two candidate regression lines, the red and blue line. The red line equation is \(-6.5x - 39\). The blue line is \(-7.5x - 42\).

The least squares of red line

We will calculate the squares residual for the red line

                    wt  mpg    yhat residual squared_residual
AMC Javelin      3.435 15.2 16.6725  -1.4725        2.1682563
Camaro Z28       3.840 13.3 14.0400  -0.7400        0.5476000
Pontiac Firebird 3.845 19.2 14.0075   5.1925       26.9620562
Fiat X1-9        1.935 27.3 26.4225   0.8775        0.7700063
Porsche 914-2    2.140 26.0 25.0900   0.9100        0.8281000
Lotus Europa     1.513 30.4 29.1655   1.2345        1.5239902
Ford Pantera L   3.170 15.8 18.3950  -2.5950        6.7340250
Ferrari Dino     2.770 19.7 20.9950  -1.2950        1.6770250
Maserati Bora    3.570 15.0 15.7950  -0.7950        0.6320250
Volvo 142E       2.780 21.4 20.9300   0.4700        0.2209000

The sum of squared residuals is

[1] 42.06398

Least squares of blue line

Likewise, we will calculate the residual for the blue line

                    wt  mpg    yhat residual squared_residual
AMC Javelin      3.435 15.2 16.2375  -1.0375       1.07640625
Camaro Z28       3.840 13.3 13.2000   0.1000       0.01000000
Pontiac Firebird 3.845 19.2 13.1625   6.0375      36.45140625
Fiat X1-9        1.935 27.3 27.4875  -0.1875       0.03515625
Porsche 914-2    2.140 26.0 25.9500   0.0500       0.00250000
Lotus Europa     1.513 30.4 30.6525  -0.2525       0.06375625
Ford Pantera L   3.170 15.8 18.2250  -2.4250       5.88062500
Ferrari Dino     2.770 19.7 21.2250  -1.5250       2.32562500
Maserati Bora    3.570 15.0 15.2250  -0.2250       0.05062500
Volvo 142E       2.780 21.4 21.1500   0.2500       0.06250000

The sum of squared residuals is

[1] 45.9586

Which is best?

The residual of the red line is \(42.06398\) while the residual for the blue line is \(45.9686\).

This means that the red line is better! Why? Because it has the smaller error.

What about the optimal solution

Since the least squares equation is quadratic, this means that it is convex and has a global minima. Solving for this minima is know as least squares solution which is the solution of \(b_0\) and \(b_1\) that has the smallest residual. We can use R to solve for the least squares as follows:

model = lm(mpg ~ wt, data = cars)
model
Call:
lm(formula = mpg ~ wt, data = cars)

Coefficients:
(Intercept)           wt  
     39.633       -6.657  

Our red line estimate coefficients were really close!

Calculate the residual

We will calculate the squares residual for the least squares line

                    wt  mpg     yhat   residual squared_residual
AMC Javelin      3.435 15.2 16.76731 -1.5673076       2.45645321
Camaro Z28       3.840 13.3 14.07132 -0.7713241       0.59494080
Pontiac Firebird 3.845 19.2 14.03804  5.1619597      26.64582786
Fiat X1-9        1.935 27.3 26.75243  0.5475680       0.29983073
Porsche 914-2    2.140 26.0 25.38780  0.6122017       0.37479089
Lotus Europa     1.513 30.4 29.56158  0.8384197       0.70294759
Ford Pantera L   3.170 15.8 18.53135 -2.7313463       7.46025243
Ferrari Dino     2.770 19.7 21.19405 -1.4940461       2.23217373
Maserati Bora    3.570 15.0 15.86865 -0.8686464       0.75454664
Volvo 142E       2.780 21.4 21.12748  0.2725214       0.07426791

The sum of squared residuals is

[1] 41.59603

Least squares solution

Here is the least squares solution shown in hot pink.