Simple Linear Regressions is all about statistical relationships as opposed to deterministic relationships. This is when the relationship between variables is imperfect. Notice how the plot has a trend, but also exhibits some scatter.
2026-03-08
Simple Linear Regressions is all about statistical relationships as opposed to deterministic relationships. This is when the relationship between variables is imperfect. Notice how the plot has a trend, but also exhibits some scatter.
Lets introduce some notation.
Then the equation for the best-fitting line is:
\[ \hat{y_i} = b_0 + b_1 x_i \]
where we make a prediction error (residual) of size
\[e_i = y_i - \hat{y_i}\]
We need to find the line that minimizes the squares of the residuals.
This means we need to find values for \(b_0\) and \(b_1\) which minimize the following:
\[ \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]
Let us consider the first 10 data points from MTCars:
cars = tail(mtcars, n = 10)
cars[, c("wt", "mpg")]
wt mpg AMC Javelin 3.435 15.2 Camaro Z28 3.840 13.3 Pontiac Firebird 3.845 19.2 Fiat X1-9 1.935 27.3 Porsche 914-2 2.140 26.0 Lotus Europa 1.513 30.4 Ford Pantera L 3.170 15.8 Ferrari Dino 2.770 19.7 Maserati Bora 3.570 15.0 Volvo 142E 2.780 21.4
Here we have two candidate regression lines, the red and blue line. The red line equation is \(-6.5x - 39\). The blue line is \(-7.5x - 42\).
We will calculate the squares residual for the red line
wt mpg yhat residual squared_residual AMC Javelin 3.435 15.2 16.6725 -1.4725 2.1682563 Camaro Z28 3.840 13.3 14.0400 -0.7400 0.5476000 Pontiac Firebird 3.845 19.2 14.0075 5.1925 26.9620562 Fiat X1-9 1.935 27.3 26.4225 0.8775 0.7700063 Porsche 914-2 2.140 26.0 25.0900 0.9100 0.8281000 Lotus Europa 1.513 30.4 29.1655 1.2345 1.5239902 Ford Pantera L 3.170 15.8 18.3950 -2.5950 6.7340250 Ferrari Dino 2.770 19.7 20.9950 -1.2950 1.6770250 Maserati Bora 3.570 15.0 15.7950 -0.7950 0.6320250 Volvo 142E 2.780 21.4 20.9300 0.4700 0.2209000
The sum of squared residuals is
[1] 42.06398
Likewise, we will calculate the residual for the blue line
wt mpg yhat residual squared_residual AMC Javelin 3.435 15.2 16.2375 -1.0375 1.07640625 Camaro Z28 3.840 13.3 13.2000 0.1000 0.01000000 Pontiac Firebird 3.845 19.2 13.1625 6.0375 36.45140625 Fiat X1-9 1.935 27.3 27.4875 -0.1875 0.03515625 Porsche 914-2 2.140 26.0 25.9500 0.0500 0.00250000 Lotus Europa 1.513 30.4 30.6525 -0.2525 0.06375625 Ford Pantera L 3.170 15.8 18.2250 -2.4250 5.88062500 Ferrari Dino 2.770 19.7 21.2250 -1.5250 2.32562500 Maserati Bora 3.570 15.0 15.2250 -0.2250 0.05062500 Volvo 142E 2.780 21.4 21.1500 0.2500 0.06250000
The sum of squared residuals is
[1] 45.9586
The residual of the red line is \(42.06398\) while the residual for the blue line is \(45.9686\).
This means that the red line is better! Why? Because it has the smaller error.
Since the least squares equation is quadratic, this means that it is convex and has a global minima. Solving for this minima is know as least squares solution which is the solution of \(b_0\) and \(b_1\) that has the smallest residual. We can use R to solve for the least squares as follows:
model = lm(mpg ~ wt, data = cars) model
Call:
lm(formula = mpg ~ wt, data = cars)
Coefficients:
(Intercept) wt
39.633 -6.657
Our red line estimate coefficients were really close!
We will calculate the squares residual for the least squares line
wt mpg yhat residual squared_residual AMC Javelin 3.435 15.2 16.76731 -1.5673076 2.45645321 Camaro Z28 3.840 13.3 14.07132 -0.7713241 0.59494080 Pontiac Firebird 3.845 19.2 14.03804 5.1619597 26.64582786 Fiat X1-9 1.935 27.3 26.75243 0.5475680 0.29983073 Porsche 914-2 2.140 26.0 25.38780 0.6122017 0.37479089 Lotus Europa 1.513 30.4 29.56158 0.8384197 0.70294759 Ford Pantera L 3.170 15.8 18.53135 -2.7313463 7.46025243 Ferrari Dino 2.770 19.7 21.19405 -1.4940461 2.23217373 Maserati Bora 3.570 15.0 15.86865 -0.8686464 0.75454664 Volvo 142E 2.780 21.4 21.12748 0.2725214 0.07426791
The sum of squared residuals is
[1] 41.59603
Here is the least squares solution shown in hot pink.