Linear Regression

2025-09-14

Introduction

Simple linear regression models the relationship between two variables.
One predictor (X): Independent variable
One response (Y): Dependent variable
Equation: \(Y = \beta_0 + \beta_1 X + \epsilon\)
We estimate the intercept (\(\beta_0\)) and slope (\(\beta_1\)).
\(Y\): Response variable
\(X\): Predictor variable
\(\beta_0\): Intercept (value of \(Y\) when \(X=0\))
\(\beta_1\): Slope (change in \(Y\) for a one-unit change in \(X\))
\(\epsilon\): Error term

Dataset

We will use the mtcars data set, A data set built-in to R.
It contains data on fuel consumption and car design.
For our purposes, we will see how car weight (wt) affects fuel efficiency (mpg).

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scatterplot showing Weight vs Miles per Gallon

Fitting the Model

Fitting the model is finding the values of the coefficients (\(\beta_0\), \(\beta_1\)) that best describe the relationship between the variables.
In Simple Linear Regression, we draw the line of “best fit” through our data.
The line that fits the “best” is the one that minimizes the sum of squared residuals. This is the Least Squares method.
A residual is the vertical distance between an observed data point and the predicted value on the line: Residual = Observed y - Predicted y

Fitting the Model

Once we fit the model, we can us it to make predictions and identify how the predictor, in our case weight, affects the response, miles per hour.
In R, we can fit the model using this:

fit <- lm(mpg ~ wt, data = mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Regression Equation

From the fitted model:

\[ mpg = 37.285 - 5.344 * wt \]

This means that for every increase of 1000 lbs in weight, mpg decreases by about 5.3 miles per gallon.
It also tells us that if a car had 0 weight (x = 0), the car would get around 37 miles per gallon.
Lets look at the regression line on a graph

Regression Line

## `geom_smooth()` using formula = 'y ~ x'

Residuals

We spoke a little bit about residuals earlier where Residual = Observed y - Predicted y
The residual represents the error in our prediction
Lets look at an example from the mtcars data set.

Example: Mazda RX4

The observed y = 21.0 mpg, weight x = 2.62 (1000 lbs). Lets plug the weight into our predicted equation: \[mpg = 37.285 - 5.344 * wt\]
Multiply: \[-5.344 * 2.62 = -14.00128\]
so \[\hat y = 37.285 - 14.00128 = 23.28372\]
Lets calculate the residual using the formula: \[r = y - \hat y = 21.0 - 23.28372 = -2.28372\]
The residual equals around -2.28 mpg. This means our model over predicts the miles per gallon for this car by ~2.28 mpg. If our residual is negative, our model over predicts at that point, if its positive, it under predicts.

3D Visualization with Plotly

Here is a 3d plot showing the weight on the x-axis, the mpg on the y-axis, and the Residuals on the z-axis. This graph lets us get a clearer visual of which cars our model over predicts (negative residuals, below 0) and which cars the model under-predicts (positive residuals, above 0)