2025-09-14

Introduction

  • Simple linear regression models the relationship between two variables.
  • One predictor (X): Independent variable
  • One response (Y): Dependent variable
  • Equation: \(Y = \beta_0 + \beta_1 X + \epsilon\)
  • We estimate the intercept (\(\beta_0\)) and slope (\(\beta_1\)).
  • \(Y\): Response variable
  • \(X\): Predictor variable
  • \(\beta_0\): Intercept (value of \(Y\) when \(X=0\))
  • \(\beta_1\): Slope (change in \(Y\) for a one-unit change in \(X\))
  • \(\epsilon\): Error term

Dataset

  • We will use the mtcars data set, A data set built-in to R.
  • It contains data on fuel consumption and car design.
  • For our purposes, we will see how car weight (wt) affects fuel efficiency (mpg).
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Scatterplot showing Weight vs Miles per Gallon

Fitting the Model

  • Fitting the model is finding the values of the coefficients (\(\beta_0\), \(\beta_1\)) that best describe the relationship between the variables.
  • In Simple Linear Regression, we draw the line of “best fit” through our data.
  • The line that fits the “best” is the one that minimizes the sum of squared residuals. This is the Least Squares method.
  • A residual is the vertical distance between an observed data point and the predicted value on the line: Residual = Observed y - Predicted y

Fitting the Model

  • Once we fit the model, we can us it to make predictions and identify how the predictor, in our case weight, affects the response, miles per hour.
  • In R, we can fit the model using this:

fit <- lm(mpg ~ wt, data = mtcars)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Regression Equation

From the fitted model:

\[ mpg = 37.285 - 5.344 * wt \]

  • This means that for every increase of 1000 lbs in weight, mpg decreases by about 5.3 miles per gallon.
  • It also tells us that if a car had 0 weight (x = 0), the car would get around 37 miles per gallon.
  • Lets look at the regression line on a graph

Regression Line

## `geom_smooth()` using formula = 'y ~ x'

Residuals

  • We spoke a little bit about residuals earlier where Residual = Observed y - Predicted y
  • The residual represents the error in our prediction
  • Lets look at an example from the mtcars data set.

Example: Mazda RX4

The observed y = 21.0 mpg, weight x = 2.62 (1000 lbs). Lets plug the weight into our predicted equation: \[mpg = 37.285 - 5.344 * wt\]
Multiply: \[-5.344 * 2.62 = -14.00128\]
so \[\hat y = 37.285 - 14.00128 = 23.28372\]
Lets calculate the residual using the formula: \[r = y - \hat y = 21.0 - 23.28372 = -2.28372\]
The residual equals around -2.28 mpg. This means our model over predicts the miles per gallon for this car by ~2.28 mpg. If our residual is negative, our model over predicts at that point, if its positive, it under predicts.

3D Visualization with Plotly

Here is a 3d plot showing the weight on the x-axis, the mpg on the y-axis, and the Residuals on the z-axis. This graph lets us get a clearer visual of which cars our model over predicts (negative residuals, below 0) and which cars the model under-predicts (positive residuals, above 0)