2025-10-22

Simple Linear Regression - Predicting a cars fuel efficiency with the weight and horsepower using mtcars dataset.

What is it? Linear Regression is a method of statistics that is used to find relationships between a dependent variable (miles per gallon) and an indeopendent variable (weight and horsepower). LR fits on a straight line and this describes the outcome.

We model miles per gallon as a linear function of weight and horsepower: \[ \text{mpg} = \beta_0 + \beta_1\,\text{wt} + \beta_2\,\text{hp} + \varepsilon \]
data(mtcars)
mtcars <- as_tibble(mtcars) %>% mutate(car = rownames(mtcars))
head(mtcars)
## # A tibble: 6 × 12
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb car         
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4 Mazda RX4   
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4 Mazda RX4 W…
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1 Datsun 710  
## 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1 Hornet 4 Dr…
## 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2 Hornet Spor…
## 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1 Valiant

ggplot 1 - miles per gallon vs weight

mtcars %>%
ggplot(aes(wt, mpg, size = hp)) + geom_point(alpha = 0.7) + 
  geom_smooth(method = "lm", se = TRUE) + 
  labs(title = "MPG vs Weight (point size = horsepower)", 
  x = "Weight (1000 lbs)", y = "Miles per gallon") + 
  guides(size = guide_legend(title = "HP"))

1st latex slide

We model MPG as a linear function of weight and horsepower plus random error: \[ \text{mpg} = \beta_0 + \beta_1\,\text{wt} + \beta_2\,\text{hp} + \varepsilon, \qquad \varepsilon \sim \text{i.i.d. } N(0,\sigma^2). \] ## Independent Variable - weight and horsepower
Coefficient table from lm(mpg ~ wt + hp)
term estimate std.error statistic p.value
(Intercept) 37.227 1.599 23.285 0.000
wt -3.878 0.633 -6.129 0.000
hp -0.032 0.009 -3.519 0.001
Model fit (R^2, adj R^2, F, p-value)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.827 0.815 2.593 69.211 0 2 -74.326 156.652 162.515 195.048 29 32

ggplot 2 - how does it fit

residual_data <- broom::augment(mod)

ggplot(residual_data, aes(.fitted, .resid)) + 
geom_hline(yintercept = 0, linetype = 2) + 
geom_point(alpha = 0.85) + 
labs(title = "Residuals vs Fitted - Checking the Model Fit", 
x = "Predicted MPG (Fitted Values)", 
y = "Residuals (Errors)")

3D Plotly Plot

weight_seq <- seq(min(mtcars$wt), max(mtcars$wt), length.out = 20)
horsepower_seq <- seq(min(mtcars$hp), max(mtcars$hp), length.out = 20)

grid <- expand.grid(weight = weight_seq, horsepower = horsepower_seq)

grid$predicted_mpg <- coefs[1] + coefs[2] * grid$weight + coefs[3] * grid$horsepower

plotly <- plot_ly()
plotly <- add_markers(plotly, data = mtcars, x = ~wt, 
y = ~hp, 
z = ~mpg, 
name = "Actual Cars", 
marker = list(size = 4))

plotly <- add_surface(plotly, 
x = ~weight_seq, 
y = ~horsepower_seq, 
z = ~matrix(grid$predicted_mpg, 
nrow = length(weight_seq)), opacity = 0.5, 
name = "Fitted Plane")

plotly <- layout(plotly, scene = list(xaxis = 
          list(title = "Weight (1000 lbs)"),
yaxis = list(title = "Horsepower"), 
zaxis = list(title = "Miles per Gallon (MPG)")))
plotly

Confidence Interval and Prediction Interval - math slide

new_car <- tibble(wt = 3.0, hp = 110)

conf_interval <- as_tibble(predict(mod, newdata = new_car, 
interval = "confidence", level = 0.95))

pred_interval <- as_tibble(predict(mod, newdata = new_car, 
interval = "prediction", level = 0.95))


knitr::kable(dplyr::bind_cols(new_car, dplyr::rename(conf_interval, 
mean_fit = fit, 
lower_CI = lwr, 
upper_CI = upr), 
dplyr::rename(pred_interval, mean_PI = fit, 
lower_PI = lwr, 
upper_PI = upr)), 
digits = 2, caption = "93-95-98% Confidence vs. Prediction Intervals for MPG")
93-95-98% Confidence vs. Prediction Intervals for MPG
wt hp mean_fit lower_CI upper_CI mean_PI lower_PI upper_PI
3 110 22.1 21.02 23.18 22.1 16.69 27.51

Any Assumptions that can be made

For linear regression, we assume:

\[ \begin{aligned} 1.\;& \text{Linearity: } E[Y|X] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \\ 2.\;& \text{Independence: errors are independent of each other} \\ 3.\;& \text{Homoskedasticity: } \operatorname{Var}(\varepsilon_i) = \sigma^2 \text{ for all } i \\ 4.\;& \text{Normality: } \varepsilon_i \sim N(0, \sigma^2) \end{aligned} \]

Final Takeaways

  • Linear regression helps predict fuel efficiency using car weight and horsepower.
  • Heavier cars and higher horsepower generally reduce MPG.
  • Residual plots confirm a mostly linear trend, with some variation.
  • 3D Plotly visuals help explain multivariable effects clearly.
  • Confidence and prediction intervals show how accurate future predictions can be.