Motivation

  • Goal: Predict a car’s fuel efficiency (mpg) from weight (wt) and horsepower (hp).
  • We’ll fit a multiple linear regression model, assess fit, and interpret results.
  • Data: built-in mtcars dataset in R.

The Model (Math)

We model the response \(y\) as a linear function of predictors \(\mathbf{x}\):

\[ \mathbf{y} = \mathbf{X}oldsymbol{eta} + oldsymbol{ arepsilon}, \quad oldsymbol{ arepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \]

For two predictors \(x_1 = ext{wt}\) and \(x_2 = ext{hp}\):

\[ ext{mpg}_i = eta_0 + eta_1 \, ext{wt}_i + eta_2 \, ext{hp}_i + arepsilon_i. \]

Estimation (Math)

Ordinary Least Squares (OLS) estimate:

\[ \hat{oldsymbol{eta}} = (\mathbf{X}^ op \mathbf{X})^{-1}\mathbf{X}^ op \mathbf{y} \]

Inference uses the sampling distribution of \(\hat{oldsymbol{eta}}\):

\[ \hat{oldsymbol{eta}} \sim \mathcal{N}\!\left(oldsymbol{eta}, \; \sigma^2 (\mathbf{X}^ op \mathbf{X})^{-1} ight). \]

Fit the Model (R code)

# Packages
library(tidyverse)
library(broom)
library(plotly)

# Data
data(mtcars)
df <- mtcars %>% 
  mutate(wt = wt, hp = hp, mpg = mpg)

# Fit multiple linear regression
fit <- lm(mpg ~ wt + hp, data = df)
summary(fit)
tidy(fit)

Notes: - Coefficients show mpg decreases as wt or hp increases. - We’ll visualize the fitted plane and diagnostics next.

ggplot: Observed vs Fitted

library(ggplot2)
aug <- augment(fit)

ggplot(aug, aes(.fitted, mpg)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Observed vs Fitted MPG",
       x = "Fitted values", y = "Observed mpg")

ggplot: Residuals vs Fitted

ggplot(aug, aes(.fitted, .resid)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(alpha = 0.8) +
  labs(title = "Residuals vs Fitted",
       x = "Fitted values", y = "Residuals")

plotly 3D: Fitted Plane over Data

# Create grid for surface
wt_seq <- seq(min(df$wt), max(df$wt), length.out = 30)
hp_seq <- seq(min(df$hp), max(df$hp), length.out = 30)
grid <- expand.grid(wt = wt_seq, hp = hp_seq)
grid$mpg_hat <- predict(fit, newdata = grid)

# Matrix for surface
z_mat <- matrix(grid$mpg_hat, nrow = length(wt_seq), ncol = length(hp_seq))

# 3D interactive plot
plot_ly() |>
  add_markers(data = df, x = ~wt, y = ~hp, z = ~mpg,
              name = "Cars", marker = list(size = 4)) |>
  add_surface(x = wt_seq, y = hp_seq, z = z_mat, opacity = 0.6, name = "Fitted plane") |>
  layout(scene = list(xaxis = list(title = "Weight (1000 lbs)"),
                      yaxis = list(title = "Horsepower"),
                      zaxis = list(title = "MPG")))

Hypothesis Tests

We test the significance of each coefficient with a t test:

  • \(H_0: eta_j = 0\)
  • \(H_A: eta_j eq 0\)

In R:

tidy(fit, conf.int = TRUE)
  • Look at p values and confidence intervals for \(eta_1\) and \(eta_2\).

Confidence Intervals for Predictions

newcars <- tibble(
  wt = c(2.5, 3.5),
  hp = c(110, 180)
)

predict(fit, newdata = newcars, interval = "prediction")
  • Prediction intervals account for both parameter uncertainty and residual variance.

Takeaways

  • Multiple linear regression captures the effect of two predictors at once.
  • In our example, heavier and more powerful cars tend to have lower mpg.
  • Always check diagnostics and communicate uncertainty with intervals.