Multiple Linear Regression: From Estimation to Inference

Motivation

Goal: Predict a car’s fuel efficiency (mpg) from weight (wt) and horsepower (hp).
We’ll fit a multiple linear regression model, assess fit, and interpret results.
Data: built-in mtcars dataset in R.

The Model (Math)

We model the response \(y\) as a linear function of predictors \(\mathbf{x}\):

\[ \mathbf{y} = \mathbf{X}oldsymbol{eta} + oldsymbol{arepsilon}, \quad oldsymbol{arepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). \]

For two predictors \(x_1 = ext{wt}\) and \(x_2 = ext{hp}\):

\[ ext{mpg}_i = eta_0 + eta_1 \, ext{wt}_i + eta_2 \, ext{hp}_i + arepsilon_i. \]

Estimation (Math)

Ordinary Least Squares (OLS) estimate:

\[ \hat{oldsymbol{eta}} = (\mathbf{X}^ op \mathbf{X})^{-1}\mathbf{X}^ op \mathbf{y} \]

Inference uses the sampling distribution of \(\hat{oldsymbol{eta}}\):

\[ \hat{oldsymbol{eta}} \sim \mathcal{N}\!\left(oldsymbol{eta}, \; \sigma^2 (\mathbf{X}^ op \mathbf{X})^{-1} ight). \]

Fit the Model (R code)

# Packages
library(tidyverse)
library(broom)
library(plotly)

# Data
data(mtcars)
df <- mtcars %>% 
  mutate(wt = wt, hp = hp, mpg = mpg)

# Fit multiple linear regression
fit <- lm(mpg ~ wt + hp, data = df)
summary(fit)
tidy(fit)

Notes: - Coefficients show mpg decreases as wt or hp increases. - We’ll visualize the fitted plane and diagnostics next.

ggplot: Observed vs Fitted

library(ggplot2)
aug <- augment(fit)

ggplot(aug, aes(.fitted, mpg)) +
  geom_point(alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Observed vs Fitted MPG",
       x = "Fitted values", y = "Observed mpg")

ggplot: Residuals vs Fitted

ggplot(aug, aes(.fitted, .resid)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_point(alpha = 0.8) +
  labs(title = "Residuals vs Fitted",
       x = "Fitted values", y = "Residuals")

plotly 3D: Fitted Plane over Data

# Create grid for surface
wt_seq <- seq(min(df$wt), max(df$wt), length.out = 30)
hp_seq <- seq(min(df$hp), max(df$hp), length.out = 30)
grid <- expand.grid(wt = wt_seq, hp = hp_seq)
grid$mpg_hat <- predict(fit, newdata = grid)

# Matrix for surface
z_mat <- matrix(grid$mpg_hat, nrow = length(wt_seq), ncol = length(hp_seq))

# 3D interactive plot
plot_ly() |>
  add_markers(data = df, x = ~wt, y = ~hp, z = ~mpg,
              name = "Cars", marker = list(size = 4)) |>
  add_surface(x = wt_seq, y = hp_seq, z = z_mat, opacity = 0.6, name = "Fitted plane") |>
  layout(scene = list(xaxis = list(title = "Weight (1000 lbs)"),
                      yaxis = list(title = "Horsepower"),
                      zaxis = list(title = "MPG")))

Hypothesis Tests

We test the significance of each coefficient with a t test:

\(H_0: eta_j = 0\)
\(H_A: eta_j eq 0\)

In R:

tidy(fit, conf.int = TRUE)

Look at p values and confidence intervals for \(eta_1\) and \(eta_2\).

Confidence Intervals for Predictions

newcars <- tibble(
  wt = c(2.5, 3.5),
  hp = c(110, 180)
)

predict(fit, newdata = newcars, interval = "prediction")

Prediction intervals account for both parameter uncertainty and residual variance.

Takeaways

Multiple linear regression captures the effect of two predictors at once.
In our example, heavier and more powerful cars tend to have lower mpg.
Always check diagnostics and communicate uncertainty with intervals.