What we’re learning

  • Goal: model how car weight (wt) predicts fuel efficiency (mpg)
  • We’ll cover:
    • the linear regression model + assumptions
    • two ggplot visualizations
    • a 3D Plotly view of the loss function (SSE)
    • inference: slope, p-value, confidence interval

Data (mtcars)

We’ll use the built-in mtcars dataset (32 cars).

mpg wt hp
Mazda RX4 21.0 2.62 110
Mazda RX4 Wag 21.0 2.88 110
Datsun 710 22.8 2.32 93
Hornet 4 Drive 21.4 3.21 110
Hornet Sportabout 18.7 3.44 175
Valiant 18.1 3.46 105

Interpretation: - mpg = miles per gallon (response) - wt = weight (1000 lbs) (predictor)

The model (math)

The simple linear regression model is:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Assumptions (typical): - \(E(\varepsilon_i)=0\) - constant variance: \(\mathrm{Var}(\varepsilon_i)=\sigma^2\) - independent errors (often) - normal errors (mainly for inference)

ggplot #1: scatter + fitted line

ggplot #2: residuals vs fitted

What to look for: - random scatter around 0 is good - curvature/funnel shape suggests model issues

Plotly 3D: SSE surface over (β0, β1)

The loss we minimize in OLS is:

\[ \mathrm{SSE}(\beta_0,\beta_1)=\sum_{i=1}^n (y_i-(\beta_0+\beta_1 x_i))^2 \]

Inference (math + example results)

Testing if weight matters:

\[ H_0:\beta_1=0 \quad\text{vs}\quad H_a:\beta_1\neq 0 \]

Test statistic:

\[ t=\frac{\hat\beta_1 - 0}{SE(\hat\beta_1)} \]

A 95% CI for slope:

\[ \hat\beta_1 \pm t^* SE(\hat\beta_1) \]

term estimate std.error statistic p.value conf.low conf.high
(Intercept) 37.2851 1.8776 19.8576 0 33.4505 41.1198
wt -5.3445 0.5591 -9.5590 0 -6.4863 -4.2026

R code slide

library(ggplot2)
library(plotly)

fit <- lm(mpg ~ wt, data = mtcars)

# ggplot scatter + line
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

# 3D Plotly SSE surface (outline)
# (See full grid + outer() code in the Rmd)

Key takeaways

  • Regression gives a predictive relationship between wt and mpg
  • OLS chooses coefficients that minimize SSE
  • The p-value for the slope answers: “is the relationship likely nonzero?”
  • Always check diagnostics (like residual plots)