Simple Linear Regression: mpg vs weight

What we’re learning

Goal: model how car weight (wt) predicts fuel efficiency (mpg)
We’ll cover:
- the linear regression model + assumptions
- two ggplot visualizations
- a 3D Plotly view of the loss function (SSE)
- inference: slope, p-value, confidence interval

Data (mtcars)

We’ll use the built-in mtcars dataset (32 cars).

	mpg	wt	hp
Mazda RX4	21.0	2.62	110
Mazda RX4 Wag	21.0	2.88	110
Datsun 710	22.8	2.32	93
Hornet 4 Drive	21.4	3.21	110
Hornet Sportabout	18.7	3.44	175
Valiant	18.1	3.46	105

Interpretation: - mpg = miles per gallon (response) - wt = weight (1000 lbs) (predictor)

The model (math)

The simple linear regression model is:

\[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]

Assumptions (typical): - \(E(\varepsilon_i)=0\) - constant variance: \(\mathrm{Var}(\varepsilon_i)=\sigma^2\) - independent errors (often) - normal errors (mainly for inference)

ggplot #1: scatter + fitted line

ggplot #2: residuals vs fitted

What to look for: - random scatter around 0 is good - curvature/funnel shape suggests model issues

Plotly 3D: SSE surface over (β0, β1)

The loss we minimize in OLS is:

\[ \mathrm{SSE}(\beta_0,\beta_1)=\sum_{i=1}^n (y_i-(\beta_0+\beta_1 x_i))^2 \]

Inference (math + example results)

Testing if weight matters:

\[ H_0:\beta_1=0 \quad\text{vs}\quad H_a:\beta_1\neq 0 \]

Test statistic:

\[ t=\frac{\hat\beta_1 - 0}{SE(\hat\beta_1)} \]

A 95% CI for slope:

\[ \hat\beta_1 \pm t^* SE(\hat\beta_1) \]

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	37.2851	1.8776	19.8576	0	33.4505	41.1198
wt	-5.3445	0.5591	-9.5590	0	-6.4863	-4.2026

R code slide

library(ggplot2)
library(plotly)

fit <- lm(mpg ~ wt, data = mtcars)

# ggplot scatter + line
ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)

# 3D Plotly SSE surface (outline)
# (See full grid + outer() code in the Rmd)

Key takeaways

Regression gives a predictive relationship between wt and mpg
OLS chooses coefficients that minimize SSE
The p-value for the slope answers: “is the relationship likely nonzero?”
Always check diagnostics (like residual plots)