We use the Auto dataset from the ISLR
package to investigate whether the relationship between mpg
(miles per gallon) and horsepower is linear or
nonlinear.
The key diagnostic tool is the Residual vs. Fitted Plot:
| Pattern | Interpretation |
|---|---|
| Residuals randomly scattered around 0 | ✅ Linearity holds |
| “U”-shape or systematic curvature | ❌ Nonlinear trend not captured |
All plots use base R graphics only — no ggplot2 required.
## mpg horsepower weight displacement
## 1 18 130 3504 307
## 2 15 165 3693 350
## 3 18 150 3436 318
## 4 16 150 3433 304
## 5 17 140 3449 302
## 6 15 198 4341 429
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
fit1 <- fitted(lm1)
res1 <- residuals(lm1)
# LOESS smooth
lo1 <- loess(res1 ~ fit1)
lo1_seq <- seq(min(fit1), max(fit1), length.out = 200)
lo1_pred <- predict(lo1, newdata = data.frame(fit1 = lo1_seq), se = TRUE)
plot(fit1, res1,
pch = 16, col = adjustcolor("#2196F3", alpha.f = 0.5),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual vs. Fitted — Linear Model\nmpg ~ horsepower")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
# LOESS line + confidence band
polygon(
c(lo1_seq, rev(lo1_seq)),
c(lo1_pred$fit + 1.96 * lo1_pred$se.fit,
rev(lo1_pred$fit - 1.96 * lo1_pred$se.fit)),
col = adjustcolor("#FF5722", alpha.f = 0.15), border = NA
)
lines(lo1_seq, lo1_pred$fit, col = "#FF5722", lwd = 2)
legend("topright",
legend = c("Residuals", "LOESS smooth", "Zero line"),
pch = c(16, NA, NA),
lty = c(NA, 1, 2),
lwd = c(NA, 2, 1.5),
col = c(adjustcolor("#2196F3", 0.5), "#FF5722", "red"),
bty = "n")
text(max(fit1) * 0.65, max(res1) * 0.9,
"U-shape detected: nonlinearity present",
col = "#FF5722", cex = 0.85, font = 2)Figure 1: Residual vs. Fitted — Linear Model
Interpretation: The LOESS smooth shows a clear “U”-shape, indicating that the linear model under-predicts at low and high horsepower values but over-predicts in the middle range. The linear specification has failed to capture the nonlinear trend.
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
fit2 <- fitted(lm2)
res2 <- residuals(lm2)
lo2 <- loess(res2 ~ fit2)
lo2_seq <- seq(min(fit2), max(fit2), length.out = 200)
lo2_pred <- predict(lo2, newdata = data.frame(fit2 = lo2_seq), se = TRUE)
plot(fit2, res2,
pch = 16, col = adjustcolor("#4CAF50", alpha.f = 0.5),
xlab = "Fitted Values", ylab = "Residuals",
main = "Residual vs. Fitted — Quadratic Model\nmpg ~ poly(horsepower, 2)")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
polygon(
c(lo2_seq, rev(lo2_seq)),
c(lo2_pred$fit + 1.96 * lo2_pred$se.fit,
rev(lo2_pred$fit - 1.96 * lo2_pred$se.fit)),
col = adjustcolor("#FF5722", alpha.f = 0.15), border = NA
)
lines(lo2_seq, lo2_pred$fit, col = "#FF5722", lwd = 2)
legend("topright",
legend = c("Residuals", "LOESS smooth", "Zero line"),
pch = c(16, NA, NA),
lty = c(NA, 1, 2),
lwd = c(NA, 2, 1.5),
col = c(adjustcolor("#4CAF50", 0.5), "#FF5722", "red"),
bty = "n")
text(max(fit2) * 0.55, max(res2) * 0.9,
"Residuals randomly scattered",
col = "#4CAF50", cex = 0.85, font = 2)Figure 2: Residual vs. Fitted — Quadratic Model
Interpretation: After adding the quadratic term, the LOESS smooth is much flatter and the residuals appear randomly scattered — the nonlinear relationship has been adequately captured.
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
# --- Left: Linear ---
plot(fit1, res1,
pch = 16, col = adjustcolor("#2196F3", 0.4),
xlab = "Fitted Values", ylab = "Residuals",
main = "Linear Model")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
lines(lo1_seq, lo1_pred$fit, col = "#FF5722", lwd = 2)
text(mean(range(fit1)), max(res1) * 0.85,
"U-shape: nonlinear", col = "#FF5722", cex = 0.8, font = 2)
# --- Right: Quadratic ---
plot(fit2, res2,
pch = 16, col = adjustcolor("#4CAF50", 0.4),
xlab = "Fitted Values", ylab = "Residuals",
main = "Quadratic Model")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
lines(lo2_seq, lo2_pred$fit, col = "#FF5722", lwd = 2)
text(mean(range(fit2)), max(res2) * 0.85,
"Flat: linearity holds", col = "#4CAF50", cex = 0.8, font = 2)Figure 3: Side-by-Side Comparison
## === Linear Model ===
## R2 : 0.6059
## Adj. R2 : 0.6049
## RSE : 4.9058
##
## === Quadratic Model ===
## R2 : 0.6876
## Adj. R2 : 0.6860
## RSE : 4.3739
##
## === ANOVA (Linear vs Quadratic) ===
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ poly(horsepower, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Diagnostic | Linear Model | Quadratic Model |
|---|---|---|
| Residual pattern | ❌ Clear “U”-shape | ✅ Randomly scattered |
| R² | ~0.606 | ~0.688 |
| ANOVA p-value | — | < 0.001 (significant gain) |
The Residual vs. Fitted plot is a powerful tool for
detecting nonlinearity. The simple linear model shows an unmistakable
U-shaped pattern, revealing that mpg and
horsepower have a nonlinear (quadratic)
relationship. Adding the polynomial term resolves the pattern
and yields a significantly better fit.
Ready to publish: Knit to HTML in RStudio, then click Publish -> RPubs.