We use the Auto dataset from the ISLR
package to examine the relationship between horsepower
(predictor) and mpg (miles per gallon, response).
We fit two models:
mpg ~ horsepowermpg ~ horsepower + horsepower²We then use Residual vs. Fitted plots to judge whether the linear model is sufficient or whether a nonlinear term is needed.
## Dimensions: 392 9
## mpg horsepower
## Min. : 9.00 Min. : 46.0
## 1st Qu.:17.00 1st Qu.: 75.0
## Median :22.75 Median : 93.5
## Mean :23.45 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:126.0
## Max. :46.60 Max. :230.0
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.4, colour = "steelblue") +
geom_smooth(method = "lm", se = FALSE, colour = "red", linetype = "dashed", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "darkorange", linewidth = 0.8) +
labs(title = "MPG vs Horsepower",
subtitle = "Red dashed = linear fit | Orange = LOESS (flexible) fit",
x = "Horsepower", y = "MPG") +
theme_minimal()The LOESS curve bends noticeably — a strong visual hint that the relationship is nonlinear.
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Auto$fitted_lm <- fitted(lm1)
Auto$resid_lm <- residuals(lm1)
ggplot(Auto, aes(x = fitted_lm, y = resid_lm)) +
geom_point(alpha = 0.4, colour = "steelblue") +
geom_hline(yintercept = 0, colour = "red", linetype = "dashed", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "darkorange", linewidth = 0.8) +
labs(title = "Model 1 (Linear): Residuals vs. Fitted",
subtitle = "A clear U-shaped pattern → the linear model misses nonlinearity",
x = "Fitted Values", y = "Residuals") +
theme_minimal()Interpretation: The residuals show a clear inverted-U (arch) shape — residuals are negative at low and high fitted values and positive in the middle. This systematic curvature is strong evidence that the linear model has failed to capture a nonlinear trend in the data.
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Auto$fitted_lm2 <- fitted(lm2)
Auto$resid_lm2 <- residuals(lm2)
ggplot(Auto, aes(x = fitted_lm2, y = resid_lm2)) +
geom_point(alpha = 0.4, colour = "steelblue") +
geom_hline(yintercept = 0, colour = "red", linetype = "dashed", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "darkorange", linewidth = 0.8) +
labs(title = "Model 2 (Quadratic): Residuals vs. Fitted",
subtitle = "Residuals are much more randomly scattered around zero",
x = "Fitted Values", y = "Residuals") +
theme_minimal()Interpretation: After adding the quadratic term, the residuals are much more randomly scattered around the zero line with no strong systematic pattern. This confirms that the quadratic model adequately captures the nonlinear relationship.
par(mfrow = c(1, 2))
# Linear model diagnostic
plot(lm1, which = 1,
main = "Model 1 (Linear)\nResiduals vs Fitted",
pch = 20, col = adjustcolor("steelblue", alpha.f = 0.5))
# Quadratic model diagnostic
plot(lm2, which = 1,
main = "Model 2 (Quadratic)\nResiduals vs Fitted",
pch = 20, col = adjustcolor("darkorange", alpha.f = 0.5))comparison <- data.frame(
Model = c("Linear", "Quadratic"),
R_squared = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
Adj_R2 = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
RSE = c(summary(lm1)$sigma, summary(lm2)$sigma)
)
comparison[, 2:4] <- round(comparison[, 2:4], 4)
kable(comparison,
col.names = c("Model", "R²", "Adjusted R²", "Residual Std Error"),
caption = "Model Comparison: Linear vs. Quadratic") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
row_spec(2, bold = TRUE, background = "#d4edda")| Model | R² | Adjusted R² | Residual Std Error |
|---|---|---|---|
| Linear | 0.6059 | 0.6049 | 4.9058 |
| Quadratic | 0.6876 | 0.6860 | 4.3739 |
| Criterion | Linear Model | Quadratic Model |
|---|---|---|
| Residual vs. Fitted | U-shaped pattern (nonlinearity missed) | Random scatter (nonlinearity captured) |
| R² | ~0.606 | ~0.688 |
| Residual Std Error | Higher | Lower |
The Residual vs. Fitted plot is the key diagnostic tool here:
Final answer: The relationship between
horsepower and mpg in the Auto dataset is
nonlinear. A quadratic (polynomial degree 2) model is
more appropriate than a simple linear regression.
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 ggplot2_4.0.2 ISLR_1.4
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-4 gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2
## [5] xml2_1.5.2 stringr_1.6.0 jquerylib_0.1.4 splines_4.5.2
## [9] systemfonts_1.3.2 scales_1.4.0 textshaping_1.0.5 yaml_2.3.12
## [13] fastmap_1.2.0 lattice_0.22-7 R6_2.6.1 labeling_0.4.3
## [17] svglite_2.2.2 bslib_0.10.0 RColorBrewer_1.1-3 rlang_1.1.7
## [21] cachem_1.1.0 stringi_1.8.7 xfun_0.56 sass_0.4.10
## [25] S7_0.2.1 viridisLite_0.4.3 cli_3.6.5 withr_3.0.2
## [29] magrittr_2.0.4 mgcv_1.9-3 digest_0.6.39 grid_4.5.2
## [33] rstudioapi_0.18.0 lifecycle_1.0.5 nlme_3.1-168 vctrs_0.7.1
## [37] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 rmarkdown_2.30
## [41] tools_4.5.2 htmltools_0.5.9