We use the Auto dataset from the
ISLR2 package to investigate the relationship between
horsepower and mpg. We fit:
Then we inspect Residual vs. Fitted plots to check whether a nonlinear relationship exists.
# Install ISLR2 if needed
# install.packages("ISLR2")
library(ISLR2)
library(ggplot2)
data(Auto)
head(Auto[, c("mpg", "horsepower", "weight", "cylinders")])## mpg horsepower weight cylinders
## 1 18 130 3504 8
## 2 15 165 3693 8
## 3 18 150 3436 8
## 4 16 150 3433 8
## 5 17 140 3449 8
## 6 15 198 4341 8
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5, color = "#2c7bb6") +
geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dashed", linewidth = 1) +
geom_smooth(method = "loess", color = "#1a9641", se = FALSE, linewidth = 1) +
labs(
title = "MPG vs Horsepower",
subtitle = "Red dashed = linear fit | Green = LOESS (true trend)",
x = "Horsepower",
y = "Miles Per Gallon (mpg)"
) +
theme_minimal(base_size = 13)Observation: The LOESS curve bends — suggesting a nonlinear (curved) relationship between
horsepowerandmpg.
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
# Extract diagnostics
lm1_diag <- data.frame(
fitted = fitted(lm1),
residuals = residuals(lm1)
)
ggplot(lm1_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#d7191c") +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
geom_smooth(method = "loess", color = "#d7191c", se = TRUE, linewidth = 1.2) +
labs(
title = "Residual vs. Fitted — Linear Model",
subtitle = "⚠ U-shaped pattern → nonlinearity NOT captured",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)Interpretation: The residuals show a clear U-shaped (∪) curve — the model systematically over-predicts at low and high horsepower, and under-predicts in the middle. This is a classic signal that the linear model fails to capture the true nonlinear relationship.
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
lm2_diag <- data.frame(
fitted = fitted(lm2),
residuals = residuals(lm2)
)
ggplot(lm2_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#1a9641") +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
geom_smooth(method = "loess", color = "#1a9641", se = TRUE, linewidth = 1.2) +
labs(
title = "Residual vs. Fitted — Quadratic Model",
subtitle = "✓ Residuals randomly scattered around zero — nonlinearity captured",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)Interpretation: The residuals are now randomly scattered around the zero line with no systematic pattern. This confirms the quadratic term successfully captures the nonlinear relationship.
library(gridExtra) # comes with ggplot2 installation
p1 <- ggplot(lm1_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#d7191c") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", color = "#d7191c", se = FALSE, linewidth = 1.2) +
labs(title = "Linear Model", subtitle = "U-shape = nonlinearity missed",
x = "Fitted Values", y = "Residuals") +
theme_minimal(base_size = 12)
p2 <- ggplot(lm2_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#1a9641") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", color = "#1a9641", se = FALSE, linewidth = 1.2) +
labs(title = "Quadratic Model", subtitle = "Random scatter = good fit",
x = "Fitted Values", y = "Residuals") +
theme_minimal(base_size = 12)
grid.arrange(p1, p2, ncol = 2)summary_df <- data.frame(
Model = c("Linear", "Quadratic"),
R_Squared = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
Adj_R2 = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
Residual_SE = c(summary(lm1)$sigma, summary(lm2)$sigma),
Pattern = c("U-shape (nonlinearity)", "Random (good)")
)
knitr::kable(summary_df, digits = 4,
col.names = c("Model", "R²", "Adj. R²", "Residual SE", "Residual Pattern"),
caption = "Model Comparison: Linear vs. Quadratic")| Model | R² | Adj. R² | Residual SE | Residual Pattern |
|---|---|---|---|---|
| Linear | 0.6059 | 0.6049 | 4.9058 | U-shape (nonlinearity) |
| Quadratic | 0.6876 | 0.6860 | 4.3739 | Random (good) |
| Diagnostic Rule | What You See | Conclusion |
|---|---|---|
| Linear model residuals | U-shaped curve | Nonlinearity not captured |
| Quadratic model residuals | Random scatter around 0 | Nonlinearity captured ✓ |
poly(horsepower, 2) fixes this, as confirmed by
the flat residual plot and improved R².