This report uses the Auto dataset (from the
ISLR package) to demonstrate:
mpg on
horsepowermpg on horsepowerKey Diagnostic Rule
- Linear fit is adequate: residuals scatter randomly around 0 — no pattern.
- Nonlinearity present: residuals form a curved pattern (U-shape or inverted-U), indicating the model is systematically wrong.
# Install ISLR if needed: install.packages("ISLR")
library(ISLR) # Auto dataset
library(ggplot2) # Plotting
library(dplyr) # Data wrangling
library(gridExtra) # Arrange multiple plots
library(broom) # Tidy model outputs
# Load the Auto dataset
data(Auto)
# Quick look
glimpse(Auto)#> Rows: 392
#> Columns: 9
#> $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
#> $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
#> $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
#> $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
#> $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
#> $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
#> $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
#> $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
#> $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5, color = "steelblue", size = 2) +
geom_smooth(method = "lm", formula = y ~ x,
aes(color = "Linear"), se = TRUE, linewidth = 1) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
aes(color = "Quadratic"), se = TRUE, linewidth = 1) +
scale_color_manual(name = "Fit",
values = c("Linear" = "firebrick", "Quadratic" = "darkgreen")) +
labs(
title = "MPG vs. Horsepower",
subtitle = "Auto Dataset (n = 392)",
x = "Horsepower",
y = "Miles Per Gallon (MPG)"
) +
theme_bw(base_size = 13)Figure 1: mpg vs. horsepower with linear and quadratic fits.
Observation: The scatter plot already hints at a curved (nonlinear) relationship between
mpgandhorsepower. The quadratic fit (green) tracks the data better than the straight line (red).
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]
#>
#> Call:
#> lm(formula = mpg ~ horsepower, data = Auto)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13.5710 -3.2592 -0.3435 2.7630 16.9240
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
#> horsepower -0.157845 0.006446 -24.49 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.906 on 390 degrees of freedom
#> Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
#> F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2, raw = TRUE), data = Auto)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.7135 -2.5943 -0.0859 2.2868 15.8961
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)1 -0.4661896 0.0311246 -14.98 <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)2 0.0012305 0.0001221 10.08 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.374 on 389 degrees of freedom
#> Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
#> F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
# Compare R-squared and RSE
tibble(
Model = c("Linear", "Quadratic"),
R_squared = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
Adj_R_sq = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
RSE = c(summary(lm1)$sigma, summary(lm2)$sigma),
AIC = c(AIC(lm1), AIC(lm2))
) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
knitr::kable(caption = "Table 1: Model Comparison Summary")| Model | R_squared | Adj_R_sq | RSE | AIC |
|---|---|---|---|---|
| Linear | 0.6059 | 0.6049 | 4.9058 | 2363.324 |
| Quadratic | 0.6876 | 0.6860 | 4.3739 | 2274.354 |
This is the key diagnostic section. We plot residuals on the y-axis against fitted (predicted) values on the x-axis for both models.
# ── Augment models with residuals & fitted values ──────────────────────────
aug1 <- augment(lm1) # Linear
aug2 <- augment(lm2) # Quadratic
# ── Helper: loess smooth line colour ───────────────────────────────────────
smooth_col <- "firebrick"
# ── Plot 1: Linear model ───────────────────────────────────────────────────
p1 <- ggplot(aug1, aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey40", linewidth = 0.8) +
geom_point(alpha = 0.45, color = "steelblue", size = 1.8) +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, color = smooth_col, linewidth = 1.2) +
annotate("text", x = Inf, y = Inf,
label = "⚠ U-shaped curve → nonlinearity detected",
hjust = 1.05, vjust = 1.5, color = "firebrick", size = 4, fontface = "bold") +
labs(
title = "Model 1: Linear Regression — Residual vs. Fitted",
subtitle = "mpg ~ horsepower",
x = "Fitted Values (Predicted MPG)",
y = "Residuals"
) +
theme_bw(base_size = 12)
# ── Plot 2: Quadratic model ────────────────────────────────────────────────
p2 <- ggplot(aug2, aes(x = .fitted, y = .resid)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey40", linewidth = 0.8) +
geom_point(alpha = 0.45, color = "steelblue", size = 1.8) +
geom_smooth(method = "loess", formula = y ~ x,
se = FALSE, color = "darkgreen", linewidth = 1.2) +
annotate("text", x = Inf, y = Inf,
label = "✔ Residuals randomly scattered → nonlinearity resolved",
hjust = 1.05, vjust = 1.5, color = "darkgreen", size = 4, fontface = "bold") +
labs(
title = "Model 2: Quadratic Regression — Residual vs. Fitted",
subtitle = "mpg ~ horsepower + horsepower²",
x = "Fitted Values (Predicted MPG)",
y = "Residuals"
) +
theme_bw(base_size = 12)
# ── Arrange side-by-side ───────────────────────────────────────────────────
grid.arrange(p1, p2, nrow = 2)Figure 2: Residual vs. Fitted plots for the linear (top) and quadratic (bottom) models.
| Feature | Observation |
|---|---|
| Pattern | Clear inverted-U / arch shape (LOESS curve bows upward then downward) |
| Meaning | The linear model under-predicts MPG for low and high horsepower, and over-predicts in the middle |
| Diagnosis | Nonlinearity is present — a straight line cannot capture the true relationship |
| Feature | Observation |
|---|---|
| Pattern | Residuals are randomly scattered around the zero line |
| LOESS curve | Nearly flat, hugging zero — no systematic curvature |
| Diagnosis | Nonlinearity resolved — adding
horsepower² captures the curved relationship |
R’s built-in plot() for lm objects provides
four panels. The first panel is the Residual vs. Fitted plot; the third
(Scale-Location) is also useful for heteroscedasticity.
par(mfrow = c(2, 2))
# Linear model — first diagnostic panel only
plot(lm1, which = 1,
main = "Linear Model: Residuals vs Fitted",
col = adjustcolor("steelblue", 0.5), pch = 16)
# Quadratic model — first diagnostic panel only
plot(lm2, which = 1,
main = "Quadratic Model: Residuals vs Fitted",
col = adjustcolor("darkgreen", 0.5), pch = 16)
# Scale-Location for both
plot(lm1, which = 3, main = "Linear: Scale-Location",
col = adjustcolor("steelblue", 0.5), pch = 16)
plot(lm2, which = 3, main = "Quadratic: Scale-Location",
col = adjustcolor("darkgreen", 0.5), pch = 16)Figure 3: Base R diagnostic plots for both models.
We can formally test whether the quadratic term significantly improves the model.
#> Analysis of Variance Table
#>
#> Model 1: mpg ~ horsepower
#> Model 2: mpg ~ poly(horsepower, 2, raw = TRUE)
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 390 9385.9
#> 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: A very small p-value (< 0.05) confirms that the quadratic term significantly reduces residual error — the nonlinear model fits significantly better than the linear one.
| Step | Action | Finding |
|---|---|---|
| 1 | Fit linear model: mpg ~ horsepower | R² ≈ 0.606; statistically significant slope |
| 2 | Plot Residuals vs. Fitted (linear) | Arch/U-shape in residuals → nonlinearity detected |
| 3 | Fit quadratic model: mpg ~ horsepower + horsepower² | R² ≈ 0.688; both terms significant; lower AIC |
| 4 | Plot Residuals vs. Fitted (quadratic) | Residuals randomly scattered → nonlinearity resolved |
Key takeaways:
Generated with R Markdown · ISLR Auto dataset · March 16, 2026