This analysis uses the Auto dataset from the
ISLR package to demonstrate how Residual vs. Fitted
plots can reveal nonlinear relationships that a linear model
fails to capture.
We compare two models:
mpg ~ horsepowermpg ~ horsepower + horsepower²# Install ISLR if needed
if (!require(ISLR)) install.packages("ISLR", repos = "https://cloud.r-project.org")
if (!require(ggplot2)) install.packages("ggplot2", repos = "https://cloud.r-project.org")
library(ISLR)
library(ggplot2)
data(Auto)
head(Auto[, c("mpg", "horsepower", "weight", "year")])## mpg horsepower weight year
## 1 18 130 3504 70
## 2 15 165 3693 70
## 3 18 150 3436 70
## 4 16 150 3433 70
## 5 17 140 3449 70
## 6 15 198 4341 70
Before fitting, visualise the raw relationship between
mpg and horsepower.
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5, colour = "#2c7bb6") +
geom_smooth(method = "loess", se = TRUE, colour = "#d7191c", linewidth = 1) +
labs(
title = "MPG vs Horsepower (Auto Dataset)",
subtitle = "LOESS smoother suggests a nonlinear (curved) relationship",
x = "Horsepower", y = "Miles Per Gallon (mpg)"
) +
theme_minimal(base_size = 13)Figure 1: Scatter plot — mpg vs horsepower
Observation: The LOESS smoother curves downward more steeply at low horsepower values, hinting at a nonlinear (concave) relationship.
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
df1 <- data.frame(
fitted = fitted(lm1),
residuals = residuals(lm1)
)
ggplot(df1, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.45, colour = "#2c7bb6") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted — Linear Model",
subtitle = "A clear U-shaped pattern → the linear model misses nonlinearity",
x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
) +
theme_minimal(base_size = 13)Figure 2: Residual vs. Fitted — Linear model
⚠️ Interpretation: The red LOESS curve forms a clear U-shape, bowing upward in the middle. This systematic curvature tells us that the linear model under-predicts mpg for cars with moderate horsepower and over-predicts for cars at the extremes. The linearity assumption is violated.
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
df2 <- data.frame(
fitted = fitted(lm2),
residuals = residuals(lm2)
)
ggplot(df2, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.45, colour = "#1a9641") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted — Quadratic Model",
subtitle = "Residuals scatter randomly around zero → nonlinearity captured",
x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
) +
theme_minimal(base_size = 13)Figure 3: Residual vs. Fitted — Quadratic model
✅ Interpretation: The residuals now scatter randomly around the horizontal zero line with no obvious curvature. Adding the quadratic term has successfully absorbed the nonlinear trend.
df1$model <- "Linear: mpg ~ hp"
df2$model <- "Quadratic: mpg ~ hp + hp²"
df_all <- rbind(df1, df2)
ggplot(df_all, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, colour = "#2c7bb6") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
facet_wrap(~model, scales = "free_x") +
labs(
title = "Residual vs. Fitted: Linear vs. Quadratic",
x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
) +
theme_minimal(base_size = 13)Figure 4: Side-by-side residual plots
## === Linear Model ===
## R² : 0.6059
## Adj. R² : 0.6049
## RSE : 4.9058
##
## === Quadratic Model ===
## R² : 0.6876
## Adj. R² : 0.6860
## RSE : 4.3739
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ horsepower + I(horsepower^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA F-test confirms that the quadratic term provides a statistically significant improvement over the linear model (p < 2.2e-16).
| Diagnostic | Linear Model | Quadratic Model |
|---|---|---|
| Residual vs. Fitted shape | U-shaped (nonlinearity) | Random scatter ✅ |
| R² | ~0.606 | ~0.688 |
| Linearity assumption | Violated ❌ | Satisfied ✅ |
A U-shape or systematic curve in the Residual vs. Fitted plot is the classic visual signal that your model has failed to capture a nonlinear relationship. The fix — adding polynomial terms, applying a log transformation, or using a nonparametric smoother — can be verified by checking that the pattern disappears in the updated plot.
Analysis performed using R with the ISLR and
ggplot2 packages.