We use the Auto dataset from the ISLR
package to examine whether the relationship between
horsepower and mpg is linear or
nonlinear. The key diagnostic tool is the Residual vs. Fitted
plot.
library(ISLR)
library(ggplot2)
data(Auto)
head(Auto[, c("mpg", "horsepower")])
## mpg horsepower
## 1 18 130
## 2 15 165
## 3 18 150
## 4 16 150
## 5 17 140
## 6 15 198
lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.4, color = "#378ADD") +
geom_smooth(method = "lm", se = TRUE, color = "#E24B4A") +
labs(title = "Linear fit: mpg ~ horsepower",
x = "Horsepower", y = "MPG") +
theme_minimal()
Scatter plot with linear fit
plot(lm1, which = 1,
main = "Residuals vs Fitted — Linear Model",
sub = "")
Residual vs. Fitted — Linear model
Interpretation: The residual plot shows a clear U-shape. Residuals are positive at low and high fitted values, and negative in the middle. This pattern confirms the linear model fails to capture a nonlinear trend.
lm2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(lm2)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.4, color = "#378ADD") +
geom_smooth(method = "lm", formula = y ~ x + I(x^2),
se = TRUE, color = "#1D9E75") +
labs(title = "Quadratic fit: mpg ~ hp + hp²",
x = "Horsepower", y = "MPG") +
theme_minimal()
Scatter plot with quadratic fit
plot(lm2, which = 1,
main = "Residuals vs Fitted — Quadratic Model",
sub = "")
Residual vs. Fitted — Quadratic model
Interpretation: The U-shape largely disappears. Residuals are now randomly scattered around zero, indicating the quadratic term successfully captures the nonlinear (diminishing returns) relationship.
par(mfrow = c(1, 2))
plot(lm1, which = 1, main = "Linear model", sub = "")
plot(lm2, which = 1, main = "Quadratic model", sub = "")
par(mfrow = c(1, 1))
data.frame(
Model = c("Linear", "Quadratic"),
R_squared = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
Adj_R2 = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
RMSE = c(
sqrt(mean(residuals(lm1)^2)),
sqrt(mean(residuals(lm2)^2))
)
)
## Model R_squared Adj_R2 RMSE
## 1 Linear 0.6059483 0.6049379 4.893226
## 2 Quadratic 0.6875590 0.6859527 4.357151
| Model | Residual plot | R² | Verdict |
|---|---|---|---|
| Linear | U-shaped curvature | ~0.60 | ❌ Nonlinearity missed |
| Quadratic | Random scatter | ~0.69 | ✅ Nonlinearity captured |
The residual vs. fitted plot is the key diagnostic: a systematic
curve (U-shape) in the linear model’s residuals signals that a
higher-order term is needed. Adding horsepower² removes the
pattern and substantially improves fit.