We regress fuel economy (mpg) on engine power
(horsepower) using the Auto data set, and
use the residual-vs-fitted plot to judge whether a
straight-line model is adequate.
The diagnostic rule:
The Auto data set ships with the ISLR / ISLR2 packages.
Install once with install.packages("ISLR2") if you have not
already.
# Use ISLR2 if available, otherwise fall back to ISLR
if (requireNamespace("ISLR2", quietly = TRUE)) {
data(Auto, package = "ISLR2")
} else {
data(Auto, package = "ISLR")
}
Auto <- na.omit(Auto) # drop the few missing-horsepower rows
nrow(Auto) # number of usable observations
## [1] 392
head(Auto[, c("mpg", "horsepower")])
## mpg horsepower
## 1 18 130
## 2 15 165
## 3 18 150
## 4 16 150
## 5 17 140
## 6 15 198
A quick scatter plot already hints at curvature — the points bend rather than following a straight line.
plot(Auto$horsepower, Auto$mpg,
xlab = "horsepower", ylab = "mpg",
main = "mpg vs. horsepower", pch = 20, col = "grey40")
abline(lm(mpg ~ horsepower, data = Auto), col = "red", lwd = 2) # straight-line fit
\[\text{mpg} = \beta_0 + \beta_1 \,\text{horsepower} + \varepsilon\]
fit_linear <- lm(mpg ~ horsepower, data = Auto)
summary(fit_linear)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
The slope is large and highly significant (negative: more horsepower
→ lower mpg), and \(R^2 \approx
0.61\), so the line explains a lot of the variation. But
significance alone does not mean the functional form is
correct — we must inspect the residuals.
plot(fitted(fit_linear), resid(fit_linear),
xlab = "Fitted values", ylab = "Residuals",
main = "Model 1 (linear): Residuals vs Fitted",
pch = 20, col = "grey40")
abline(h = 0, lty = 2) # reference line at zero
lines(lowess(fitted(fit_linear), resid(fit_linear)), # smoother to reveal shape
col = "red", lwd = 2)
R’s built-in diagnostic produces the same plot (with the loess smoother drawn automatically):
plot(fit_linear, which = 1)
Interpretation. The residuals are not randomly scattered around zero. The red smoother traces a clear U shape: residuals are positive for small fitted values, dip negative in the middle, and rise positive again for large fitted values. This systematic curvature is exactly the signature of a nonlinear relationship the straight line has missed — the linear model is inadequate.
To capture the curvature we add horsepower\(^2\):
\[\text{mpg} = \beta_0 + \beta_1 \,\text{horsepower} + \beta_2 \,\text{horsepower}^2 + \varepsilon\]
fit_quad <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(fit_quad)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
The quadratic term I(horsepower^2) is highly significant
(its t-statistic is about 10), and \(R^2\) rises from roughly 0.61 to about 0.69
— the curved term genuinely improves the fit.
plot(fitted(fit_quad), resid(fit_quad),
xlab = "Fitted values", ylab = "Residuals",
main = "Model 2 (quadratic): Residuals vs Fitted",
pch = 20, col = "grey40")
abline(h = 0, lty = 2)
lines(lowess(fitted(fit_quad), resid(fit_quad)), col = "red", lwd = 2)
plot(fit_quad, which = 1)
Interpretation. The smoother is now close to flat and the residuals are much more randomly scattered around zero — the pronounced U shape has largely disappeared. The quadratic model captures the nonlinear trend that the straight line could not.
anova(fit_linear, fit_quad) # is the quadratic term needed?
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ horsepower + I(horsepower^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
c(linear_R2 = summary(fit_linear)$r.squared,
quad_R2 = summary(fit_quad)$r.squared)
## linear_R2 quad_R2
## 0.6059483 0.6875590
The anova test rejects the linear model in favour of the
quadratic one (p-value far below 0.05), confirming the visual evidence
from the residual plots.
mpg and
horsepower have a nonlinear relationship
the straight line cannot represent.