This analysis uses the Auto dataset from the
ISLR package to examine the relationship between
mpg (miles per gallon) and horsepower. We fit
a linear regression model and then use Residual
vs. Fitted plots to check whether a nonlinear relationship
exists.
# Install ISLR if needed
if (!require(ISLR)) install.packages("ISLR")
library(ISLR)
# Load Auto dataset
data(Auto)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
dim(Auto)
## [1] 392 9
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "MPG (Miles Per Gallon)",
main = "MPG vs. Horsepower",
pch = 19, col = "steelblue", cex = 0.7)
abline(lm(mpg ~ horsepower, data = Auto), col = "red", lwd = 2)
The scatter plot already hints at a curved
(nonlinear) relationship between horsepower and
mpg.
# Fit simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Interpretation key: - Linear relationship → residuals scatter randomly around 0 (no pattern) - Nonlinear relationship → residuals show a “U” shape or systematic curvature
plot(fitted(lm_fit), residuals(lm_fit),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residual vs. Fitted — Linear Model (mpg ~ horsepower)",
pch = 19, col = "steelblue", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_fit), residuals(lm_fit)), col = "orange", lwd = 2)
legend("topright", legend = c("Zero line", "LOWESS smooth"),
col = c("red", "orange"), lty = c(2, 1), lwd = 2)
Observation: The LOWESS smoother traces a clear “U” shape (inverted curve), indicating the linear model fails to capture the nonlinear trend. This is evidence of a nonlinear relationship.
To address nonlinearity, we fit a quadratic (degree-2) polynomial model:
lm_poly <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(lm_poly)
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
plot(fitted(lm_poly), residuals(lm_poly),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residual vs. Fitted — Polynomial Model (degree 2)",
pch = 19, col = "darkorange", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_poly), residuals(lm_poly)), col = "purple", lwd = 2)
legend("topright", legend = c("Zero line", "LOWESS smooth"),
col = c("red", "purple"), lty = c(2, 1), lwd = 2)
Observation: With the polynomial model, residuals are much more randomly scattered around zero — the curvature is largely removed, confirming the quadratic term captures the nonlinear pattern.
par(mfrow = c(1, 2))
# Linear
plot(fitted(lm_fit), residuals(lm_fit),
xlab = "Fitted Values", ylab = "Residuals",
main = "Linear Model\n(Shows U-shape = Nonlinearity)",
pch = 19, col = "steelblue", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_fit), residuals(lm_fit)), col = "orange", lwd = 2)
# Polynomial
plot(fitted(lm_poly), residuals(lm_poly),
xlab = "Fitted Values", ylab = "Residuals",
main = "Polynomial Model (degree 2)\n(Residuals more random)",
pch = 19, col = "darkorange", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_poly), residuals(lm_poly)), col = "purple", lwd = 2)
par(mfrow = c(1, 1))
cat("=== Linear Model ===\n")
## === Linear Model ===
cat("R-squared:", summary(lm_fit)$r.squared, "\n")
## R-squared: 0.6059483
cat("RSE:", summary(lm_fit)$sigma, "\n\n")
## RSE: 4.905757
cat("=== Polynomial Model (degree 2) ===\n")
## === Polynomial Model (degree 2) ===
cat("R-squared:", summary(lm_poly)$r.squared, "\n")
## R-squared: 0.687559
cat("RSE:", summary(lm_poly)$sigma, "\n")
## RSE: 4.373921
| Model | R² | Residual Pattern |
|---|---|---|
Linear (mpg ~ horsepower) |
~0.606 | U-shape → nonlinearity detected |
Polynomial (mpg ~ poly(hp, 2)) |
~0.688 | Random scatter → good fit |
Key Takeaway: The Residual vs. Fitted plot for the linear model shows a clear “U” shape, which is the classic diagnostic sign of an unmodeled nonlinear relationship. Fitting a degree-2 polynomial removes this pattern, yielding better model fit (higher R², lower RSE) and randomly scattered residuals.
This approach — inspecting residual plots and then upgrading to polynomial regression — is a standard workflow for detecting and addressing nonlinearity in regression analysis.