In this analysis, we use the Auto dataset from the
ISLR package to examine the relationship between
mpg (miles per gallon) and horsepower. We fit
both a linear and a polynomial
(nonlinear) regression model, and use Residual
vs. Fitted plots to assess whether the linear model captures
the true relationship.
# Install ISLR if needed: install.packages("ISLR")
library(ISLR)
library(ggplot2)
# Load Auto dataset
data(Auto)
# Preview the data
head(Auto)## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## [1] 392 9
Let’s first visualize the relationship between
horsepower and mpg.
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed",
aes(linetype = "Linear")) +
geom_smooth(method = "loess", se = TRUE, color = "darkgreen",
aes(linetype = "LOESS")) +
labs(title = "MPG vs. Horsepower",
subtitle = "Red dashed = Linear fit | Green = LOESS (nonparametric) fit",
x = "Horsepower", y = "Miles Per Gallon (MPG)") +
theme_minimal()Observation: The LOESS curve suggests a curved (nonlinear) relationship — a linear fit may not be adequate.
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
# Extract fitted values and residuals
lm_diag <- data.frame(
fitted = fitted(lm_fit),
residuals = residuals(lm_fit)
)
ggplot(lm_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted Plot — Linear Model",
subtitle = "A curved (U-shaped) pattern indicates the linear model misses nonlinearity",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()Interpretation:
The residuals show a clear U-shaped (curved) pattern around the zero line. This tells us the linear model fails to capture the true nonlinear trend in the data. Residuals are systematically positive at low and high fitted values, and negative in the middle — a classic sign of a missed curve.
To address the nonlinearity, we fit a quadratic (degree-2) polynomial model.
# Fit quadratic model: mpg ~ horsepower + horsepower^2
poly_fit <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(poly_fit)##
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4459 0.2209 106.13 <2e-16 ***
## poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
## poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
poly_diag <- data.frame(
fitted = fitted(poly_fit),
residuals = residuals(poly_fit)
)
ggplot(poly_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.5, color = "darkorchid") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted Plot — Polynomial Model (Degree 2)",
subtitle = "Residuals should now scatter randomly around zero with no pattern",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()Interpretation:
The residuals are now more randomly scattered around the zero line with no strong systematic pattern. This indicates the quadratic model has successfully captured much of the nonlinear trend.
library(gridExtra)
p1 <- ggplot(lm_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_smooth(method = "loess", se = FALSE, color = "orange") +
labs(title = "Linear Model",
subtitle = "U-shaped pattern → nonlinearity missed",
x = "Fitted Values", y = "Residuals") +
theme_minimal()
p2 <- ggplot(poly_diag, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "darkorchid") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_smooth(method = "loess", se = FALSE, color = "orange") +
labs(title = "Polynomial Model (Degree 2)",
subtitle = "Random scatter → nonlinearity captured",
x = "Fitted Values", y = "Residuals") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)## === Linear Model ===
## R-squared: 0.6059
## Adj R-sq: 0.6049
## RSE: 4.9058
## === Polynomial Model (Degree 2) ===
## R-squared: 0.6876
## Adj R-sq: 0.686
## RSE: 4.3739
| Diagnostic | Linear Model | Polynomial Model (Degree 2) |
|---|---|---|
| Residual vs. Fitted pattern | U-shaped (nonlinear trend present) | Random scatter (good fit) |
| R² | ~0.606 | ~0.688 |
| Conclusion | Underfits the nonlinear relationship | Captures the curvature well |
Key Takeaways:
mpg and horsepower
have a nonlinear (concave) relationship — as horsepower
increases, mpg decreases rapidly at first, then levels off.Analysis performed in R using the ISLR,
ggplot2, and gridExtra packages.