We examine whether mpg (miles per gallon) has a
linear or nonlinear relationship with
horsepower in the Auto dataset from the
ISLR package.
Key diagnostic tool: the Residual vs. Fitted plot
| Pattern | Interpretation |
|---|---|
| Random scatter around zero | Model captures the relationship |
| U-shape / inverted-U / systematic curve | Nonlinearity missed by the model |
library(ISLR)
library(ggplot2)
library(gridExtra)
data("Auto")
cat("Observations:", nrow(Auto), "\n")## Observations: 392
## Variables: 9
## mpg horsepower cylinders weight
## 1 18 130 8 3504
## 2 15 165 8 3693
## 3 18 150 8 3436
## 4 16 150 8 3433
## 5 17 140 8 3449
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(alpha = 0.45, color = "#2166ac", size = 1.8) +
geom_smooth(method = "lm", se = TRUE,
color = "#d73027", linetype = "dashed", linewidth = 1) +
geom_smooth(method = "loess", se = TRUE,
color = "#1a9641", linewidth = 1) +
labs(
title = "MPG vs Horsepower - Auto dataset",
subtitle = "Red dashed = Linear fit | Green solid = LOESS (flexible) fit",
x = "Horsepower",
y = "Miles Per Gallon (mpg)"
) +
theme_minimal(base_size = 13)Observation: The flexible LOESS curve (green) clearly bends away from the straight red dashed line. This is the first visual hint that the relationship between
mpgandhorsepoweris nonlinear.
We first fit a straight-line model:
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
df_linear <- data.frame(
fitted = fitted(model_linear),
residuals = residuals(model_linear)
)
ggplot(df_linear, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#d73027", size = 1.8) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted - Linear Model",
subtitle = "WARNING: Clear U-shape means nonlinearity is NOT captured",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)Interpretation: The smooth blue curve forms a distinct U-shape. The model over-predicts mpg at low and high horsepower values, and under-predicts in the middle. This systematic pattern reveals that the linear model has failed to capture the nonlinear trend in the data.
We add a squared term to model the curvature:
\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
df_quad <- data.frame(
fitted = fitted(model_quad),
residuals = residuals(model_quad)
)
ggplot(df_quad, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.4, color = "#1a9641", size = 1.8) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
labs(
title = "Residual vs. Fitted - Quadratic Model",
subtitle = "Residuals scatter randomly: nonlinearity captured",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal(base_size = 13)Interpretation: The smooth curve now stays close to zero with no systematic pattern. The residuals scatter randomly above and below the dashed line. The quadratic term has successfully absorbed the curvature.
p_lin <- ggplot(df_linear, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.35, color = "#d73027", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
labs(title = "Linear Model",
subtitle = "U-shape: nonlinearity present",
x = "Fitted Values", y = "Residuals") +
theme_minimal(base_size = 12)
p_quad <- ggplot(df_quad, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.35, color = "#1a9641", size = 1.5) +
geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
labs(title = "Quadratic Model",
subtitle = "Random scatter: good fit",
x = "Fitted Values", y = "Residuals") +
theme_minimal(base_size = 12)
grid.arrange(p_lin, p_quad, ncol = 2)rsq_lin <- round(summary(model_linear)$r.squared, 4)
rsq_quad <- round(summary(model_quad)$r.squared, 4)
sse_lin <- round(sum(residuals(model_linear)^2), 2)
sse_quad <- round(sum(residuals(model_quad)^2), 2)
cat("=== Linear Model ===\n")## === Linear Model ===
## R-squared = 0.6059
## SSE = 9385.92
## === Quadratic Model ===
## R-squared = 0.6876
## SSE = 7442.03
| Diagnostic | Linear Model | Quadratic Model |
|---|---|---|
| Residual pattern | U-shape (bad) | Random scatter (good) |
| Nonlinearity missed | Yes | No |
| R-squared | 0.6059 | 0.6876 |
| SSE | 9385.92 | 7442.03 |
| Verdict | Insufficient | Better fit |
The Residual vs. Fitted plot is a powerful and simple diagnostic for detecting nonlinearity in regression models.
mpg and horsepower is
better described by a curve than a straight line.Rule of thumb: