1. Background

We examine whether mpg (miles per gallon) has a linear or nonlinear relationship with horsepower in the Auto dataset from the ISLR package.

Key diagnostic tool: the Residual vs. Fitted plot

Pattern Interpretation
Random scatter around zero Model captures the relationship
U-shape / inverted-U / systematic curve Nonlinearity missed by the model

2. Load Data

library(ISLR)
library(ggplot2)
library(gridExtra)

data("Auto")
cat("Observations:", nrow(Auto), "\n")
## Observations: 392
cat("Variables:   ", ncol(Auto), "\n")
## Variables:    9
head(Auto[, c("mpg", "horsepower", "cylinders", "weight")], 5)
##   mpg horsepower cylinders weight
## 1  18        130         8   3504
## 2  15        165         8   3693
## 3  18        150         8   3436
## 4  16        150         8   3433
## 5  17        140         8   3449

3. Exploratory Plot

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.45, color = "#2166ac", size = 1.8) +
  geom_smooth(method = "lm",    se = TRUE,
              color = "#d73027", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = TRUE,
              color = "#1a9641", linewidth = 1) +
  labs(
    title    = "MPG vs Horsepower - Auto dataset",
    subtitle = "Red dashed = Linear fit | Green solid = LOESS (flexible) fit",
    x = "Horsepower",
    y = "Miles Per Gallon (mpg)"
  ) +
  theme_minimal(base_size = 13)

Observation: The flexible LOESS curve (green) clearly bends away from the straight red dashed line. This is the first visual hint that the relationship between mpg and horsepower is nonlinear.


4. Model 1: Simple Linear Regression

We first fit a straight-line model:

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]

model_linear <- lm(mpg ~ horsepower, data = Auto)
summary(model_linear)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Linear Model

df_linear <- data.frame(
  fitted    = fitted(model_linear),
  residuals = residuals(model_linear)
)

ggplot(df_linear, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#d73027", size = 1.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted - Linear Model",
    subtitle = "WARNING: Clear U-shape means nonlinearity is NOT captured",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The smooth blue curve forms a distinct U-shape. The model over-predicts mpg at low and high horsepower values, and under-predicts in the middle. This systematic pattern reveals that the linear model has failed to capture the nonlinear trend in the data.


5. Model 2: Quadratic (Polynomial) Regression

We add a squared term to model the curvature:

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]

model_quad <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(model_quad)
## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Quadratic Model

df_quad <- data.frame(
  fitted    = fitted(model_quad),
  residuals = residuals(model_quad)
)

ggplot(df_quad, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#1a9641", size = 1.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted - Quadratic Model",
    subtitle = "Residuals scatter randomly: nonlinearity captured",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The smooth curve now stays close to zero with no systematic pattern. The residuals scatter randomly above and below the dashed line. The quadratic term has successfully absorbed the curvature.


6. Side-by-Side Comparison

p_lin <- ggplot(df_linear, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.35, color = "#d73027", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
  labs(title    = "Linear Model",
       subtitle = "U-shape: nonlinearity present",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

p_quad <- ggplot(df_quad, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.35, color = "#1a9641", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
  labs(title    = "Quadratic Model",
       subtitle = "Random scatter: good fit",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

grid.arrange(p_lin, p_quad, ncol = 2)


7. Model Comparison

rsq_lin  <- round(summary(model_linear)$r.squared, 4)
rsq_quad <- round(summary(model_quad)$r.squared,   4)
sse_lin  <- round(sum(residuals(model_linear)^2),  2)
sse_quad <- round(sum(residuals(model_quad)^2),    2)

cat("=== Linear Model ===\n")
## === Linear Model ===
cat("  R-squared =", rsq_lin,  "\n")
##   R-squared = 0.6059
cat("  SSE       =", sse_lin,  "\n\n")
##   SSE       = 9385.92
cat("=== Quadratic Model ===\n")
## === Quadratic Model ===
cat("  R-squared =", rsq_quad, "\n")
##   R-squared = 0.6876
cat("  SSE       =", sse_quad, "\n")
##   SSE       = 7442.03

8. Summary and Conclusion

Diagnostic Linear Model Quadratic Model
Residual pattern U-shape (bad) Random scatter (good)
Nonlinearity missed Yes No
R-squared 0.6059 0.6876
SSE 9385.92 7442.03
Verdict Insufficient Better fit

Key Takeaway

The Residual vs. Fitted plot is a powerful and simple diagnostic for detecting nonlinearity in regression models.

  • A U-shape or systematic curve in the residuals means the model is missing a nonlinear trend — the errors are structured, not random.
  • Adding a polynomial (quadratic) term flattens the residual pattern, reduces SSE, and raises R-squared — confirming the relationship between mpg and horsepower is better described by a curve than a straight line.

Rule of thumb:

  • If your residuals look random around zero → the model is working well.
  • If you see a U-shape, arch, or wave → try adding a polynomial or log transformation.