Detecting Nonlinearity in Auto Data

1. Background

We examine whether mpg (miles per gallon) has a linear or nonlinear relationship with horsepower in the Auto dataset from the ISLR package.

Key diagnostic tool: the Residual vs. Fitted plot

Pattern	Interpretation
Random scatter around zero	Model captures the relationship
U-shape / inverted-U / systematic curve	Nonlinearity missed by the model

2. Load Data

library(ISLR)
library(ggplot2)
library(gridExtra)

data("Auto")
cat("Observations:", nrow(Auto), "\n")

## Observations: 392

cat("Variables:   ", ncol(Auto), "\n")

## Variables:    9

head(Auto[, c("mpg", "horsepower", "cylinders", "weight")], 5)

##   mpg horsepower cylinders weight
## 1  18        130         8   3504
## 2  15        165         8   3693
## 3  18        150         8   3436
## 4  16        150         8   3433
## 5  17        140         8   3449

3. Exploratory Plot

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.45, color = "#2166ac", size = 1.8) +
  geom_smooth(method = "lm",    se = TRUE,
              color = "#d73027", linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", se = TRUE,
              color = "#1a9641", linewidth = 1) +
  labs(
    title    = "MPG vs Horsepower - Auto dataset",
    subtitle = "Red dashed = Linear fit | Green solid = LOESS (flexible) fit",
    x = "Horsepower",
    y = "Miles Per Gallon (mpg)"
  ) +
  theme_minimal(base_size = 13)

Observation: The flexible LOESS curve (green) clearly bends away from the straight red dashed line. This is the first visual hint that the relationship between mpg and horsepower is nonlinear.

4. Model 1: Simple Linear Regression

We first fit a straight-line model:

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]

model_linear <- lm(mpg ~ horsepower, data = Auto)
summary(model_linear)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Linear Model

df_linear <- data.frame(
  fitted    = fitted(model_linear),
  residuals = residuals(model_linear)
)

ggplot(df_linear, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#d73027", size = 1.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted - Linear Model",
    subtitle = "WARNING: Clear U-shape means nonlinearity is NOT captured",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The smooth blue curve forms a distinct U-shape. The model over-predicts mpg at low and high horsepower values, and under-predicts in the middle. This systematic pattern reveals that the linear model has failed to capture the nonlinear trend in the data.

5. Model 2: Quadratic (Polynomial) Regression

We add a squared term to model the curvature:

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]

model_quad <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(model_quad)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Quadratic Model

df_quad <- data.frame(
  fitted    = fitted(model_quad),
  residuals = residuals(model_quad)
)

ggplot(df_quad, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#1a9641", size = 1.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted - Quadratic Model",
    subtitle = "Residuals scatter randomly: nonlinearity captured",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The smooth curve now stays close to zero with no systematic pattern. The residuals scatter randomly above and below the dashed line. The quadratic term has successfully absorbed the curvature.

6. Side-by-Side Comparison

p_lin <- ggplot(df_linear, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.35, color = "#d73027", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
  labs(title    = "Linear Model",
       subtitle = "U-shape: nonlinearity present",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

p_quad <- ggplot(df_quad, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.35, color = "#1a9641", size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "#08306b", linewidth = 1.1) +
  labs(title    = "Quadratic Model",
       subtitle = "Random scatter: good fit",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

grid.arrange(p_lin, p_quad, ncol = 2)

7. Model Comparison

rsq_lin  <- round(summary(model_linear)$r.squared, 4)
rsq_quad <- round(summary(model_quad)$r.squared,   4)
sse_lin  <- round(sum(residuals(model_linear)^2),  2)
sse_quad <- round(sum(residuals(model_quad)^2),    2)

cat("=== Linear Model ===\n")

## === Linear Model ===

cat("  R-squared =", rsq_lin,  "\n")

##   R-squared = 0.6059

cat("  SSE       =", sse_lin,  "\n\n")

##   SSE       = 9385.92

cat("=== Quadratic Model ===\n")

## === Quadratic Model ===

cat("  R-squared =", rsq_quad, "\n")

##   R-squared = 0.6876

cat("  SSE       =", sse_quad, "\n")

##   SSE       = 7442.03

8. Summary and Conclusion

Diagnostic	Linear Model	Quadratic Model
Residual pattern	U-shape (bad)	Random scatter (good)
Nonlinearity missed	Yes	No
R-squared	0.6059	0.6876
SSE	9385.92	7442.03
Verdict	Insufficient	Better fit

Key Takeaway

The Residual vs. Fitted plot is a powerful and simple diagnostic for detecting nonlinearity in regression models.

A U-shape or systematic curve in the residuals means the model is missing a nonlinear trend — the errors are structured, not random.
Adding a polynomial (quadratic) term flattens the residual pattern, reduces SSE, and raises R-squared — confirming the relationship between mpg and horsepower is better described by a curve than a straight line.

Rule of thumb:

If your residuals look random around zero → the model is working well.
If you see a U-shape, arch, or wave → try adding a polynomial or log transformation.