Detecting Nonlinearity: Regression on Auto Data

Overview

We use the Auto dataset from the ISLR2 package to investigate the relationship between horsepower and mpg. We fit:

A linear model
A quadratic (polynomial) model

Then we inspect Residual vs. Fitted plots to check whether a nonlinear relationship exists.

Load Data

# Install ISLR2 if needed
# install.packages("ISLR2")
library(ISLR2)
library(ggplot2)

data(Auto)
head(Auto[, c("mpg", "horsepower", "weight", "cylinders")])

##   mpg horsepower weight cylinders
## 1  18        130   3504         8
## 2  15        165   3693         8
## 3  18        150   3436         8
## 4  16        150   3433         8
## 5  17        140   3449         8
## 6  15        198   4341         8

Exploratory Plot

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5, color = "#2c7bb6") +
  geom_smooth(method = "lm",  color = "red",    se = FALSE, linetype = "dashed", linewidth = 1) +
  geom_smooth(method = "loess", color = "#1a9641", se = FALSE, linewidth = 1) +
  labs(
    title    = "MPG vs Horsepower",
    subtitle = "Red dashed = linear fit | Green = LOESS (true trend)",
    x        = "Horsepower",
    y        = "Miles Per Gallon (mpg)"
  ) +
  theme_minimal(base_size = 13)

Observation: The LOESS curve bends — suggesting a nonlinear (curved) relationship between horsepower and mpg.

Model 1 — Linear Regression

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot (Linear Model)

# Extract diagnostics
lm1_diag <- data.frame(
  fitted    = fitted(lm1),
  residuals = residuals(lm1)
)

ggplot(lm1_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#d7191c") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", color = "#d7191c", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted — Linear Model",
    subtitle = "⚠ U-shaped pattern → nonlinearity NOT captured",
    x        = "Fitted Values",
    y        = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The residuals show a clear U-shaped (∪) curve — the model systematically over-predicts at low and high horsepower, and under-predicts in the middle. This is a classic signal that the linear model fails to capture the true nonlinear relationship.

Model 2 — Polynomial (Quadratic) Regression

lm2 <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(lm2)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot (Quadratic Model)

lm2_diag <- data.frame(
  fitted    = fitted(lm2),
  residuals = residuals(lm2)
)

ggplot(lm2_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#1a9641") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  geom_smooth(method = "loess", color = "#1a9641", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted — Quadratic Model",
    subtitle = "✓ Residuals randomly scattered around zero — nonlinearity captured",
    x        = "Fitted Values",
    y        = "Residuals"
  ) +
  theme_minimal(base_size = 13)

Interpretation: The residuals are now randomly scattered around the zero line with no systematic pattern. This confirms the quadratic term successfully captures the nonlinear relationship.

Side-by-Side Comparison

library(gridExtra)  # comes with ggplot2 installation

p1 <- ggplot(lm1_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#d7191c") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(method = "loess", color = "#d7191c", se = FALSE, linewidth = 1.2) +
  labs(title = "Linear Model", subtitle = "U-shape = nonlinearity missed",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

p2 <- ggplot(lm2_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "#1a9641") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(method = "loess", color = "#1a9641", se = FALSE, linewidth = 1.2) +
  labs(title = "Quadratic Model", subtitle = "Random scatter = good fit",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal(base_size = 12)

grid.arrange(p1, p2, ncol = 2)

Summary Table

summary_df <- data.frame(
  Model       = c("Linear",    "Quadratic"),
  R_Squared   = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
  Adj_R2      = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
  Residual_SE = c(summary(lm1)$sigma, summary(lm2)$sigma),
  Pattern     = c("U-shape (nonlinearity)", "Random (good)")
)

knitr::kable(summary_df, digits = 4,
             col.names = c("Model", "R²", "Adj. R²", "Residual SE", "Residual Pattern"),
             caption   = "Model Comparison: Linear vs. Quadratic")

Model Comparison: Linear vs. Quadratic
Model	R²	Adj. R²	Residual SE	Residual Pattern
Linear	0.6059	0.6049	4.9058	U-shape (nonlinearity)
Quadratic	0.6876	0.6860	4.3739	Random (good)

Key Takeaways

Diagnostic Rule	What You See	Conclusion
Linear model residuals	U-shaped curve	Nonlinearity not captured
Quadratic model residuals	Random scatter around 0	Nonlinearity captured ✓

A U-shape (or inverted U) in the Residual vs. Fitted plot is the strongest visual signal that your model is misspecified — it’s missing a nonlinear term.
Adding poly(horsepower, 2) fixes this, as confirmed by the flat residual plot and improved R².