Checking Nonlinearity in Auto Data Regression

1. Introduction

We use the Auto dataset from the ISLR package to investigate whether the relationship between mpg (miles per gallon) and horsepower is linear or nonlinear.

The key diagnostic tool is the Residual vs. Fitted Plot:

Pattern	Interpretation
Residuals randomly scattered around 0	✅ Linearity holds
“U”-shape or systematic curvature	❌ Nonlinear trend not captured

All plots use base R graphics only — no ggplot2 required.

2. Load Data

library(ISLR)

data(Auto)
head(Auto[, c("mpg", "horsepower", "weight", "displacement")])

##   mpg horsepower weight displacement
## 1  18        130   3504          307
## 2  15        165   3693          350
## 3  18        150   3436          318
## 4  16        150   3433          304
## 5  17        140   3449          302
## 6  15        198   4341          429

3. Model 1 — Simple Linear Regression

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot (Linear Model)

fit1 <- fitted(lm1)
res1 <- residuals(lm1)

# LOESS smooth
lo1 <- loess(res1 ~ fit1)
lo1_seq <- seq(min(fit1), max(fit1), length.out = 200)
lo1_pred <- predict(lo1, newdata = data.frame(fit1 = lo1_seq), se = TRUE)

plot(fit1, res1,
     pch  = 16, col  = adjustcolor("#2196F3", alpha.f = 0.5),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual vs. Fitted — Linear Model\nmpg ~ horsepower")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)

# LOESS line + confidence band
polygon(
  c(lo1_seq, rev(lo1_seq)),
  c(lo1_pred$fit + 1.96 * lo1_pred$se.fit,
    rev(lo1_pred$fit - 1.96 * lo1_pred$se.fit)),
  col = adjustcolor("#FF5722", alpha.f = 0.15), border = NA
)
lines(lo1_seq, lo1_pred$fit, col = "#FF5722", lwd = 2)

legend("topright",
       legend = c("Residuals", "LOESS smooth", "Zero line"),
       pch    = c(16, NA, NA),
       lty    = c(NA, 1, 2),
       lwd    = c(NA, 2, 1.5),
       col    = c(adjustcolor("#2196F3", 0.5), "#FF5722", "red"),
       bty    = "n")

text(max(fit1) * 0.65, max(res1) * 0.9,
     "U-shape detected: nonlinearity present",
     col = "#FF5722", cex = 0.85, font = 2)

Figure 1: Residual vs. Fitted — Linear Model

Interpretation: The LOESS smooth shows a clear “U”-shape, indicating that the linear model under-predicts at low and high horsepower values but over-predicts in the middle range. The linear specification has failed to capture the nonlinear trend.

4. Model 2 — Polynomial Regression (Degree 2)

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]

lm2 <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(lm2)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot (Quadratic Model)

fit2 <- fitted(lm2)
res2 <- residuals(lm2)

lo2      <- loess(res2 ~ fit2)
lo2_seq  <- seq(min(fit2), max(fit2), length.out = 200)
lo2_pred <- predict(lo2, newdata = data.frame(fit2 = lo2_seq), se = TRUE)

plot(fit2, res2,
     pch  = 16, col  = adjustcolor("#4CAF50", alpha.f = 0.5),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residual vs. Fitted — Quadratic Model\nmpg ~ poly(horsepower, 2)")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)

polygon(
  c(lo2_seq, rev(lo2_seq)),
  c(lo2_pred$fit + 1.96 * lo2_pred$se.fit,
    rev(lo2_pred$fit - 1.96 * lo2_pred$se.fit)),
  col = adjustcolor("#FF5722", alpha.f = 0.15), border = NA
)
lines(lo2_seq, lo2_pred$fit, col = "#FF5722", lwd = 2)

legend("topright",
       legend = c("Residuals", "LOESS smooth", "Zero line"),
       pch    = c(16, NA, NA),
       lty    = c(NA, 1, 2),
       lwd    = c(NA, 2, 1.5),
       col    = c(adjustcolor("#4CAF50", 0.5), "#FF5722", "red"),
       bty    = "n")

text(max(fit2) * 0.55, max(res2) * 0.9,
     "Residuals randomly scattered",
     col = "#4CAF50", cex = 0.85, font = 2)

Figure 2: Residual vs. Fitted — Quadratic Model

Interpretation: After adding the quadratic term, the LOESS smooth is much flatter and the residuals appear randomly scattered — the nonlinear relationship has been adequately captured.

5. Side-by-Side Comparison

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# --- Left: Linear ---
plot(fit1, res1,
     pch  = 16, col  = adjustcolor("#2196F3", 0.4),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Linear Model")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
lines(lo1_seq, lo1_pred$fit, col = "#FF5722", lwd = 2)
text(mean(range(fit1)), max(res1) * 0.85,
     "U-shape: nonlinear", col = "#FF5722", cex = 0.8, font = 2)

# --- Right: Quadratic ---
plot(fit2, res2,
     pch  = 16, col  = adjustcolor("#4CAF50", 0.4),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Quadratic Model")
abline(h = 0, col = "red", lty = 2, lwd = 1.5)
lines(lo2_seq, lo2_pred$fit, col = "#FF5722", lwd = 2)
text(mean(range(fit2)), max(res2) * 0.85,
     "Flat: linearity holds", col = "#4CAF50", cex = 0.8, font = 2)

Figure 3: Side-by-Side Comparison

par(mfrow = c(1, 1))  # reset layout

6. Model Fit Comparison

cat("=== Linear Model ===\n")

## === Linear Model ===

cat(sprintf("R2      : %.4f\n", summary(lm1)$r.squared))

## R2      : 0.6059

cat(sprintf("Adj. R2 : %.4f\n", summary(lm1)$adj.r.squared))

## Adj. R2 : 0.6049

cat(sprintf("RSE     : %.4f\n", summary(lm1)$sigma))

## RSE     : 4.9058

cat("\n=== Quadratic Model ===\n")

## 
## === Quadratic Model ===

cat(sprintf("R2      : %.4f\n", summary(lm2)$r.squared))

## R2      : 0.6876

cat(sprintf("Adj. R2 : %.4f\n", summary(lm2)$adj.r.squared))

## Adj. R2 : 0.6860

cat(sprintf("RSE     : %.4f\n", summary(lm2)$sigma))

## RSE     : 4.3739

cat("\n=== ANOVA (Linear vs Quadratic) ===\n")

## 
## === ANOVA (Linear vs Quadratic) ===

print(anova(lm1, lm2))

## Analysis of Variance Table
## 
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ poly(horsepower, 2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7. Conclusion

Diagnostic	Linear Model	Quadratic Model
Residual pattern	❌ Clear “U”-shape	✅ Randomly scattered
R²	~0.606	~0.688
ANOVA p-value	—	< 0.001 (significant gain)

The Residual vs. Fitted plot is a powerful tool for detecting nonlinearity. The simple linear model shows an unmistakable U-shaped pattern, revealing that mpg and horsepower have a nonlinear (quadratic) relationship. Adding the polynomial term resolves the pattern and yields a significantly better fit.

Ready to publish: Knit to HTML in RStudio, then click Publish -> RPubs.