Regression on Auto Data: Detecting Nonlinearity via Residual vs. Fitted Plots

1. Overview

This report uses the Auto dataset (from the ISLR package) to demonstrate:

A simple linear regression of mpg on horsepower
A polynomial regression (quadratic) of mpg on horsepower
Residual vs. Fitted plots to diagnose whether a linear model adequately captures the relationship, or whether a nonlinear term is needed.

Key Diagnostic Rule
- Linear fit is adequate: residuals scatter randomly around 0 — no pattern.
- Nonlinearity present: residuals form a curved pattern (U-shape or inverted-U), indicating the model is systematically wrong.

2. Load Packages & Data

# Install ISLR if needed: install.packages("ISLR")
library(ISLR)      # Auto dataset
library(ggplot2)   # Plotting
library(dplyr)     # Data wrangling
library(gridExtra) # Arrange multiple plots
library(broom)     # Tidy model outputs

# Load the Auto dataset
data(Auto)

# Quick look
glimpse(Auto)

#> Rows: 392
#> Columns: 9
#> $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
#> $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
#> $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
#> $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
#> $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
#> $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
#> $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
#> $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
#> $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth sa…

3. Exploratory Data Analysis

3.1 Scatter Plot: mpg vs. horsepower

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5, color = "steelblue", size = 2) +
  geom_smooth(method = "lm",  formula = y ~ x,
              aes(color = "Linear"),    se = TRUE, linewidth = 1) +
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2),
              aes(color = "Quadratic"), se = TRUE, linewidth = 1) +
  scale_color_manual(name = "Fit",
                     values = c("Linear" = "firebrick", "Quadratic" = "darkgreen")) +
  labs(
    title    = "MPG vs. Horsepower",
    subtitle = "Auto Dataset (n = 392)",
    x        = "Horsepower",
    y        = "Miles Per Gallon (MPG)"
  ) +
  theme_bw(base_size = 13)

Figure 1: mpg vs. horsepower with linear and quadratic fits.

Observation: The scatter plot already hints at a curved (nonlinear) relationship between mpg and horsepower. The quadratic fit (green) tracks the data better than the straight line (red).

4. Model Fitting

4.1 Model 1: Simple Linear Regression

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)

#> 
#> Call:
#> lm(formula = mpg ~ horsepower, data = Auto)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.5710  -3.2592  -0.3435   2.7630  16.9240 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
#> horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.906 on 390 degrees of freedom
#> Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
#> F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

4.2 Model 2: Quadratic (Polynomial) Regression

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]

lm2 <- lm(mpg ~ poly(horsepower, 2, raw = TRUE), data = Auto)
summary(lm2)

#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2, raw = TRUE), data = Auto)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.7135  -2.5943  -0.0859   2.2868  15.8961 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      56.9000997  1.8004268   31.60   <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)1 -0.4661896  0.0311246  -14.98   <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)2  0.0012305  0.0001221   10.08   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.374 on 389 degrees of freedom
#> Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
#> F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

4.3 Model Comparison

# Compare R-squared and RSE
tibble(
  Model        = c("Linear", "Quadratic"),
  R_squared    = c(summary(lm1)$r.squared,  summary(lm2)$r.squared),
  Adj_R_sq     = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
  RSE          = c(summary(lm1)$sigma, summary(lm2)$sigma),
  AIC          = c(AIC(lm1), AIC(lm2))
) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  knitr::kable(caption = "Table 1: Model Comparison Summary")

Table 1: Model Comparison Summary
Model	R_squared	Adj_R_sq	RSE	AIC
Linear	0.6059	0.6049	4.9058	2363.324
Quadratic	0.6876	0.6860	4.3739	2274.354

5. Residual vs. Fitted Plots — Diagnosing Nonlinearity

This is the key diagnostic section. We plot residuals on the y-axis against fitted (predicted) values on the x-axis for both models.

# ── Augment models with residuals & fitted values ──────────────────────────
aug1 <- augment(lm1)  # Linear
aug2 <- augment(lm2)  # Quadratic

# ── Helper: loess smooth line colour ───────────────────────────────────────
smooth_col <- "firebrick"

# ── Plot 1: Linear model ───────────────────────────────────────────────────
p1 <- ggplot(aug1, aes(x = .fitted, y = .resid)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey40", linewidth = 0.8) +
  geom_point(alpha = 0.45, color = "steelblue", size = 1.8) +
  geom_smooth(method = "loess", formula = y ~ x,
              se = FALSE, color = smooth_col, linewidth = 1.2) +
  annotate("text", x = Inf, y = Inf,
           label = "⚠ U-shaped curve → nonlinearity detected",
           hjust = 1.05, vjust = 1.5, color = "firebrick", size = 4, fontface = "bold") +
  labs(
    title    = "Model 1: Linear Regression — Residual vs. Fitted",
    subtitle = "mpg ~ horsepower",
    x        = "Fitted Values (Predicted MPG)",
    y        = "Residuals"
  ) +
  theme_bw(base_size = 12)

# ── Plot 2: Quadratic model ────────────────────────────────────────────────
p2 <- ggplot(aug2, aes(x = .fitted, y = .resid)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey40", linewidth = 0.8) +
  geom_point(alpha = 0.45, color = "steelblue", size = 1.8) +
  geom_smooth(method = "loess", formula = y ~ x,
              se = FALSE, color = "darkgreen", linewidth = 1.2) +
  annotate("text", x = Inf, y = Inf,
           label = "✔ Residuals randomly scattered → nonlinearity resolved",
           hjust = 1.05, vjust = 1.5, color = "darkgreen", size = 4, fontface = "bold") +
  labs(
    title    = "Model 2: Quadratic Regression — Residual vs. Fitted",
    subtitle = "mpg ~ horsepower + horsepower²",
    x        = "Fitted Values (Predicted MPG)",
    y        = "Residuals"
  ) +
  theme_bw(base_size = 12)

# ── Arrange side-by-side ───────────────────────────────────────────────────
grid.arrange(p1, p2, nrow = 2)

Figure 2: Residual vs. Fitted plots for the linear (top) and quadratic (bottom) models.

6. Interpretation

6.1 Linear Model — Residual Plot (Top)

Feature	Observation
Pattern	Clear inverted-U / arch shape (LOESS curve bows upward then downward)
Meaning	The linear model under-predicts MPG for low and high horsepower, and over-predicts in the middle
Diagnosis	Nonlinearity is present — a straight line cannot capture the true relationship

6.2 Quadratic Model — Residual Plot (Bottom)

Feature	Observation
Pattern	Residuals are randomly scattered around the zero line
LOESS curve	Nearly flat, hugging zero — no systematic curvature
Diagnosis	Nonlinearity resolved — adding `horsepower²` captures the curved relationship

7. Additional Diagnostic Plots (Base R)

R’s built-in plot() for lm objects provides four panels. The first panel is the Residual vs. Fitted plot; the third (Scale-Location) is also useful for heteroscedasticity.

par(mfrow = c(2, 2))

# Linear model — first diagnostic panel only
plot(lm1, which = 1,
     main = "Linear Model: Residuals vs Fitted",
     col  = adjustcolor("steelblue", 0.5), pch = 16)

# Quadratic model — first diagnostic panel only
plot(lm2, which = 1,
     main = "Quadratic Model: Residuals vs Fitted",
     col  = adjustcolor("darkgreen", 0.5), pch = 16)

# Scale-Location for both
plot(lm1, which = 3, main = "Linear: Scale-Location",
     col = adjustcolor("steelblue", 0.5), pch = 16)
plot(lm2, which = 3, main = "Quadratic: Scale-Location",
     col = adjustcolor("darkgreen", 0.5), pch = 16)

Figure 3: Base R diagnostic plots for both models.

par(mfrow = c(1, 1))

8. Formal Test for Nonlinearity (ANOVA)

We can formally test whether the quadratic term significantly improves the model.

anova(lm1, lm2)

#> Analysis of Variance Table
#> 
#> Model 1: mpg ~ horsepower
#> Model 2: mpg ~ poly(horsepower, 2, raw = TRUE)
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    390 9385.9                                  
#> 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: A very small p-value (< 0.05) confirms that the quadratic term significantly reduces residual error — the nonlinear model fits significantly better than the linear one.

9. Summary & Conclusions

Table 2: Step-by-step workflow and key findings
Step	Action	Finding
1	Fit linear model: mpg ~ horsepower	R² ≈ 0.606; statistically significant slope
2	Plot Residuals vs. Fitted (linear)	Arch/U-shape in residuals → nonlinearity detected
3	Fit quadratic model: mpg ~ horsepower + horsepower²	R² ≈ 0.688; both terms significant; lower AIC
4	Plot Residuals vs. Fitted (quadratic)	Residuals randomly scattered → nonlinearity resolved

Key takeaways:

The Residual vs. Fitted plot is the primary visual tool for detecting nonlinearity.
A U-shape or arch pattern in the residuals signals the model is missing a curve.
Adding a quadratic (polynomial) term corrects the nonlinearity and flattens the residual pattern.
The ANOVA F-test confirms the improvement is statistically significant.

Generated with R Markdown · ISLR Auto dataset · March 16, 2026