Overview

We use the Auto dataset from the ISLR package to examine whether the relationship between horsepower and mpg is linear or nonlinear. The key diagnostic tool is the Residual vs. Fitted plot.


Load Data

library(ISLR)
library(ggplot2)

data(Auto)
head(Auto[, c("mpg", "horsepower")])
##   mpg horsepower
## 1  18        130
## 2  15        165
## 3  18        150
## 4  16        150
## 5  17        140
## 6  15        198

1. Linear Model: mpg ~ horsepower

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.4, color = "#378ADD") +
  geom_smooth(method = "lm", se = TRUE, color = "#E24B4A") +
  labs(title = "Linear fit: mpg ~ horsepower",
       x = "Horsepower", y = "MPG") +
  theme_minimal()
Scatter plot with linear fit

Scatter plot with linear fit

plot(lm1, which = 1,
     main = "Residuals vs Fitted — Linear Model",
     sub = "")
Residual vs. Fitted — Linear model

Residual vs. Fitted — Linear model

Interpretation: The residual plot shows a clear U-shape. Residuals are positive at low and high fitted values, and negative in the middle. This pattern confirms the linear model fails to capture a nonlinear trend.


2. Quadratic Model: mpg ~ horsepower + horsepower²

lm2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(lm2)
## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.4, color = "#378ADD") +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2),
              se = TRUE, color = "#1D9E75") +
  labs(title = "Quadratic fit: mpg ~ hp + hp²",
       x = "Horsepower", y = "MPG") +
  theme_minimal()
Scatter plot with quadratic fit

Scatter plot with quadratic fit

plot(lm2, which = 1,
     main = "Residuals vs Fitted — Quadratic Model",
     sub = "")
Residual vs. Fitted — Quadratic model

Residual vs. Fitted — Quadratic model

Interpretation: The U-shape largely disappears. Residuals are now randomly scattered around zero, indicating the quadratic term successfully captures the nonlinear (diminishing returns) relationship.


3. Side-by-Side Comparison

par(mfrow = c(1, 2))
plot(lm1, which = 1, main = "Linear model", sub = "")
plot(lm2, which = 1, main = "Quadratic model", sub = "")

par(mfrow = c(1, 1))

4. Model Comparison

data.frame(
  Model     = c("Linear", "Quadratic"),
  R_squared = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
  Adj_R2    = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
  RMSE      = c(
    sqrt(mean(residuals(lm1)^2)),
    sqrt(mean(residuals(lm2)^2))
  )
)
##       Model R_squared    Adj_R2     RMSE
## 1    Linear 0.6059483 0.6049379 4.893226
## 2 Quadratic 0.6875590 0.6859527 4.357151

Conclusion

Model Residual plot Verdict
Linear U-shaped curvature ~0.60 ❌ Nonlinearity missed
Quadratic Random scatter ~0.69 ✅ Nonlinearity captured

The residual vs. fitted plot is the key diagnostic: a systematic curve (U-shape) in the linear model’s residuals signals that a higher-order term is needed. Adding horsepower² removes the pattern and substantially improves fit.