Regression on Auto Data – Checking for Nonlinearity

Overview

We use the Auto dataset from the ISLR package to examine the relationship between horsepower (predictor) and mpg (miles per gallon, response).

We fit two models:

Linear model: mpg ~ horsepower
Polynomial (quadratic) model: mpg ~ horsepower + horsepower²

We then use Residual vs. Fitted plots to judge whether the linear model is sufficient or whether a nonlinear term is needed.

1. Explore the Data

data(Auto)

# Basic summary
cat("Dimensions:", dim(Auto), "\n")

## Dimensions: 392 9

summary(Auto[, c("mpg", "horsepower")])

##       mpg          horsepower   
##  Min.   : 9.00   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.: 75.0  
##  Median :22.75   Median : 93.5  
##  Mean   :23.45   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :230.0

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.4, colour = "steelblue") +
  geom_smooth(method = "lm",     se = FALSE, colour = "red",       linetype = "dashed", linewidth = 0.8) +
  geom_smooth(method = "loess",  se = FALSE, colour = "darkorange", linewidth = 0.8) +
  labs(title    = "MPG vs Horsepower",
       subtitle = "Red dashed = linear fit  |  Orange = LOESS (flexible) fit",
       x = "Horsepower", y = "MPG") +
  theme_minimal()

The LOESS curve bends noticeably — a strong visual hint that the relationship is nonlinear.

2. Model 1 – Simple Linear Regression

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \varepsilon\]

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot – Linear Model

Auto$fitted_lm  <- fitted(lm1)
Auto$resid_lm   <- residuals(lm1)

ggplot(Auto, aes(x = fitted_lm, y = resid_lm)) +
  geom_point(alpha = 0.4, colour = "steelblue") +
  geom_hline(yintercept = 0, colour = "red", linetype = "dashed", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = "darkorange", linewidth = 0.8) +
  labs(title    = "Model 1 (Linear): Residuals vs. Fitted",
       subtitle = "A clear U-shaped pattern → the linear model misses nonlinearity",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Interpretation: The residuals show a clear inverted-U (arch) shape — residuals are negative at low and high fitted values and positive in the middle. This systematic curvature is strong evidence that the linear model has failed to capture a nonlinear trend in the data.

3. Model 2 – Polynomial (Quadratic) Regression

\[\text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{horsepower}^2 + \varepsilon\]

lm2 <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(lm2)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot – Quadratic Model

Auto$fitted_lm2 <- fitted(lm2)
Auto$resid_lm2  <- residuals(lm2)

ggplot(Auto, aes(x = fitted_lm2, y = resid_lm2)) +
  geom_point(alpha = 0.4, colour = "steelblue") +
  geom_hline(yintercept = 0, colour = "red", linetype = "dashed", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = "darkorange", linewidth = 0.8) +
  labs(title    = "Model 2 (Quadratic): Residuals vs. Fitted",
       subtitle = "Residuals are much more randomly scattered around zero",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Interpretation: After adding the quadratic term, the residuals are much more randomly scattered around the zero line with no strong systematic pattern. This confirms that the quadratic model adequately captures the nonlinear relationship.

4. Side-by-Side Comparison

par(mfrow = c(1, 2))

# Linear model diagnostic
plot(lm1, which = 1,
     main  = "Model 1 (Linear)\nResiduals vs Fitted",
     pch   = 20, col = adjustcolor("steelblue", alpha.f = 0.5))

# Quadratic model diagnostic
plot(lm2, which = 1,
     main  = "Model 2 (Quadratic)\nResiduals vs Fitted",
     pch   = 20, col = adjustcolor("darkorange", alpha.f = 0.5))

par(mfrow = c(1, 1))

5. Model Comparison

comparison <- data.frame(
  Model       = c("Linear", "Quadratic"),
  R_squared   = c(summary(lm1)$r.squared,  summary(lm2)$r.squared),
  Adj_R2      = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
  RSE         = c(summary(lm1)$sigma,       summary(lm2)$sigma)
)
comparison[, 2:4] <- round(comparison[, 2:4], 4)

kable(comparison,
      col.names = c("Model", "R²", "Adjusted R²", "Residual Std Error"),
      caption   = "Model Comparison: Linear vs. Quadratic") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(2, bold = TRUE, background = "#d4edda")

Model Comparison: Linear vs. Quadratic
Model	R²	Adjusted R²	Residual Std Error
Linear	0.6059	0.6049	4.9058
Quadratic	0.6876	0.6860	4.3739

6. Conclusion

Criterion	Linear Model	Quadratic Model
Residual vs. Fitted	U-shaped pattern (nonlinearity missed)	Random scatter (nonlinearity captured)
R²	~0.606	~0.688
Residual Std Error	Higher	Lower

The Residual vs. Fitted plot is the key diagnostic tool here:

Linear model: The inverted-U arch in the residuals clearly signals that a nonlinear relationship exists and is being ignored.
Quadratic model: The residuals scatter randomly around zero — the curvature is accounted for. The quadratic term is highly significant (p < 0.001) and improves both R² and RSE.

Final answer: The relationship between horsepower and mpg in the Auto dataset is nonlinear. A quadratic (polynomial degree 2) model is more appropriate than a simple linear regression.

Session Info

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51       ggplot2_4.0.2    ISLR_1.4        
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2    
##  [5] xml2_1.5.2         stringr_1.6.0      jquerylib_0.1.4    splines_4.5.2     
##  [9] systemfonts_1.3.2  scales_1.4.0       textshaping_1.0.5  yaml_2.3.12       
## [13] fastmap_1.2.0      lattice_0.22-7     R6_2.6.1           labeling_0.4.3    
## [17] svglite_2.2.2      bslib_0.10.0       RColorBrewer_1.1-3 rlang_1.1.7       
## [21] cachem_1.1.0       stringi_1.8.7      xfun_0.56          sass_0.4.10       
## [25] S7_0.2.1           viridisLite_0.4.3  cli_3.6.5          withr_3.0.2       
## [29] magrittr_2.0.4     mgcv_1.9-3         digest_0.6.39      grid_4.5.2        
## [33] rstudioapi_0.18.0  lifecycle_1.0.5    nlme_3.1-168       vctrs_0.7.1       
## [37] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [41] tools_4.5.2        htmltools_0.5.9