Homework_4: Detecting Nonlinearity in Auto Data

Overview

This analysis uses the Auto dataset from the ISLR package to demonstrate how Residual vs. Fitted plots can reveal nonlinear relationships that a linear model fails to capture.

We compare two models:

Model 1 (Linear): mpg ~ horsepower
Model 2 (Quadratic): mpg ~ horsepower + horsepower²

1. Load Data & Packages

# Install ISLR if needed
if (!require(ISLR))  install.packages("ISLR",  repos = "https://cloud.r-project.org")
if (!require(ggplot2)) install.packages("ggplot2", repos = "https://cloud.r-project.org")

library(ISLR)
library(ggplot2)

data(Auto)
head(Auto[, c("mpg", "horsepower", "weight", "year")])

##   mpg horsepower weight year
## 1  18        130   3504   70
## 2  15        165   3693   70
## 3  18        150   3436   70
## 4  16        150   3433   70
## 5  17        140   3449   70
## 6  15        198   4341   70

2. Exploratory Plot

Before fitting, visualise the raw relationship between mpg and horsepower.

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5, colour = "#2c7bb6") +
  geom_smooth(method = "loess", se = TRUE, colour = "#d7191c", linewidth = 1) +
  labs(
    title  = "MPG vs Horsepower (Auto Dataset)",
    subtitle = "LOESS smoother suggests a nonlinear (curved) relationship",
    x = "Horsepower", y = "Miles Per Gallon (mpg)"
  ) +
  theme_minimal(base_size = 13)

Figure 1: Scatter plot — mpg vs horsepower

Observation: The LOESS smoother curves downward more steeply at low horsepower values, hinting at a nonlinear (concave) relationship.

3. Model 1 — Simple Linear Regression

lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Linear Model

df1 <- data.frame(
  fitted    = fitted(lm1),
  residuals = residuals(lm1)
)

ggplot(df1, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.45, colour = "#2c7bb6") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted — Linear Model",
    subtitle = "A clear U-shaped pattern → the linear model misses nonlinearity",
    x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
  ) +
  theme_minimal(base_size = 13)

Figure 2: Residual vs. Fitted — Linear model

⚠️ Interpretation: The red LOESS curve forms a clear U-shape, bowing upward in the middle. This systematic curvature tells us that the linear model under-predicts mpg for cars with moderate horsepower and over-predicts for cars at the extremes. The linearity assumption is violated.

4. Model 2 — Quadratic (Polynomial) Regression

lm2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(lm2)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Quadratic Model

df2 <- data.frame(
  fitted    = fitted(lm2),
  residuals = residuals(lm2)
)

ggplot(df2, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.45, colour = "#1a9641") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted — Quadratic Model",
    subtitle = "Residuals scatter randomly around zero → nonlinearity captured",
    x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
  ) +
  theme_minimal(base_size = 13)

Figure 3: Residual vs. Fitted — Quadratic model

✅ Interpretation: The residuals now scatter randomly around the horizontal zero line with no obvious curvature. Adding the quadratic term has successfully absorbed the nonlinear trend.

5. Side-by-Side Comparison

df1$model <- "Linear:  mpg ~ hp"
df2$model <- "Quadratic:  mpg ~ hp + hp²"
df_all <- rbind(df1, df2)

ggplot(df_all, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, colour = "#2c7bb6") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_smooth(method = "loess", se = FALSE, colour = "#d7191c", linewidth = 1.2) +
  facet_wrap(~model, scales = "free_x") +
  labs(
    title = "Residual vs. Fitted: Linear vs. Quadratic",
    x = "Fitted Values (ŷ)", y = "Residuals (y − ŷ)"
  ) +
  theme_minimal(base_size = 13)

Figure 4: Side-by-side residual plots

6. Model Fit Statistics

cat("=== Linear Model ===\n")

## === Linear Model ===

cat(sprintf("  R²       : %.4f\n", summary(lm1)$r.squared))

##   R²       : 0.6059

cat(sprintf("  Adj. R²  : %.4f\n", summary(lm1)$adj.r.squared))

##   Adj. R²  : 0.6049

cat(sprintf("  RSE      : %.4f\n", summary(lm1)$sigma))

##   RSE      : 4.9058

cat("\n=== Quadratic Model ===\n")

## 
## === Quadratic Model ===

cat(sprintf("  R²       : %.4f\n", summary(lm2)$r.squared))

##   R²       : 0.6876

cat(sprintf("  Adj. R²  : %.4f\n", summary(lm2)$adj.r.squared))

##   Adj. R²  : 0.6860

cat(sprintf("  RSE      : %.4f\n", summary(lm2)$sigma))

##   RSE      : 4.3739

anova(lm1, lm2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ horsepower + I(horsepower^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA F-test confirms that the quadratic term provides a statistically significant improvement over the linear model (p < 2.2e-16).

7. Summary

Diagnostic	Linear Model	Quadratic Model
Residual vs. Fitted shape	U-shaped (nonlinearity)	Random scatter ✅
R²	~0.606	~0.688
Linearity assumption	Violated ❌	Satisfied ✅

Key Takeaway

A U-shape or systematic curve in the Residual vs. Fitted plot is the classic visual signal that your model has failed to capture a nonlinear relationship. The fix — adding polynomial terms, applying a log transformation, or using a nonparametric smoother — can be verified by checking that the pattern disappears in the updated plot.

Analysis performed using R with the ISLR and ggplot2 packages.