Regression on Auto Data: Checking Nonlinear Relationships

1. Introduction

In this analysis, we use the Auto dataset from the ISLR package to examine the relationship between mpg (miles per gallon) and horsepower. We fit both a linear and a polynomial (nonlinear) regression model, and use Residual vs. Fitted plots to assess whether the linear model captures the true relationship.

2. Load Libraries and Data

# Install ISLR if needed: install.packages("ISLR")
library(ISLR)
library(ggplot2)

# Load Auto dataset
data(Auto)

# Preview the data
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

dim(Auto)

## [1] 392   9

3. Exploratory Data Analysis

Let’s first visualize the relationship between horsepower and mpg.

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed", 
              aes(linetype = "Linear")) +
  geom_smooth(method = "loess", se = TRUE, color = "darkgreen", 
              aes(linetype = "LOESS")) +
  labs(title = "MPG vs. Horsepower",
       subtitle = "Red dashed = Linear fit | Green = LOESS (nonparametric) fit",
       x = "Horsepower", y = "Miles Per Gallon (MPG)") +
  theme_minimal()

Observation: The LOESS curve suggests a curved (nonlinear) relationship — a linear fit may not be adequate.

4. Model 1: Simple Linear Regression

# Fit linear model: mpg ~ horsepower
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Linear Model

# Extract fitted values and residuals
lm_diag <- data.frame(
  fitted    = fitted(lm_fit),
  residuals = residuals(lm_fit)
)

ggplot(lm_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted Plot — Linear Model",
    subtitle = "A curved (U-shaped) pattern indicates the linear model misses nonlinearity",
    x        = "Fitted Values",
    y        = "Residuals"
  ) +
  theme_minimal()

Interpretation:
The residuals show a clear U-shaped (curved) pattern around the zero line. This tells us the linear model fails to capture the true nonlinear trend in the data. Residuals are systematically positive at low and high fitted values, and negative in the middle — a classic sign of a missed curve.

5. Model 2: Polynomial Regression (Degree 2)

To address the nonlinearity, we fit a quadratic (degree-2) polynomial model.

# Fit quadratic model: mpg ~ horsepower + horsepower^2
poly_fit <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(poly_fit)

## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residual vs. Fitted Plot — Polynomial Model

poly_diag <- data.frame(
  fitted    = fitted(poly_fit),
  residuals = residuals(poly_fit)
)

ggplot(poly_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.5, color = "darkorchid") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "orange", linewidth = 1.2) +
  labs(
    title    = "Residual vs. Fitted Plot — Polynomial Model (Degree 2)",
    subtitle = "Residuals should now scatter randomly around zero with no pattern",
    x        = "Fitted Values",
    y        = "Residuals"
  ) +
  theme_minimal()

Interpretation:
The residuals are now more randomly scattered around the zero line with no strong systematic pattern. This indicates the quadratic model has successfully captured much of the nonlinear trend.

6. Side-by-Side Comparison

library(gridExtra)

p1 <- ggplot(lm_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "orange") +
  labs(title = "Linear Model",
       subtitle = "U-shaped pattern → nonlinearity missed",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

p2 <- ggplot(poly_diag, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.4, color = "darkorchid") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "orange") +
  labs(title = "Polynomial Model (Degree 2)",
       subtitle = "Random scatter → nonlinearity captured",
       x = "Fitted Values", y = "Residuals") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

7. Model Fit Comparison

cat("=== Linear Model ===\n")

## === Linear Model ===

cat("R-squared:  ", round(summary(lm_fit)$r.squared, 4), "\n")

## R-squared:   0.6059

cat("Adj R-sq:   ", round(summary(lm_fit)$adj.r.squared, 4), "\n")

## Adj R-sq:    0.6049

cat("RSE:        ", round(summary(lm_fit)$sigma, 4), "\n\n")

## RSE:         4.9058

cat("=== Polynomial Model (Degree 2) ===\n")

## === Polynomial Model (Degree 2) ===

cat("R-squared:  ", round(summary(poly_fit)$r.squared, 4), "\n")

## R-squared:   0.6876

cat("Adj R-sq:   ", round(summary(poly_fit)$adj.r.squared, 4), "\n")

## Adj R-sq:    0.686

cat("RSE:        ", round(summary(poly_fit)$sigma, 4), "\n")

## RSE:         4.3739

8. Conclusion

Diagnostic	Linear Model	Polynomial Model (Degree 2)
Residual vs. Fitted pattern	U-shaped (nonlinear trend present)	Random scatter (good fit)
R²	~0.606	~0.688
Conclusion	Underfits the nonlinear relationship	Captures the curvature well

Key Takeaways:

The Residual vs. Fitted plot is a powerful diagnostic tool. A U-shape or inverted U-shape in the residuals is a clear signal of a missed nonlinear relationship.
In the Auto dataset, mpg and horsepower have a nonlinear (concave) relationship — as horsepower increases, mpg decreases rapidly at first, then levels off.
A polynomial regression (degree 2) significantly improves the fit and produces residuals that are more randomly distributed around zero.

Analysis performed in R using the ISLR, ggplot2, and gridExtra packages.