Overview

This analysis uses the Auto dataset from the ISLR package to examine the relationship between mpg (miles per gallon) and horsepower. We fit a linear regression model and then use Residual vs. Fitted plots to check whether a nonlinear relationship exists.


1. Load Data

# Install ISLR if needed
if (!require(ISLR)) install.packages("ISLR")
library(ISLR)

# Load Auto dataset
data(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
dim(Auto)
## [1] 392   9

2. Exploratory Plot

plot(Auto$horsepower, Auto$mpg,
     xlab = "Horsepower",
     ylab = "MPG (Miles Per Gallon)",
     main = "MPG vs. Horsepower",
     pch = 19, col = "steelblue", cex = 0.7)
abline(lm(mpg ~ horsepower, data = Auto), col = "red", lwd = 2)

The scatter plot already hints at a curved (nonlinear) relationship between horsepower and mpg.


3. Linear Regression Model

# Fit simple linear regression
lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

4. Residual vs. Fitted Plot — Linear Model

Interpretation key: - Linear relationship → residuals scatter randomly around 0 (no pattern) - Nonlinear relationship → residuals show a “U” shape or systematic curvature

plot(fitted(lm_fit), residuals(lm_fit),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residual vs. Fitted — Linear Model (mpg ~ horsepower)",
     pch = 19, col = "steelblue", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_fit), residuals(lm_fit)), col = "orange", lwd = 2)
legend("topright", legend = c("Zero line", "LOWESS smooth"),
       col = c("red", "orange"), lty = c(2, 1), lwd = 2)

Observation: The LOWESS smoother traces a clear “U” shape (inverted curve), indicating the linear model fails to capture the nonlinear trend. This is evidence of a nonlinear relationship.


5. Polynomial Regression Model (Degree 2)

To address nonlinearity, we fit a quadratic (degree-2) polynomial model:

lm_poly <- lm(mpg ~ poly(horsepower, 2), data = Auto)
summary(lm_poly)
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.4459     0.2209  106.13   <2e-16 ***
## poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
## poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

6. Residual vs. Fitted Plot — Polynomial Model

plot(fitted(lm_poly), residuals(lm_poly),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residual vs. Fitted — Polynomial Model (degree 2)",
     pch = 19, col = "darkorange", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_poly), residuals(lm_poly)), col = "purple", lwd = 2)
legend("topright", legend = c("Zero line", "LOWESS smooth"),
       col = c("red", "purple"), lty = c(2, 1), lwd = 2)

Observation: With the polynomial model, residuals are much more randomly scattered around zero — the curvature is largely removed, confirming the quadratic term captures the nonlinear pattern.


7. Side-by-Side Comparison

par(mfrow = c(1, 2))

# Linear
plot(fitted(lm_fit), residuals(lm_fit),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Linear Model\n(Shows U-shape = Nonlinearity)",
     pch = 19, col = "steelblue", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_fit), residuals(lm_fit)), col = "orange", lwd = 2)

# Polynomial
plot(fitted(lm_poly), residuals(lm_poly),
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Polynomial Model (degree 2)\n(Residuals more random)",
     pch = 19, col = "darkorange", cex = 0.7)
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(fitted(lm_poly), residuals(lm_poly)), col = "purple", lwd = 2)

par(mfrow = c(1, 1))

8. Model Comparison

cat("=== Linear Model ===\n")
## === Linear Model ===
cat("R-squared:", summary(lm_fit)$r.squared, "\n")
## R-squared: 0.6059483
cat("RSE:", summary(lm_fit)$sigma, "\n\n")
## RSE: 4.905757
cat("=== Polynomial Model (degree 2) ===\n")
## === Polynomial Model (degree 2) ===
cat("R-squared:", summary(lm_poly)$r.squared, "\n")
## R-squared: 0.687559
cat("RSE:", summary(lm_poly)$sigma, "\n")
## RSE: 4.373921

9. Conclusion

Model Residual Pattern
Linear (mpg ~ horsepower) ~0.606 U-shape → nonlinearity detected
Polynomial (mpg ~ poly(hp, 2)) ~0.688 Random scatter → good fit

Key Takeaway: The Residual vs. Fitted plot for the linear model shows a clear “U” shape, which is the classic diagnostic sign of an unmodeled nonlinear relationship. Fitting a degree-2 polynomial removes this pattern, yielding better model fit (higher R², lower RSE) and randomly scattered residuals.

This approach — inspecting residual plots and then upgrading to polynomial regression — is a standard workflow for detecting and addressing nonlinearity in regression analysis.