HW 04 - Nonlinear Relationship Check on Auto Data

Overview

In this assignment, we examine the Auto dataset from the ISLR package to check whether the relationship between mpg and horsepower is linear or nonlinear. We use Residuals vs. Fitted plots to diagnose model fit.

Load Data

library(ISLR)
library(ggplot2)

# Load Auto dataset
data(Auto)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1. Linear Model: mpg ~ horsepower

# Fit linear regression
model_linear <- lm(mpg ~ horsepower, data = Auto)
summary(model_linear)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Residuals vs. Fitted Plot (Linear Model)

# Extract fitted values and residuals
df_linear <- data.frame(
  Fitted = fitted(model_linear),
  Residuals = residuals(model_linear)
)

ggplot(df_linear, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(
    title = "Residuals vs. Fitted - Linear Model",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal()

Interpretation: The residuals show a clear “U” shape (curved pattern), meaning the linear model fails to capture a nonlinear trend between mpg and horsepower.

2. Nonlinear Model: mpg ~ horsepower + horsepower²

# Fit polynomial regression (degree 2)
model_poly <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(model_poly)

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Residuals vs. Fitted Plot (Nonlinear Model)

df_poly <- data.frame(
  Fitted = fitted(model_poly),
  Residuals = residuals(model_poly)
)

ggplot(df_poly, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.5, color = "darkorchid") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", color = "orange", se = FALSE) +
  labs(
    title = "Residuals vs. Fitted - Polynomial Model (Degree 2)",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal()

Interpretation: The residuals are now more randomly scattered around zero with no clear systematic pattern, indicating the polynomial model better captures the nonlinear relationship.

3. Side-by-Side Comparison

df_linear$Model <- "Linear"
df_poly$Model <- "Polynomial (Degree 2)"

df_combined <- rbind(df_linear, df_poly)

ggplot(df_combined, aes(x = Fitted, y = Residuals, color = Model)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~Model, scales = "free_x") +
  labs(
    title = "Residuals vs. Fitted: Linear vs. Polynomial Model",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Conclusion

Model	Residual Pattern	Conclusion
Linear (`mpg ~ horsepower`)	U-shaped curve	Nonlinear trend NOT captured
Polynomial (`mpg ~ horsepower + horsepower²`)	Randomly scattered	Better fit, nonlinearity addressed

The Residuals vs. Fitted plot is a powerful diagnostic tool:

Random scatter around zero → model is appropriate (linear relationship holds)
U-shape or systematic curvature → model is missing a nonlinear trend

In the Auto data, adding a quadratic term for horsepower significantly improves the model fit.