In this assignment, we examine the Auto dataset from
the ISLR package to check whether the relationship between
mpg and horsepower is linear or nonlinear. We
use Residuals vs. Fitted plots to diagnose model
fit.
library(ISLR)
library(ggplot2)
# Load Auto dataset
data(Auto)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
# Fit linear regression
model_linear <- lm(mpg ~ horsepower, data = Auto)
summary(model_linear)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
# Extract fitted values and residuals
df_linear <- data.frame(
Fitted = fitted(model_linear),
Residuals = residuals(model_linear)
)
ggplot(df_linear, aes(x = Fitted, y = Residuals)) +
geom_point(alpha = 0.5, color = "steelblue") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", color = "orange", se = FALSE) +
labs(
title = "Residuals vs. Fitted - Linear Model",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()
Interpretation: The residuals show a clear “U” shape (curved pattern), meaning the linear model fails to capture a nonlinear trend between
mpgandhorsepower.
# Fit polynomial regression (degree 2)
model_poly <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(model_poly)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
df_poly <- data.frame(
Fitted = fitted(model_poly),
Residuals = residuals(model_poly)
)
ggplot(df_poly, aes(x = Fitted, y = Residuals)) +
geom_point(alpha = 0.5, color = "darkorchid") +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
geom_smooth(method = "loess", color = "orange", se = FALSE) +
labs(
title = "Residuals vs. Fitted - Polynomial Model (Degree 2)",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()
Interpretation: The residuals are now more randomly scattered around zero with no clear systematic pattern, indicating the polynomial model better captures the nonlinear relationship.
df_linear$Model <- "Linear"
df_poly$Model <- "Polynomial (Degree 2)"
df_combined <- rbind(df_linear, df_poly)
ggplot(df_combined, aes(x = Fitted, y = Residuals, color = Model)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
geom_smooth(method = "loess", se = FALSE) +
facet_wrap(~Model, scales = "free_x") +
labs(
title = "Residuals vs. Fitted: Linear vs. Polynomial Model",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal() +
theme(legend.position = "none")
| Model | Residual Pattern | Conclusion |
|---|---|---|
Linear (mpg ~ horsepower) |
U-shaped curve | Nonlinear trend NOT captured |
Polynomial (mpg ~ horsepower + horsepower²) |
Randomly scattered | Better fit, nonlinearity addressed |
The Residuals vs. Fitted plot is a powerful diagnostic tool:
In the Auto data, adding a quadratic term for horsepower significantly improves the model fit.