This report looks at the classic Auto dataset to check
whether a simple linear regression of fuel efficiency (mpg)
on engine horsepower is actually appropriate, or whether
there is a nonlinear pattern hiding in the residuals.
The main tool I use is the residual vs. fitted (RVF) plot. The idea is straightforward: if the model is correctly specified, residuals should look like random noise centered at zero. If you spot a curve, a funnel, or any repeating pattern, that is the model telling you it is missing something.
# load packages
library(ISLR2) # contains the Auto dataset
library(ggplot2)
# grab the data, drop the name column (character, not useful here)
df <- Auto[, -9]
# quick look
str(df)#> 'data.frame': 392 obs. of 8 variables:
#> $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
#> $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
#> $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
#> $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
#> $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
#> $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
#> $ year : int 70 70 70 70 70 70 70 70 70 70 ...
#> $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
#> mpg horsepower
#> Min. : 9.00 Min. : 46.0
#> 1st Qu.:17.00 1st Qu.: 75.0
#> Median :22.75 Median : 93.5
#> Mean :23.45 Mean :104.5
#> 3rd Qu.:29.00 3rd Qu.:126.0
#> Max. :46.60 Max. :230.0
Let’s first just visualise the raw relationship.
ggplot(df, aes(x = horsepower, y = mpg)) +
geom_point(color = "steelblue", alpha = 0.5, size = 2) +
geom_smooth(method = "lm", color = "tomato", se = FALSE,
linetype = "dashed", linewidth = 1.1) +
geom_smooth(method = "loess", color = "darkorange", se = FALSE,
linewidth = 1.1) +
labs(
title = "Fuel efficiency vs engine size",
x = "Horsepower",
y = "MPG",
caption = "Dashed red = OLS line | Solid orange = LOESS smoother"
) +
theme_bw(base_size = 12)The orange LOESS curve clearly does not follow the OLS line — it drops steeply at first, then levels off. This is a classic diminishing-returns shape and already suggests a linear model will not do a great job.
#>
#> Call:
#> lm(formula = mpg ~ horsepower, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -13.5710 -3.2592 -0.3435 2.7630 16.9240
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
#> horsepower -0.157845 0.006446 -24.49 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.906 on 390 degrees of freedom
#> Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
#> F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
The coefficient on horsepower is negative and highly
significant — more power means worse fuel economy, no surprise. But
significance alone does not tell us whether the functional form is
right. That is where residual diagnostics come in.
# build a small data frame of diagnostics
diag1 <- data.frame(
yhat = fitted(fit1),
e = resid(fit1)
)
ggplot(diag1, aes(x = yhat, y = e)) +
geom_point(alpha = 0.45, color = "tomato", size = 1.7) +
geom_hline(yintercept = 0, linetype = "dotted", linewidth = 0.9) +
geom_smooth(method = "loess", se = FALSE,
color = "black", linewidth = 1, linetype = "solid") +
labs(
title = "RVF plot — simple linear model",
x = "Fitted values",
y = "Residuals",
caption = "A curved smoother = the model misses nonlinearity"
) +
theme_bw(base_size = 12)The black smoother traces a clear inverted-U arc across the fitted values. Residuals start negative, rise to a peak in the middle, then fall again. That shape tells us the linear model systematically under-predicts mpg for cars in the mid-range of horsepower and over-predicts at the extremes — textbook evidence of a missed nonlinear trend.
Since the scatter plot hinted at diminishing returns, a quadratic term is a natural first fix. The model becomes:
\[\widehat{\text{mpg}} = \hat\beta_0 + \hat\beta_1\,hp + \hat\beta_2\,hp^2\]
#>
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2, raw = TRUE), data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -14.7135 -2.5943 -0.0859 2.2868 15.8961
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)1 -0.4661896 0.0311246 -14.98 <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)2 0.0012305 0.0001221 10.08 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.374 on 389 degrees of freedom
#> Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
#> F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Both the linear and quadratic terms are significant. Adjusted R² went from 0.605 to 0.686 — a meaningful improvement.
diag2 <- data.frame(
yhat = fitted(fit2),
e = resid(fit2)
)
ggplot(diag2, aes(x = yhat, y = e)) +
geom_point(alpha = 0.45, color = "steelblue", size = 1.7) +
geom_hline(yintercept = 0, linetype = "dotted", linewidth = 0.9) +
geom_smooth(method = "loess", se = FALSE,
color = "black", linewidth = 1) +
labs(
title = "RVF plot — quadratic model",
x = "Fitted values",
y = "Residuals",
caption = "Smoother stays near zero: no obvious pattern remaining"
) +
theme_bw(base_size = 12)Much better. The smoother is nearly flat across the whole range. There is still some spread in the residuals (heteroscedasticity is a separate issue), but there is no longer a systematic curve. The quadratic term has absorbed the nonlinearity.
diag1$model <- "1. Linear"
diag2$model <- "2. Quadratic"
both <- rbind(diag1, diag2)
ggplot(both, aes(x = yhat, y = e, color = model)) +
geom_point(alpha = 0.3, size = 1.5) +
geom_hline(yintercept = 0, linetype = "dotted") +
geom_smooth(method = "loess", se = FALSE, color = "black", linewidth = 1) +
facet_wrap(~model, scales = "free_x") +
scale_color_manual(values = c("tomato", "steelblue")) +
labs(
title = "Residual vs. Fitted — linear vs. quadratic",
x = "Fitted values",
y = "Residuals"
) +
theme_bw(base_size = 12) +
theme(legend.position = "none")| Model | R² | Adj. R² | RVF shape |
|---|---|---|---|
| Linear (hp) | 0.606 | 0.605 | Inverted-U arc (bad) |
| Quadratic (hp + hp^2) | 0.688 | 0.686 | Roughly flat (good) |
The residual vs. fitted plot is one of the simplest and most informative tools in regression diagnostics. In this case:
For the Auto data, fuel efficiency has a diminishing
relationship with horsepower — each additional unit of horsepower hurts
mpg less at the high end than at the low end. A quadratic polynomial
captures this adequately, though one could also try a log transformation
of horsepower as an alternative.