1 Overview

This report looks at the classic Auto dataset to check whether a simple linear regression of fuel efficiency (mpg) on engine horsepower is actually appropriate, or whether there is a nonlinear pattern hiding in the residuals.

The main tool I use is the residual vs. fitted (RVF) plot. The idea is straightforward: if the model is correctly specified, residuals should look like random noise centered at zero. If you spot a curve, a funnel, or any repeating pattern, that is the model telling you it is missing something.


2 Setup and Data

# load packages
library(ISLR2)   # contains the Auto dataset
library(ggplot2)

# grab the data, drop the name column (character, not useful here)
df <- Auto[, -9]

# quick look
str(df)
#> 'data.frame':    392 obs. of  8 variables:
#>  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
#>  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
#>  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
#>  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
#>  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
#>  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
#>  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
#>  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
summary(df[, c("mpg", "horsepower")])
#>       mpg          horsepower   
#>  Min.   : 9.00   Min.   : 46.0  
#>  1st Qu.:17.00   1st Qu.: 75.0  
#>  Median :22.75   Median : 93.5  
#>  Mean   :23.45   Mean   :104.5  
#>  3rd Qu.:29.00   3rd Qu.:126.0  
#>  Max.   :46.60   Max.   :230.0

Let’s first just visualise the raw relationship.

ggplot(df, aes(x = horsepower, y = mpg)) +
  geom_point(color = "steelblue", alpha = 0.5, size = 2) +
  geom_smooth(method = "lm",    color = "tomato",      se = FALSE,
              linetype = "dashed", linewidth = 1.1) +
  geom_smooth(method = "loess", color = "darkorange",  se = FALSE,
              linewidth = 1.1) +
  labs(
    title = "Fuel efficiency vs engine size",
    x     = "Horsepower",
    y     = "MPG",
    caption = "Dashed red = OLS line  |  Solid orange = LOESS smoother"
  ) +
  theme_bw(base_size = 12)

The orange LOESS curve clearly does not follow the OLS line — it drops steeply at first, then levels off. This is a classic diminishing-returns shape and already suggests a linear model will not do a great job.


3 Linear Regression (baseline)

fit1 <- lm(mpg ~ horsepower, data = df)
summary(fit1)
#> 
#> Call:
#> lm(formula = mpg ~ horsepower, data = df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -13.5710  -3.2592  -0.3435   2.7630  16.9240 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
#> horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.906 on 390 degrees of freedom
#> Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
#> F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

The coefficient on horsepower is negative and highly significant — more power means worse fuel economy, no surprise. But significance alone does not tell us whether the functional form is right. That is where residual diagnostics come in.

3.1 Residual vs. Fitted Plot

# build a small data frame of diagnostics
diag1 <- data.frame(
  yhat = fitted(fit1),
  e    = resid(fit1)
)

ggplot(diag1, aes(x = yhat, y = e)) +
  geom_point(alpha = 0.45, color = "tomato", size = 1.7) +
  geom_hline(yintercept = 0, linetype = "dotted", linewidth = 0.9) +
  geom_smooth(method = "loess", se = FALSE,
              color = "black", linewidth = 1, linetype = "solid") +
  labs(
    title   = "RVF plot — simple linear model",
    x       = "Fitted values",
    y       = "Residuals",
    caption = "A curved smoother = the model misses nonlinearity"
  ) +
  theme_bw(base_size = 12)

The black smoother traces a clear inverted-U arc across the fitted values. Residuals start negative, rise to a peak in the middle, then fall again. That shape tells us the linear model systematically under-predicts mpg for cars in the mid-range of horsepower and over-predicts at the extremes — textbook evidence of a missed nonlinear trend.


4 Polynomial Regression (degree 2)

Since the scatter plot hinted at diminishing returns, a quadratic term is a natural first fix. The model becomes:

\[\widehat{\text{mpg}} = \hat\beta_0 + \hat\beta_1\,hp + \hat\beta_2\,hp^2\]

fit2 <- lm(mpg ~ poly(horsepower, 2, raw = TRUE), data = df)
summary(fit2)
#> 
#> Call:
#> lm(formula = mpg ~ poly(horsepower, 2, raw = TRUE), data = df)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.7135  -2.5943  -0.0859   2.2868  15.8961 
#> 
#> Coefficients:
#>                                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                      56.9000997  1.8004268   31.60   <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)1 -0.4661896  0.0311246  -14.98   <2e-16 ***
#> poly(horsepower, 2, raw = TRUE)2  0.0012305  0.0001221   10.08   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.374 on 389 degrees of freedom
#> Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
#> F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Both the linear and quadratic terms are significant. Adjusted R² went from 0.605 to 0.686 — a meaningful improvement.

4.1 Residual vs. Fitted Plot

diag2 <- data.frame(
  yhat = fitted(fit2),
  e    = resid(fit2)
)

ggplot(diag2, aes(x = yhat, y = e)) +
  geom_point(alpha = 0.45, color = "steelblue", size = 1.7) +
  geom_hline(yintercept = 0, linetype = "dotted", linewidth = 0.9) +
  geom_smooth(method = "loess", se = FALSE,
              color = "black", linewidth = 1) +
  labs(
    title   = "RVF plot — quadratic model",
    x       = "Fitted values",
    y       = "Residuals",
    caption = "Smoother stays near zero: no obvious pattern remaining"
  ) +
  theme_bw(base_size = 12)

Much better. The smoother is nearly flat across the whole range. There is still some spread in the residuals (heteroscedasticity is a separate issue), but there is no longer a systematic curve. The quadratic term has absorbed the nonlinearity.


5 Putting It Side by Side

diag1$model <- "1. Linear"
diag2$model <- "2. Quadratic"
both <- rbind(diag1, diag2)

ggplot(both, aes(x = yhat, y = e, color = model)) +
  geom_point(alpha = 0.3, size = 1.5) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_smooth(method = "loess", se = FALSE, color = "black", linewidth = 1) +
  facet_wrap(~model, scales = "free_x") +
  scale_color_manual(values = c("tomato", "steelblue")) +
  labs(
    title = "Residual vs. Fitted — linear vs. quadratic",
    x     = "Fitted values",
    y     = "Residuals"
  ) +
  theme_bw(base_size = 12) +
  theme(legend.position = "none")


6 Summary of Results

Model Adj. R² RVF shape
Linear (hp) 0.606 0.605 Inverted-U arc (bad)
Quadratic (hp + hp^2) 0.688 0.686 Roughly flat (good)

7 Conclusion

The residual vs. fitted plot is one of the simplest and most informative tools in regression diagnostics. In this case:

For the Auto data, fuel efficiency has a diminishing relationship with horsepower — each additional unit of horsepower hurts mpg less at the high end than at the low end. A quadratic polynomial captures this adequately, though one could also try a log transformation of horsepower as an alternative.