Backward selection

Importing necessary libraries

library(tidymodels)
library(tidyverse)
library(ggplot2)
library(GGally)
library(ggformula)
library(ISLR)
library(dplyr)
library(car)

Use the Auto dataset (can be found in ISLR R package). Our objectives: - Predict acceleration using the remaining quantitative variables. - Which, if any, of the variables are significant predictors of acceleration. - How good is the resulting prediction?

1.

Remove the name and origin columns from the dataset. Then - Produce a ggpairs plot. Write a short discussion (2–3 sentences) about what the plot suggests for multiple linear regression (collinearity, pairwise relationships, etc.).

auto <- Auto

auto <- auto |>
  dplyr::select(-name,-origin)

glimpse(auto)
## Rows: 392
## Columns: 7
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
summary(auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year      
##  Min.   : 8.00   Min.   :70.00  
##  1st Qu.:13.78   1st Qu.:73.00  
##  Median :15.50   Median :76.00  
##  Mean   :15.54   Mean   :75.98  
##  3rd Qu.:17.02   3rd Qu.:79.00  
##  Max.   :24.80   Max.   :82.00

As suggested by the question, we have 7 variables after removing name and origin columns from the Auto dataset. The summary() function gives us the five number summary (min, Q1, median, Q3, max) for all the variables in the dataset. For variable mpg, values range from 9.0 to 46.6 mpg. Similarly, for our response variable acceleration, values range from 8.0 to 24.8, with a mean of 15.54, which indicates that a typical vehicle in the dataset takes about 15.5 seconds to accelerate from 0 to 60 mph, with both faster- and slower-accelerating vehicles present in the data.

auto |>
  ggpairs()

ggpairs() function shows the pairwise relationship among the variables. From the above plot, it is clear that acceleration shows a moderate negative correlations with several predictors, including cylinders, displacement, horsepower, and weight. Among these, horsepower has the strongest negative association with acceleration (i.e −0.689). On the other hand, mpg shows a moderate positive linear relationship with acceleration, whereas year shows a little bit weaker positive relationship with acceleration i.e. (0.290)

2.

Start from the full model with all quantitative predictors. Do a manual backward selection using p-values to determine what to remove until all predictors are significant at the α= 0.05 level. Filter the tidy() output to show only candidates for removal, i.e. p-values >α. Remove the predictor with the largest p-value at each step. Compare the full (unfiltered) tidy() results and adjusted R2 at each step to the previous step.

#model instantiation
lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Full model with all the quantitative predictors

Training our model with all the predictors

model <- lm_spec |>
  fit(acceleration ~ ., data = auto)
model.metrics <- glance(model)
tidy(model) |>
  arrange(p.value) |>
  mutate(remove = (p.value > 0.05))
## # A tibble: 7 × 6
##   term         estimate std.error statistic  p.value remove
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl> <lgl> 
## 1 horsepower   -0.0857   0.00535    -16.0   1.14e-44 FALSE 
## 2 weight        0.00331  0.000337     9.83  1.76e-20 FALSE 
## 3 (Intercept)  20.9      2.16         9.67  6.24e-20 FALSE 
## 4 displacement -0.00815  0.00365     -2.23  2.63e- 2 FALSE 
## 5 year         -0.0563   0.0324      -1.74  8.29e- 2 TRUE  
## 6 cylinders    -0.155    0.166       -0.935 3.51e- 1 TRUE  
## 7 mpg           0.0212   0.0254       0.836 4.04e- 1 TRUE

tidy() function gives us the estimate of our model parameters along with their associated p.value. Among all the statistically insignificant variables (i.e variables with p.value > 0.05), mpg has the highest p.value. Hence, we are removing mpg from the consideration in our backward elimination process.

Remove mpg

model <- lm_spec |>
  fit(acceleration ~ ., data = dplyr::select(auto, -mpg))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |> 
  arrange(p.value) |>
  mutate(remove = (p.value > 0.05))
## # A tibble: 6 × 6
##   term         estimate std.error statistic  p.value remove
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl> <lgl> 
## 1 horsepower   -0.0859   0.00534    -16.1   6.81e-45 FALSE 
## 2 weight        0.00317  0.000293    10.8   5.13e-24 FALSE 
## 3 (Intercept)  20.6      2.13         9.65  6.79e-20 FALSE 
## 4 displacement -0.00800  0.00365     -2.19  2.89e- 2 FALSE 
## 5 year         -0.0404   0.0262      -1.54  1.24e- 1 TRUE  
## 6 cylinders    -0.162    0.165       -0.980 3.28e- 1 TRUE
model.metrics
## # A tibble: 2 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.620         0.614  1.71      105. 1.02e-77     6  -764. 1544. 1576.
## 2     0.619         0.614  1.71      125. 1.23e-78     5  -764. 1543. 1570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Among all the statistically insignificant variables (i.e variables with p.value > 0.05), cylinders has the highest p.value. Hence, we are removing cylinders from the consideration in our backward elimination process.

Remove cylinders

model <- lm_spec |>
  fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |> 
  arrange(p.value) |>
  mutate(remove = (p.value > 0.05))
## # A tibble: 5 × 6
##   term         estimate std.error statistic  p.value remove
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl> <lgl> 
## 1 horsepower   -0.0854   0.00532     -16.1  7.91e-45 FALSE 
## 2 weight        0.00314  0.000291     10.8  7.19e-24 FALSE 
## 3 (Intercept)  20.2      2.09          9.64 7.39e-20 FALSE 
## 4 displacement -0.0104   0.00267      -3.90 1.12e- 4 FALSE 
## 5 year         -0.0401   0.0262       -1.53 1.26e- 1 TRUE
model.metrics
## # A tibble: 3 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.620         0.614  1.71      105. 1.02e-77     6  -764. 1544. 1576.
## 2     0.619         0.614  1.71      125. 1.23e-78     5  -764. 1543. 1570.
## 3     0.618         0.614  1.71      157. 1.48e-79     4  -765. 1542. 1565.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Among all the statistically insignificant variables (i.e variables with p.value > 0.05), year has the highest p.value. Hence, we are removing year from the consideration in our backward elimination process.

Remove year

model <- lm_spec |>
  fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders, -year))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |> 
  arrange(p.value) |>
  mutate(remove = (p.value > 0.05))
## # A tibble: 4 × 6
##   term         estimate std.error statistic   p.value remove
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl> <lgl> 
## 1 (Intercept)  17.1      0.484        35.3  4.10e-123 FALSE 
## 2 horsepower   -0.0835   0.00519     -16.1  4.59e- 45 FALSE 
## 3 weight        0.00307  0.000288     10.7  2.05e- 23 FALSE 
## 4 displacement -0.0100   0.00266      -3.76 1.93e-  4 FALSE
model.metrics
## # A tibble: 4 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.620         0.614  1.71      105. 1.02e-77     6  -764. 1544. 1576.
## 2     0.619         0.614  1.71      125. 1.23e-78     5  -764. 1543. 1570.
## 3     0.618         0.614  1.71      157. 1.48e-79     4  -765. 1542. 1565.
## 4     0.616         0.613  1.72      207. 3.04e-80     3  -766. 1542. 1562.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Checking the p.value for all the remaining predictors, we can say that every predictor is significant because the p.value is greater than 0.05 for remaining predictors. Hence our final multiple linear model will be:

\[\widehat{acceleration} = 17.07 - 0.08 \times horsepower + 0.003 \times weight - 0.01 \times displacement \] Interpret the model:

The value of the intercept (\(beta_0\)) is 17.07, which represents the predicted value of acceleration for a hypothetical vehicle with horsepower, weight, and displacement all equal to 0. While such a vehicle is not realistic, the intercept serves as a baseline from which the effects of the predictors are measured.

The coefficient of horsepower (\(beta_1\)) is −0.08, which suggests that holding weight and displacement constant, if horsepower increases by 1 unit, the predicted value of acceleration decreases by 0.08 units. This indicates that vehicles with higher horsepower tend to accelerate faster (i.e., take less time to accelerate).

The coefficient of weight (\(beta_2\)) is 0.003, which suggests that holding horsepower and displacement constant, if weight (in pounds) increases by 1 unit, the predicted value of acceleration increases by 0.003 units. This implies that heavier vehicles tend to accelerate more slowly.

The coefficient of displacement (\(beta_3\)) is −0.01, which suggests that holding horsepower and weight constant, if displacement increases by 1 unit, the predicted value of acceleration decreases by 0.01 units.

model.metrics
## # A tibble: 4 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.620         0.614  1.71      105. 1.02e-77     6  -764. 1544. 1576.
## 2     0.619         0.614  1.71      125. 1.23e-78     5  -764. 1543. 1570.
## 3     0.618         0.614  1.71      157. 1.48e-79     4  -765. 1542. 1565.
## 4     0.616         0.613  1.72      207. 3.04e-80     3  -766. 1542. 1562.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

From the table, we can see that as the degree of freedom of each model decreases, the value of \(R^2\) is also decreasing each selection step. However, Adjusted \(R^2\) increases until the last step. This is the expected behaviour because when we remove predictors, the model becomes less flexibility, so it can explain less variation in the acceleration. Therefore, \(R^2\) goes down as predictors are removed. On the other hand, when you remove predictors that don’t add much explanatory power, adjusted \(R^2\) increases because it penalizes unnecessary predictors. We can see a minor decrease in the adjusted \(R^2\) values when the degree of freedom reduces from 4 to 3. However, the decrease is very minor which is insignificant in comparison to the reduced model complexity. That’s why, it is not much concerning in our case.

ggpairs(data = dplyr::select(auto,-c(mpg, cylinders, year)) |>
          na.omit())

After the completion of backward elimination process, we can see that the there is a moderate negative correlation between all the remaining predictors (i.e displacement, horsepower, weight) and our response variables acceleration, which means that if any of our predictor decreases the acceleration increases. However, the coefficient of weight is positive, which tells a different story, .i,e if weight increases the acceleration also increases. This instability in the coefficient is due to multicollinearity between weight, displacement and horsepower, we cannot increase the value of weight holding displacement and horsepower constant since they are highly correlated with each other.

displacement, horsepower, weight are all measuring closely related physical characteristics of the same vehicle, so they provide overlapping information in the regression model.

vif(model$fit)
## displacement   horsepower       weight 
##    10.310539     5.287295     7.957383

In this case displacement has the higher VIF(Variance Inflation Factor) value in comparison to horsepower and weight, so we remove it from our model.

model <- lm_spec |>
  fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders, -year, -displacement))

model.metrics <- rbind(model.metrics, glance(model))
model.metrics
## # A tibble: 5 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.620         0.614  1.71      105. 1.02e-77     6  -764. 1544. 1576.
## 2     0.619         0.614  1.71      125. 1.23e-78     5  -764. 1543. 1570.
## 3     0.618         0.614  1.71      157. 1.48e-79     4  -765. 1542. 1565.
## 4     0.616         0.613  1.72      207. 3.04e-80     3  -766. 1542. 1562.
## 5     0.602         0.600  1.75      294. 1.62e-78     2  -773. 1554. 1570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

We can also notice the decrease in adjusted \(R^2\) mean that elimination of displacement didn’t improve the predictive power of our model.

vif(model$fit)
## horsepower     weight 
##   3.959228   3.959228

Since both the predictors have VIF smaller than 5, I am stopping the elimination process here because our model is likely stable.

ggpairs(data = dplyr::select(auto,-c(mpg, cylinders, year, displacement)) |>
          na.omit())

tidy(model, conf.int = TRUE)
## # A tibble: 3 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept) 18.4      0.326         56.5 1.60e-189 17.8      19.1    
## 2 horsepower  -0.0933   0.00456      -20.5 1.21e- 63 -0.102    -0.0843 
## 3 weight       0.00230  0.000207      11.1 3.55e- 25  0.00190   0.00271

Our final model is: \[\widehat{acceleration} = 18.436 - 0.093 \times horsepower + 0.002 \times weight\]

We are 95% confident that for the vehicles having 0 weight and 0 horsepower year, the acceleration lies between (17.793887623, 19.077694704). (This scenario is not meaningful in practice).

Holding the weight constant, we are 95% confident that, on average, a one-unit increase in vehicle horsepower is associated with a decrease in acceleration by an amount between -0.102283508 and -0.084342022

Holding the horsepower constant, we are 95% confident that, on average, a one-unit increase in vehicle weight is associated with a increase in acceleration by an amount between 0.66005097 and 0.854585595.

Residuals check

aug = augment(model, new_data = auto)
aug
## # A tibble: 392 × 9
##    .pred  .resid   mpg cylinders displacement horsepower weight acceleration
##    <dbl>   <dbl> <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl>
##  1 14.4  -2.37      18         8          307        130   3504         12  
##  2 11.5  -0.0398    15         8          350        165   3693         11.5
##  3 12.3  -1.35      18         8          318        150   3436         11  
##  4 12.3  -0.341     16         8          304        150   3433         12  
##  5 13.3  -2.81      17         8          302        140   3449         10.5
##  6  9.95  0.0479    15         8          429        198   4341         10  
##  7  7.93  1.07      14         8          454        220   4354          9  
##  8  8.30  0.201     14         8          440        215   4312          8.5
##  9  7.63  2.37      14         8          455        225   4425         10  
## 10  9.57 -1.07      15         8          390        190   3850          8.5
## # ℹ 382 more rows
## # ℹ 1 more variable: year <dbl>

Residuals vs fitted plot

ggplot(aug, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = -5, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 5, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

From the Residuals vs fitted plot above, we can see that residuals are generally centered around zero, indicating no strong systematic bias. However, the increasing spread of residuals at higher fitted values suggests heteroscedasticity, and there is mild evidence of non - linearity because of a little curve like pattern formed by our data points, which implies that while the linear model provides a reasonable first approximation, it may not fully capture the relationship between predictors and acceleration.

Histogram

ggplot(data = aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

We can see a slightly lonfer right tail, which might indicate a minor deviation from the normality but not much concerning.

Q-Q plot

aug |>
  gf_qq(~.resid) |>
  gf_qqline()

From the histogram and Q-Q plot, we can see that the majority of points in the middle of the plot closely follow the dashed line, which indicates the residual distribution is roughly normal with a slightly longer right tail.