Importing necessary libraries
library(tidymodels)
library(tidyverse)
library(ggplot2)
library(GGally)
library(ggformula)
library(ISLR)
library(dplyr)
library(car)
Use the Auto dataset (can be found in ISLR R package). Our objectives: - Predict acceleration using the remaining quantitative variables. - Which, if any, of the variables are significant predictors of acceleration. - How good is the resulting prediction?
Remove the name and origin columns from the dataset. Then - Produce a ggpairs plot. Write a short discussion (2–3 sentences) about what the plot suggests for multiple linear regression (collinearity, pairwise relationships, etc.).
auto <- Auto
auto <- auto |>
dplyr::select(-name,-origin)
glimpse(auto)
## Rows: 392
## Columns: 7
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration year
## Min. : 8.00 Min. :70.00
## 1st Qu.:13.78 1st Qu.:73.00
## Median :15.50 Median :76.00
## Mean :15.54 Mean :75.98
## 3rd Qu.:17.02 3rd Qu.:79.00
## Max. :24.80 Max. :82.00
As suggested by the question, we have 7 variables after removing
name and origin columns from the
Auto dataset. The summary() function gives us
the five number summary (min, Q1, median, Q3, max) for all the variables
in the dataset. For variable mpg, values range from 9.0 to
46.6 mpg. Similarly, for our response variable
acceleration, values range from 8.0 to 24.8, with a mean of
15.54, which indicates that a typical vehicle in the dataset takes about
15.5 seconds to accelerate from 0 to 60 mph, with both faster- and
slower-accelerating vehicles present in the data.
auto |>
ggpairs()
ggpairs() function shows the pairwise relationship among
the variables. From the above plot, it is clear that
acceleration shows a moderate negative correlations with
several predictors, including cylinders,
displacement, horsepower, and
weight. Among these, horsepower has the
strongest negative association with acceleration (i.e
−0.689). On the other hand, mpg shows a moderate positive
linear relationship with acceleration, whereas
year shows a little bit weaker positive relationship with
acceleration i.e. (0.290)
Start from the full model with all quantitative predictors. Do a
manual backward selection using p-values to determine what to remove
until all predictors are significant at the α= 0.05 level. Filter the
tidy() output to show only candidates for removal, i.e. p-values >α.
Remove the predictor with the largest p-value at each step. Compare the
full (unfiltered) tidy() results and adjusted R2 at each
step to the previous step.
#model instantiation
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
lm_spec
## Linear Regression Model Specification (regression)
##
## Computational engine: lm
Training our model with all the predictors
model <- lm_spec |>
fit(acceleration ~ ., data = auto)
model.metrics <- glance(model)
tidy(model) |>
arrange(p.value) |>
mutate(remove = (p.value > 0.05))
## # A tibble: 7 × 6
## term estimate std.error statistic p.value remove
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 horsepower -0.0857 0.00535 -16.0 1.14e-44 FALSE
## 2 weight 0.00331 0.000337 9.83 1.76e-20 FALSE
## 3 (Intercept) 20.9 2.16 9.67 6.24e-20 FALSE
## 4 displacement -0.00815 0.00365 -2.23 2.63e- 2 FALSE
## 5 year -0.0563 0.0324 -1.74 8.29e- 2 TRUE
## 6 cylinders -0.155 0.166 -0.935 3.51e- 1 TRUE
## 7 mpg 0.0212 0.0254 0.836 4.04e- 1 TRUE
tidy() function gives us the estimate of our model
parameters along with their associated p.value. Among all the
statistically insignificant variables (i.e variables with p.value >
0.05), mpg has the highest p.value. Hence, we are removing
mpg from the consideration in our backward elimination
process.
mpgmodel <- lm_spec |>
fit(acceleration ~ ., data = dplyr::select(auto, -mpg))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |>
arrange(p.value) |>
mutate(remove = (p.value > 0.05))
## # A tibble: 6 × 6
## term estimate std.error statistic p.value remove
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 horsepower -0.0859 0.00534 -16.1 6.81e-45 FALSE
## 2 weight 0.00317 0.000293 10.8 5.13e-24 FALSE
## 3 (Intercept) 20.6 2.13 9.65 6.79e-20 FALSE
## 4 displacement -0.00800 0.00365 -2.19 2.89e- 2 FALSE
## 5 year -0.0404 0.0262 -1.54 1.24e- 1 TRUE
## 6 cylinders -0.162 0.165 -0.980 3.28e- 1 TRUE
model.metrics
## # A tibble: 2 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.620 0.614 1.71 105. 1.02e-77 6 -764. 1544. 1576.
## 2 0.619 0.614 1.71 125. 1.23e-78 5 -764. 1543. 1570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Among all the statistically insignificant variables (i.e variables
with p.value > 0.05), cylinders has the highest p.value.
Hence, we are removing cylinders from the consideration in
our backward elimination process.
cylindersmodel <- lm_spec |>
fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |>
arrange(p.value) |>
mutate(remove = (p.value > 0.05))
## # A tibble: 5 × 6
## term estimate std.error statistic p.value remove
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 horsepower -0.0854 0.00532 -16.1 7.91e-45 FALSE
## 2 weight 0.00314 0.000291 10.8 7.19e-24 FALSE
## 3 (Intercept) 20.2 2.09 9.64 7.39e-20 FALSE
## 4 displacement -0.0104 0.00267 -3.90 1.12e- 4 FALSE
## 5 year -0.0401 0.0262 -1.53 1.26e- 1 TRUE
model.metrics
## # A tibble: 3 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.620 0.614 1.71 105. 1.02e-77 6 -764. 1544. 1576.
## 2 0.619 0.614 1.71 125. 1.23e-78 5 -764. 1543. 1570.
## 3 0.618 0.614 1.71 157. 1.48e-79 4 -765. 1542. 1565.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Among all the statistically insignificant variables (i.e variables
with p.value > 0.05), year has the highest p.value.
Hence, we are removing year from the consideration in our
backward elimination process.
yearmodel <- lm_spec |>
fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders, -year))
model.metrics<- rbind(model.metrics, glance(model))
tidy(model) |>
arrange(p.value) |>
mutate(remove = (p.value > 0.05))
## # A tibble: 4 × 6
## term estimate std.error statistic p.value remove
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 (Intercept) 17.1 0.484 35.3 4.10e-123 FALSE
## 2 horsepower -0.0835 0.00519 -16.1 4.59e- 45 FALSE
## 3 weight 0.00307 0.000288 10.7 2.05e- 23 FALSE
## 4 displacement -0.0100 0.00266 -3.76 1.93e- 4 FALSE
model.metrics
## # A tibble: 4 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.620 0.614 1.71 105. 1.02e-77 6 -764. 1544. 1576.
## 2 0.619 0.614 1.71 125. 1.23e-78 5 -764. 1543. 1570.
## 3 0.618 0.614 1.71 157. 1.48e-79 4 -765. 1542. 1565.
## 4 0.616 0.613 1.72 207. 3.04e-80 3 -766. 1542. 1562.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Checking the p.value for all the remaining predictors, we can say
that every predictor is significant because the p.value is
greater than 0.05 for remaining predictors. Hence our final multiple
linear model will be:
\[\widehat{acceleration} = 17.07 - 0.08 \times horsepower + 0.003 \times weight - 0.01 \times displacement \] Interpret the model:
The value of the intercept (\(beta_0\)) is 17.07, which represents the predicted value of acceleration for a hypothetical vehicle with horsepower, weight, and displacement all equal to 0. While such a vehicle is not realistic, the intercept serves as a baseline from which the effects of the predictors are measured.
The coefficient of horsepower (\(beta_1\)) is −0.08, which suggests that holding weight and displacement constant, if horsepower increases by 1 unit, the predicted value of acceleration decreases by 0.08 units. This indicates that vehicles with higher horsepower tend to accelerate faster (i.e., take less time to accelerate).
The coefficient of weight (\(beta_2\)) is 0.003, which suggests that holding horsepower and displacement constant, if weight (in pounds) increases by 1 unit, the predicted value of acceleration increases by 0.003 units. This implies that heavier vehicles tend to accelerate more slowly.
The coefficient of displacement (\(beta_3\)) is −0.01, which suggests that holding horsepower and weight constant, if displacement increases by 1 unit, the predicted value of acceleration decreases by 0.01 units.
model.metrics
## # A tibble: 4 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.620 0.614 1.71 105. 1.02e-77 6 -764. 1544. 1576.
## 2 0.619 0.614 1.71 125. 1.23e-78 5 -764. 1543. 1570.
## 3 0.618 0.614 1.71 157. 1.48e-79 4 -765. 1542. 1565.
## 4 0.616 0.613 1.72 207. 3.04e-80 3 -766. 1542. 1562.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
From the table, we can see that as the degree of freedom of each
model decreases, the value of \(R^2\)
is also decreasing each selection step. However, Adjusted \(R^2\) increases until the last step. This
is the expected behaviour because when we remove predictors, the model
becomes less flexibility, so it can explain less variation in the
acceleration. Therefore, \(R^2\) goes down as predictors are removed.
On the other hand, when you remove predictors that don’t add much
explanatory power, adjusted \(R^2\)
increases because it penalizes unnecessary predictors. We can see a
minor decrease in the adjusted \(R^2\)
values when the degree of freedom reduces from 4 to 3. However, the
decrease is very minor which is insignificant in comparison to the
reduced model complexity. That’s why, it is not much concerning in our
case.
ggpairs(data = dplyr::select(auto,-c(mpg, cylinders, year)) |>
na.omit())
After the completion of backward elimination process, we can see that
the there is a moderate negative correlation between all the remaining
predictors (i.e displacement, horsepower,
weight) and our response variables
acceleration, which means that if any of our predictor
decreases the acceleration increases. However, the
coefficient of weight is positive, which tells a different
story, .i,e if weight increases the
acceleration also increases. This instability in the
coefficient is due to multicollinearity between weight,
displacement and horsepower, we cannot
increase the value of weight holding
displacement and horsepower constant since
they are highly correlated with each other.
displacement, horsepower,
weight are all measuring closely related physical
characteristics of the same vehicle, so they provide overlapping
information in the regression model.
vif(model$fit)
## displacement horsepower weight
## 10.310539 5.287295 7.957383
In this case displacement has the higher VIF(Variance
Inflation Factor) value in comparison to horsepower and
weight, so we remove it from our model.
model <- lm_spec |>
fit(acceleration ~ ., data = dplyr::select(auto, -mpg, -cylinders, -year, -displacement))
model.metrics <- rbind(model.metrics, glance(model))
model.metrics
## # A tibble: 5 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.620 0.614 1.71 105. 1.02e-77 6 -764. 1544. 1576.
## 2 0.619 0.614 1.71 125. 1.23e-78 5 -764. 1543. 1570.
## 3 0.618 0.614 1.71 157. 1.48e-79 4 -765. 1542. 1565.
## 4 0.616 0.613 1.72 207. 3.04e-80 3 -766. 1542. 1562.
## 5 0.602 0.600 1.75 294. 1.62e-78 2 -773. 1554. 1570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
We can also notice the decrease in adjusted \(R^2\) mean that elimination of displacement didn’t improve the predictive power of our model.
vif(model$fit)
## horsepower weight
## 3.959228 3.959228
Since both the predictors have VIF smaller than 5, I am stopping the elimination process here because our model is likely stable.
ggpairs(data = dplyr::select(auto,-c(mpg, cylinders, year, displacement)) |>
na.omit())
tidy(model, conf.int = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 18.4 0.326 56.5 1.60e-189 17.8 19.1
## 2 horsepower -0.0933 0.00456 -20.5 1.21e- 63 -0.102 -0.0843
## 3 weight 0.00230 0.000207 11.1 3.55e- 25 0.00190 0.00271
Our final model is: \[\widehat{acceleration} = 18.436 - 0.093 \times horsepower + 0.002 \times weight\]
We are 95% confident that for the vehicles having 0 weight and 0
horsepower year, the acceleration lies between
(17.793887623, 19.077694704). (This scenario is not meaningful in
practice).
Holding the weight constant, we are 95% confident that,
on average, a one-unit increase in vehicle horsepower is
associated with a decrease in acceleration by an amount between
-0.102283508 and -0.084342022
Holding the horsepower constant, we are 95% confident
that, on average, a one-unit increase in vehicle weight is
associated with a increase in acceleration by an amount between
0.66005097 and 0.854585595.
aug = augment(model, new_data = auto)
aug
## # A tibble: 392 × 9
## .pred .resid mpg cylinders displacement horsepower weight acceleration
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14.4 -2.37 18 8 307 130 3504 12
## 2 11.5 -0.0398 15 8 350 165 3693 11.5
## 3 12.3 -1.35 18 8 318 150 3436 11
## 4 12.3 -0.341 16 8 304 150 3433 12
## 5 13.3 -2.81 17 8 302 140 3449 10.5
## 6 9.95 0.0479 15 8 429 198 4341 10
## 7 7.93 1.07 14 8 454 220 4354 9
## 8 8.30 0.201 14 8 440 215 4312 8.5
## 9 7.63 2.37 14 8 455 225 4425 10
## 10 9.57 -1.07 15 8 390 190 3850 8.5
## # ℹ 382 more rows
## # ℹ 1 more variable: year <dbl>
ggplot(aug, aes(x = .pred, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_hline(yintercept = -5, linetype = "dashed", color = "red") +
geom_hline(yintercept = 5, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
From the Residuals vs fitted plot above, we can see that residuals are generally centered around zero, indicating no strong systematic bias. However, the increasing spread of residuals at higher fitted values suggests heteroscedasticity, and there is mild evidence of non - linearity because of a little curve like pattern formed by our data points, which implies that while the linear model provides a reasonable first approximation, it may not fully capture the relationship between predictors and acceleration.
ggplot(data = aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
We can see a slightly lonfer right tail, which might indicate a minor deviation from the normality but not much concerning.
aug |>
gf_qq(~.resid) |>
gf_qqline()
From the histogram and Q-Q plot, we can see that the majority of points in the middle of the plot closely follow the dashed line, which indicates the residual distribution is roughly normal with a slightly longer right tail.