I use the built-in mtcars dataset (32 cars, 11
variables). The goal is to predict fuel efficiency (mpg)
from multiple predictors: weight (wt), horsepower
(hp), number of cylinders (cyl), and
displacement (disp).
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
pairs(mtcars[c("mpg","wt","hp","cyl","disp")], main="Scatterplot Matrix")
All four predictors show a negative relationship with
mpg. Weight and displacement are strongly correlated with
each other, which hints at potential multicollinearity.
fit.full <- lm(mpg ~ wt + hp + cyl + disp, data = mtcars)
summary(fit.full)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0562 -1.4636 -0.4281 1.2854 5.8269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.82854 2.75747 14.807 1.76e-14 ***
## wt -3.85390 1.01547 -3.795 0.000759 ***
## hp -0.02054 0.01215 -1.691 0.102379
## cyl -1.29332 0.65588 -1.972 0.058947 .
## disp 0.01160 0.01173 0.989 0.331386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.513 on 27 degrees of freedom
## Multiple R-squared: 0.8486, Adjusted R-squared: 0.8262
## F-statistic: 37.84 on 4 and 27 DF, p-value: 1.061e-10
Interpretation:
mpg (Adjusted R-squared = 0.8262).wt is the only clearly significant predictor (p <
0.001): holding other variables constant, each 1,000 lb increase in
weight reduces fuel efficiency by 3.85 mpg.cyl is marginally non-significant (p = 0.059) and
hp approaches significance (p = 0.102), but neither crosses
the 0.05 threshold – likely because all three overlap in what they
measure (engine size and power).disp contributes nothing beyond what the other three
already capture (p = 0.331).par(mfrow = c(2,2))
plot(fit.full)
Variable selection identifies the simplest model with the best predictive accuracy. Three main approaches are demonstrated below.
All-subsets evaluates every possible combination of predictors and
ranks them by a criterion. Here we use adjusted R-squared via the
leaps package.
library(leaps)
reg.sub <- regsubsets(mpg ~ wt + hp + cyl + disp, data = mtcars, nvmax = 4)
reg.sum <- summary(reg.sub)
reg.sum
## Subset selection object
## Call: regsubsets.formula(mpg ~ wt + hp + cyl + disp, data = mtcars,
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## wt FALSE FALSE
## hp FALSE FALSE
## cyl FALSE FALSE
## disp FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## wt hp cyl disp
## 1 ( 1 ) "*" " " " " " "
## 2 ( 1 ) "*" " " "*" " "
## 3 ( 1 ) "*" "*" "*" " "
## 4 ( 1 ) "*" "*" "*" "*"
par(mfrow = c(1,2))
plot(reg.sum$adjr2, xlab = "Number of predictors", ylab = "Adjusted R-squared", type = "b")
points(which.max(reg.sum$adjr2), reg.sum$adjr2[which.max(reg.sum$adjr2)], col = "red", pch = 19)
plot(reg.sub, scale = "adjr2", main = "Best subsets by Adjusted R-squared")
cat("Best number of predictors:", which.max(reg.sum$adjr2), "\n")
## Best number of predictors: 3
cat("Variables included:\n")
## Variables included:
print(reg.sum$which[which.max(reg.sum$adjr2), ])
## (Intercept) wt hp cyl disp
## TRUE TRUE TRUE TRUE FALSE
The highest adjusted R-squared is achieved with 3
predictors. The best 3-variable model includes wt,
hp, and cyl – disp is excluded.
Adding it as a fourth predictor does not improve adjusted R-squared,
confirming it adds noise rather than signal.
Stepwise selection adds or removes one predictor at a time using AIC. Three directions are shown.
fit.step.back <- step(fit.full, direction = "backward")
## Start: AIC=63.53
## mpg ~ wt + hp + cyl + disp
##
## Df Sum of Sq RSS AIC
## - disp 1 6.176 176.62 62.665
## <none> 170.44 63.526
## - hp 1 18.048 188.49 64.746
## - cyl 1 24.546 194.99 65.831
## - wt 1 90.925 261.37 75.206
##
## Step: AIC=62.66
## mpg ~ wt + hp + cyl
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## - hp 1 14.551 191.17 63.198
## - cyl 1 18.427 195.05 63.840
## - wt 1 115.354 291.98 76.750
summary(fit.step.back)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Backward stepwise drops disp (AIC drops from 63.53 to
62.66) and stops, yielding mpg ~ wt + hp + cyl.
fit.null <- lm(mpg ~ 1, data = mtcars)
fit.step.fwd <- step(fit.null, scope = list(lower = fit.null, upper = fit.full), direction = "forward")
## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.150 191.17 63.198
## + hp 1 83.274 195.05 63.840
## + disp 1 31.639 246.68 71.356
## <none> 278.32 73.217
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.5514 176.62 62.665
## <none> 191.17 63.198
## + disp 1 2.6796 188.49 64.746
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## + disp 1 6.1762 170.44 63.526
summary(fit.step.fwd)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
Forward stepwise adds wt first (biggest AIC drop), then
cyl, then hp, and stops before adding
disp. Same final model.
fit.step.both <- step(fit.null, scope = list(lower = fit.null, upper = fit.full), direction = "both")
## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.15 191.17 63.198
## + hp 1 83.27 195.05 63.840
## + disp 1 31.64 246.68 71.356
## <none> 278.32 73.217
## - wt 1 847.73 1126.05 115.943
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.551 176.62 62.665
## <none> 191.17 63.198
## + disp 1 2.680 188.49 64.746
## - cyl 1 87.150 278.32 73.217
## - wt 1 117.162 308.33 76.494
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## - hp 1 14.551 191.17 63.198
## + disp 1 6.176 170.44 63.526
## - cyl 1 18.427 195.05 63.840
## - wt 1 115.354 291.98 76.750
summary(fit.step.both)
##
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## cyl -0.94162 0.55092 -1.709 0.098480 .
## hp -0.01804 0.01188 -1.519 0.140015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
The bidirectional search also converges on
mpg ~ wt + hp + cyl. All three stepwise variants agree,
which strengthens confidence in this model.
AIC comparison across stepwise models:
AIC(fit.full, fit.step.back, fit.step.fwd, fit.step.both)
## df AIC
## fit.full 6 156.3376
## fit.step.back 5 155.4766
## fit.step.fwd 5 155.4766
## fit.step.both 5 155.4766
The three stepwise models share the same AIC (155.48), all lower than
the full 4-predictor model (156.34). wt is the only
individually significant predictor (p < 0.001): each extra 1,000 lb
cuts mpg by 3.17. hp (p = 0.140) and
cyl (p = 0.098) stay in by AIC even without reaching 0.05 –
their combined presence still improves fit. Adjusted R-squared holds at
0.8263, identical to the full model, confirming disp
contributed nothing.
LASSO adds a penalty to the regression that shrinks small coefficients toward zero and can set them exactly to zero, effectively removing variables. The penalty strength is chosen by cross-validation.
library(glmnet)
x <- model.matrix(mpg ~ wt + hp + cyl + disp, data = mtcars)[, -1]
y <- mtcars$mpg
set.seed(1)
cv.lasso <- cv.glmnet(x, y, alpha = 1)
plot(cv.lasso)
cat("Optimal lambda (min CV error):", cv.lasso$lambda.min, "\n")
## Optimal lambda (min CV error): 0.1034148
coef(cv.lasso, s = "lambda.min")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## lambda.min
## (Intercept) 38.39711409
## wt -3.10670322
## hp -0.01718842
## cyl -0.93577732
## disp .
cat("Lambda 1se (more regularized):", cv.lasso$lambda.1se, "\n")
## Lambda 1se (more regularized): 1.161684
coef(cv.lasso, s = "lambda.1se")
## 5 x 1 sparse Matrix of class "dgCMatrix"
## lambda.1se
## (Intercept) 34.755554740
## wt -2.499948724
## hp -0.008841051
## cyl -0.860621916
## disp .
At lambda.min, LASSO retains all four predictors but
shrinks disp close to zero. At the more conservative
lambda.1se, disp is shrunk to exactly zero,
leaving wt, hp, and cyl –
consistent with what all-subsets and stepwise methods also found. LASSO
confirms wt as the dominant predictor with the largest
retained coefficient in magnitude.
All three variable selection methods point to removing
disp. The backward stepwise model is used here to check
whether regression assumptions hold after variable reduction.
par(mfrow = c(2,2))
plot(fit.step.back)
No major violations of linearity, normality, or homoscedasticity. Toyota Corolla and Chrysler Imperial remain the highest leverage points but stay within Cook’s distance bounds.