In the previous post, we fitted a linear regression model using all the variables to predict the response variable “lead”. Although our model has a very large R squared of 0.96, many of the predictors are not significant. In this section, we select the top predictors without losing too much of the model’s predicability. To do this, we use the model selection method of “bestsubset”, “forward” and “backward” in the package “leaps”. We begin by loading the previous model into our console.
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
model_bench <- lm(lead~.,data = meuse) ## notice: the "." indicates that we are using all other variables in the dataset except the response variable lead
summary(model_bench)
##
## Call:
## lm(formula = lead ~ ., data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.330 -10.755 0.244 9.086 64.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.832e+02 1.318e+03 0.594 0.553418
## x -2.291e-03 8.331e-03 -0.275 0.783809
## y -1.038e-03 7.031e-03 -0.148 0.882832
## cadmium -1.180e+01 2.059e+00 -5.731 7.26e-08 ***
## copper -3.956e-01 3.225e-01 -1.227 0.222309
## zinc 4.427e-01 1.721e-02 25.726 < 2e-16 ***
## elev -1.027e+00 3.626e+00 -0.283 0.777465
## dist -9.036e+01 6.289e+01 -1.437 0.153310
## om -1.855e+00 1.004e+00 -1.848 0.066968 .
## ffreq2 -1.352e+01 8.368e+00 -1.616 0.108670
## ffreq3 -2.359e+01 1.108e+01 -2.129 0.035238 *
## soil2 -4.029e+00 7.016e+00 -0.574 0.566804
## soil3 -2.485e+00 1.103e+01 -0.225 0.822070
## lime1 -2.352e+01 6.935e+00 -3.391 0.000937 ***
## landuseAb 8.548e+00 1.906e+01 0.448 0.654587
## landuseAg 2.180e+00 2.036e+01 0.107 0.914875
## landuseAh 7.162e+00 1.719e+01 0.417 0.677598
## landuseAm 7.322e+00 1.743e+01 0.420 0.675256
## landuseB -3.199e+00 2.218e+01 -0.144 0.885542
## landuseBw 1.183e+01 2.085e+01 0.567 0.571505
## landuseDEN -2.456e+01 3.001e+01 -0.819 0.414646
## landuseFh 2.169e+01 2.865e+01 0.757 0.450588
## landuseFw 1.577e+00 1.855e+01 0.085 0.932428
## landuseGa 6.300e+01 2.673e+01 2.357 0.020019 *
## landuseSPO -1.184e+01 2.948e+01 -0.402 0.688625
## landuseSTA 1.366e+01 2.415e+01 0.565 0.572810
## landuseTv 3.089e+01 2.877e+01 1.074 0.285076
## landuseW -8.839e-01 1.762e+01 -0.050 0.960071
## dist.m 9.603e-02 5.309e-02 1.809 0.072911 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.63 on 123 degrees of freedom
## Multiple R-squared: 0.9664, Adjusted R-squared: 0.9587
## F-statistic: 126.2 on 28 and 123 DF, p-value: < 2.2e-16
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
head(meuse)
## x y cadmium copper lead zinc elev dist om ffreq soil
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1
## 2 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1
## 3 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1
## 4 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 1 2
## 5 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 1 2
## 6 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 1 2
## lime landuse dist.m
## 1 1 Ah 50
## 2 1 Ah 30
## 3 1 Ah 150
## 4 0 Ga 270
## 5 0 Ah 380
## 6 0 Ga 470
From inspecting the top 6 observations and their storage type, we find that column 11 - 14 are stored as factors. We convert the factors into numeric/dummy for later analysis
meuse$ffreq <- as.numeric(meuse$ffreq)
meuse$soil <- as.numeric(meuse$soil)
meuse$lime <- as.numeric(meuse$lime)
meuse$landuse <- as.numeric(meuse$landuse)
Best Subsets compares all possible models using a specified set of predictors, and displays the best-fitting models that contain one predictor, two predictors, and so on. We begin by building a complete model with all the predictors. Before using regsubset, let’s convert the categorical variables into dummy
variables <- meuse[ , -which(names(meuse) == "lead")]
outcome <- meuse$lead
out_1 <- regsubsets(x = variables, y = outcome, method = "exhaustive", nvmax = dim(variables)[2])
## We output the summary based on best subset
summary(out_1)
## Subset selection object
## 13 Variables (and intercept)
## Forced in Forced out
## x FALSE FALSE
## y FALSE FALSE
## cadmium FALSE FALSE
## copper FALSE FALSE
## zinc FALSE FALSE
## elev FALSE FALSE
## dist FALSE FALSE
## om FALSE FALSE
## ffreq FALSE FALSE
## soil FALSE FALSE
## lime FALSE FALSE
## landuse FALSE FALSE
## dist.m FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## x y cadmium copper zinc elev dist om ffreq soil lime
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " "*" " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " "*" " " "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) " " "*" "*" " " "*" " " " " "*" "*" " " "*"
## 7 ( 1 ) " " " " "*" " " "*" " " " " "*" "*" "*" "*"
## 8 ( 1 ) " " " " "*" " " "*" " " "*" "*" "*" "*" "*"
## 9 ( 1 ) " " "*" "*" " " "*" " " "*" "*" "*" " " "*"
## 10 ( 1 ) " " "*" "*" " " "*" " " "*" "*" "*" "*" "*"
## 11 ( 1 ) " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## landuse dist.m
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
The summary plot shows which variables we want to select given the number of desired variables we want to select to be in our model. However, how do we know how many variables we want to have in our model? We need to establish some criteria for selecting.
Criteria:
## plotting BIC
plot(1:dim(variables)[2],summary(out_1)$bic)
lines(1:dim(variables)[2],summary(out_1)$bic)
## plotting cp
plot(1:dim(variables)[2],summary(out_1)$cp)
lines(1:dim(variables)[2],summary(out_1)$cp)
## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_1)$adjr2)
lines(1:dim(variables)[2],summary(out_1)$adjr2)
## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_1)$bic)
## [1] 5
which.min(summary(out_1)$cp)
## [1] 5
which.max(summary(out_1)$adjr2)
## [1] 9
Based on bic and cp, we should select the model with 5 variablesm and according to adjusted Rsquared, we should select model with 9 variables. However, a closer inspection of the graph reveals that the adjusted Rsquared reaches plateau after var = 5. So we select top 5 variables and construct a new linear model.
var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
if (summary(out_1)$outmat[which.min(summary(out_1)$bic),i] == "*") {var_list[i] = TRUE}
else {var_list[i] = FALSE}
}
var_selec <- variables[,var_list]
Now we have selected the best 5 variables based on the best subset output, we build the model again
data_selec <- cbind(var_selec, meuse$lead)
colnames(data_selec)[6] <- "lead"
model_1 <- lm(lead~. ,data = data_selec)
summary(model_1)
##
## Call:
## lm(formula = lead ~ ., data = data_selec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.514 -12.536 -0.082 13.660 59.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.5786 9.1269 6.637 5.85e-10 ***
## cadmium -13.0918 1.5459 -8.469 2.43e-14 ***
## zinc 0.4300 0.0128 33.606 < 2e-16 ***
## om -2.3889 0.8193 -2.916 0.004110 **
## ffreq -10.5112 2.9432 -3.571 0.000481 ***
## lime -25.0169 5.8700 -4.262 3.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.9584
## F-statistic: 696.1 on 5 and 146 DF, p-value: < 2.2e-16
From the above summary, we can see that all of our variables are significant (we can see that from the * at the summary). Moreover, our Rsquared is 0.9597, not an extreme drop from the Rsquared in model1 (0.9664), but we have sigificantly reduced the number of variables.
Besides best subset, we can also use the forward and backward stepwise selection method in the leap package. We demonstrate the variables selected by using forward and backward stepwise selection below following similar procedures as best subset function
## Forward stepwise
out_2 <- regsubsets(x = variables, y = outcome, method = "forward", nvmax = dim(variables)[2])
summary(out_2)
## Subset selection object
## 13 Variables (and intercept)
## Forced in Forced out
## x FALSE FALSE
## y FALSE FALSE
## cadmium FALSE FALSE
## copper FALSE FALSE
## zinc FALSE FALSE
## elev FALSE FALSE
## dist FALSE FALSE
## om FALSE FALSE
## ffreq FALSE FALSE
## soil FALSE FALSE
## lime FALSE FALSE
## landuse FALSE FALSE
## dist.m FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: forward
## x y cadmium copper zinc elev dist om ffreq soil lime
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " "*" " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " "*" " " "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) " " "*" "*" " " "*" " " " " "*" "*" " " "*"
## 7 ( 1 ) " " "*" "*" " " "*" " " " " "*" "*" " " "*"
## 8 ( 1 ) " " "*" "*" " " "*" " " " " "*" "*" "*" "*"
## 9 ( 1 ) " " "*" "*" " " "*" " " " " "*" "*" "*" "*"
## 10 ( 1 ) " " "*" "*" " " "*" " " "*" "*" "*" "*" "*"
## 11 ( 1 ) " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## landuse dist.m
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " " "
## 7 ( 1 ) "*" " "
## 8 ( 1 ) "*" " "
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## plotting BIC
plot(1:dim(variables)[2],summary(out_2)$bic)
lines(1:dim(variables)[2],summary(out_2)$bic)
## plotting cp
plot(1:dim(variables)[2],summary(out_2)$cp)
lines(1:dim(variables)[2],summary(out_2)$cp)
## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_2)$adjr2)
lines(1:dim(variables)[2],summary(out_2)$adjr2)
## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_2)$bic)
## [1] 5
which.min(summary(out_2)$cp)
## [1] 5
which.max(summary(out_2)$adjr2)
## [1] 10
out_3 <- regsubsets(x = variables, y = outcome, method = "backward", nvmax = dim(variables)[2])
summary(out_3)
## Subset selection object
## 13 Variables (and intercept)
## Forced in Forced out
## x FALSE FALSE
## y FALSE FALSE
## cadmium FALSE FALSE
## copper FALSE FALSE
## zinc FALSE FALSE
## elev FALSE FALSE
## dist FALSE FALSE
## om FALSE FALSE
## ffreq FALSE FALSE
## soil FALSE FALSE
## lime FALSE FALSE
## landuse FALSE FALSE
## dist.m FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: backward
## x y cadmium copper zinc elev dist om ffreq soil lime
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " "*" " " " " " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " "*" " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " "*" " " "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) " " " " "*" " " "*" " " " " "*" "*" " " "*"
## 7 ( 1 ) " " " " "*" " " "*" " " "*" "*" "*" " " "*"
## 8 ( 1 ) " " " " "*" " " "*" " " "*" "*" "*" " " "*"
## 9 ( 1 ) " " "*" "*" " " "*" " " "*" "*" "*" " " "*"
## 10 ( 1 ) " " "*" "*" " " "*" " " "*" "*" "*" "*" "*"
## 11 ( 1 ) " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## landuse dist.m
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " " "
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 12 ( 1 ) "*" "*"
## 13 ( 1 ) "*" "*"
## plotting BIC
plot(1:dim(variables)[2],summary(out_3)$bic)
lines(1:dim(variables)[2],summary(out_3)$bic)
## plotting cp
plot(1:dim(variables)[2],summary(out_3)$cp)
lines(1:dim(variables)[2],summary(out_3)$cp)
## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_3)$adjr2)
lines(1:dim(variables)[2],summary(out_3)$adjr2)
## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_3)$bic)
## [1] 5
which.min(summary(out_3)$cp)
## [1] 5
which.max(summary(out_3)$adjr2)
## [1] 9
Similarly, forward and backward stepwise selection also selects 5 variable (Ajusted Rsquared suggests 10 but we reject it and choose 5 because of similar reason as best subset) However, do we choose the same variables based on these methods?
var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
if (summary(out_2)$outmat[which.min(summary(out_2)$bic),i] == "*") {var_list[i] = TRUE}
else {var_list[i] = FALSE}
}
var_selec_2 <- variables[,var_list]
var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
if (summary(out_3)$outmat[which.min(summary(out_3)$bic),i] == "*") {var_list[i] = TRUE}
else {var_list[i] = FALSE}
}
var_selec_3 <- variables[,var_list]
colnames(var_selec)
## [1] "cadmium" "zinc" "om" "ffreq" "lime"
colnames(var_selec_2)
## [1] "cadmium" "zinc" "om" "ffreq" "lime"
colnames(var_selec_3)
## [1] "cadmium" "zinc" "om" "ffreq" "lime"
We found that the best 5 are all the same across the three different variable selection methods. In next section, we investigate whether we need to conduct any more variable transformations.