Variable Selection in R

In the previous post, we fitted a linear regression model using all the variables to predict the response variable “lead”. Although our model has a very large R squared of 0.96, many of the predictors are not significant. In this section, we select the top predictors without losing too much of the model’s predicability. To do this, we use the model selection method of “bestsubset”, “forward” and “backward” in the package “leaps”. We begin by loading the previous model into our console.

library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
model_bench <- lm(lead~.,data = meuse) ## notice: the "." indicates that we are using all other variables in the dataset except the response variable lead
summary(model_bench)

## 
## Call:
## lm(formula = lead ~ ., data = meuse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.330 -10.755   0.244   9.086  64.421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.832e+02  1.318e+03   0.594 0.553418    
## x           -2.291e-03  8.331e-03  -0.275 0.783809    
## y           -1.038e-03  7.031e-03  -0.148 0.882832    
## cadmium     -1.180e+01  2.059e+00  -5.731 7.26e-08 ***
## copper      -3.956e-01  3.225e-01  -1.227 0.222309    
## zinc         4.427e-01  1.721e-02  25.726  < 2e-16 ***
## elev        -1.027e+00  3.626e+00  -0.283 0.777465    
## dist        -9.036e+01  6.289e+01  -1.437 0.153310    
## om          -1.855e+00  1.004e+00  -1.848 0.066968 .  
## ffreq2      -1.352e+01  8.368e+00  -1.616 0.108670    
## ffreq3      -2.359e+01  1.108e+01  -2.129 0.035238 *  
## soil2       -4.029e+00  7.016e+00  -0.574 0.566804    
## soil3       -2.485e+00  1.103e+01  -0.225 0.822070    
## lime1       -2.352e+01  6.935e+00  -3.391 0.000937 ***
## landuseAb    8.548e+00  1.906e+01   0.448 0.654587    
## landuseAg    2.180e+00  2.036e+01   0.107 0.914875    
## landuseAh    7.162e+00  1.719e+01   0.417 0.677598    
## landuseAm    7.322e+00  1.743e+01   0.420 0.675256    
## landuseB    -3.199e+00  2.218e+01  -0.144 0.885542    
## landuseBw    1.183e+01  2.085e+01   0.567 0.571505    
## landuseDEN  -2.456e+01  3.001e+01  -0.819 0.414646    
## landuseFh    2.169e+01  2.865e+01   0.757 0.450588    
## landuseFw    1.577e+00  1.855e+01   0.085 0.932428    
## landuseGa    6.300e+01  2.673e+01   2.357 0.020019 *  
## landuseSPO  -1.184e+01  2.948e+01  -0.402 0.688625    
## landuseSTA   1.366e+01  2.415e+01   0.565 0.572810    
## landuseTv    3.089e+01  2.877e+01   1.074 0.285076    
## landuseW    -8.839e-01  1.762e+01  -0.050 0.960071    
## dist.m       9.603e-02  5.309e-02   1.809 0.072911 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.63 on 123 degrees of freedom
## Multiple R-squared:  0.9664, Adjusted R-squared:  0.9587 
## F-statistic: 126.2 on 28 and 123 DF,  p-value: < 2.2e-16

names(meuse)

##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m"

head(meuse)

##        x      y cadmium copper lead zinc  elev       dist   om ffreq soil
## 1 181072 333611    11.7     85  299 1022 7.909 0.00135803 13.6     1    1
## 2 181025 333558     8.6     81  277 1141 6.983 0.01222430 14.0     1    1
## 3 181165 333537     6.5     68  199  640 7.800 0.10302900 13.0     1    1
## 4 181298 333484     2.6     81  116  257 7.655 0.19009400  8.0     1    2
## 5 181307 333330     2.8     48  117  269 7.480 0.27709000  8.7     1    2
## 6 181390 333260     3.0     61  137  281 7.791 0.36406700  7.8     1    2
##   lime landuse dist.m
## 1    1      Ah     50
## 2    1      Ah     30
## 3    1      Ah    150
## 4    0      Ga    270
## 5    0      Ah    380
## 6    0      Ga    470

From inspecting the top 6 observations and their storage type, we find that column 11 - 14 are stored as factors. We convert the factors into numeric/dummy for later analysis

meuse$ffreq <- as.numeric(meuse$ffreq)
meuse$soil <- as.numeric(meuse$soil)
meuse$lime <- as.numeric(meuse$lime)
meuse$landuse <- as.numeric(meuse$landuse)

Part One: Best Subset Selection

Best Subsets compares all possible models using a specified set of predictors, and displays the best-fitting models that contain one predictor, two predictors, and so on. We begin by building a complete model with all the predictors. Before using regsubset, let’s convert the categorical variables into dummy

variables <- meuse[ , -which(names(meuse) == "lead")]
outcome <- meuse$lead
out_1 <- regsubsets(x = variables, y = outcome, method = "exhaustive", nvmax = dim(variables)[2])

## We output the summary based on best subset
summary(out_1)

## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## x           FALSE      FALSE
## y           FALSE      FALSE
## cadmium     FALSE      FALSE
## copper      FALSE      FALSE
## zinc        FALSE      FALSE
## elev        FALSE      FALSE
## dist        FALSE      FALSE
## om          FALSE      FALSE
## ffreq       FALSE      FALSE
## soil        FALSE      FALSE
## lime        FALSE      FALSE
## landuse     FALSE      FALSE
## dist.m      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           x   y   cadmium copper zinc elev dist om  ffreq soil lime
## 1  ( 1 )  " " " " " "     " "    "*"  " "  " "  " " " "   " "  " " 
## 2  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  " " 
## 3  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  "*" 
## 4  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " "*"   " "  "*" 
## 5  ( 1 )  " " " " "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 6  ( 1 )  " " "*" "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 7  ( 1 )  " " " " "*"     " "    "*"  " "  " "  "*" "*"   "*"  "*" 
## 8  ( 1 )  " " " " "*"     " "    "*"  " "  "*"  "*" "*"   "*"  "*" 
## 9  ( 1 )  " " "*" "*"     " "    "*"  " "  "*"  "*" "*"   " "  "*" 
## 10  ( 1 ) " " "*" "*"     " "    "*"  " "  "*"  "*" "*"   "*"  "*" 
## 11  ( 1 ) " " "*" "*"     " "    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 12  ( 1 ) " " "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 13  ( 1 ) "*" "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
##           landuse dist.m
## 1  ( 1 )  " "     " "   
## 2  ( 1 )  " "     " "   
## 3  ( 1 )  " "     " "   
## 4  ( 1 )  " "     " "   
## 5  ( 1 )  " "     " "   
## 6  ( 1 )  " "     " "   
## 7  ( 1 )  " "     "*"   
## 8  ( 1 )  " "     "*"   
## 9  ( 1 )  "*"     "*"   
## 10  ( 1 ) "*"     "*"   
## 11  ( 1 ) "*"     "*"   
## 12  ( 1 ) "*"     "*"   
## 13  ( 1 ) "*"     "*"

The summary plot shows which variables we want to select given the number of desired variables we want to select to be in our model. However, how do we know how many variables we want to have in our model? We need to establish some criteria for selecting.

Criteria:

BIC: a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred
cp: Cp is a ratio of the specification spread to the process spread.
AdjRsquared: a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance.

## plotting BIC
plot(1:dim(variables)[2],summary(out_1)$bic)
lines(1:dim(variables)[2],summary(out_1)$bic)

## plotting cp
plot(1:dim(variables)[2],summary(out_1)$cp)
lines(1:dim(variables)[2],summary(out_1)$cp)

## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_1)$adjr2)
lines(1:dim(variables)[2],summary(out_1)$adjr2)

## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_1)$bic)

## [1] 5

which.min(summary(out_1)$cp)

## [1] 5

which.max(summary(out_1)$adjr2)

## [1] 9

Based on bic and cp, we should select the model with 5 variablesm and according to adjusted Rsquared, we should select model with 9 variables. However, a closer inspection of the graph reveals that the adjusted Rsquared reaches plateau after var = 5. So we select top 5 variables and construct a new linear model.

var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
  if (summary(out_1)$outmat[which.min(summary(out_1)$bic),i] == "*") {var_list[i] = TRUE}
  else {var_list[i] = FALSE}
}
var_selec <- variables[,var_list]

Now we have selected the best 5 variables based on the best subset output, we build the model again

data_selec <- cbind(var_selec, meuse$lead)
colnames(data_selec)[6] <- "lead"
model_1 <- lm(lead~. ,data = data_selec) 
summary(model_1)

## 
## Call:
## lm(formula = lead ~ ., data = data_selec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.514 -12.536  -0.082  13.660  59.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  60.5786     9.1269   6.637 5.85e-10 ***
## cadmium     -13.0918     1.5459  -8.469 2.43e-14 ***
## zinc          0.4300     0.0128  33.606  < 2e-16 ***
## om           -2.3889     0.8193  -2.916 0.004110 ** 
## ffreq       -10.5112     2.9432  -3.571 0.000481 ***
## lime        -25.0169     5.8700  -4.262 3.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared:  0.9597, Adjusted R-squared:  0.9584 
## F-statistic: 696.1 on 5 and 146 DF,  p-value: < 2.2e-16

From the above summary, we can see that all of our variables are significant (we can see that from the * at the summary). Moreover, our Rsquared is 0.9597, not an extreme drop from the Rsquared in model1 (0.9664), but we have sigificantly reduced the number of variables.

Part TWO: Forward and backward stepwise:

Besides best subset, we can also use the forward and backward stepwise selection method in the leap package. We demonstrate the variables selected by using forward and backward stepwise selection below following similar procedures as best subset function

## Forward stepwise
out_2 <- regsubsets(x = variables, y = outcome, method = "forward", nvmax = dim(variables)[2])
summary(out_2)

## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## x           FALSE      FALSE
## y           FALSE      FALSE
## cadmium     FALSE      FALSE
## copper      FALSE      FALSE
## zinc        FALSE      FALSE
## elev        FALSE      FALSE
## dist        FALSE      FALSE
## om          FALSE      FALSE
## ffreq       FALSE      FALSE
## soil        FALSE      FALSE
## lime        FALSE      FALSE
## landuse     FALSE      FALSE
## dist.m      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: forward
##           x   y   cadmium copper zinc elev dist om  ffreq soil lime
## 1  ( 1 )  " " " " " "     " "    "*"  " "  " "  " " " "   " "  " " 
## 2  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  " " 
## 3  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  "*" 
## 4  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " "*"   " "  "*" 
## 5  ( 1 )  " " " " "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 6  ( 1 )  " " "*" "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 7  ( 1 )  " " "*" "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 8  ( 1 )  " " "*" "*"     " "    "*"  " "  " "  "*" "*"   "*"  "*" 
## 9  ( 1 )  " " "*" "*"     " "    "*"  " "  " "  "*" "*"   "*"  "*" 
## 10  ( 1 ) " " "*" "*"     " "    "*"  " "  "*"  "*" "*"   "*"  "*" 
## 11  ( 1 ) " " "*" "*"     " "    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 12  ( 1 ) " " "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 13  ( 1 ) "*" "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
##           landuse dist.m
## 1  ( 1 )  " "     " "   
## 2  ( 1 )  " "     " "   
## 3  ( 1 )  " "     " "   
## 4  ( 1 )  " "     " "   
## 5  ( 1 )  " "     " "   
## 6  ( 1 )  " "     " "   
## 7  ( 1 )  "*"     " "   
## 8  ( 1 )  "*"     " "   
## 9  ( 1 )  "*"     "*"   
## 10  ( 1 ) "*"     "*"   
## 11  ( 1 ) "*"     "*"   
## 12  ( 1 ) "*"     "*"   
## 13  ( 1 ) "*"     "*"

## plotting BIC
plot(1:dim(variables)[2],summary(out_2)$bic)
lines(1:dim(variables)[2],summary(out_2)$bic)

## plotting cp
plot(1:dim(variables)[2],summary(out_2)$cp)
lines(1:dim(variables)[2],summary(out_2)$cp)

## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_2)$adjr2)
lines(1:dim(variables)[2],summary(out_2)$adjr2)

## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_2)$bic)

## [1] 5

which.min(summary(out_2)$cp)

## [1] 5

which.max(summary(out_2)$adjr2)

## [1] 10

out_3 <- regsubsets(x = variables, y = outcome, method = "backward", nvmax = dim(variables)[2])
summary(out_3)

## Subset selection object
## 13 Variables  (and intercept)
##         Forced in Forced out
## x           FALSE      FALSE
## y           FALSE      FALSE
## cadmium     FALSE      FALSE
## copper      FALSE      FALSE
## zinc        FALSE      FALSE
## elev        FALSE      FALSE
## dist        FALSE      FALSE
## om          FALSE      FALSE
## ffreq       FALSE      FALSE
## soil        FALSE      FALSE
## lime        FALSE      FALSE
## landuse     FALSE      FALSE
## dist.m      FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: backward
##           x   y   cadmium copper zinc elev dist om  ffreq soil lime
## 1  ( 1 )  " " " " " "     " "    "*"  " "  " "  " " " "   " "  " " 
## 2  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  " " 
## 3  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " " "   " "  "*" 
## 4  ( 1 )  " " " " "*"     " "    "*"  " "  " "  " " "*"   " "  "*" 
## 5  ( 1 )  " " " " "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 6  ( 1 )  " " " " "*"     " "    "*"  " "  " "  "*" "*"   " "  "*" 
## 7  ( 1 )  " " " " "*"     " "    "*"  " "  "*"  "*" "*"   " "  "*" 
## 8  ( 1 )  " " " " "*"     " "    "*"  " "  "*"  "*" "*"   " "  "*" 
## 9  ( 1 )  " " "*" "*"     " "    "*"  " "  "*"  "*" "*"   " "  "*" 
## 10  ( 1 ) " " "*" "*"     " "    "*"  " "  "*"  "*" "*"   "*"  "*" 
## 11  ( 1 ) " " "*" "*"     " "    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 12  ( 1 ) " " "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
## 13  ( 1 ) "*" "*" "*"     "*"    "*"  "*"  "*"  "*" "*"   "*"  "*" 
##           landuse dist.m
## 1  ( 1 )  " "     " "   
## 2  ( 1 )  " "     " "   
## 3  ( 1 )  " "     " "   
## 4  ( 1 )  " "     " "   
## 5  ( 1 )  " "     " "   
## 6  ( 1 )  " "     "*"   
## 7  ( 1 )  " "     "*"   
## 8  ( 1 )  "*"     "*"   
## 9  ( 1 )  "*"     "*"   
## 10  ( 1 ) "*"     "*"   
## 11  ( 1 ) "*"     "*"   
## 12  ( 1 ) "*"     "*"   
## 13  ( 1 ) "*"     "*"

## plotting BIC
plot(1:dim(variables)[2],summary(out_3)$bic)
lines(1:dim(variables)[2],summary(out_3)$bic)

## plotting cp
plot(1:dim(variables)[2],summary(out_3)$cp)
lines(1:dim(variables)[2],summary(out_3)$cp)

## plotting adjRsquared
plot(1:dim(variables)[2],summary(out_3)$adjr2)
lines(1:dim(variables)[2],summary(out_3)$adjr2)

## choosing the variables that lead to lowest BIC, Adjusted Rsquared and cp
which.min(summary(out_3)$bic)

## [1] 5

which.min(summary(out_3)$cp)

## [1] 5

which.max(summary(out_3)$adjr2)

## [1] 9

Similarly, forward and backward stepwise selection also selects 5 variable (Ajusted Rsquared suggests 10 but we reject it and choose 5 because of similar reason as best subset) However, do we choose the same variables based on these methods?

var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
  if (summary(out_2)$outmat[which.min(summary(out_2)$bic),i] == "*") {var_list[i] = TRUE}
  else {var_list[i] = FALSE}
}
var_selec_2 <- variables[,var_list]

var_list <- rep(NA,dim(variables)[2] )
for (i in 1: dim(variables)[2]) {
  if (summary(out_3)$outmat[which.min(summary(out_3)$bic),i] == "*") {var_list[i] = TRUE}
  else {var_list[i] = FALSE}
}
var_selec_3 <- variables[,var_list]

colnames(var_selec)

## [1] "cadmium" "zinc"    "om"      "ffreq"   "lime"

colnames(var_selec_2)

## [1] "cadmium" "zinc"    "om"      "ffreq"   "lime"

colnames(var_selec_3)

## [1] "cadmium" "zinc"    "om"      "ffreq"   "lime"

We found that the best 5 are all the same across the three different variable selection methods. In next section, we investigate whether we need to conduct any more variable transformations.

Variable Selection in R

Huizi Yu

10/26/2019

Part One: Best Subset Selection

Part TWO: Forward and backward stepwise: