Intro to Machine Learning for Classification: Feature Selection

Feature Selection

If you’ve already tried the mini-tutorials for Trees, Linear Models, or random forests, you may have been surprised at how poor the predictions were. After all, we have almost 2600 measurements of 19 variables – surely we can find some pattern that indicates possible seismic danger!

In our Linear Models document, we realized that our data had some problems that kept the model from even working at all, and we removed a few variables. In this document, we’re going to take a closer look at the data and do some feature selection – removing unnecessary variables, that just make the model more complex (and increases variance, but that’s too much statistics for now).

There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.

First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.

library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")

Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.

Attribute information:

seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
shift: information about type of a shift (W - coal-getting, N -preparation shift);
genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
gpuls: a number of pulses recorded within previous shift by GMax;
gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;
ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
nbumps: the number of seismic bumps recorded within previous shift;
nbumps2: the number of seismic bumps (in energy range [10²,10³)) registered within previous shift;
nbumps3: the number of seismic bumps (in energy range [10³,10⁴)) registered within previous shift;
nbumps4: the number of seismic bumps (in energy range [10⁴,10⁵)) registered within previous shift;
nbumps5: the number of seismic bumps (in energy range [10⁵,10⁶)) registered within the last shift;
nbumps6: the number of seismic bumps (in energy range [10⁶,10⁷)) registered within previous shift;
nbumps7: the number of seismic bumps (in energy range [10⁷,10⁸)) registered within previous shift;
nbumps89: the number of seismic bumps (in energy range [10⁸,10¹⁰)) registered within previous shift;
energy: total energy of seismic bumps registered within previous shift;
maxenergy: the maximum energy of the seismic bumps registered within previous shift;
class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

We want to look at attributes 1-18 and make some educated guesses about which variables are the most important and have the most predictive power.

Let’s do some thinking about what would make a variable not very useful (without looking at the result or outcome, just at the variable data).

The variable never (or hardly ever) changes – for example, if we looked at a data set of male lung cancer patients, their biological sex wouldn’t be a good predictor of disease outcome, because it doesn’t vary at all.
The variable is missing for much of the data – if we only know the lead exposure levels of one-third of rats, it doesn’t make much sense to create a race completion speed model that relies on that variable.
The variable is very closely related to another variable. For example, if I take the hand width and foot width of subjects, those will be very closely related, and I will probably want to throw one out so that together, they don’t exert undue influence. Or if one variable contributes to the calculation of another variable (like weight and BMI), they might be strongly correlated and could mess up model prediction.

Let’s use some tools to eliminate cases like the three above.

First, let’s look at zero or near-zero variance in variables. Caret has a nice “nearZeroVar” function that helps find these.

library(caret)
theseDontChangeMuch<-nearZeroVar(seismic_data, saveMetrics= TRUE)
theseDontChangeMuch

##                 freqRatio percentUnique zeroVar   nzv
## seismic          1.864745    0.07739938   FALSE FALSE
## seismoacoustic   1.652720    0.11609907   FALSE FALSE
## shift            1.805646    0.07739938   FALSE FALSE
## genergy          1.250000   85.60371517   FALSE FALSE
## gpuls            1.333333   43.65325077   FALSE FALSE
## gdenergy         1.035714   12.92569659   FALSE FALSE
## gdpuls           1.000000   11.30030960   FALSE FALSE
## ghazard         11.047170    0.11609907   FALSE FALSE
## nbumps           2.448161    0.38699690   FALSE FALSE
## nbumps2          4.178022    0.27089783   FALSE FALSE
## nbumps3          3.780684    0.27089783   FALSE FALSE
## nbumps4         16.965035    0.15479876   FALSE FALSE
## nbumps5        214.333333    0.07739938   FALSE  TRUE
## nbumps6          0.000000    0.03869969    TRUE  TRUE
## nbumps7          0.000000    0.03869969    TRUE  TRUE
## nbumps89         0.000000    0.03869969    TRUE  TRUE
## energy          20.914286    9.36532508   FALSE  TRUE
## maxenergy       11.437500    1.27708978   FALSE FALSE
## class           14.200000    0.07739938   FALSE FALSE

Looks like we need to remove nbumps6, nbumps7, and nbumps89 for certain, as they have zero variance. Let’s keep nbumps5 and energy in our sights as well, as they have near-zero variance, but leave them in for now.

library(dplyr)
seismic_data<-seismic_data %>% select(-c(nbumps6, nbumps7, nbumps89))

So, now we’re down to 16 variables. The next question is whether we have lots of NA’s anywhere. Let’s take a look to see what percentage of each column is NA. If we have a column with a lot of NA’s, it might make sense to leave that out of our model.

colMeans(is.na(seismic_data))

##        seismic seismoacoustic          shift        genergy          gpuls 
##              0              0              0              0              0 
##       gdenergy         gdpuls        ghazard         nbumps        nbumps2 
##              0              0              0              0              0 
##        nbumps3        nbumps4        nbumps5         energy      maxenergy 
##              0              0              0              0              0 
##          class 
##              0

OK, great, no problem with NAs! Now, on to the “are they closely related” question. In this, it makes sense not only to look at the relationships mathematically but also in terms of their context or meaning. Let’s first see if we have any highly correlated variables.

First, we’ll create a correlation matrix. Keep in mind that we can only correlate numbers, not category factors or text, so we have to do some column selection:

# select correlatable vars
toCorrelate<-seismic_data %>% select(-c(seismic, seismoacoustic, shift, ghazard, class))
# calculate correlation matrix
correlationMatrix <- cor(toCorrelate)
# summarize the correlation matrix
print(correlationMatrix)

##                genergy      gpuls    gdenergy     gdpuls     nbumps
## genergy    1.000000000 0.74802035  0.04851437 0.07155398 0.22072050
## gpuls      0.748020352 1.00000000  0.29303605 0.38290587 0.30092338
## gdenergy   0.048514373 0.29303605  1.00000000 0.81194443 0.03003934
## gdpuls     0.071553983 0.38290587  0.81194443 1.00000000 0.05799571
## nbumps     0.220720499 0.30092338  0.03003934 0.05799571 1.00000000
## nbumps2    0.143587473 0.20738990  0.04124631 0.05110624 0.80497842
## nbumps3    0.191752999 0.22569479 -0.01218890 0.01473474 0.80336382
## nbumps4    0.150589156 0.25654726  0.03691597 0.06619534 0.39505175
## nbumps5   -0.009862519 0.04944977  0.12322867 0.14104427 0.06961347
## energy     0.080827619 0.18735024  0.10597113 0.14327679 0.34785197
## maxenergy  0.064405366 0.16426324  0.10857184 0.14364583 0.27371367
##                nbumps2     nbumps3     nbumps4      nbumps5     energy
## genergy    0.143587473  0.19175300  0.15058916 -0.009862519 0.08082762
## gpuls      0.207389902  0.22569479  0.25654726  0.049449769 0.18735024
## gdenergy   0.041246306 -0.01218890  0.03691597  0.123228668 0.10597113
## gdpuls     0.051106244  0.01473474  0.06619534  0.141044269 0.14327679
## nbumps     0.804978416  0.80336382  0.39505175  0.069613473 0.34785197
## nbumps2    1.000000000  0.35072388  0.16129607 -0.005251127 0.12464986
## nbumps3    0.350723876  1.00000000  0.17530147  0.046497699 0.24408299
## nbumps4    0.161296068  0.17530147  1.00000000 -0.016580165 0.48982804
## nbumps5   -0.005251127  0.04649770 -0.01658016  1.000000000 0.77360515
## energy     0.124649861  0.24408299  0.48982804  0.773605146 1.00000000
## maxenergy  0.085032084  0.17648018  0.41648307  0.808406380 0.98954662
##            maxenergy
## genergy   0.06440537
## gpuls     0.16426324
## gdenergy  0.10857184
## gdpuls    0.14364583
## nbumps    0.27371367
## nbumps2   0.08503208
## nbumps3   0.17648018
## nbumps4   0.41648307
## nbumps5   0.80840638
## energy    0.98954662
## maxenergy 1.00000000

Some people would suggest 0.75 as a cutoff, but we don’t have that many features to lose, so let’s raise the bar for elimination, to +- 0.85. Since this is only a few features, we can eyeball the matrix and see if we have any correlations (other than the 1.0000 self-correlation) that go above 0.85 or below -0.85. But let’s make this easier on our eyes.

correlationMatrix[upper.tri(correlationMatrix)]<-0  # this way I only pick one of the mirror-imaged highly correlated pair
diag(correlationMatrix)<-0 # and I don't remove the highly-correlated-with-itself group
# Let's see what it looks like now...
print(correlationMatrix)

##                genergy      gpuls    gdenergy     gdpuls     nbumps
## genergy    0.000000000 0.00000000  0.00000000 0.00000000 0.00000000
## gpuls      0.748020352 0.00000000  0.00000000 0.00000000 0.00000000
## gdenergy   0.048514373 0.29303605  0.00000000 0.00000000 0.00000000
## gdpuls     0.071553983 0.38290587  0.81194443 0.00000000 0.00000000
## nbumps     0.220720499 0.30092338  0.03003934 0.05799571 0.00000000
## nbumps2    0.143587473 0.20738990  0.04124631 0.05110624 0.80497842
## nbumps3    0.191752999 0.22569479 -0.01218890 0.01473474 0.80336382
## nbumps4    0.150589156 0.25654726  0.03691597 0.06619534 0.39505175
## nbumps5   -0.009862519 0.04944977  0.12322867 0.14104427 0.06961347
## energy     0.080827619 0.18735024  0.10597113 0.14327679 0.34785197
## maxenergy  0.064405366 0.16426324  0.10857184 0.14364583 0.27371367
##                nbumps2   nbumps3     nbumps4   nbumps5    energy maxenergy
## genergy    0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## gpuls      0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## gdenergy   0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## gdpuls     0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## nbumps     0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## nbumps2    0.000000000 0.0000000  0.00000000 0.0000000 0.0000000         0
## nbumps3    0.350723876 0.0000000  0.00000000 0.0000000 0.0000000         0
## nbumps4    0.161296068 0.1753015  0.00000000 0.0000000 0.0000000         0
## nbumps5   -0.005251127 0.0464977 -0.01658016 0.0000000 0.0000000         0
## energy     0.124649861 0.2440830  0.48982804 0.7736051 0.0000000         0
## maxenergy  0.085032084 0.1764802  0.41648307 0.8084064 0.9895466         0

apply(correlationMatrix,2, function(x) any(abs(x)>=0.85))

##   genergy     gpuls  gdenergy    gdpuls    nbumps   nbumps2   nbumps3 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE 
##   nbumps4   nbumps5    energy maxenergy 
##     FALSE     FALSE      TRUE     FALSE

OK, so we have very high correlations somewhere with the energy field. Let’s look more closely!

cor(toCorrelate$energy, toCorrelate)

##         genergy     gpuls  gdenergy    gdpuls   nbumps   nbumps2  nbumps3
## [1,] 0.08082762 0.1873502 0.1059711 0.1432768 0.347852 0.1246499 0.244083
##       nbumps4   nbumps5 energy maxenergy
## [1,] 0.489828 0.7736051      1 0.9895466

Not surprisingly (once we read the data dictionary above), energy is extremely highly correlated (essentially colinear) with maxenergy.

What do we do with this information? Especially without being subject matter experts? Well, the first thing we can do is choose just one of energy and maxenergy to keep. Since energy has already been identified as problematic because of its lack of variance, we’ll get rid of it.

seismic_data<-seismic_data %>% select(-energy)

Let’s keep in mind that nbumps5 was also very low variance. Since we’ve already gotten rid of its neighbors nbumps6, nbumps7, and nbumps89, we could consider getting rid of this one too. Let’s compare its variance with the other “numbered” nbumps:

sd(seismic_data$nbumps5)

## [1] 0.06800137

sd(seismic_data$nbumps4)

## [1] 0.2790589

sd(seismic_data$nbumps3)

## [1] 0.7697097

sd(seismic_data$nbumps2)

## [1] 0.7837721

What we see is a huge difference (2 orders of magnitude) in variance. nbumps5 barely varies at all, compared to its neighbors. It might make sense to drop it.

seismic_data<-seismic_data %>% select(-nbumps5)

So, now we’re down to 14 variables, down from 19. That should make model creation much easier. Let’s see if it improves our predictions, especially our linear predictions!

Currently, seismic_data is the culled data set. Let’s get the original, and we can see whether certain algorithms (like glm or rpart2 or rf ) work better with our original data or the pared-down data we decided on. That way, we can see if feature selection actually helped us or not.

original_seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")

Let’s split into training and test.

# We set a seed for the sake of reproducibility
set.seed(42)

# First, we'll pick off the training data: 
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
smaller_training<-seismic_data[inTrain,]
orig_training<-original_seismic_data[inTrain,]

# Then what's left is testing data.
smaller_testing<-seismic_data[-inTrain,]
orig_testing<-original_seismic_data[-inTrain,]

GLM is a robust model even with “bad” variables, so let’s try it – we’ll generate two models, one using the original columns, and one with the reduced columns. We’ll also see how each model does at predicting on the training data itself.

set.seed(42)
glm_model_orig<-train(class ~ ., data=orig_training, method="glm")
confusionMatrix(orig_training$class, predict(glm_model_orig, orig_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1440    9
##          1   98    4
##                                           
##                Accuracy : 0.931           
##                  95% CI : (0.9172, 0.9431)
##     No Information Rate : 0.9916          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0555          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.307692        
##             Specificity : 0.936281        
##          Pos Pred Value : 0.039216        
##          Neg Pred Value : 0.993789        
##              Prevalence : 0.008382        
##          Detection Rate : 0.002579        
##    Detection Prevalence : 0.065764        
##       Balanced Accuracy : 0.621987        
##                                           
##        'Positive' Class : 1               
##

So, the GLM model on the training data (original columns) has sensitivity of 0.3077 and PPV of 0.0392.

set.seed(42)
glm_model_small<-train(class ~ ., data=smaller_training, method="glm")
confusionMatrix(smaller_training$class, predict(glm_model_small, smaller_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1440    9
##          1   98    4
##                                           
##                Accuracy : 0.931           
##                  95% CI : (0.9172, 0.9431)
##     No Information Rate : 0.9916          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0555          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.307692        
##             Specificity : 0.936281        
##          Pos Pred Value : 0.039216        
##          Neg Pred Value : 0.993789        
##              Prevalence : 0.008382        
##          Detection Rate : 0.002579        
##    Detection Prevalence : 0.065764        
##       Balanced Accuracy : 0.621987        
##                                           
##        'Positive' Class : 1               
##

The GLM model on the training data (smaller columns) has sensitivity of 0.3077 and PPV of 0.0392. Identical, right? Sometimes, after feature selection, we may not notice increased success against our training data, but keep in mind that part of feature selection is to reduce the overfitting to training. We may even have reduced model effectiveness in training, once we do feature selection. But the power of a model and its final measure of success is its ability (we hope) to work on previously unseen data that played no part in the creation of the model. Let’s see if there are any differences between the original and smaller variable set when we use the model to predict on testing data.

confusionMatrix(orig_testing$class, predict(glm_model_orig, orig_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 956   9
##          1  65   3
##                                           
##                Accuracy : 0.9284          
##                  95% CI : (0.9109, 0.9433)
##     No Information Rate : 0.9884          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0564          
##  Mcnemar's Test P-Value : 1.62e-10        
##                                           
##             Sensitivity : 0.250000        
##             Specificity : 0.936337        
##          Pos Pred Value : 0.044118        
##          Neg Pred Value : 0.990674        
##              Prevalence : 0.011617        
##          Detection Rate : 0.002904        
##    Detection Prevalence : 0.065828        
##       Balanced Accuracy : 0.593168        
##                                           
##        'Positive' Class : 1               
##

Sensitivity of 0.25 and PPV of 0.0441. The sensitivity has decreased five percentage points.

confusionMatrix(smaller_testing$class, predict(glm_model_small, smaller_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 957   8
##          1  64   4
##                                          
##                Accuracy : 0.9303         
##                  95% CI : (0.913, 0.9451)
##     No Information Rate : 0.9884         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0819         
##  Mcnemar's Test P-Value : 9.063e-11      
##                                          
##             Sensitivity : 0.333333       
##             Specificity : 0.937316       
##          Pos Pred Value : 0.058824       
##          Neg Pred Value : 0.991710       
##              Prevalence : 0.011617       
##          Detection Rate : 0.003872       
##    Detection Prevalence : 0.065828       
##       Balanced Accuracy : 0.635325       
##                                          
##        'Positive' Class : 1              
##

The strength of feature selection comes through when we see the smaller variable set model applied to testing data: Sensitivity of 0.3333 and PPV of 0.0588. With careful variable selection, our model looked the same when applied to training data, but the more curated feature set gave us a better outcome in what really matters – application to unknown data.

It’s tricky sometimes to know what the right balance is of keeping a rich variety of variables while eliminating highly correlated variables, but in this case, a somewhat conservative approach seems to do well for us.

Let’s do the same thing with a tree and see if the principle holds – has feature reduction helped us create a better tree?

set.seed(42)
tree_model_orig<-train(class ~ ., data=orig_training, method="rpart2")

## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .

confusionMatrix(orig_training$class, predict(tree_model_orig, orig_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1433   16
##          1   65   37
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.9355, 0.9583)
##     No Information Rate : 0.9658          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.4528          
##  Mcnemar's Test P-Value : 9.643e-08       
##                                           
##             Sensitivity : 0.69811         
##             Specificity : 0.95661         
##          Pos Pred Value : 0.36275         
##          Neg Pred Value : 0.98896         
##              Prevalence : 0.03417         
##          Detection Rate : 0.02386         
##    Detection Prevalence : 0.06576         
##       Balanced Accuracy : 0.82736         
##                                           
##        'Positive' Class : 1               
##

set.seed(42)
tree_model_small<-train(class ~ ., data=smaller_training, method="rpart2")

## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .

confusionMatrix(smaller_training$class, predict(tree_model_small, smaller_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1438   11
##          1   74   28
##                                          
##                Accuracy : 0.9452         
##                  95% CI : (0.9327, 0.956)
##     No Information Rate : 0.9749         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3744         
##  Mcnemar's Test P-Value : 1.758e-11      
##                                          
##             Sensitivity : 0.71795        
##             Specificity : 0.95106        
##          Pos Pred Value : 0.27451        
##          Neg Pred Value : 0.99241        
##              Prevalence : 0.02515        
##          Detection Rate : 0.01805        
##    Detection Prevalence : 0.06576        
##       Balanced Accuracy : 0.83450        
##                                          
##        'Positive' Class : 1              
##

At least in training data, we see sensitivity go up and positive predictive value go down when we remove some features. Again, just as in the glm case above, we’ll now apply the two models to testing data to see what the “real” effect is (“real” meaning here what we predict will happen with brand new data that we don’t know about yet).

confusionMatrix(orig_testing$class, predict(tree_model_orig, orig_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 942  23
##          1  62   6
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.8993, 0.9337)
##     No Information Rate : 0.9719          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0878          
##  Mcnemar's Test P-Value : 3.761e-05       
##                                           
##             Sensitivity : 0.206897        
##             Specificity : 0.938247        
##          Pos Pred Value : 0.088235        
##          Neg Pred Value : 0.976166        
##              Prevalence : 0.028074        
##          Detection Rate : 0.005808        
##    Detection Prevalence : 0.065828        
##       Balanced Accuracy : 0.572572        
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(smaller_testing$class, predict(tree_model_small, smaller_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 945  20
##          1  64   4
##                                           
##                Accuracy : 0.9187          
##                  95% CI : (0.9003, 0.9346)
##     No Information Rate : 0.9768          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0545          
##  Mcnemar's Test P-Value : 2.71e-06        
##                                           
##             Sensitivity : 0.166667        
##             Specificity : 0.936571        
##          Pos Pred Value : 0.058824        
##          Neg Pred Value : 0.979275        
##              Prevalence : 0.023233        
##          Detection Rate : 0.003872        
##    Detection Prevalence : 0.065828        
##       Balanced Accuracy : 0.551619        
##                                           
##        'Positive' Class : 1               
##

It looks like reducing our features had a negative impact on the performance of the tree algorithm we used. Clearly, feature selection is a bit of an art, and can have both positive and negative consequences. Let’s do a random forest example.

set.seed(42)
rf_model_orig<-train(class ~ ., data=orig_training, method="rf")
confusionMatrix(orig_training$class, predict(rf_model_orig, orig_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1449    0
##          1  101    1
##                                           
##                Accuracy : 0.9349          
##                  95% CI : (0.9214, 0.9466)
##     No Information Rate : 0.9994          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0182          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000000       
##             Specificity : 0.9348387       
##          Pos Pred Value : 0.0098039       
##          Neg Pred Value : 1.0000000       
##              Prevalence : 0.0006447       
##          Detection Rate : 0.0006447       
##    Detection Prevalence : 0.0657640       
##       Balanced Accuracy : 0.9674194       
##                                           
##        'Positive' Class : 1               
##

set.seed(42)
rf_model_small<-train(class ~ ., data=smaller_training, method="rf")
confusionMatrix(smaller_training$class, predict(rf_model_small, smaller_training), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1449    0
##          1   47   55
##                                           
##                Accuracy : 0.9697          
##                  95% CI : (0.9599, 0.9777)
##     No Information Rate : 0.9645          
##     P-Value [Acc > NIR] : 0.1511          
##                                           
##                   Kappa : 0.6862          
##  Mcnemar's Test P-Value : 1.949e-11       
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.96858         
##          Pos Pred Value : 0.53922         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.03546         
##          Detection Rate : 0.03546         
##    Detection Prevalence : 0.06576         
##       Balanced Accuracy : 0.98429         
##                                           
##        'Positive' Class : 1               
##

Wow, in the random forest example, the PPV went way up when possibly extraneous features were removed! Let’s see, however, what the two models perform on testing data:

confusionMatrix(orig_testing$class, predict(rf_model_orig, orig_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 965   0
##          1  68   0
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.9173, 0.9485)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 4.476e-16       
##                                           
##             Sensitivity :      NA         
##             Specificity : 0.93417         
##          Pos Pred Value :      NA         
##          Neg Pred Value :      NA         
##              Prevalence : 0.00000         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.06583         
##       Balanced Accuracy :      NA         
##                                           
##        'Positive' Class : 1               
##

confusionMatrix(smaller_testing$class, predict(rf_model_small, smaller_testing), positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 965   0
##          1  68   0
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.9173, 0.9485)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 4.476e-16       
##                                           
##             Sensitivity :      NA         
##             Specificity : 0.93417         
##          Pos Pred Value :      NA         
##          Neg Pred Value :      NA         
##              Prevalence : 0.00000         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.06583         
##       Balanced Accuracy :      NA         
##                                           
##        'Positive' Class : 1               
##

Alas, it looks as though in both cases, we have identical, awful results. The random forest algorithm itself may not be the right choice for this data, or we may need to do some model optimization.

Feature selection also includes far more complex operations than the mere culling we’ve done here. Advanced topics include:

Creating new features to enrich data
Scaling and centering feature data
Principal Components Analysis (PCA)
Factor Analysis
and more…

So learn all about it in Advanced Feature Engineering.

Intro to Machine Learning for Classification: Feature Selection

Joy Payton

August 18, 2016

Intro to Machine Learning

Feature Selection

Learn more!