This is intended to be one of a series of bite-size documents to help you understand what machine learning is and how to do it in R, without knowing a lot of math or computation. The goal is to get you doing machine learning right away, with enough knowledge to get you interested in doing more.
Machine learning has many applications and meanings, but for our purposes, we’ll define it this way: “Machine learning consists of taking some data with a known result, and coming up with good rules about how the result is related to the rest of the data, so that we can apply those rules and make predictions about data for which we don’t know the result.”
This is super simple, obviously, and leaves out lots of important concepts, like signal vs. noise, overfitting, etc. But it’s enough to get started.
It’s also important to note that we’re starting here with a classification question, with two possible outcomes (we could say a binomial or binary outcome): hazardous or non-hazardous conditions. There are other kinds of questions, like multi-class classification (with more than 2 bins) and continuous outcome (like predicting salary or IQ). Also, we’re starting with a kind of tough classification – the prevalence of hazardous and non-hazardous conditions are unbalanced.
If you’ve already tried the mini-tutorials for Trees, Linear Models, or random forests, you may have been surprised at how poor the predictions were. After all, we have almost 2600 measurements of 19 variables – surely we can find some pattern that indicates possible seismic danger!
In our Linear Models document, we realized that our data had some problems that kept the model from even working at all, and we removed a few variables. In this document, we’re going to take a closer look at the data and do some feature selection – removing unnecessary variables, that just make the model more complex (and increases variance, but that’s too much statistics for now).
There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.
First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.
library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")
Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.
Attribute information:
seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
shift: information about type of a shift (W - coal-getting, N -preparation shift);
genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
gpuls: a number of pulses recorded within previous shift by GMax;
gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;
ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
nbumps: the number of seismic bumps recorded within previous shift;
nbumps2: the number of seismic bumps (in energy range [102,103)) registered within previous shift;
nbumps3: the number of seismic bumps (in energy range [103,104)) registered within previous shift;
nbumps4: the number of seismic bumps (in energy range [104,105)) registered within previous shift;
nbumps5: the number of seismic bumps (in energy range [105,106)) registered within the last shift;
nbumps6: the number of seismic bumps (in energy range [106,107)) registered within previous shift;
nbumps7: the number of seismic bumps (in energy range [107,108)) registered within previous shift;
nbumps89: the number of seismic bumps (in energy range [108,1010)) registered within previous shift;
energy: total energy of seismic bumps registered within previous shift;
maxenergy: the maximum energy of the seismic bumps registered within previous shift;
class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).
We want to look at attributes 1-18 and make some educated guesses about which variables are the most important and have the most predictive power.
Let’s do some thinking about what would make a variable not very useful (without looking at the result or outcome, just at the variable data).
The variable never (or hardly ever) changes – for example, if we looked at a data set of male lung cancer patients, their biological sex wouldn’t be a good predictor of disease outcome, because it doesn’t vary at all.
The variable is missing for much of the data – if we only know the lead exposure levels of one-third of rats, it doesn’t make much sense to create a race completion speed model that relies on that variable.
The variable is very closely related to another variable. For example, if I take the hand width and foot width of subjects, those will be very closely related, and I will probably want to throw one out so that together, they don’t exert undue influence. Or if one variable contributes to the calculation of another variable (like weight and BMI), they might be strongly correlated and could mess up model prediction.
Let’s use some tools to eliminate cases like the three above.
First, let’s look at zero or near-zero variance in variables. Caret has a nice “nearZeroVar” function that helps find these.
library(caret)
theseDontChangeMuch<-nearZeroVar(seismic_data, saveMetrics= TRUE)
theseDontChangeMuch
## freqRatio percentUnique zeroVar nzv
## seismic 1.864745 0.07739938 FALSE FALSE
## seismoacoustic 1.652720 0.11609907 FALSE FALSE
## shift 1.805646 0.07739938 FALSE FALSE
## genergy 1.250000 85.60371517 FALSE FALSE
## gpuls 1.333333 43.65325077 FALSE FALSE
## gdenergy 1.035714 12.92569659 FALSE FALSE
## gdpuls 1.000000 11.30030960 FALSE FALSE
## ghazard 11.047170 0.11609907 FALSE FALSE
## nbumps 2.448161 0.38699690 FALSE FALSE
## nbumps2 4.178022 0.27089783 FALSE FALSE
## nbumps3 3.780684 0.27089783 FALSE FALSE
## nbumps4 16.965035 0.15479876 FALSE FALSE
## nbumps5 214.333333 0.07739938 FALSE TRUE
## nbumps6 0.000000 0.03869969 TRUE TRUE
## nbumps7 0.000000 0.03869969 TRUE TRUE
## nbumps89 0.000000 0.03869969 TRUE TRUE
## energy 20.914286 9.36532508 FALSE TRUE
## maxenergy 11.437500 1.27708978 FALSE FALSE
## class 14.200000 0.07739938 FALSE FALSE
Looks like we need to remove nbumps6, nbumps7, and nbumps89 for certain, as they have zero variance. Let’s keep nbumps5 and energy in our sights as well, as they have near-zero variance, but leave them in for now.
library(dplyr)
seismic_data<-seismic_data %>% select(-c(nbumps6, nbumps7, nbumps89))
So, now we’re down to 16 variables. The next question is whether we have lots of NA’s anywhere. Let’s take a look to see what percentage of each column is NA. If we have a column with a lot of NA’s, it might make sense to leave that out of our model.
colMeans(is.na(seismic_data))
## seismic seismoacoustic shift genergy gpuls
## 0 0 0 0 0
## gdenergy gdpuls ghazard nbumps nbumps2
## 0 0 0 0 0
## nbumps3 nbumps4 nbumps5 energy maxenergy
## 0 0 0 0 0
## class
## 0
OK, great, no problem with NAs! Now, on to the “are they closely related” question. In this, it makes sense not only to look at the relationships mathematically but also in terms of their context or meaning. Let’s first see if we have any highly correlated variables.
First, we’ll create a correlation matrix. Keep in mind that we can only correlate numbers, not category factors or text, so we have to do some column selection:
# select correlatable vars
toCorrelate<-seismic_data %>% select(-c(seismic, seismoacoustic, shift, ghazard, class))
# calculate correlation matrix
correlationMatrix <- cor(toCorrelate)
# summarize the correlation matrix
print(correlationMatrix)
## genergy gpuls gdenergy gdpuls nbumps
## genergy 1.000000000 0.74802035 0.04851437 0.07155398 0.22072050
## gpuls 0.748020352 1.00000000 0.29303605 0.38290587 0.30092338
## gdenergy 0.048514373 0.29303605 1.00000000 0.81194443 0.03003934
## gdpuls 0.071553983 0.38290587 0.81194443 1.00000000 0.05799571
## nbumps 0.220720499 0.30092338 0.03003934 0.05799571 1.00000000
## nbumps2 0.143587473 0.20738990 0.04124631 0.05110624 0.80497842
## nbumps3 0.191752999 0.22569479 -0.01218890 0.01473474 0.80336382
## nbumps4 0.150589156 0.25654726 0.03691597 0.06619534 0.39505175
## nbumps5 -0.009862519 0.04944977 0.12322867 0.14104427 0.06961347
## energy 0.080827619 0.18735024 0.10597113 0.14327679 0.34785197
## maxenergy 0.064405366 0.16426324 0.10857184 0.14364583 0.27371367
## nbumps2 nbumps3 nbumps4 nbumps5 energy
## genergy 0.143587473 0.19175300 0.15058916 -0.009862519 0.08082762
## gpuls 0.207389902 0.22569479 0.25654726 0.049449769 0.18735024
## gdenergy 0.041246306 -0.01218890 0.03691597 0.123228668 0.10597113
## gdpuls 0.051106244 0.01473474 0.06619534 0.141044269 0.14327679
## nbumps 0.804978416 0.80336382 0.39505175 0.069613473 0.34785197
## nbumps2 1.000000000 0.35072388 0.16129607 -0.005251127 0.12464986
## nbumps3 0.350723876 1.00000000 0.17530147 0.046497699 0.24408299
## nbumps4 0.161296068 0.17530147 1.00000000 -0.016580165 0.48982804
## nbumps5 -0.005251127 0.04649770 -0.01658016 1.000000000 0.77360515
## energy 0.124649861 0.24408299 0.48982804 0.773605146 1.00000000
## maxenergy 0.085032084 0.17648018 0.41648307 0.808406380 0.98954662
## maxenergy
## genergy 0.06440537
## gpuls 0.16426324
## gdenergy 0.10857184
## gdpuls 0.14364583
## nbumps 0.27371367
## nbumps2 0.08503208
## nbumps3 0.17648018
## nbumps4 0.41648307
## nbumps5 0.80840638
## energy 0.98954662
## maxenergy 1.00000000
Some people would suggest 0.75 as a cutoff, but we don’t have that many features to lose, so let’s raise the bar for elimination, to +- 0.85. Since this is only a few features, we can eyeball the matrix and see if we have any correlations (other than the 1.0000 self-correlation) that go above 0.85 or below -0.85. But let’s make this easier on our eyes.
correlationMatrix[upper.tri(correlationMatrix)]<-0 # this way I only pick one of the mirror-imaged highly correlated pair
diag(correlationMatrix)<-0 # and I don't remove the highly-correlated-with-itself group
# Let's see what it looks like now...
print(correlationMatrix)
## genergy gpuls gdenergy gdpuls nbumps
## genergy 0.000000000 0.00000000 0.00000000 0.00000000 0.00000000
## gpuls 0.748020352 0.00000000 0.00000000 0.00000000 0.00000000
## gdenergy 0.048514373 0.29303605 0.00000000 0.00000000 0.00000000
## gdpuls 0.071553983 0.38290587 0.81194443 0.00000000 0.00000000
## nbumps 0.220720499 0.30092338 0.03003934 0.05799571 0.00000000
## nbumps2 0.143587473 0.20738990 0.04124631 0.05110624 0.80497842
## nbumps3 0.191752999 0.22569479 -0.01218890 0.01473474 0.80336382
## nbumps4 0.150589156 0.25654726 0.03691597 0.06619534 0.39505175
## nbumps5 -0.009862519 0.04944977 0.12322867 0.14104427 0.06961347
## energy 0.080827619 0.18735024 0.10597113 0.14327679 0.34785197
## maxenergy 0.064405366 0.16426324 0.10857184 0.14364583 0.27371367
## nbumps2 nbumps3 nbumps4 nbumps5 energy maxenergy
## genergy 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## gpuls 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## gdenergy 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## gdpuls 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## nbumps 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## nbumps2 0.000000000 0.0000000 0.00000000 0.0000000 0.0000000 0
## nbumps3 0.350723876 0.0000000 0.00000000 0.0000000 0.0000000 0
## nbumps4 0.161296068 0.1753015 0.00000000 0.0000000 0.0000000 0
## nbumps5 -0.005251127 0.0464977 -0.01658016 0.0000000 0.0000000 0
## energy 0.124649861 0.2440830 0.48982804 0.7736051 0.0000000 0
## maxenergy 0.085032084 0.1764802 0.41648307 0.8084064 0.9895466 0
apply(correlationMatrix,2, function(x) any(abs(x)>=0.85))
## genergy gpuls gdenergy gdpuls nbumps nbumps2 nbumps3
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## nbumps4 nbumps5 energy maxenergy
## FALSE FALSE TRUE FALSE
OK, so we have very high correlations somewhere with the energy field. Let’s look more closely!
cor(toCorrelate$energy, toCorrelate)
## genergy gpuls gdenergy gdpuls nbumps nbumps2 nbumps3
## [1,] 0.08082762 0.1873502 0.1059711 0.1432768 0.347852 0.1246499 0.244083
## nbumps4 nbumps5 energy maxenergy
## [1,] 0.489828 0.7736051 1 0.9895466
Not surprisingly (once we read the data dictionary above), energy is extremely highly correlated (essentially colinear) with maxenergy.
What do we do with this information? Especially without being subject matter experts? Well, the first thing we can do is choose just one of energy and maxenergy to keep. Since energy has already been identified as problematic because of its lack of variance, we’ll get rid of it.
seismic_data<-seismic_data %>% select(-energy)
Let’s keep in mind that nbumps5 was also very low variance. Since we’ve already gotten rid of its neighbors nbumps6, nbumps7, and nbumps89, we could consider getting rid of this one too. Let’s compare its variance with the other “numbered” nbumps:
sd(seismic_data$nbumps5)
## [1] 0.06800137
sd(seismic_data$nbumps4)
## [1] 0.2790589
sd(seismic_data$nbumps3)
## [1] 0.7697097
sd(seismic_data$nbumps2)
## [1] 0.7837721
What we see is a huge difference (2 orders of magnitude) in variance. nbumps5 barely varies at all, compared to its neighbors. It might make sense to drop it.
seismic_data<-seismic_data %>% select(-nbumps5)
So, now we’re down to 14 variables, down from 19. That should make model creation much easier. Let’s see if it improves our predictions, especially our linear predictions!
Currently, seismic_data is the culled data set. Let’s get the original, and we can see whether certain algorithms (like glm or rpart2 or rf ) work better with our original data or the pared-down data we decided on. That way, we can see if feature selection actually helped us or not.
original_seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")
Let’s split into training and test.
# We set a seed for the sake of reproducibility
set.seed(42)
# First, we'll pick off the training data:
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
smaller_training<-seismic_data[inTrain,]
orig_training<-original_seismic_data[inTrain,]
# Then what's left is testing data.
smaller_testing<-seismic_data[-inTrain,]
orig_testing<-original_seismic_data[-inTrain,]
GLM is a robust model even with “bad” variables, so let’s try it – we’ll generate two models, one using the original columns, and one with the reduced columns. We’ll also see how each model does at predicting on the training data itself.
set.seed(42)
glm_model_orig<-train(class ~ ., data=orig_training, method="glm")
confusionMatrix(orig_training$class, predict(glm_model_orig, orig_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1440 9
## 1 98 4
##
## Accuracy : 0.931
## 95% CI : (0.9172, 0.9431)
## No Information Rate : 0.9916
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0555
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.307692
## Specificity : 0.936281
## Pos Pred Value : 0.039216
## Neg Pred Value : 0.993789
## Prevalence : 0.008382
## Detection Rate : 0.002579
## Detection Prevalence : 0.065764
## Balanced Accuracy : 0.621987
##
## 'Positive' Class : 1
##
So, the GLM model on the training data (original columns) has sensitivity of 0.3077 and PPV of 0.0392.
set.seed(42)
glm_model_small<-train(class ~ ., data=smaller_training, method="glm")
confusionMatrix(smaller_training$class, predict(glm_model_small, smaller_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1440 9
## 1 98 4
##
## Accuracy : 0.931
## 95% CI : (0.9172, 0.9431)
## No Information Rate : 0.9916
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0555
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.307692
## Specificity : 0.936281
## Pos Pred Value : 0.039216
## Neg Pred Value : 0.993789
## Prevalence : 0.008382
## Detection Rate : 0.002579
## Detection Prevalence : 0.065764
## Balanced Accuracy : 0.621987
##
## 'Positive' Class : 1
##
The GLM model on the training data (smaller columns) has sensitivity of 0.3077 and PPV of 0.0392. Identical, right? Sometimes, after feature selection, we may not notice increased success against our training data, but keep in mind that part of feature selection is to reduce the overfitting to training. We may even have reduced model effectiveness in training, once we do feature selection. But the power of a model and its final measure of success is its ability (we hope) to work on previously unseen data that played no part in the creation of the model. Let’s see if there are any differences between the original and smaller variable set when we use the model to predict on testing data.
confusionMatrix(orig_testing$class, predict(glm_model_orig, orig_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 956 9
## 1 65 3
##
## Accuracy : 0.9284
## 95% CI : (0.9109, 0.9433)
## No Information Rate : 0.9884
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0564
## Mcnemar's Test P-Value : 1.62e-10
##
## Sensitivity : 0.250000
## Specificity : 0.936337
## Pos Pred Value : 0.044118
## Neg Pred Value : 0.990674
## Prevalence : 0.011617
## Detection Rate : 0.002904
## Detection Prevalence : 0.065828
## Balanced Accuracy : 0.593168
##
## 'Positive' Class : 1
##
Sensitivity of 0.25 and PPV of 0.0441. The sensitivity has decreased five percentage points.
confusionMatrix(smaller_testing$class, predict(glm_model_small, smaller_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 957 8
## 1 64 4
##
## Accuracy : 0.9303
## 95% CI : (0.913, 0.9451)
## No Information Rate : 0.9884
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0819
## Mcnemar's Test P-Value : 9.063e-11
##
## Sensitivity : 0.333333
## Specificity : 0.937316
## Pos Pred Value : 0.058824
## Neg Pred Value : 0.991710
## Prevalence : 0.011617
## Detection Rate : 0.003872
## Detection Prevalence : 0.065828
## Balanced Accuracy : 0.635325
##
## 'Positive' Class : 1
##
The strength of feature selection comes through when we see the smaller variable set model applied to testing data: Sensitivity of 0.3333 and PPV of 0.0588. With careful variable selection, our model looked the same when applied to training data, but the more curated feature set gave us a better outcome in what really matters – application to unknown data.
It’s tricky sometimes to know what the right balance is of keeping a rich variety of variables while eliminating highly correlated variables, but in this case, a somewhat conservative approach seems to do well for us.
Let’s do the same thing with a tree and see if the principle holds – has feature reduction helped us create a better tree?
set.seed(42)
tree_model_orig<-train(class ~ ., data=orig_training, method="rpart2")
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
confusionMatrix(orig_training$class, predict(tree_model_orig, orig_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1433 16
## 1 65 37
##
## Accuracy : 0.9478
## 95% CI : (0.9355, 0.9583)
## No Information Rate : 0.9658
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.4528
## Mcnemar's Test P-Value : 9.643e-08
##
## Sensitivity : 0.69811
## Specificity : 0.95661
## Pos Pred Value : 0.36275
## Neg Pred Value : 0.98896
## Prevalence : 0.03417
## Detection Rate : 0.02386
## Detection Prevalence : 0.06576
## Balanced Accuracy : 0.82736
##
## 'Positive' Class : 1
##
set.seed(42)
tree_model_small<-train(class ~ ., data=smaller_training, method="rpart2")
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
confusionMatrix(smaller_training$class, predict(tree_model_small, smaller_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1438 11
## 1 74 28
##
## Accuracy : 0.9452
## 95% CI : (0.9327, 0.956)
## No Information Rate : 0.9749
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3744
## Mcnemar's Test P-Value : 1.758e-11
##
## Sensitivity : 0.71795
## Specificity : 0.95106
## Pos Pred Value : 0.27451
## Neg Pred Value : 0.99241
## Prevalence : 0.02515
## Detection Rate : 0.01805
## Detection Prevalence : 0.06576
## Balanced Accuracy : 0.83450
##
## 'Positive' Class : 1
##
At least in training data, we see sensitivity go up and positive predictive value go down when we remove some features. Again, just as in the glm case above, we’ll now apply the two models to testing data to see what the “real” effect is (“real” meaning here what we predict will happen with brand new data that we don’t know about yet).
confusionMatrix(orig_testing$class, predict(tree_model_orig, orig_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 942 23
## 1 62 6
##
## Accuracy : 0.9177
## 95% CI : (0.8993, 0.9337)
## No Information Rate : 0.9719
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0878
## Mcnemar's Test P-Value : 3.761e-05
##
## Sensitivity : 0.206897
## Specificity : 0.938247
## Pos Pred Value : 0.088235
## Neg Pred Value : 0.976166
## Prevalence : 0.028074
## Detection Rate : 0.005808
## Detection Prevalence : 0.065828
## Balanced Accuracy : 0.572572
##
## 'Positive' Class : 1
##
confusionMatrix(smaller_testing$class, predict(tree_model_small, smaller_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 945 20
## 1 64 4
##
## Accuracy : 0.9187
## 95% CI : (0.9003, 0.9346)
## No Information Rate : 0.9768
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0545
## Mcnemar's Test P-Value : 2.71e-06
##
## Sensitivity : 0.166667
## Specificity : 0.936571
## Pos Pred Value : 0.058824
## Neg Pred Value : 0.979275
## Prevalence : 0.023233
## Detection Rate : 0.003872
## Detection Prevalence : 0.065828
## Balanced Accuracy : 0.551619
##
## 'Positive' Class : 1
##
It looks like reducing our features had a negative impact on the performance of the tree algorithm we used. Clearly, feature selection is a bit of an art, and can have both positive and negative consequences. Let’s do a random forest example.
set.seed(42)
rf_model_orig<-train(class ~ ., data=orig_training, method="rf")
confusionMatrix(orig_training$class, predict(rf_model_orig, orig_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1449 0
## 1 101 1
##
## Accuracy : 0.9349
## 95% CI : (0.9214, 0.9466)
## No Information Rate : 0.9994
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0182
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000000
## Specificity : 0.9348387
## Pos Pred Value : 0.0098039
## Neg Pred Value : 1.0000000
## Prevalence : 0.0006447
## Detection Rate : 0.0006447
## Detection Prevalence : 0.0657640
## Balanced Accuracy : 0.9674194
##
## 'Positive' Class : 1
##
set.seed(42)
rf_model_small<-train(class ~ ., data=smaller_training, method="rf")
confusionMatrix(smaller_training$class, predict(rf_model_small, smaller_training), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1449 0
## 1 47 55
##
## Accuracy : 0.9697
## 95% CI : (0.9599, 0.9777)
## No Information Rate : 0.9645
## P-Value [Acc > NIR] : 0.1511
##
## Kappa : 0.6862
## Mcnemar's Test P-Value : 1.949e-11
##
## Sensitivity : 1.00000
## Specificity : 0.96858
## Pos Pred Value : 0.53922
## Neg Pred Value : 1.00000
## Prevalence : 0.03546
## Detection Rate : 0.03546
## Detection Prevalence : 0.06576
## Balanced Accuracy : 0.98429
##
## 'Positive' Class : 1
##
Wow, in the random forest example, the PPV went way up when possibly extraneous features were removed! Let’s see, however, what the two models perform on testing data:
confusionMatrix(orig_testing$class, predict(rf_model_orig, orig_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 965 0
## 1 68 0
##
## Accuracy : 0.9342
## 95% CI : (0.9173, 0.9485)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
## Mcnemar's Test P-Value : 4.476e-16
##
## Sensitivity : NA
## Specificity : 0.93417
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.00000
## Detection Rate : 0.00000
## Detection Prevalence : 0.06583
## Balanced Accuracy : NA
##
## 'Positive' Class : 1
##
confusionMatrix(smaller_testing$class, predict(rf_model_small, smaller_testing), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 965 0
## 1 68 0
##
## Accuracy : 0.9342
## 95% CI : (0.9173, 0.9485)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
## Mcnemar's Test P-Value : 4.476e-16
##
## Sensitivity : NA
## Specificity : 0.93417
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.00000
## Detection Rate : 0.00000
## Detection Prevalence : 0.06583
## Balanced Accuracy : NA
##
## 'Positive' Class : 1
##
Alas, it looks as though in both cases, we have identical, awful results. The random forest algorithm itself may not be the right choice for this data, or we may need to do some model optimization.
Feature selection also includes far more complex operations than the mere culling we’ve done here. Advanced topics include:
So learn all about it in Advanced Feature Engineering.
Catch all the intro to machine learning documents: