This is intended to be one of a series of bite-size documents to help you understand what machine learning is and how to do it in R, without knowing a lot of math or computation. The goal is to get you doing machine learning right away, with enough knowledge to get you interested in doing more.
Machine learning has many applications and meanings, but for our purposes, we’ll define it this way: “Machine learning consists of taking some data with a known result, and coming up with good rules about how the result is related to the rest of the data, so that we can apply those rules and make predictions about data for which we don’t know the result.”
This is super simple, obviously, and leaves out lots of important concepts, like signal vs. noise, overfitting, etc. But it’s enough to get started.
It’s also important to note that we’re starting here with a classification question, with two possible outcomes (we could say a binomial or binary outcome): hazardous or non-hazardous conditions. There are other kinds of questions, like multi-class classification (with more than 2 bins) and continuous outcome (like predicting salary or IQ). Also, we’re starting with a kind of tough classification – the prevalence of hazardous and non-hazardous conditions are unbalanced.
Before reading further, you should consider giving Trees a read, to understand what a tree is and understand how to read a dendrogram. In the trees document, we were able to not only come up with a (not great) model for predicting dangerous situations in a mine, but we were able to see that model easily in a dendrogram and be able to explain the decision tree in human language. With random forests, we’ll have many trees, and we’ll (we hope!) get a more accurate model (better at prediction), but at the cost of understandability. There’s no single diagram in a random forest to point to as “the” model.
In the trees example, we talked about the example of the crying baby and the decision tree to figure out why the baby was crying:
You may have thought my decision tree was a bit flawed – I didn’t include some factors you thought were important, and the “splits” I made (like time since last diaper change) are too high or too low. You would have come up with a different decision tree. Your family members and friends will similarly all have different decision trees, that might include different factors, maybe entirely different ones than you or I came up with, different splits, and different numbers of final decisions (leaves). However, given a certain crying baby, if all of us came together in a room, in our various methods and decision trees, we could probably come up with a way to decide as a group why the baby is crying – perhaps by consensus, or voting, or we might establish that some people are better at predicting the reason for crying with older babies and others with younger ones, or so on. The combining of lots of different trees is what makes a “random forest”.
Let’s do a quick example.
There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.
First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.
library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")
Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.
Attribute information:
seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
shift: information about type of a shift (W - coal-getting, N -preparation shift);
genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
gpuls: a number of pulses recorded within previous shift by GMax;
gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;
ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
nbumps: the number of seismic bumps recorded within previous shift;
nbumps2: the number of seismic bumps (in energy range [102,103)) registered within previous shift;
nbumps3: the number of seismic bumps (in energy range [103,104)) registered within previous shift;
nbumps4: the number of seismic bumps (in energy range [104,105)) registered within previous shift;
nbumps5: the number of seismic bumps (in energy range [105,106)) registered within the last shift;
nbumps6: the number of seismic bumps (in energy range [106,107)) registered within previous shift;
nbumps7: the number of seismic bumps (in energy range [107,108)) registered within previous shift;
nbumps89: the number of seismic bumps (in energy range [108,1010)) registered within previous shift;
energy: total energy of seismic bumps registered within previous shift;
maxenergy: the maximum energy of the seismic bumps registered within previous shift;
class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).
We want to look at attributes 1-18, to see if we can come up with a rule set (a “model”) that will allow us to predict whether attribute 19 turns out to be hazardous or non-hazardous.
Let’s first split up our data. We want to “train” (or create the model) on some data, and test it on a different set of data.
First off, let’s load some libraries we’ll want.
library(lattice)
library(ggplot2)
library(caret)
Then, let’s create a data partition. We have over 2500 discrete data points, so we can split them up into training and testing. We’ll do 60%, 40% breakup:
# We set a seed for the sake of reproducibility
set.seed(42)
# First, we'll pick off the training data:
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
training<-seismic_data[inTrain,]
# Then what's left is testing data.
testing<-seismic_data[-inTrain,]
Our plan is to first train a tree on our training data and then measure its accuracy on the same data used to generate the model (thus, the risk of overfitting, or making a model too contingent on this data’s random noise). We will do this a few times using different tree generation algorithms. Once we get a good enough model, we’ll apply it to the testing data just once (this is super important) and see what kind of error we get, and if we have a model that’s going to help miners stay safe!
First, we’ll start with the easiest possible model: predicting class from every other variable, using all the default settings and the “rf”, or random forest, algorithm. We’ll use the caret package, because it makes it so easy to try different random forest algorithms (see http://topepo.github.io/caret/train-models-by-tag.html#Random_Forest). Additionally, caret includes repeated applications of a model and some tuning within its default settings. Best of all, caret takes existing R machine learning packages and standardizes a single approach to using them, so you only have to learn a single syntax. It’s handy!
To see how good our model is, we’ll use confusionMatrix. Confusion matrix takes reality, compares it to your model’s prediction, and sees how correct it is. We tell the confusion matrix which category counts as “positive” for things like positive predictive value, specificity, etc. In our case, positive equals 1, or a hazardous condition.
Note that forests take a long time to grow, so be patient if you run this in your own R environment!
set.seed(42)
rf_model_1<-train(class ~ ., data=training, method="rf")
confusionMatrix(predict(rf_model_1, training), reference=training$class, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1449 101
## 1 0 1
##
## Accuracy : 0.9349
## 95% CI : (0.9214, 0.9466)
## No Information Rate : 0.9342
## P-Value [Acc > NIR] : 0.4855
##
## Kappa : 0.0182
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0098039
## Specificity : 1.0000000
## Pos Pred Value : 1.0000000
## Neg Pred Value : 0.9348387
## Prevalence : 0.0657640
## Detection Rate : 0.0006447
## Detection Prevalence : 0.0006447
## Balanced Accuracy : 0.5049020
##
## 'Positive' Class : 1
##
93% accuracy? 100% Positive Predictive Value?! That’s awesome! BUT…looking more closely… we have exceptionally low sensitivity. The predictor just took one really clear case of a hazardous condition as a positive and ruled everything else non-hazardous. We want to get higher than a “oh, it’s probably not hazardous” prediction that a human could do, ruling almost everything non-hazardous.
We can pause here to go over some of the values displayed in the confusionMatrix output. Which of these values are most important to us?
We could favor sensitivity – catching every hazardous event, even if that means we tag some non-hazardous situations as hazardous. But what if people get so used to false alarms that they disregard alerts that a situation is hazardous?
Or we could favor, say, positive predictive value – the probability we got it right when we said a situation was dangerous. Or negative predictive value – the probability we got it right when we said things were safe.
Or just overall accuracy – the overall percentage of correct predictions.
Ideally, we’d want perfect scores on everything, but in the real world, this is impossible. We have to measure model “success” by one (or more) of these measures.
So much depends on what our priorities are. Let’s say we want high sensitivity and high positive predictive value. We could do some sophisticated stuff to our train() command to make the model selection more tuned to this metric, but in order to keep things simple, let’s just manually keep an eye out for the models that do the best in sensitivity!
Since our simple rf approach didn’t work that well, let’s try some other random forest models here. We’ll try the random ferns algorithm, which is a type of random forest that is optimized for computer vision, but might be helpful here as well:
set.seed(42)
rf_model_2<-train(class ~ ., data=training, method="rFerns")
confusionMatrix(predict(rf_model_2, training), reference=training$class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1036 24
## 1 413 78
##
## Accuracy : 0.7182
## 95% CI : (0.6951, 0.7405)
## No Information Rate : 0.9342
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.173
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.76471
## Specificity : 0.71498
## Pos Pred Value : 0.15886
## Neg Pred Value : 0.97736
## Prevalence : 0.06576
## Detection Rate : 0.05029
## Detection Prevalence : 0.31657
## Balanced Accuracy : 0.73984
##
## 'Positive' Class : 1
##
This model has sensitivity at 76% and 16% PPV. That’s a good start!
What about weighted subspace random forests? This one might take a long time:
set.seed(42)
rf_model_3<-train(class ~ ., data=training, method="wsrf")
confusionMatrix(predict(rf_model_3, training), reference=training$class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1449 28
## 1 0 74
##
## Accuracy : 0.9819
## 95% CI : (0.974, 0.988)
## No Information Rate : 0.9342
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8316
## Mcnemar's Test P-Value : 3.352e-07
##
## Sensitivity : 0.72549
## Specificity : 1.00000
## Pos Pred Value : 1.00000
## Neg Pred Value : 0.98104
## Prevalence : 0.06576
## Detection Rate : 0.04771
## Detection Prevalence : 0.04771
## Balanced Accuracy : 0.86275
##
## 'Positive' Class : 1
##
OK, this one is pretty awesome and we’ll call it our winner in the random forest category. Sensitivity is 100% and we even have 73% positive predictive value!
Let’s apply this model to our testing data and see how we do.
confusionMatrix(predict(rf_model_3, testing), reference = testing$class, positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 962 67
## 1 3 1
##
## Accuracy : 0.9322
## 95% CI : (0.9152, 0.9468)
## No Information Rate : 0.9342
## P-Value [Acc > NIR] : 0.6293
##
## Kappa : 0.0206
## Mcnemar's Test P-Value : 5.076e-14
##
## Sensitivity : 0.0147059
## Specificity : 0.9968912
## Pos Pred Value : 0.2500000
## Neg Pred Value : 0.9348882
## Prevalence : 0.0658277
## Detection Rate : 0.0009681
## Detection Prevalence : 0.0038722
## Balanced Accuracy : 0.5057985
##
## 'Positive' Class : 1
##
Drat. This really promising random forest seems very overfitted to the noise in the training data. At only 1.5% sensitivity and 25% PPV, we’ve fallen quite a bit in performance!
One thing to avoid doing here is to continue tuning the random forest and trying other models to see if we can get testing performance to improve. The reason we need to avoid that is that we’re then tuning to signal and noise again, and it’s essentially turning testing data into training data.
Instead, if we want to get better at machine learning, we should consider learning a bit about feature selection, advanced feature engineering, and model optimization.
To understand a bit more about what we did in this post you can:
?wsrf()summary(rf_model_3)plot(rf_model_3) (What will it plot, I wonder?)rf_model_2$resultsCatch all the intro to machine learning documents: