Intro to Machine Learning for Classification: Logistic Models

This is intended to be one of a series of bite-size documents to help you understand what machine learning is and how to do it in R, without knowing a lot of math or computation. The goal is to get you doing machine learning right away, with enough knowledge to get you interested in doing more.

Machine learning has many applications and meanings, but for our purposes, we’ll define it this way: “Machine learning consists of taking some data with a known result, and coming up with good rules about how the result is related to the rest of the data, so that we can apply those rules and make predictions about data for which we don’t know the result.”

This is super simple, obviously, and leaves out lots of important concepts, like signal vs. noise, overfitting, etc. But it’s enough to get started.

It’s also important to note that we’re starting here with a classification question, with two possible outcomes (we could say a binomial or binary outcome): hazardous or non-hazardous conditions. There are other kinds of questions, like multi-class classification (with more than 2 bins) and continuous outcome (like predicting salary or IQ). Also, we’re starting with a kind of tough classification – the prevalence of hazardous and non-hazardous conditions are unbalanced.

Logistic Models

Sometimes we just have two outcomes: male/female (predicting sex through text analysis), disease/healthy (probability of lung cancer following asbestos exposure), pass/fail (comparison of study methods). If we were to plot, say, pass/fail rates based on hours of pre-exam study, we might have data that looks like this:

Normal linear regression would draw some sort of diagonal line through our points to try to fit a model. We’d have to figure out some point at which to cut the line such that values above some point were considered “passes” and values below that cut “failures”. This is called thresholding, and it can be useful, but in this case, it seems like a poor fit to the data.

Logistic regression, on the other hand, creates a sigmoid (s-shaped) curve that captures the probability of one of two outcome values, like what we see below. Note that we have to make our outcomes equal to 0 and 1, which we did here by making “fail” be 0 and “pass” be 1. The nice thing is that our line never goes “out of bounds” – below 0 or past 1. That mimics our data more closely than linear regression does. We’ll still have some thresholding in the middle bit of the S curve, but the relationship of our model to our data is much better and more intuitive.

If you dig math, you might recognize the shape of this as being related to logarithms, especially the “natural log”. The actual formula for this curved line is:

\[\sigma(t)=\frac{e^t}{e^t+1} =\frac{1}{1+e^{-t}}\]

Where t is a familar model from linear regression, consisting of a combination of a constant and a variable coefficient (looks a lot like old-fashioned linear regression of y= mx + b):

\[t=\beta_0 + \beta_1x\]

Ok, math mode OFF.

Let’s do a quick example of using logistic models to classify.

There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.

First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.

library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")

Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.

Attribute information:

  1. seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);

  2. seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;

  3. shift: information about type of a shift (W - coal-getting, N -preparation shift);

  4. genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;

  5. gpuls: a number of pulses recorded within previous shift by GMax;

  6. gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;

  7. gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;

  8. ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;

  9. nbumps: the number of seismic bumps recorded within previous shift;

  10. nbumps2: the number of seismic bumps (in energy range [102,103)) registered within previous shift;

  11. nbumps3: the number of seismic bumps (in energy range [103,104)) registered within previous shift;

  12. nbumps4: the number of seismic bumps (in energy range [104,105)) registered within previous shift;

  13. nbumps5: the number of seismic bumps (in energy range [105,106)) registered within the last shift;

  14. nbumps6: the number of seismic bumps (in energy range [106,107)) registered within previous shift;

  15. nbumps7: the number of seismic bumps (in energy range [107,108)) registered within previous shift;

  16. nbumps89: the number of seismic bumps (in energy range [108,1010)) registered within previous shift;

  17. energy: total energy of seismic bumps registered within previous shift;

  18. maxenergy: the maximum energy of the seismic bumps registered within previous shift;

  19. class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

We want to look at attributes 1-18, to see if we can come up with a rule set (a “model”) that will allow us to predict whether attribute 19 turns out to be hazardous or non-hazardous.

Let’s first split up our data. We want to “train” (or create the model) on some data, and test it on a different set of data.

First off, let’s load some libraries we’ll want.

library(lattice)
library(ggplot2)
library(caret)

Then, let’s create a data partition. We have over 2500 discrete data points, so we can split them up into training and testing. We’ll do 60%, 40% breakup:

# We set a seed for the sake of reproducibility
set.seed(42)

# First, we'll pick off the training data: 
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
training<-seismic_data[inTrain,]

# Then what's left is testing data.
testing<-seismic_data[-inTrain,]

Our plan is to first train a logistic model on our training data and then measure its accuracy on the same data used to generate the model (thus, the risk of overfitting, or making a model too contingent on this data’s random noise). We will do this a few times using different algorithms. Once we get a good enough model, we’ll apply it to the testing data just once (this is super important) and see what kind of error we get, and if we have a model that’s going to help miners stay safe!

Generalized Linear Model Algorithm (glm)

First, we’ll start with the easiest possible model: predicting class from every other variable, using all the default settings. In our case, we want to try with a “glm” algorithm.

The nice thing about GLM is that it comes up with various linear relationship “families” based on the data (in our case, we have a binomial, or two-classifier, family), and it does a great job of figuring out what kind of data we have. So really, our “linear” model actually comes up with a logistic model.

We’ll use the caret package, because it makes it so easy to try different logistic algorithms (see http://topepo.github.io/caret/train-models-by-tag.html#Logistic_Regression). And to see how good our model is, we’ll use confusionMatrix. Confusion matrix takes reality, compares it to your model’s prediction, and sees how correct it is. We tell the confusion matrix which category counts as “positive” for things like positive predictive value, specificity, etc. In our case, positive equals 1, or a hazardous condition.

set.seed(42)
logistic_model_1<-train(class ~ ., data=training, method="glm")
confusionMatrix(predict(logistic_model_1, training), reference=training$class, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1440   98
##          1    9    4
##                                           
##                Accuracy : 0.931           
##                  95% CI : (0.9172, 0.9431)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.7169          
##                                           
##                   Kappa : 0.0555          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.039216        
##             Specificity : 0.993789        
##          Pos Pred Value : 0.307692        
##          Neg Pred Value : 0.936281        
##              Prevalence : 0.065764        
##          Detection Rate : 0.002579        
##    Detection Prevalence : 0.008382        
##       Balanced Accuracy : 0.516502        
##                                           
##        'Positive' Class : 1               
## 

93% accuracy? That’s awesome! But what do all those values mean? Which is the most important?

We could favor sensitivity – catching every hazardous event, even if that means we tag some non-hazardous situations as hazardous. But what if people get so used to false alarms that they disregard alerts that a situation is hazardous?

Or we could favor, say, positive predictive value – the probability we got it right when we said a situation was dangerous. Or negative predictive value – the probability we got it right when we said things were safe.

Or just overall accuracy – the overall percentage of correct predictions.

Ideally, we’d want perfect scores on everything, but in the real world, this is impossible. We have to measure model “success” by one (or more) of these measures.

So much depends on what our priorities are. Let’s say we want high sensitivity and high positive predictive value. We could do some sophisticated stuff to our train() command to make the model selection more tuned to this metric, but in order to keep things simple, let’s just manually keep an eye out for the models that do the best in sensitivity!

So far, with this model, we have 3.9% sensitivity (so we’re catching less than 4% of the hazardous events in our prediction!) and 31% positive predictive value (when the alarm sounds – not nearly enough, because our sensitivity is too low –, it has a 31% percent chance of being right.) We’re still exposing our miners to lots of un-caught hazardous conditions, so let’s keep trying to see if we can get a better model.

Penalized Logistic Regression Algorithm (plr)

Let’s try another logistic model! Penalized logistic regression. This one can do a lot of output, so I’m going to break it up into two code chunks, the first of which has suppressed output (so you don’t have to look at it here).

set.seed(42)
logistic_model_2<-train(class ~ ., data=training, method="plr")
confusionMatrix(predict(logistic_model_2, training), reference=training$class, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1403  101
##          1   46    1
##                                           
##                Accuracy : 0.9052          
##                  95% CI : (0.8895, 0.9193)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0293         
##  Mcnemar's Test P-Value : 8.435e-06       
##                                           
##             Sensitivity : 0.0098039       
##             Specificity : 0.9682540       
##          Pos Pred Value : 0.0212766       
##          Neg Pred Value : 0.9328457       
##              Prevalence : 0.0657640       
##          Detection Rate : 0.0006447       
##    Detection Prevalence : 0.0303030       
##       Balanced Accuracy : 0.4890289       
##                                           
##        'Positive' Class : 1               
## 

Oh, wow, this is really bad. Minimal sensitivity and positive predictive value. Eeek.

Let’s try one more, based in Bayesian stats:

Bayesian GLM Algorithm (bayesglm)

set.seed(42)
logistic_model_3<-train(class ~ ., data=training, method="bayesglm")
confusionMatrix(predict(logistic_model_3, training), reference=training$class, positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1440   99
##          1    9    3
##                                           
##                Accuracy : 0.9304          
##                  95% CI : (0.9165, 0.9425)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.7499          
##                                           
##                   Kappa : 0.0393          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.029412        
##             Specificity : 0.993789        
##          Pos Pred Value : 0.250000        
##          Neg Pred Value : 0.935673        
##              Prevalence : 0.065764        
##          Detection Rate : 0.001934        
##    Detection Prevalence : 0.007737        
##       Balanced Accuracy : 0.511600        
##                                           
##        'Positive' Class : 1               
## 

Yikes, not great: 3% sensitivity and about 25% PPV. In a fuller exposition of machine learning, we’d want to consider doing some feature engineering or optimize our model better, with cross-validation or pre-processing of our data. That’s a bit out of the scope of this document, though. So let’s just call our first attempt a qualified “winner” (but still not fantastic!).

Our final step, once we’ve come up with a “winner”, is to apply it to our testing data. Our model won’t be as accurate on the testing data as it is on the training data. Let’s think about why: any model is going to do really well on the data it trained on, since it’s tuned on both the signal (the predictive relationship we’re trying to identify, which will be the same in training and testing) and the noise (different, subject to randomness, will differ between training and testing) of the training data. The more our model is tuned to random noise, the worse the decline in performance between training and testing data.

Let’s see how much sensitivity and positive predictive value we lose, when we apply our model to the testing data.

confusionMatrix(predict(logistic_model_1, testing), reference=testing$class, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 956  65
##          1   9   3
##                                           
##                Accuracy : 0.9284          
##                  95% CI : (0.9109, 0.9433)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.7944          
##                                           
##                   Kappa : 0.0564          
##  Mcnemar's Test P-Value : 1.62e-10        
##                                           
##             Sensitivity : 0.044118        
##             Specificity : 0.990674        
##          Pos Pred Value : 0.250000        
##          Neg Pred Value : 0.936337        
##              Prevalence : 0.065828        
##          Detection Rate : 0.002904        
##    Detection Prevalence : 0.011617        
##       Balanced Accuracy : 0.517396        
##                                           
##        'Positive' Class : 1               
## 

Our sensitivity went up a tiny bit, to 4.4%, once we applied our model to testing data, but the positive predictive value declined, to 25%. None of these models seemed to do the trick for us.

One thing to avoid doing here is to continue tuning the logistic approach and trying other models to see if we can get testing performance to improve. The reason we need to avoid that is that we’re then tuning to signal and noise again, and it’s essentially turning testing data into training data.

Learn more!

Want to know more about logistic regression? The Wikipedia page is actually pretty good!

You may also be interested in seeing the actual implementation of various logistic algorithms in caret:

Catch all the intro to machine learning documents: