Intro to Machine Learning for Classification: Linear Models

This is intended to be one of a series of bite-size documents to help you understand what machine learning is and how to do it in R, without knowing a lot of math or computation. The goal is to get you doing machine learning right away, with enough knowledge to get you interested in doing more.

Machine learning has many applications and meanings, but for our purposes, we’ll define it this way: “Machine learning consists of taking some data with a known result, and coming up with good rules about how the result is related to the rest of the data, so that we can apply those rules and make predictions about data for which we don’t know the result.”

This is super simple, obviously, and leaves out lots of important concepts, like signal vs. noise, overfitting, etc. But it’s enough to get started.

It’s also important to note that we’re starting here with a classification question, with two possible outcomes (we could say a binomial or binary outcome): hazardous or non-hazardous conditions. There are other kinds of questions, like multi-class classification (with more than 2 bins) and continuous outcome (like predicting salary or IQ). Also, we’re starting with a kind of tough classification – the prevalence of hazardous and non-hazardous conditions are unbalanced.

Linear Models

Linear models are probably familiar to you from algebra: y = mx + b. Want to know how much a waiter makes? Take his hourly wage, multiply it by the number of hours worked, and add the tips. Want to predict your risk of lung cancer? Take your number of pack years of smoking and multiply it by some value, then add a certain amount if you’ve been exposed to asbestos, then add the product of the number of chest x-rays and CT scans in your life and multiply that by some small number. Obviously I’m simplifying, but the application is (hopefully) clear. Linear models are constructed by taking variables, weighting their importance by multiplying them by larger or smaller coefficients, adding a constant value or three, and coming up with the cumulative sum of all of that. You get the idea.

While linear models are usually used for continuous variable output and not for classification (that’s the realm of logistic regression), you can use linear models to predict classification as well.

Let’s do a quick example.

There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.

First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.

library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")

Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.

Attribute information:

  1. seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);

  2. seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;

  3. shift: information about type of a shift (W - coal-getting, N -preparation shift);

  4. genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;

  5. gpuls: a number of pulses recorded within previous shift by GMax;

  6. gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;

  7. gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;

  8. ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;

  9. nbumps: the number of seismic bumps recorded within previous shift;

  10. nbumps2: the number of seismic bumps (in energy range [102,103)) registered within previous shift;

  11. nbumps3: the number of seismic bumps (in energy range [103,104)) registered within previous shift;

  12. nbumps4: the number of seismic bumps (in energy range [104,105)) registered within previous shift;

  13. nbumps5: the number of seismic bumps (in energy range [105,106)) registered within the last shift;

  14. nbumps6: the number of seismic bumps (in energy range [106,107)) registered within previous shift;

  15. nbumps7: the number of seismic bumps (in energy range [107,108)) registered within previous shift;

  16. nbumps89: the number of seismic bumps (in energy range [108,1010)) registered within previous shift;

  17. energy: total energy of seismic bumps registered within previous shift;

  18. maxenergy: the maximum energy of the seismic bumps registered within previous shift;

  19. class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

We want to look at attributes 1-18, to see if we can come up with a rule set (a “model”) that will allow us to predict whether attribute 19 turns out to be hazardous or non-hazardous.

Let’s first split up our data. We want to “train” (or create the model) on some data, and test it on a different set of data.

First off, let’s load some libraries we’ll want.

library(lattice)
library(ggplot2)
library(caret)

Then, let’s create a data partition. We have over 2500 discrete data points, so we can split them up into training and testing. We’ll do 60%, 40% breakup:

# We set a seed for the sake of reproducibility
set.seed(42)

# First, we'll pick off the training data: 
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
training<-seismic_data[inTrain,]

# Then what's left is testing data.
testing<-seismic_data[-inTrain,]

Our plan is to first train a linear model on our training data and then measure its accuracy on the same data used to generate the model (thus, the risk of overfitting, or making a model too contingent on this data’s random noise). We will do this a few times using different algorithms. Once we get a good enough model, we’ll apply it to the testing data just once (this is super important) and see what kind of error we get, and if we have a model that’s going to help miners stay safe!

LDA

First, we’ll start with the easiest possible model: predicting class from every other variable, using all the default settings. We’ll use the caret package, because it makes it so easy to try different linear algorithms (see http://topepo.github.io/caret/Linear_Classifier.html). And to see how good our model is, we’ll use confusionMatrix. Confusion matrix takes reality, compares it to your model’s prediction, and sees how correct it is. We tell the confusion matrix which category counts as “positive” for things like positive predictive value, specificity, etc. In our case, positive equals 1, or a hazardous condition.

set.seed(42)
linear_model_1<-train(class ~ ., data=training, method="lda")
confusionMatrix(training$class, predict(linear_model_1, training), positive="1")

Argh, we got an error! Let’s see why. Google is your friend in things like this! In this case, the problem is that I have some variables that are all the same, with zero variance. I figured this out by looking at the summary:

summary(training)
##  seismic  seismoacoustic shift       genergy            gpuls       
##  a:1028   a:940          N: 549   Min.   :    100   Min.   :   3.0  
##  b: 523   b:582          W:1002   1st Qu.:  12075   1st Qu.: 194.0  
##           c: 29                   Median :  25710   Median : 379.0  
##                                   Mean   :  91877   Mean   : 548.1  
##                                   3rd Qu.:  53795   3rd Qu.: 674.5  
##                                   Max.   :2595650   Max.   :4518.0  
##     gdenergy           gdpuls        ghazard      nbumps      
##  Min.   : -96.00   Min.   :-96.000   a:1399   Min.   :0.0000  
##  1st Qu.: -37.00   1st Qu.:-36.000   b: 134   1st Qu.:0.0000  
##  Median :  -6.00   Median : -6.000   c:  18   Median :0.0000  
##  Mean   :  11.94   Mean   :  5.282            Mean   :0.8627  
##  3rd Qu.:  37.50   3rd Qu.: 33.000            3rd Qu.:1.0000  
##  Max.   :1245.00   Max.   :838.000            Max.   :9.0000  
##     nbumps2         nbumps3          nbumps4           nbumps5        
##  Min.   :0.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.392   Mean   :0.3978   Mean   :0.06899   Mean   :0.003224  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :8.000   Max.   :7.0000   Max.   :3.00000   Max.   :1.000000  
##     nbumps6     nbumps7     nbumps89     energy         maxenergy     
##  Min.   :0   Min.   :0   Min.   :0   Min.   :     0   Min.   :     0  
##  1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:     0   1st Qu.:     0  
##  Median :0   Median :0   Median :0   Median :     0   Median :     0  
##  Mean   :0   Mean   :0   Mean   :0   Mean   :  4787   Mean   :  4115  
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:  2950   3rd Qu.:  2000  
##  Max.   :0   Max.   :0   Max.   :0   Max.   :300000   Max.   :300000  
##  class   
##  0:1449  
##  1: 102  
##          
##          
##          
## 

See how nbumps6, nbumps7, and nbumps89 are uniform? Let’s remove them from the model, as obviously LDA doesn’t like them. We’re dipping a bit into feature selection, which is an important topic, but we’ll leave aside further discussion for now.

set.seed(42)
linear_model_1<-train(class ~ .-nbumps6 -nbumps7 - nbumps89, data=training, method="lda")
confusionMatrix(training$class, predict(linear_model_1, training), positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1415   34
##          1   82   20
##                                          
##                Accuracy : 0.9252         
##                  95% CI : (0.911, 0.9378)
##     No Information Rate : 0.9652         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2209         
##  Mcnemar's Test P-Value : 1.278e-05      
##                                          
##             Sensitivity : 0.37037        
##             Specificity : 0.94522        
##          Pos Pred Value : 0.19608        
##          Neg Pred Value : 0.97654        
##              Prevalence : 0.03482        
##          Detection Rate : 0.01289        
##    Detection Prevalence : 0.06576        
##       Balanced Accuracy : 0.65780        
##                                          
##        'Positive' Class : 1              
## 

Well, we have reasonable sensitivity and PPV, at 37% and 20%.

Let’s do one more, shall we? We’ll keep the improved feature set and try a “stabilized LDA”.

set.seed(42)
linear_model_3<-train(class ~ .-nbumps6 -nbumps7 - nbumps89, data=training, method="slda")
confusionMatrix(training$class, predict(linear_model_3, training), positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1421   28
##          1   87   15
##                                           
##                Accuracy : 0.9259          
##                  95% CI : (0.9117, 0.9384)
##     No Information Rate : 0.9723          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1747          
##  Mcnemar's Test P-Value : 6.354e-08       
##                                           
##             Sensitivity : 0.348837        
##             Specificity : 0.942308        
##          Pos Pred Value : 0.147059        
##          Neg Pred Value : 0.980676        
##              Prevalence : 0.027724        
##          Detection Rate : 0.009671        
##    Detection Prevalence : 0.065764        
##       Balanced Accuracy : 0.645572        
##                                           
##        'Positive' Class : 1               
## 

OK, this time around we have almost 35% sensitivity and 14.7% PPV, so we’re close to the regular LDA performance, but model 2, our regular LDA, is still better. We’ll call it our winner.

Our final step, once we’ve come up with a “winner”, is to apply it to our testing data. Our model won’t be as accurate on the testing data as it is on the training data. Let’s think about why: any model is going to do really well on the data it trained on, since it’s tuned on both the signal (the predictive relationship we’re trying to identify, which will be the same in training and testing) and the noise (different, subject to randomness, will differ between training and testing) of the training data. The more our model is tuned to random noise, the worse the decline in performance between training and testing data.

Let’s see how much sensitivity and positive predictive value we lose, when we apply our model to the testing data.

confusionMatrix(testing$class, predict(linear_model_1, testing), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 933  32
##          1  56  12
##                                           
##                Accuracy : 0.9148          
##                  95% CI : (0.8961, 0.9311)
##     No Information Rate : 0.9574          
##     P-Value [Acc > NIR] : 1.00000         
##                                           
##                   Kappa : 0.1714          
##  Mcnemar's Test P-Value : 0.01421         
##                                           
##             Sensitivity : 0.27273         
##             Specificity : 0.94338         
##          Pos Pred Value : 0.17647         
##          Neg Pred Value : 0.96684         
##              Prevalence : 0.04259         
##          Detection Rate : 0.01162         
##    Detection Prevalence : 0.06583         
##       Balanced Accuracy : 0.60805         
##                                           
##        'Positive' Class : 1               
## 

Our sensitivity went down pretty dramatically, to 27%, once we applied our model to testing data, but the positive predictive value is still pretty close to what we had before, at 17.6%. It’s not perfect, to be sure, but it’s not a bad performance based on what we’ve seen up to now.

One thing to avoid doing here is to continue tuning the linear approach and trying other models to see if we can get testing performance to improve. The reason we need to avoid that is that we’re then tuning to signal and noise again, and it’s essentially turning testing data into training data.

Learn more!

Catch all the intro to machine learning documents: