Lab 12: Decision Trees II: Toy Example and Detecting Fraudulent Transactions

Learning Objectives

  • understanding the nature decision tree algorithm
  • estimating decision tree using C5.0 function
  • making predictions based on a decision tree
  • improving algorithm performance by boosting

1. A toy example

Let’s first apply decision trees to a toy data set. We have two variables:Insp which we want to predict/classify and rel_uprice which we will use to classify whether transaction is “ok” or “fraud.”

library(C50)
library(dplyr)
toy <- data.frame(Insp=c("ok","ok","ok", "fraud","ok", "fraud"),
                  rel_uprice=c(1,2,2,90,100,110))
toy
##    Insp rel_uprice
## 1    ok          1
## 2    ok          2
## 3    ok          2
## 4 fraud         90
## 5    ok        100
## 6 fraud        110

Decision tree algorithm uses the predictor variable to split data set into parts such that teach part is relatively homogeneous in terms of the classes it contains. Our data is already sorted by relative unit price. Notice that if we split the data into observations where unit price is greater or equal to 2 we get a split where one part is completely homogeneous (all ok transactions) and the other part is two thirds fraudulent. Let’s see what the C5.0 algorithm, part of C50 package thinks. It takes a data frame of predictors as its first argument and then a vector of classes as the second argument.

tree <- C5.0(select(toy, rel_uprice),toy$Insp)
summary(tree)
## 
## Call:
## C5.0.default(x = select(toy, rel_uprice), y = toy$Insp)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Feb 08 09:17:35 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 6 cases (2 attributes) from undefined.data
## 
## Decision tree:
## 
## rel_uprice <= 2: ok (3)
## rel_uprice > 2: fraud (3/1)
## 
## 
## Evaluation on training data (6 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       2    1(16.7%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##       2          (a): class fraud
##       1     3    (b): class ok
## 
## 
##  Attribute usage:
## 
##  100.00% rel_uprice
## 
## 
## Time: 0.0 secs

Well, we were right. The algorithm says that if price is less than or equal to 2, the transaction is classified as ok, if it is greater than 2, the transaction is classified as fraud. Thus, we have a tree with two branches. The numbers in parentheses tell us how many observations fall into each branch, and following the “/” how many of those were classified incorrectly. We see that there are 3 observations in the first (price<=2) branch. None of those were incorrectly classified. In the second branch (price>2) there are 3 observations of which 1 was incorrectly classified. Overall, we have one incorrectly classified observation (error of 16.7%).

IN-CLASS EXERCISE 1: Take a look at the data below. What do you think the decision tree should look like if we use rel_uprice and missing as predictors?

toy <- data.frame(Insp=c("ok","ok","ok", "fraud", "ok", "fraud", "fraud", "fraud"),
                  rel_uprice=c(1,2,2,90,100,110, NA, NA ),
                  missing=c("no missing", "no missing", "no missing", "no missing", 
                            "no missing", "no missing", "Val missing", "Val missing"))
toy
##    Insp rel_uprice     missing
## 1    ok          1  no missing
## 2    ok          2  no missing
## 3    ok          2  no missing
## 4 fraud         90  no missing
## 5    ok        100  no missing
## 6 fraud        110  no missing
## 7 fraud         NA Val missing
## 8 fraud         NA Val missing

2. Application to detecting fraudulent transactions

Let’s load data on sales reports from lab 11. Here we will use only transactions that have been inspected.

inspected <- read.csv("https://www.dropbox.com/s/1kt7nibpth1e7pi/inspected.csv?raw=1")

Let’s now create a training and testing data sets. We will use function sample as we did in the past. We will use at 80/20 split.

set.seed(364) #set the seed for reproducibility
inspected$rand <- runif(nrow(inspected))
train <- filter(inspected, rand<0.8)
test <-  filter(inspected, rand>=0.8)

Since fraudulent transaction are pretty infrequent, and our inspected data set is not particularly large, we should check that we have roughly the same proportion of fraudulent transactions in both test and train data sets.

prop.table(table(train$Insp))
## 
##      fraud         ok 
## 0.08165539 0.91834461
prop.table(table(test$Insp))
## 
##      fraud         ok 
## 0.07704452 0.92295548

3. Estimate a decision tree

The function C5.0 takes a data frame of predictor variables as the first argument. (Note that C5.0 sometimes needs the first argument to be explicitly declared as data frame using the data.frame function. The select() returns something very similar to data frame but not exactly class data frame.) The second argument is a vector of classes in the training data set.

tree <- C5.0(data.frame(select(train, missing, rel_uprice)),train$Insp)
summary(tree)
## 
## Call:
## C5.0.default(x = data.frame(select(train, missing, rel_uprice)), y
##  = train$Insp)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Feb 08 09:17:38 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 12565 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## rel_uprice <= -146.3251:
## :...missing in {both missing,Quant missing}: ok (4.8/1)
## :   missing = Val missing: fraud (1.5/0.2)
## :   missing = no missing:
## :   :...rel_uprice <= -154.9556: fraud (449/70)
## :       rel_uprice > -154.9556:
## :       :...rel_uprice <= -147.258: ok (47/16)
## :           rel_uprice > -147.258: fraud (5)
## rel_uprice > -146.3251:
## :...missing = Val missing: fraud (35.5/3.8)
##     missing in {both missing,no missing,Quant missing}:
##     :...rel_uprice <= 162.6781: ok (11408.3/230.7)
##         rel_uprice > 162.6781:
##         :...rel_uprice > 187.716: fraud (255.4/17.9)
##             rel_uprice <= 187.716:
##             :...rel_uprice <= 177.4104: ok (233.2/59.5)
##                 rel_uprice > 177.4104: fraud (125.2/60.9)
## 
## 
## Evaluation on training data (12565 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      10  458( 3.6%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     718   308    (a): class fraud
##     150 11389    (b): class ok
## 
## 
##  Attribute usage:
## 
##  100.00% missing
##   98.75% rel_uprice
## 
## 
## Time: 0.0 secs

We grew a tree with 10 branches - a pretty big tree that is hard to interpret. Moreover, some of the splits may specific to the training dataset. This would cause overfitting. Still, notice that the biggest branch is where price is greater than -146 and less than 162 which is classified as ok and which includes over 11 thousand cases of which only 230 are incorrectly classified as fraud. This is very consistent with our data exploration in lab 11.

We could get a simpler tree by including the minCases= option. This option require that at least two splits have more cases than a specified number.

tree2 <- C5.0(data.frame(select(train, missing, rel_uprice)),train$Insp,
             control=C5.0Control(minCases=500))
summary(tree2)
## 
## Call:
## C5.0.default(x = data.frame(select(train, missing, rel_uprice)), y
##  = train$Insp, control = C5.0Control(minCases = 500))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Feb 08 09:17:38 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 12565 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## rel_uprice <= -146.3251: fraud (507.3/105)
## rel_uprice > -146.3251:
## :...missing = Val missing: fraud (35.5/3.8)
##     missing in {both missing,no missing,Quant missing}:
##     :...rel_uprice <= 162.6781: ok (11408.3/230.7)
##         rel_uprice > 162.6781: fraud (613.9/252.6)
## 
## 
## Evaluation on training data (12565 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       4  586( 4.7%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     793   233    (a): class fraud
##     353 11186    (b): class ok
## 
## 
##  Attribute usage:
## 
##   98.75% rel_uprice
##   96.01% missing
## 
## 
## Time: 0.0 secs

We have a simpler tree but is has a few more errors. The advantage is that we can describe this tree in words. In the price is less than -146 classify transaction as fraud. If Val is missing, classify as fraud. If price is greater than -146 but less than 162, classify as ok, otherwise classify as fraud. The number of cases includes fractions because rel_uprice has missing observations. The algorithm can not determine whether the observation has rel_uprice more or less than 146 so it assigns fraction of the missing cases to each branch.

IN-CLASS EXERCISE 2: Rules are often easier to interpret than decision trees as evidenced in this example. Estimate the decision tree again with option rules=TRUE added to the C5.0 function.

IN-CLASS EXERCISE 3: Estimate decision tree using missing, val and Quant as predictors. Which tree do you find more intuitive?

4. Make and evaluate predictions

We will use function predict() to make predictions. It takes the classifier (in our case tree) as the first argument. Test data set with predictor variables only as the second argument.

pred <- predict(tree2, data.frame(select(test,missing, rel_uprice)))
summary(pred)
## fraud    ok 
##   294  2873

For evaluation we will use confusionMatrix() function from caret package. It has three main arguments. The first is a vector of class predictions, the second is the vector of actual class. The third argument tells the function which class is considered “positive.” Here we intend to detect fraud so “fraud” is considered the positive class.

library(caret)
confusionMatrix(pred, test$Insp, positive = "fraud")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fraud   ok
##      fraud   199   95
##      ok       45 2828
##                                          
##                Accuracy : 0.9558         
##                  95% CI : (0.948, 0.9627)
##     No Information Rate : 0.923          
##     P-Value [Acc > NIR] : 5.001e-14      
##                                          
##                   Kappa : 0.7159         
##  Mcnemar's Test P-Value : 3.454e-05      
##                                          
##             Sensitivity : 0.81557        
##             Specificity : 0.96750        
##          Pos Pred Value : 0.67687        
##          Neg Pred Value : 0.98434        
##              Prevalence : 0.07704        
##          Detection Rate : 0.06284        
##    Detection Prevalence : 0.09283        
##       Balanced Accuracy : 0.89154        
##                                          
##        'Positive' Class : fraud          
## 

We achieved 96% accuracy which is better than base of 92% which we would get by declaring every transaction as ‘ok’. Amazingly, Kappa is 0.72 suggesting that our predictors make a lot of difference.

IN-CLASS EXERCISE 4: Does the 10-branch tree do better than the 4-branch tree? Does the 10-branch tree suffer from overfitting?

pred <- predict(tree, data.frame(select(test,missing, rel_uprice)))
confusionMatrix(pred, test$Insp, positive = "fraud")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fraud   ok
##      fraud   179   47
##      ok       65 2876
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9576, 0.9708)
##     No Information Rate : 0.923           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7426          
##  Mcnemar's Test P-Value : 0.1082          
##                                           
##             Sensitivity : 0.73361         
##             Specificity : 0.98392         
##          Pos Pred Value : 0.79204         
##          Neg Pred Value : 0.97790         
##              Prevalence : 0.07704         
##          Detection Rate : 0.05652         
##    Detection Prevalence : 0.07136         
##       Balanced Accuracy : 0.85876         
##                                           
##        'Positive' Class : fraud           
## 

5. Improving model performance by boosting

The C5.0 algorithm allows an option for boosting the accuracy of the algorithm by estimating several trees and assigning as class that was assigned by the highest number of estimated trees. This is done by adding option trials=.

tree <- C5.0(data.frame(select(train, missing, rel_uprice)),train$Insp,
             control=C5.0Control(minCases=500), trials = 10)
#summary(tree) #this prints 10 different decision trees
pred <- predict(tree, data.frame(select(test,missing, rel_uprice)))
confusionMatrix(pred, test$Insp, positive = "fraud")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fraud   ok
##      fraud   167   40
##      ok       77 2883
##                                           
##                Accuracy : 0.9631          
##                  95% CI : (0.9559, 0.9694)
##     No Information Rate : 0.923           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7208          
##  Mcnemar's Test P-Value : 0.0008741       
##                                           
##             Sensitivity : 0.68443         
##             Specificity : 0.98632         
##          Pos Pred Value : 0.80676         
##          Neg Pred Value : 0.97399         
##              Prevalence : 0.07704         
##          Detection Rate : 0.05273         
##    Detection Prevalence : 0.06536         
##       Balanced Accuracy : 0.83537         
##                                           
##        'Positive' Class : fraud           
## 

Our accuracy improved a little bit as did Kappa. We lost some sensitivity though.

Exercises:

  1. These exercises are based on a famous kaggle competition in which competitors predict who lives and who dies on the Titanic. Download the data set from this address, and load it into R. It has the following variables:
  • Survived Survival (0 = No; 1 = Yes)
  • Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • Name Name
  • Sex Sex
  • Age Age (in years)
  • SibSp Number of Siblings/Spouses Aboard
  • Parch Number of Parents/Children Aboard
  • Ticket Ticket Number
  • Fare Passenger Fare
  • Cabin Cabin
  • Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  1. What percentage of passengers in our data set survived?

  2. Which variables do you think may be good predictors of the survival on the Titanic? Document your exploration. (Hint: You may want to turn the Survived variable into a factor using the factor() function.)

  3. Estimate a decision tree predicting survival using age and sex as predictors. Describe your results.

  4. Estimate a decision tree using age, sex and passenger class. Describe your results.

  5. Estimate your own decision tree with your own set of predictors (you are, of course, free to include the predictors we used above). How accurate is your model on the training data? How does it compare to the models above?

  6. Download test data from this link. This is the test data from Kaggle, we actually don’t know the true fate of the passengers in this data set. Use this data to make predictions for these passengers.

  7. Even though we don’t know the fate of the passengers in the test data set, Kaggle does. In fact, Kaggle will evaluate our predictions and compare the accuracy of our predictions to those of other participants in the competition. All we have to do is register with Kaggle and create a .csv file that contains two columns: PassengerId and Survived. Where Survived contains our predictions 0 for did not survive, and 1 for survived. We can do this by first creating a data frame (let’s call it submit) using function data.frame() with the two columns. It should look something like this:submit <- data.frame(PassengerId = test$PassengerId, Survived = prediction) Second, we need to write a .csv file using function write.csv() This function takes a data frame to be written as its first argument, name of .csv file to be created as the second argument. We also need to use option row.names = FALSE to prevent the function from adding an additional column with row numbers. It should look something like this: write.csv(submit, "C:/business analytics/labs/lab 12/Submission.csv", row.names = FALSE) Submit your predictions and report on your accuracy and rank compared to other participants. Take a screenshot and attach to your lab.