Lab 8: Decision Trees: Detecting Fraudulent Transactions

Learning Objectives

  • explore data (a lot)
  • estimate decision tree
  • make and evaluate predictions
  • estimate decision rules

1. Introduction

This lab is adapted from chapter 4 in Data Mining with R by Luis Torgo. The data contains over 400 thousand sales reports with the following variables:

  • ID - ID of the sales person
  • Prod - ID of the product sold
  • Quant - number of reported units sold
  • Val - reported total value of the sale
  • Insp - this variable has three possible values: ok if the report was inspected and found valid, fraud if the report was inspected but found fraudulent, and unkn if the report was not inspected.

2. Exploring the data

library(dplyr)
library(ggplot2)
library(stargazer)
#data <- read.csv("https://www.dropbox.com/s/xla1590d698xvur/sales.csv?raw=1")
data <- read.csv("C:/Users/dvorakt/Dropbox/ba data/sales.csv")
str(data)
## 'data.frame':    401146 obs. of  5 variables:
##  $ ID   : Factor w/ 6016 levels "v1","v10","v100",..: 1 1109 2216 3317 2216 4411 5505 5684 5795 5906 ...
##  $ Prod : Factor w/ 4548 levels "p1","p10","p100",..: 1 1 1 1 1 1112 1112 1112 1112 1112 ...
##  $ Quant: int  182 3072 20393 112 6164 104 350 200 233 118 ...
##  $ Val  : int  1665 8780 76990 1100 20260 1155 5680 4010 2855 1175 ...
##  $ Insp : Factor w/ 3 levels "fraud","ok","unkn": 3 3 3 3 3 3 3 3 3 3 ...

We see that there are lots of different sales people (6,016 of them), and lots of different products (4,548 of them). Let’s look at the quantitative variables and the Insp.

summary(data[,c("Val", "Quant", "Insp")])
##       Val              Quant              Insp       
##  Min.   :   1005   Min.   :      100   fraud:  1270  
##  1st Qu.:   1345   1st Qu.:      107   ok   : 14462  
##  Median :   2675   Median :      168   unkn :385414  
##  Mean   :  14617   Mean   :     8442                 
##  3rd Qu.:   8680   3rd Qu.:      738                 
##  Max.   :4642955   Max.   :473883883                 
##  NA's   :1182      NA's   :13842

There is a pretty wide range in the quantity variable, but it appears that the minimum sale is 100 units. The range in the total value of the sales report is, not surprisingly, also wide and skewed. Also, there are missing values (NA’s) in both quantity and value.

Importantly, it appears that only a small fraction of reports has been inspected as over 385 thousand reports have unknown status. Of the roughly 15 thousand inspected reports only a small fraction is fraudulent, 8%. Thus, there is significant class imbalance in the variable we are trying to predict. Let’s do a few central tendency statistics by Insp.

IN-CLASS EXERCISE 1: Calculate median of value and quantity for ‘ok’, ‘fraudulent’ and ‘unknwn’ transactions.

Hopefully, the above exercise showed that we need to be careful about missing values. Since there are missing values in both Val and Quant we will use the option na.rm=TRUE inside the mean() and median() functions. This option (rm stands for remove) tells mean() and median() to ignore missing values. Otherwise, if there was any missing value within a group, mean() and median() would return an NA. In other words, we calculate mean and median over all the non-missing values within each group.

sum <- data %>% group_by(Insp) %>% 
  summarize(av_Val=mean(Val,na.rm=TRUE), av_Quant=mean(Quant,na.rm=TRUE),
            med_Val=median(Val,na.rm=TRUE), med_Quant=median(Quant,na.rm=TRUE))
sum
## Source: local data frame [3 x 5]
## 
##     Insp   av_Val  av_Quant med_Val med_Quant
##   (fctr)    (dbl)     (dbl)   (dbl)     (dbl)
## 1  fraud 93200.02 945503.78    6790     737.0
## 2     ok 60797.37  35784.17   13635     431.5
## 3   unkn 12628.77   4260.34    2620     166.0

We see that inspected transactions are decidedly smaller in terms of value and quantity than uninspected ones (status unkn). We see that in terms of average value fraudulent transactions are bigger, but in terms of median value they appear smaller. Quantity is bigger for fraudulent both in terms of average and median. Clearly, there are some complex relationships among value, quantity and fraudulent/ok status. Let’s plot the data to see if we can shed some light.

We have two quantitative variables, Val and Quant which we can plot on the x and y axes. We also have qualitative variable Insp which we can map to color. Since we saw pretty big range for both value and quantity, we will use log scales.

ggplot(data, aes(x=Quant, y=Val, color=Insp)) + geom_point() +
  scale_x_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000,100000000))+
  scale_y_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000)) 

It looks like the fraudulent transactions are scattered around the edges of the ‘cloud’ of observations. They either have high value and low quantity or low value and relatively high quantity. This suggests that unit price (Val/Quant) may be unusual for fraudulent transactions.

Let’s examine the unit price for each product. Identical products should cost roughly the same. If there is a big deviation of the unit price from what the product typically sells for, we should probably examine that sales report. To calculate a ‘typical unit price’ we should ideally use only inspected sales reports that were deemed ‘ok’ so that we know the price is accurate. However, only 798 out of total of 4548 products were inspected and deemed ‘ok’. Since the vast majority of reports is ‘ok’ we will use all reports that were not fraudulent. Note that we again use the na.rm=TRUE option so that we get a typical price (mean or median) for each product even if some reports for a product had missing values for Val or Quant.

uprice <- data %>% filter(Insp!="fraud") %>% group_by(Prod) %>%
  summarize(av_uprice=mean(Val/Quant,na.rm=TRUE), med_uprice=median(Val/Quant,na.rm=TRUE))
summary(uprice)
##       Prod        av_uprice          med_uprice      
##  p1     :   1   Min.   :   0.062   Min.   :   0.017  
##  p10    :   1   1st Qu.:   6.843   1st Qu.:   6.048  
##  p100   :   1   Median :  13.282   Median :  11.236  
##  p1000  :   1   Mean   :  19.650   Mean   :  15.022  
##  p1001  :   1   3rd Qu.:  20.363   3rd Qu.:  15.705  
##  p1002  :   1   Max.   :8157.291   Max.   :9204.195  
##  (Other):4542   NA's   :2          NA's   :2

Given that some of the unit prices may come from fraudulent reports we will use median unit price as a measure of a typical price for a product. This may eliminate undue influence of a fraudulent reports. Let’s merge the medium product price back into our data and calculate a relative price as the deviation from median price. We will calculate the deviation from median price as the difference between unit price and the median unit price divided by the average of unit price and median unit price. This method will keep the relative price between -200 and +200 percent. (This will make it easier to visualize.)

data <- full_join(data, uprice, by="Prod")
data$rel_uprice <- (data$Val/data$Quant-data$med_uprice)/((data$Val/data$Quant+data$med_uprice)/2)*100
stargazer(select(filter(data, Insp=="ok"),rel_uprice), median=TRUE, type="text")
## 
## =========================================================
## Statistic    N     Mean  St. Dev.   Min    Median   Max  
## ---------------------------------------------------------
## rel_uprice 14,347 14.623  72.591  -197.683 0.412  196.462
## ---------------------------------------------------------
stargazer(select(filter(data, Insp=="fraud"),rel_uprice), median=TRUE, type="text")
## 
## =========================================================
## Statistic    N    Mean  St. Dev.   Min    Median    Max  
## ---------------------------------------------------------
## rel_uprice 1,199 -0.088 173.075  -200.000 -32.933 199.687
## ---------------------------------------------------------

We see that ok transactions tend to have relatively higher unit prices than fraudulent transactions.(I would expect the opposite, i.e. sales people inflating prices.) There is also a big difference in the standard deviation of the relative unit price among the fraudulent transactions versus the ok transactions. Relative prices vary a lot more among the fraudulent transactions. Let’s plot the relative unit prices to get a better sense of how they are distributed withing ok and fraudulent transactions.

ggplot(data, aes(x=Insp, y=rel_uprice, color=Insp)) + geom_point() + geom_jitter()

Wow, this graph clearly shows that unit prices of fraudulent transactions are off.

Finally, let’s focus on the missing values - we have lots of them. Let’s create a new variable missing that describes whether Val, Quant, both, or none are missing. We will use function ifelse(). The first arguments of this function is a logical condition, the second is the value the function returns if the condition is TRUE, and the third argument is the value the function returns if the condition is FALSE. We will also make this variable a factor (it will be useful when we make predictions).

data$missing <- ifelse(is.na(data$Val) & is.na(data$Quant), "both missing",
                       ifelse(is.na(data$Val), "Val missing",
                              ifelse(is.na(data$Quant), "Quant missing", "no missing")))
data$missing <- as.factor(data$missing)
table(data$missing)
## 
##  both missing    no missing Quant missing   Val missing 
##           888        387010         12954           294

We see that missing quantity is much more frequent than missing value. Still, there is probably something suspicious about reports with missing values.

IN-CLASS EXERCISE 2: Create a visual representation of the distribution of observations across values of Insp and simultaneously across values of missing

We have at least two solid candidates, dev_uprice and missing, for predicting fraudulent reports.

3. Create a train and test datasets

In order to train and test our predictors we need to use only the inspected portion of the data set. We also need to ‘re-level’ the factor variable Insp. In the inspected portion, there are only two possible values for Insp and we need to tell R that. Function factor() accomplishes that.

inspected <- filter(data, Insp!="unkn")
inspected$Insp <- factor(inspected$Insp)

Let’s now create a training and testing data sets. We will use function sample as we did in the past. We will use at 80/20 split.

set.seed(364) #set the seed for reproducibility
sample <- sample(nrow(inspected),floor(nrow(inspected)*0.8)) #create a vector of random numbers 
train <- inspected[sample,]
test <- inspected[-sample,]

Since fraudulent transaction are pretty infrequent and our inspected data set is not particularly large, we should check that we have roughly the same proportion of fraudulent transactions in both test and train data sets.

prop.table(table(train$Insp))
## 
##      fraud         ok 
## 0.08073103 0.91926897
prop.table(table(test$Insp))
## 
##      fraud         ok 
## 0.08071179 0.91928821

4. Estimate a decision tree

We will use algorithm called C5.0 which is is package C50. The algorithm takes a data frame of predictor variables as the first argument. (Note that C5.0 needs the first argument to be explicitly declared as data frame using the data.frame function. The select() returns something very similar to data frame but not exactly class data frame.) The second argument is a vector of classes in the training data set.

library(C50)
tree <- C5.0(data.frame(select(train, missing, rel_uprice)),train$Insp)
tree
## 
## Call:
## C5.0.default(x = data.frame(select(train, missing, rel_uprice)), y
##  = train$Insp)
## 
## Classification Tree
## Number of samples: 12585 
## Number of predictors: 2 
## 
## Tree size: 5 
## 
## Non-standard options: attempt to group attributes
summary(tree)
## 
## Call:
## C5.0.default(x = data.frame(select(train, missing, rel_uprice)), y
##  = train$Insp)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Feb 01 11:05:35 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 12585 cases (3 attributes) from undefined.data
## 
## Decision tree:
## 
## rel_uprice <= -144.4802: fraud (510/113.7)
## rel_uprice > -144.4802:
## :...missing = Val missing: fraud (34.5/4.8)
##     missing in {both missing,no missing,Quant missing}:
##     :...rel_uprice <= 162.1818: ok (11434.1/226.9)
##         rel_uprice > 162.1818:
##         :...rel_uprice <= 185.1613: ok (327.9/115.6)
##             rel_uprice > 185.1613: fraud (278.5/30.9)
## 
## 
## Evaluation on training data (12585 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5  488( 3.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     672   344    (a): class fraud
##     144 11425    (b): class ok
## 
## 
##  Attribute usage:
## 
##   98.83% rel_uprice
##   96.00% missing
## 
## 
## Time: 0.0 secs

We grew a tree with 5 branches. It says that if the deviation from median unit price is less than -144 the report is fraudulent. Otherwise, if Val is missing it is also fraudulent. If Val is not missing and the deviation from unit price is less than 162, the transaction is ‘ok’, otherwise if deviation is less than 185, the transaction is ‘ok’ and if it is greater than 185 the transaction is fraudulent. It seems like the last branch is not necessary. Essentially, if the deviation from median unit price is between -144 and 185, the transaction is ok. Also, if Val is missing, the transaction is fraudulent. These results are very consistent with our data exploration above.

IN-CLASS EXERCISE 3: Estimate decision tree using missing, val and Quant as predictors. Which tree do you find more intuitive?

5. Make and evaluate predictions

We will use function predict() to make predictions. It takes the classifier (in our case tree) as the first argument. Test data set with predictor variables only as the second argument.

pred <- predict(tree, data.frame(select(test,missing, rel_uprice)))
summary(pred)
## fraud    ok 
##   223  2924

For evaluation we will use confusionMatrix() function from caret package. It has three main arguments. The first is a vector of class predictions, the second is the vector of actual class. The third argument tells the function which class is considered “positive.” Here we intend to detect fraud so “fraud” is considered the positive class.

library(caret)
confusionMatrix(pred, test$Insp, positive = "fraud")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fraud   ok
##      fraud   184   39
##      ok       70 2854
##                                           
##                Accuracy : 0.9654          
##                  95% CI : (0.9584, 0.9715)
##     No Information Rate : 0.9193          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7528          
##  Mcnemar's Test P-Value : 0.00406         
##                                           
##             Sensitivity : 0.72441         
##             Specificity : 0.98652         
##          Pos Pred Value : 0.82511         
##          Neg Pred Value : 0.97606         
##              Prevalence : 0.08071         
##          Detection Rate : 0.05847         
##    Detection Prevalence : 0.07086         
##       Balanced Accuracy : 0.85546         
##                                           
##        'Positive' Class : fraud           
## 

We achieved 96% accuracy which is better than base of 92% which we would get by declaring every transaction as ‘ok’. Amazingly, Kappa is 0.75 suggesting that our predictors make a lot of difference.

6. Improving model performance

The C5.0 algorithm allows an option for boosting the accuracy of the algorithm by estimating several trees and assigning as class that was assigned by the highest number of estimated trees. This is done by adding option trials=.

tree <- C5.0(data.frame(select(train, missing, rel_uprice)),train$Insp, trials = 10)
#summary(tree) #this prints 10 different decision trees
pred <- predict(tree, data.frame(select(test,missing, rel_uprice)))
confusionMatrix(pred, test$Insp, positive = "fraud")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction fraud   ok
##      fraud   174   31
##      ok       80 2862
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9577, 0.9709)
##     No Information Rate : 0.9193          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7394          
##  Mcnemar's Test P-Value : 5.215e-06       
##                                           
##             Sensitivity : 0.68504         
##             Specificity : 0.98928         
##          Pos Pred Value : 0.84878         
##          Neg Pred Value : 0.97281         
##              Prevalence : 0.08071         
##          Detection Rate : 0.05529         
##    Detection Prevalence : 0.06514         
##       Balanced Accuracy : 0.83716         
##                                           
##        'Positive' Class : fraud           
## 

Although, I did not print the results, boosting improved the accuracy of of the algorithm in the training data. However, in our case we see that it did not improve performance in the test data.

7. Creating rules

Another option offered by the C5.0 algorithm is rules= This option derives decision rules based on the estimated decision tree.

tree <- C5.0(data.frame(select(train, missing, rel_uprice)), train$Insp, rules = TRUE)
summary(tree)
## 
## Call:
## C5.0.default(x = data.frame(select(train, missing, rel_uprice)), y
##  = train$Insp, rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Mon Feb 01 11:05:39 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 12585 cases (3 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (276/29, lift 11.1)
##  rel_uprice > 185.1613
##  ->  class fraud  [0.892]
## 
## Rule 2: (36/5, lift 10.4)
##  missing = Val missing
##  ->  class fraud  [0.842]
## 
## Rule 3: (504/110, lift 9.7)
##  rel_uprice <= -144.4802
##  ->  class fraud  [0.781]
## 
## Rule 4: (11658/320, lift 1.1)
##  rel_uprice > -144.4802
##  rel_uprice <= 185.1613
##  ->  class ok  [0.972]
## 
## Default class: ok
## 
## 
## Evaluation on training data (12585 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       4  488( 3.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     672   344    (a): class fraud
##     144 11425    (b): class ok
## 
## 
##  Attribute usage:
## 
##   98.83% rel_uprice
##    0.29% missing
## 
## 
## Time: 0.0 secs

Rules are often easier to interpret than decision trees as evidenced in this example.

Exercises:

  1. These exercises are based on a famous kaggle competition in which competitors predict who lives and who dies on the Titanic. Download the data set from this address, and load it into R. It has the following variables:
Survived Survival (0 = No; 1 = Yes)
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name Name
Sex Sex
Age Age (in years)
SibSp Number of Siblings/Spouses Aboard
Parch Number of Parents/Children Aboard
Ticket Ticket Number
Fare Passenger Fare
Cabin Cabin
Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
—————- ————–
  1. What percentage of passengers in our data set survived?

  2. Which variables do you think may be good predictors of the survival on the Titanic? Document your exploration. (Hint: You may want to turn the Survived variable into a factor using the factor() function.)

  3. Estimate a decision tree predicting survival using age and sex as predictors. Describe your results.

  4. Estimate a decision tree using age, sex and passenger class. Describe your results.

  5. Estimate your own decision tree with your own set of predictors (you are, of course, free to include the predictors we used above). How accurate is your model on the training data? How does it compare to the models above?

  6. Download test data from this link. This is the test data from Kaggle, we actually don’t know the true fate of the passengers in this data set. Use this data to make predictions for these passengers.

  7. Even though we don’t know the fate of the passengers in the test data set, Kaggle does. In fact, Kaggle will evaluate our predictions and compare the accuracy of our predictions to those of other participants in the competition. All we have to do is register with Kaggle and create a .csv file that contains two columns: PassengerId and Survived. Where Survived contains our predictions 0 for did not survive, and 1 for survived. We can do this by first creating a data frame (let’s call it submit) using function data.frame() with the two columns. It should look something like this:submit <- data.frame(PassengerId = test$PassengerId, Survived = prediction) Second, we need to write a .csv file using function write.csv() This function takes a data frame to be written as its first argument, name of .csv file to be created as the second argument. We also need to use option row.names = FALSE to prevent the function from adding an additional column with row numbers. It should look something like this: write.csv(submit, "C:/business analytics/labs/lab 8/Submission.csv", row.names = FALSE) Submit your predictions and report on your accuracy and rank compared to other participants.