Lab 11: Decision Trees I: Detecting Fraudulent Transactions

Learning Objectives

  • exploring data depends on context (domain knowledge)
  • review of group_by(), summarize(), inner_join()
  • dealing with NAs in summarize() using na.rm=TRUE
  • avoid over-plotting with alpha=.. and geom_jitter()

1. Introduction

This lab is adapted from chapter 4 in Data Mining with R by Luis Torgo. The data contains over 400 thousand sales reports with the following variables:

  • ID - ID of the sales person
  • Prod - ID of the product sold
  • Quant - number of reported units sold
  • Val - reported total value of the sale
  • Insp - this variable has three possible values: ok if the report was inspected and found valid, fraud if the report was inspected but found fraudulent, and unkn if the report was not inspected.

2. Exploring the data

library(dplyr)
library(ggplot2)
library(stargazer)
data <- read.csv("https://www.dropbox.com/s/xla1590d698xvur/sales.csv?raw=1")
data <- read.csv("C:/Users/dvorakt/Dropbox/ba data/sales.csv")
str(data)
## 'data.frame':    401146 obs. of  5 variables:
##  $ ID   : Factor w/ 6016 levels "v1","v10","v100",..: 1 1109 2216 3317 2216 4411 5505 5684 5795 5906 ...
##  $ Prod : Factor w/ 4548 levels "p1","p10","p100",..: 1 1 1 1 1 1112 1112 1112 1112 1112 ...
##  $ Quant: int  182 3072 20393 112 6164 104 350 200 233 118 ...
##  $ Val  : int  1665 8780 76990 1100 20260 1155 5680 4010 2855 1175 ...
##  $ Insp : Factor w/ 3 levels "fraud","ok","unkn": 3 3 3 3 3 3 3 3 3 3 ...

We see that there are over 400 thousand sales reports. There are lots of different sales people (6,016 of them), and lots of different products (4,548 of them). Let’s look at the quantitative variables and the Insp.

summary(data[,c("Val", "Quant", "Insp")])
##       Val              Quant              Insp       
##  Min.   :   1005   Min.   :      100   fraud:  1270  
##  1st Qu.:   1345   1st Qu.:      107   ok   : 14462  
##  Median :   2675   Median :      168   unkn :385414  
##  Mean   :  14617   Mean   :     8442                 
##  3rd Qu.:   8680   3rd Qu.:      738                 
##  Max.   :4642955   Max.   :473883883                 
##  NA's   :1182      NA's   :13842

There is a pretty wide range in the quantity variable, but it appears that the minimum sale is 100 units. The range in the total value of the sales report is, not surprisingly, also wide and skewed. Also, there are missing values (NA’s) in both quantity and value.

Importantly, it appears that only a small fraction of reports has been inspected as over 385 thousand reports have unknown status. Of the roughly 15 thousand inspected reports only a small fraction is fraudulent, 8%. Thus, there is significant class imbalance in the variable we are trying to predict. Let’s do a few central tendency statistics by Insp.

IN-CLASS EXERCISE 1: Calculate median of value and quantity for ‘ok’, ‘fraudulent’ and ‘unknwn’ transactions.

Hopefully, the above exercise showed that we need to be careful about missing values. Since there are missing values in both Val and Quant we will use the option na.rm=TRUE inside the mean() and median() functions. This option (rm stands for remove) tells mean() and median() to ignore missing values. Otherwise, if there was any missing value within a group, mean() and median() would return an NA. In other words, we calculate mean and median over all the non-missing values within each group.

sum <- data %>% group_by(Insp) %>% 
  summarize(av_Val=mean(Val,na.rm=TRUE), av_Quant=mean(Quant,na.rm=TRUE),
            med_Val=median(Val,na.rm=TRUE), med_Quant=median(Quant,na.rm=TRUE))
sum
## # A tibble: 3 × 5
##     Insp   av_Val  av_Quant med_Val med_Quant
##   <fctr>    <dbl>     <dbl>   <dbl>     <dbl>
## 1  fraud 93200.02 945503.78    6790     737.0
## 2     ok 60797.37  35784.17   13635     431.5
## 3   unkn 12628.77   4260.34    2620     166.0

We see that inspected transactions are decidedly smaller in terms of value and quantity than uninspected ones (status unkn). We see that in terms of average value fraudulent transactions are bigger, but in terms of median value they appear smaller. Quantity is bigger for fraudulent both in terms of average and median. Clearly, there are some complex relationships among value, quantity and fraudulent/ok status. Let’s plot the data to see if we can shed some light.

We have two quantitative variables, Val and Quant which we can plot on the x and y axes. We also have qualitative variable Insp which we can map to color. Since we saw pretty big range for both value and quantity, we will use log scales. Also, since we have about 400 thousand observations we should reduce over-plotting by making the points transparent using the alpha= aesthetic.

ggplot(data, aes(x=Quant, y=Val, color=Insp)) + geom_point(alpha=0.25) +
  scale_x_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000,100000000))+
  scale_y_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000)) 

It looks like the fraudulent transactions are scattered around the edges of the ‘cloud’ of observations. They either have high value and low quantity or low value and relatively high quantity. This suggests that unit price (Val/Quant) may be unusual for fraudulent transactions.

Let’s examine the unit price for each product. Identical products should cost roughly the same. If there is a big deviation of the unit price from what the product typically sells for, we should probably examine that sales report. To calculate a ‘typical unit price’ we should ideally use only inspected sales reports that were deemed ‘ok’ so that we know the price is accurate. However, only 798 out of total of 4548 products were inspected and deemed ‘ok’. Since the vast majority of reports is ‘ok’ we will use all reports that were not fraudulent. Note that we again use the na.rm=TRUE option so that we get a typical price (mean or median) for each product even if some reports for a product had missing values for Val or Quant.

uprice <- data %>% filter(Insp!="fraud") %>% group_by(Prod) %>%
  summarize(av_uprice=mean(Val/Quant,na.rm=TRUE), med_uprice=median(Val/Quant,na.rm=TRUE))
summary(uprice)
##       Prod        av_uprice          med_uprice      
##  p1     :   1   Min.   :   0.062   Min.   :   0.017  
##  p10    :   1   1st Qu.:   6.843   1st Qu.:   6.048  
##  p100   :   1   Median :  13.282   Median :  11.236  
##  p1000  :   1   Mean   :  19.650   Mean   :  15.022  
##  p1001  :   1   3rd Qu.:  20.363   3rd Qu.:  15.705  
##  p1002  :   1   Max.   :8157.291   Max.   :9204.195  
##  (Other):4542   NA's   :2          NA's   :2

Given that some of the unit prices may come from fraudulent reports we will use median unit price as a measure of a typical price for a product. This may eliminate undue influence of a fraudulent reports. Let’s merge the medium product price back into our data and calculate a relative price as the deviation from median price. We will calculate the deviation from median price as the difference between unit price and the median unit price divided by the average of unit price and median unit price. This method will keep the relative price between -200 and +200 percent. (This will make it easier to visualize.)

data <- full_join(data, uprice, by="Prod")
data$rel_uprice <- (data$Val/data$Quant-data$med_uprice)/((data$Val/data$Quant+data$med_uprice)/2)*100
stargazer(select(filter(data, Insp=="ok"),rel_uprice), median=TRUE, type="text")
## 
## =========================================================
## Statistic    N     Mean  St. Dev.   Min    Median   Max  
## ---------------------------------------------------------
## rel_uprice 14,347 14.623  72.591  -197.683 0.412  196.462
## ---------------------------------------------------------
stargazer(select(filter(data, Insp=="fraud"),rel_uprice), median=TRUE, type="text")
## 
## =========================================================
## Statistic    N    Mean  St. Dev.   Min    Median    Max  
## ---------------------------------------------------------
## rel_uprice 1,199 -0.088 173.075  -200.000 -32.933 199.687
## ---------------------------------------------------------

We see that ok transactions tend to have relatively higher unit prices than fraudulent transactions.(I would expect the opposite, i.e. sales people inflating prices.) There is also a big difference in the standard deviation of the relative unit price among the fraudulent transactions versus the ok transactions. Relative prices vary a lot more among the fraudulent transactions. Let’s plot the relative unit prices to get a better sense of how they are distributed withing ok and fraudulent transactions. In addition to alpha parameter, we can avoid over-plotting by adding a geom_jitter() which will add small random shifts the position of our data points. Here we restrict the jitter to horizontal shifts by setting height=0.

ggplot(data, aes(x=Insp, y=rel_uprice, color=Insp)) + geom_point(alpha=0.25) + geom_jitter(height = 0)

Wow, this graph clearly shows that unit prices of fraudulent transactions are off.

Finally, let’s focus on the missing values - we have lots of them. Let’s create a new variable missing that describes whether Val, Quant, both, or none are missing. We will use function ifelse(). We will also make this variable a factor (it will be useful when we make predictions).

data$missing <- ifelse(is.na(data$Val) & is.na(data$Quant), "both missing",
                       ifelse(is.na(data$Val), "Val missing",
                              ifelse(is.na(data$Quant), "Quant missing", "no missing")))
data$missing <- as.factor(data$missing)
table(data$missing)
## 
##  both missing    no missing Quant missing   Val missing 
##           888        387010         12954           294

We see that missing quantity is much more frequent than missing value. Still, there is probably something suspicious about reports with missing values.

IN-CLASS EXERCISE 2: Create a visual representation of the distribution of observations across values of Insp and simultaneously across values of missing.

We have at least two solid candidates, dev_uprice and missing, for predicting fraudulent reports. We will do just that using a new algorithm called decision trees in lab 12.