R Markdown

This case study is on fraud detection. It is about detecting unusual activities, or outliers, in a dataset. In this case, we have transaction reports of salesmen of a company. Salesmen are free to set the selling price of a series of products For each transaction salesmen report back the product, the sold quantity and the overall value of the transaction.

Load Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
library(DMwR2)
## 
## Attaching package: 'DMwR2'
## The following objects are masked from 'package:DMwR':
## 
##     .Eq, .St, algae, algae.sols, centralImputation, centralValue,
##     GSPC, kNN, knnImputation, lofactor, manyNAs, outliers.ranking,
##     rpartXse, rt.prune, sales, SelfTrain, sigs.PR, SoftMax,
##     test.algae, tradeRecord, trading.signals, trading.simulator,
##     tradingEvaluation
library(forcats) 
library(ggplot2)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(performanceEstimation)
library(UBL)
## Loading required package: MBA
## Loading required package: gstat
## Loading required package: automap
## Loading required package: sp
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(e1071)
library(adabag)
## Loading required package: rpart
## Loading required package: mlbench
## Loading required package: caret
library(gbm)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(RWeka)

Load Data

data(sales, package="DMwR2")
sales
## # A tibble: 401,146 × 5
##        ID   Prod Quant   Val   Insp
##    <fctr> <fctr> <int> <dbl> <fctr>
## 1      v1     p1   182  1665   unkn
## 2      v2     p1  3072  8780   unkn
## 3      v3     p1 20393 76990   unkn
## 4      v4     p1   112  1100   unkn
## 5      v3     p1  6164 20260   unkn
## 6      v5     p2   104  1155   unkn
## 7      v6     p2   350  5680   unkn
## 8      v7     p2   200  4010   unkn
## 9      v8     p2   233  2855   unkn
## 10     v9     p2   118  1175   unkn
## # ... with 401,136 more rows

Explore Data

dim(sales)
## [1] 401146      5
summary(sales)
##        ID              Prod            Quant                Val         
##  v431   : 10159   p1125  :  3923   Min.   :      100   Min.   :   1005  
##  v54    :  6017   p3774  :  1824   1st Qu.:      107   1st Qu.:   1345  
##  v426   :  3902   p1437  :  1720   Median :      168   Median :   2675  
##  v1679  :  3016   p1917  :  1702   Mean   :     8442   Mean   :  14617  
##  v1085  :  3001   p4089  :  1598   3rd Qu.:      738   3rd Qu.:   8680  
##  v1183  :  2642   p2742  :  1519   Max.   :473883883   Max.   :4642955  
##  (Other):372409   (Other):388860   NA's   :13842       NA's   :1182     
##     Insp       
##  ok   : 14462  
##  unkn :385414  
##  fraud:  1270  
##                
##                
##                
## 

We can see that that data frame has 401,146 rows and 5 columns, as follows:

There are quite a lot of NA’s for variables Quant and Val. Let’s take a look at how many cases have NA’s that are missing for both variables.

nrow(filter(sales, is.na(Quant), is.na(Val))) # check for  null values both both quantity and values
## [1] 888

Let’s take a look at the distribution of the Insp.

table(sales$Insp)/nrow(sales)*100
## 
##        ok      unkn     fraud 
##  3.605171 96.078236  0.316593

Most the of cases have not been inspected.

Search for frauds should be done by product since a transaction can only be regarded as abnormal in the context of transactions of the same product. Unit price will be a better way of comparing transactions of the same product.

sales <- mutate(sales, Uprice = Val / Quant) # create a new column for the data set with the claimed unit price for each transaction
summary(sales$Uprice)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     0.00     8.46    11.89    20.30    19.11 26460.00    14136

The maximum unit price is very large, which suggests somthing fishy is going on.

Let’s take a look at the variability among salesmen and products.

ids <- group_by(sales, ID) 
ggplot(summarize(ids, nTrans=n()), aes(x=ID,y=nTrans)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + 
  xlab("Salesmen") + 
  ylab("Nr. of Transactions") + 
  ggtitle("Nr. of Transactions per Salesman")

prods <- group_by(sales, Prod) 
ggplot(summarize(prods, nTrans=n()), aes(x=Prod,y=nTrans)) + 
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_blank(), axis.ticks.x=element_blank()) + 
  xlab("Product") + 
  ylab("Nr. of Transactions") + 
  ggtitle("Nr. of Transactions per Product")

nrow(summarise(prods,nTrans=n()) %>% filter(nTrans < 20)) 
## [1] 982

Of the 4,548 products, 982 have less than 20 transactions. Declaring a transaction as unusual based on comparing it to fewer than 20 other transactions seems too risky.

Let’s take a look at what a typical price of a product is.

mpProds <- summarize(prods, medianPrice=median(Uprice,na.rm=TRUE)) 
bind_cols(mpProds %>% 
            arrange(medianPrice) %>% dplyr::slice(1:5), 
          mpProds %>% 
            arrange(desc(medianPrice)) %>% dplyr::slice(1:5))
## # A tibble: 5 × 4
##     Prod medianPrice   Prod medianPrice
##   <fctr>       <dbl> <fctr>       <dbl>
## 1   p560  0.01688455  p3689   9204.1954
## 2   p559  0.01884438  p2453    456.0784
## 3  p4195  0.03025914  p2452    329.3137
## 4   p601  0.05522265  p2456    304.8515
## 5   p563  0.05576406  p2459    283.8119

We see that the median price of the top 5 products is significantly different from that of the bottom 5 products.

Next we show in the boxplot below the unit price distribution of the least (p560) and most expensive products (p3689). p560 shows three outliers, which warrants further investigation.

ggplot(filter(sales, Prod %in% c("p3689","p560")), aes(x=fct_drop(Prod), y=Uprice)) + 
  geom_boxplot() + 
  scale_y_log10() + 
  xlab("") + 
  ylab("log10(UnitPrice)")

Let us take a look at the number of transactions per salesmen.

nrow(summarise(ids,nTrans=n()) %>% 
       filter(nTrans < 10))  
## [1] 1539

We see that therre are 1,539 salesmen with very few transactions. Like with products, the total sales brought in by the salesmen differ significantly, as seen below. The problem here is arguably less serious. It is quite common for the top salesmen to bring in more sales. In our case, the top 100 salesmen account for almost 40% of the revenue of the company, while the bottom 2,000 out of the 6,016 salesmen generate less than 2% of the revenue.

tvIDs <- summarize(ids,totalVal=sum(Val,na.rm=TRUE))
bind_cols(tvIDs %>% 
            arrange(totalVal) %>% dplyr::slice(1:5), 
          tvIDs %>% 
            arrange(desc(totalVal)) %>% dplyr::slice(1:5))
## # A tibble: 5 × 4
##       ID totalVal     ID  totalVal
##   <fctr>    <dbl> <fctr>     <dbl>
## 1  v3355     1050   v431 211489170
## 2  v6069     1080    v54 139322315
## 3  v5876     1115    v19  71983200
## 4  v6058     1115  v4520  64398195
## 5  v4515     1125   v955  63182215
arrange(tvIDs,desc(totalVal)) %>% # The top 100 salesmen account for almost 40% of the revenue of the company 
  dplyr::slice(1:100) %>% 
  summarize(t100=sum(totalVal)) / (summarize(tvIDs,sum(totalVal))) * 100
##       t100
## 1 38.33277
arrange(tvIDs,totalVal) %>% # bottom 2,000 out of the 6,016 salesmen generate less than 2% of the revenue
  dplyr::slice(1:2000) %>% 
  summarize(b2000=sum(totalVal)) / (summarize(tvIDs,sum(totalVal))) * 100
##      b2000
## 1 1.988716

The key assumption we are going to make to find frauds is that the transactions of the same product should not vary a lot in terms of unit price. More precisely, we will assume that the unit price of the transactions of each product should follow a normal distribution. This immediately provides a nice and intuitive baseline rule for ???nding frauds: Transactions with declared unit prices that do notfit the assumptions of the normal distribution are suspicious.

We can use the boxplot rule as a quick way of identifying the outliers. Specifically, values outside of the following interval are outliers: [Q1 - 1.5 × IQR, Q3 + 1.5 × IQR] where Q1(Q3) is the 1st (3rd) quartile and IQR is the inter-quartile range.

nouts <- function(x) length(boxplot.stats(x)$out) # this function extracts the outliers
noutsProds <- summarise(prods,nOut=nouts(Uprice)) 
arrange(noutsProds,desc(nOut))
## # A tibble: 4,548 × 2
##      Prod  nOut
##    <fctr> <int>
## 1   p1125   376
## 2   p1437   181
## 3   p2273   165
## 4   p1917   156
## 5   p1918   156
## 6   p4089   137
## 7    p538   129
## 8   p3774   125
## 9   p2742   120
## 10  p3338   117
## # ... with 4,538 more rows
summarize(noutsProds,totalOuts=sum(nOut)) 
## # A tibble: 1 × 1
##   totalOuts
##       <int>
## 1     29446
summarize(noutsProds,totalOuts=sum(nOut))/nrow(sales)*100
##   totalOuts
## 1   7.34047

29,446 transactions are considered outliers, which corresponds to approximately 7% of the total number of transactions

As we have mentioned there are lots of unknown values. Particularly serious are the 888 transactions with both Quant and Val unknown. Let’s take a look at the problematic transactions.

nas <- filter(sales, is.na(Quant), is.na(Val)) %>% select(ID,Prod) # Products and salesmen involved in the 888 problematic cases

# The most problematic products and salesmen
prop.naQandV <- function(q,v) 100*sum(is.na(q) & is.na(v))/length(q) 
summarise(ids,nProbs=prop.naQandV(Quant,Val)) %>% 
  arrange(desc(nProbs))
## # A tibble: 6,016 × 2
##        ID    nProbs
##    <fctr>     <dbl>
## 1   v1237 13.793103
## 2   v4254  9.523810
## 3   v4038  8.333333
## 4   v5248  8.333333
## 5   v3666  6.666667
## 6   v4433  6.250000
## 7   v4170  5.555556
## 8   v4926  5.555556
## 9   v4664  5.494505
## 10  v4642  4.761905
## # ... with 6,006 more rows
summarise(prods,nProbs=prop.naQandV(Quant,Val)) %>% 
  arrange(desc(nProbs))
## # A tibble: 4,548 × 2
##      Prod   nProbs
##    <fctr>    <dbl>
## 1   p2689 39.28571
## 2   p2675 35.41667
## 3   p4061 25.00000
## 4   p2780 22.72727
## 5   p4351 18.18182
## 6   p2686 16.66667
## 7   p2707 14.28571
## 8   p2690 14.08451
## 9   p2691 12.90323
## 10  p2670 12.76596
## # ... with 4,538 more rows

Tidy Data

In regards to the salesmen if we remove the transactions we are removing a small percentage In terms of the products so that is ok. In terms of the products, however, things are not that simple because some would have more than 20% of the transactions removed. Still, we will see that their unit price distribution is similar to other products. Overall, the best strategy seems to be the removal of the null values so let’s go ahead to do so.

sales <- filter(sales,!(is.na(Quant) & is.na(Val))) # Remove the rows with both null values

Let us check the transactions with either the quantity or value unknown. The following obtains the proportion of transactions of each product that have the quantity unknown:

prop.nas <- function(x) 100*sum(is.na(x))/length(x) 
propNAsQp <- summarise(prods,Proportion=prop.nas(Quant)) 
arrange(propNAsQp,desc(Proportion))
## # A tibble: 4,548 × 2
##      Prod Proportion
##    <fctr>      <dbl>
## 1   p2442  100.00000
## 2   p2443  100.00000
## 3   p1653   90.90909
## 4   p4101   85.71429
## 5   p4243   68.42105
## 6    p903   66.66667
## 7   p3678   66.66667
## 8   p4061   66.66667
## 9   p3955   64.28571
## 10  p4313   63.63636
## # ... with 4,538 more rows

Two products have all their transactions (54) with unknown values of the quantity. Let’s remove them.

nlevels(sales$Prod)
## [1] 4548
sales <- droplevels(filter(sales,!(Prod %in% c("p2442", "p2443")))) # completely drop the transactions. If not the column "remembers" the transactions removed. Plotting charts may show empty values
nlevels(sales$Prod)
## [1] 4546

Let us move our attention to the unknowns on the Val column. If the proportion of transactions of each product with unknown value in this column is reasonable then we may try to ???ll in these unknowns.

summarise(ids,Proportion=prop.nas(Quant)) %>% 
  arrange(desc(Proportion)) # Looks reasonable. we can try fill in unknowns
## # A tibble: 6,016 × 2
##        ID Proportion
##    <fctr>      <dbl>
## 1   v2925  100.00000
## 2   v4356  100.00000
## 3   v5537  100.00000
## 4   v5836  100.00000
## 5   v6044  100.00000
## 6   v6058  100.00000
## 7   v6065  100.00000
## 8   v2923   90.00000
## 9   v4368   88.88889
## 10  v2920   85.71429
## # ... with 6,006 more rows
summarise(prods,Proportion=prop.nas(Val)) %>% 
  arrange(desc(Proportion)) # Looks reasonable. we can try fill in unknowns
## # A tibble: 4,548 × 2
##      Prod Proportion
##    <fctr>      <dbl>
## 1   p2689   39.28571
## 2   p2675   35.41667
## 3   p1110   25.00000
## 4   p4061   25.00000
## 5   p2780   22.72727
## 6   p4351   18.18182
## 7   p4491   18.18182
## 8   p2707   17.85714
## 9   p1462   17.77778
## 10  p1022   17.64706
## # ... with 4,538 more rows
summarise(ids,Proportion=prop.nas(Val)) %>% 
  arrange(desc(Proportion)) # Looks reasonable. we can try fill in unknowns
## # A tibble: 6,016 × 2
##        ID Proportion
##    <fctr>      <dbl>
## 1   v5647  37.500000
## 2     v74  22.222222
## 3   v5946  20.000000
## 4   v5290  15.384615
## 5   v4022  13.953488
## 6   v1237  13.793103
## 7   v4472  12.500000
## 8    v975   9.574468
## 9   v4254   9.523810
## 10  v2814   9.090909
## # ... with 6,006 more rows

The basic assumption we make is that transactions of the same products should have a similar unit price. Let’s calculate this typical price for each product We will skip the transaction found fraudulent on this calculation.

tPrice <- filter(sales, Insp != "fraud") %>% # skip the transaction found fraudulent 
  group_by(Prod) %>% 
  summarise(medianPrice = median(Uprice,na.rm=TRUE))

Let’s go ahead to fill in in either the quantity or the value, given that we do not have any transaction with both values unknown any more.

noQuantMedPrices <- filter(sales, is.na(Quant)) %>% 
  inner_join(tPrice) %>% 
  select(medianPrice)
## Joining, by = "Prod"
noValMedPrices <- filter(sales, is.na(Val)) %>% 
  inner_join(tPrice) %>% 
  select(medianPrice)
## Joining, by = "Prod"
noQuant <- which(is.na(sales$Quant)) # Find index that are unknowns

noVal <- which(is.na(sales$Val)) # Find index that are unknowns

sales[noQuant,'Quant'] <- ceiling(sales[noQuant,'Val']/noQuantMedPrices) 

sales[noVal,'Val'] <- sales[noVal,'Quant'] * noValMedPrices

sales$Uprice <- sales$Val/sales$Quant

We are done. Let’s go ahead to save the cleaned data.

save(sales,file="mySalesObj.Rdata") 

Modelling & Model Evaluation

As noted above, the target variable (the Insp column) has three values: ok, fraud and unkn. The value unkn means that those transactions were not inspected and thus may be either frauds or normal transactions (OK). In a way this represents an unknown value of the target variable. In these contexts there are essentially 3 ways of addressing these tasks: unsupervised, supervised, or semi-supervised techniques.

Unsupervised Techniques

Unsupervised learning essentially assumes complete ignorance of the target variable values. We are trying to find the natural groupings of the data. The reasoning is that fraudulent transactions should be different from normal transactions, i.e. should belong to different groups of data. Using them would mean to throw away all information contained in column Insp.

Supervised Techniques

Supervised learning assumes all observations have a value for the target variable and try to approximate the unknown function Y = f(X1,··· ,Xp) using the available data. Using them would mean to throw away all data with value Insp = unkn, which are the majority (96%).

100*sum(sales$Insp == "unkn")/nrow(sales)
## [1] 96.0705

Semi-Supervised Techniques

Semi-supervised learning tries to take advantage of all existing data. There are two main approaches:

  • Use unsupervised methods trying to incorporate the knowledge about the target variable (Insp) available in some cases, into the criteria guiding the formation of the groups
  • Use supervised methods trying to “guess” the target variable of some cases, in order to increase the number of available training cases

There are not too many methods exist for this type of approach, unfortunately.

The Goals of our Data Mining Task

Inspection/auditing activities are typically constrained by limited resources Inspecting all cases is usually not possible In this context, the best outcome from a data mining tool is an inspection ranking. This type of result allows users to allocate their (limited) resources to the most promising cases and stop when necessary.

Before checking how to obtain inspection rankings let us see how to evaluate this type of output. Our application has an additional difficulty - labeled and unlabeled cases. If an unlabeled case appears in the top positions of the inspection ranking how do we know if this is correct?

Precision/Recall

Recall the concept of Precision/Recall.Frauds are rare events. The Precision/Recall evaluation setting allows us to focus on the performance of a model on a specific class (in this case the frauds). Precision will be measured as the proportion of cases the models say are fraudulent that are effectively frauds. Recall will be measured as the proportion of frauds that are signaled by the models as such.

Usually there is a trade-off between precision and recall - it is easy to get 100% recall by forecasting all cases as frauds. In a setting with limited inspection resources what we really aim at is to maximize the use of these resources. Recall is the key issue on this application - we want to capture as many frauds as possible with our limited inspection resources.

Precision/Recall curves show the value of precision for different values of recall. They can be easily obtained for 2 class problems, where one is the target class (e.g. the fraud). They can be obtained with function PRcurve() from package DMwR (uses functions from package ROCR). This function takes as arguments two equal-length vectors of the predicted and true positive class probabilities. As illustraion, let’s take a look at these curves.

data(ROCR.simple) 
head(ROCR.simple$predictions)
## [1] 0.6125478 0.3642710 0.4321361 0.1402911 0.3848959 0.2444155
head(ROCR.simple$labels)
## [1] 1 1 0 0 0 1
PRcurve(ROCR.simple$predictions, ROCR.simple$labels, main="Precision/Recall Curve")

Lift Charts

Lift charts provide a different perspective of the model predictions. They give more importance to the values of recall. In the x-axis they have the rate of positive predictions (RPP), which is the probability that the model predicts a positive class. In the y-axis they have the value of recall divided by the value of RPP. Lift charts are almost what we want for our application. What we want is to know the value of recall for different inspection effort levels (which is captured by RPP). We will call these graphs the Cumulative Recall chart, which are obtained by function CRchart() of package DMwR (again using functionalities of package ROCR). An example is shown below.

CRchart(ROCR.simple$predictions, ROCR.simple$labels , main='Cumulative Recall Chart')

Turning back to our case, previous measures only evaluate the quality of rankings in terms of the labeled transactions - supervised metrics. The obtained rankings will surely contain unlabeled transaction reports - are these good inspection recommendations? We can compare their unit price with the typical price of the reports of the same product. We would expect that the difference between these prices is high, as this is an indication that something is wrong with the report.

We use the Normalized Distance to Typical Price (NDTPp) to measure this difference. NDTPp is defined as abs(u - Up)/IQRp where Up is the typical price of product p (measured by the median), and IQRp is the respective inter-quartile range. The higher the value of NDTP the stranger is the unit price of a transaction, and thus the better the recommendation for inspection. The following function calculates the NDTP:

avgNDTP <- function(toInsp,train,stats){
  if (missing(train) && missing(stats)) 
    stop('Provide either the training data or the product stats') 
  if (missing(stats)){
    stats <- as.matrix(filter(train,Insp != 'fraud') %>% 
                         group_by(Prod) %>% 
                         summarise(median=median(Uprice),iqr=IQR(Uprice)) %>% 
                         select(median,iqr)) 
    rownames(stats) <- levels(train$Prod) 
    stats[which(stats[,'iqr']==0),'iqr'] <- stats[which(stats[,'iqr']==0),'median'] 
    }
return(mean(abs(toInsp$Uprice-stats[toInsp$Prod,'median']) / stats[toInsp$Prod,'iqr']))
}

Given the size of the data set, the Holdout method is the adequate methodology. We will split the data in two partitions - 70% for training and 30% for testing. We can repeat the random split several times to increase the statistical signficance. Our data set has a “problem” - class imbalance. To avoid sampling bias we should use stratified sampling. Stratified sampling consists of making sure the test sets have the same class distribution as the original data.

In summary, we will evaluate the inspection rankings proposed by the alternative models in terms of both Precision and Recall and NDTP. We will obtain reliable estimates of these evaluation metrics using several repetitions of a 70%-30% Holdout with stratified sampling.

The following function evaluates an inspection ranking proposal (returns precision, recall and average NDTP):

evalOutlierRanking <- function(testSet,rankOrder,Threshold,statsProds,...){
  ordTS <- testSet[rankOrder,] # ordered test set
  N <- nrow(testSet) 
  nF <- if (Threshold < 1) as.integer(Threshold*N) else Threshold 
  cm <- table(c(rep('fraud',nF),rep('ok',N-nF)),ordTS$Insp) 
  prec <- cm['fraud','fraud']/sum(cm['fraud',]) 
  rec <- cm['fraud','fraud']/sum(cm[,'fraud']) 
  AVGndtp <- avgNDTP(ordTS[1:nF,],stats=statsProds) 
  return(c(Precision=prec,Recall=rec,avgNDTP=AVGndtp)) 
  }

We will call the previous function with the following information on the typical prices of each product (parameter statsProds):

globalStats <- as.matrix(filter(sales,Insp != 'fraud') %>% 
                           group_by(Prod) %>% 
                           summarise(median=median(Uprice),iqr=IQR(Uprice)) %>% 
                           select(median,iqr)) 
rownames(globalStats) <- levels(sales$Prod) 
globalStats[which(globalStats[,'iqr']==0), 'iqr'] <- 
  globalStats[which(globalStats[,'iqr']==0), 'median'] 
head(globalStats,3)
##      median      iqr
## p1 11.34615 8.563580
## p2 10.87786 5.609731
## p3 10.00000 4.809092

Modified Box-Plot Rule

Recall the simple baseline approach uses a box-plot rule to tag as suspicious transactions whose unit price is an outlier according to this rule. This rule would tag each transaction as outlier or non-outlier - but we need outlier rankings. We will use the values of NDTP to obtain a score of outlier and thus produce a ranking.

We will use the performanceEstimation package to estimate the scores of the selected metrics (precision, recall and NDTP). Estimates will be obtained using an Holdout experiment leaving 30% of the cases as test set. We will repeat the (random) train + test split 3 times to increase the significance of our estimates. We will calculate our metrics assuming that we can only audit 10% of the test set.

The Workflow implementing the modified Box-Plot Rule is as follows:

BPrule.wf <- function(form,train,test,...){
  require(dplyr, quietly=TRUE)
  ms <- as.matrix(filter(train,Insp != 'fraud') %>% 
                    group_by(Prod) %>% 
                    summarise(median=median(Uprice),iqr=IQR(Uprice)) %>% 
                    select(median,iqr)) 
  rownames(ms) <- levels(train$Prod) 
  ms[which(ms[,'iqr']==0),'iqr'] <- ms[which(ms[,'iqr']==0),'median'] 
  ORscore <- abs(test$Uprice-ms[test$Prod,'median']) / ms[test$Prod,'iqr'] 
  rankOrder <- order(ORscore,decreasing=T) 
  res <- list(testSet=test, rankOrder=rankOrder, 
              probs=matrix(c(ORscore,ifelse(test$Insp=='fraud',1,0)), ncol=2)) 
  res 
}

Let’s go ahead to evaluate using the the modified Box-Plot Rule.

bp.res <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("BPrule.wf"), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), # stratified sampling 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.1, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: BPrule.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(bp.res)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  BPrule.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: BPrule.wf 
##            Precision     Recall    avgNDTP
## avg     0.0166583375 0.52631579 11.1748886
## std     0.0006505289 0.02055329  0.9004590
## med     0.0163251707 0.51578947 10.7913476
## iqr     0.0005830418 0.01842105  0.8369579
## min     0.0162418791 0.51315789 10.5297012
## max     0.0174079627 0.55000000 12.2036171
## invalid 0.0000000000 0.00000000  0.0000000

The precision is low. The recall is reasonable given the threshold of 0.1. The NDTP is notably high (likely hard to beat!).

Let’s now take at the performance for different effort levels.

par(mfrow=c(1,2)) 
ps.bp <- sapply(getIterationsInfo(bp.res),function(i) i$probs[,1]) 
ts.bp <- sapply(getIterationsInfo(bp.res),function(i) i$probs[,2]) 
PRcurve(ps.bp,ts.bp,main="PR curve",avg="vertical") 
CRchart(ps.bp,ts.bp,main='Cumulative Recall curve',avg='vertical')

Local Outlier Factor (LOF)

LOF is probably the most well-know outlier detection algorithm. The result of LOF is a local outlier score for each case. This means that the result can be regarded as a kind of ranking of outlyingness of the data set. LOF is based on the notion of the local distance of each case to its nearest neighbors. It tries to analyze this distance in the context of the typical distance within this neighborhood. This way it can cope with data sets with different data densities.

The k-distance of an object o (distk(o)) is the distance from o to its kth nearest neighbor. The k-distance neighborhood of o (Nk(o)) are the set of k nearest neighbors of o. The reachability-distance of an object o with respect to another object p is defined as reach.distk(o,p) = max{distk(p),d(o,p)}. If p and o are faraway from each other the reachability distance will be equal to their actual distance. If they are close to each other then this distance is substituted by distk(p). The local reachability-distance is the inverse of the average reachability-distance of its k-neighborhood. The LOF score of an observation captures the degree to which we can consider it an outlier.

The function lofactor() in package DMwR implements LOF. The following implements a workflow using this function:

LOF.wf <- function(form, train, test, k, ...) {
  require(DMwR2, quietly=TRUE)
  ntr <- nrow(train)
  all <- as.data.frame(rbind(train,test))
  N <- nrow(all)
  ups <- split(all$Uprice,all$Prod)
  r <- list(length=ups)
  for(u in seq(along=ups)) 
    r[[u]] <- if (NROW(ups[[u]]) > 3) 
      lofactor(ups[[u]],min(k,NROW(ups[[u]]) %/% 2)) 
  else if (NROW(ups[[u]])) rep(0,NROW(ups[[u]])) 
  else NULL
  all$lof <- vector(length=N)
  split(all$lof,all$Prod) <- r
  all$lof[which(!(is.infinite(all$lof) | is.nan(all$lof)))] <- 
    SoftMax(all$lof[which(!(is.infinite(all$lof) | is.nan(all$lof)))])

  res <- list(testSet=test,
              rankOrder=order(all[(ntr+1):N,'lof'],decreasing=T),
              probs=as.matrix(cbind(all[(ntr+1):N,'lof'],
                                    ifelse(test$Insp=='fraud',1,0))))
  res
}

Let’s evaluate the LOF.

lof.res <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("LOF.wf",k=7), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.1, 
                                     statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: LOF.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(lof.res)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  LOF.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: LOF.wf 
##            Precision     Recall   avgNDTP
## avg     0.0221000611 0.69824561 8.7661376
## std     0.0006251502 0.01975146 0.9362724
## med     0.0220722972 0.69736842 8.4358879
## iqr     0.0006246877 0.01973684 0.8915197
## min     0.0214892554 0.67894737 8.0397428
## max     0.0227386307 0.71842105 9.8227821
## invalid 0.0000000000 0.00000000 0.0000000

Compared to Box-plot rule, the LOF produces a slightly higher precision and recall but a lower NDTP.

Let us compare the Box-Plot Rule vs the LOF.

par(mfrow=c(1,2)) 
ps.lof <- sapply(getIterationsInfo(lof.res), function(i) i$probs[,1]) 
ts.lof <- sapply(getIterationsInfo(lof.res), function(i) i$probs[,2])

PRcurve(ps.bp,ts.bp,main="PR curve",lty=1, 
        xlim=c(0,1),ylim=c(0,1),avg="vertical") 
PRcurve(ps.lof,ts.lof,add=T,lty=2,avg='vertical') 
legend('topright',c('BPrule','LOF'),lty=c(1,2))

CRchart(ps.bp,ts.bp,main='Cumulative Recall curve', 
        lty=1,xlim=c(0,1),ylim=c(0,1),avg='vertical') 
CRchart(ps.lof,ts.lof,add=T,lty=2,avg='vertical') 
legend('bottomright',c('BPrule','LOF'),lty=c(1,2))

The Box-Plot rule produces has higher precision at lower levels of recall compared to the LOF. The recall is higher for the LOF at lower rates of positive predictions.

Let us now repeat the experiments with the BP rule and LOF, this time using 1% as the available inspection effort.

bp.res1 <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("BPrule.wf"), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), # stratified sampling 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.01, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: BPrule.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(bp.res1)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  BPrule.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: BPrule.wf 
##           Precision      Recall   avgNDTP
## avg     0.116111111 0.366666667 79.430446
## std     0.001272938 0.004019803  8.621318
## med     0.115833333 0.365789474 75.743465
## iqr     0.001250000 0.003947368  8.008230
## min     0.115000000 0.363157895 73.265706
## max     0.117500000 0.371052632 89.282167
## invalid 0.000000000 0.000000000  0.000000
lof.res1 <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("LOF.wf",k=7), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.01, 
                                     statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: LOF.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(lof.res1)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  LOF.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: LOF.wf 
##           Precision     Recall   avgNDTP
## avg     0.068611111 0.21666667 60.529592
## std     0.004589642 0.01449361  7.963378
## med     0.070833333 0.22368421 56.054047
## iqr     0.004166667 0.01315789  6.956473
## min     0.063333333 0.20000000 55.810891
## max     0.071666667 0.22631579 69.723837
## invalid 0.000000000 0.00000000  0.000000

At 1% inspection effort, the level of precision is higher while the recall is notably lower. The NDTP is much higher as well. This makes sense. As we decrease the threshold we should expect the model to be more confident of identifying the outliers since the ranking does not change. Thus, we would expect precision to increase but recall to decrease. We are computing NDTP on the largest outliers hence we should expect a large increase,

Let us compare the Box-Plot Rule vs the LOF based on the 1% inspection effort.

par(mfrow=c(1,2)) 
ps.bp1 <- sapply(getIterationsInfo(bp.res1),function(i) i$probs[,1]) 
ts.bp1 <- sapply(getIterationsInfo(bp.res1),function(i) i$probs[,2]) 
ps.lof1 <- sapply(getIterationsInfo(lof.res1), function(i) i$probs[,1]) 
ts.lof1 <- sapply(getIterationsInfo(lof.res1), function(i) i$probs[,2])

PRcurve(ps.bp1,ts.bp1,main="PR curve (Threshold = 1%)",lty=1, 
        xlim=c(0,1),ylim=c(0,1),avg="vertical") 
PRcurve(ps.lof1,ts.lof1,add=T,lty=2,avg='vertical') 
legend('topright',c('BPrule','LOF'),lty=c(1,2))

CRchart(ps.bp1,ts.bp1,main='Cumulative Recall curve (Threshold = 1%)', 
        lty=1,xlim=c(0,1),ylim=c(0,1),avg='vertical') 
CRchart(ps.lof1,ts.lof1,add=T,lty=2,avg='vertical') 
legend('bottomright',c('BPrule','LOF'),lty=c(1,2))

We can also create a new workflow using the boxplot rule that uses the mean and the standard deviation for calculating the outlier scores instead of the median and IQR. We can also run the experiments with the LOF work???ow using different values for the k parameter (e.g. 5 and 10) to see the impact on the results. This is left as an exercise to the interested reader.

Classification Approaches

If we consider only the OK and fraud cases we have a supervised classification task. However, we have a highly unbalanced class distribution.

table(sales[sales$Insp != 'unkn','Insp'])/ 
  nrow(sales[sales$Insp != 'unkn',])
## 
##        ok      unkn     fraud 
## 0.9193692 0.0000000 0.0806308

This type of problems creates serious difficulties to most classification algorithms. They will tend to focus on the prevailing class value. The problem is that this class is the least important on this application. There are essentially 2 ways of addressing these problems. We can change the learning algorithms, namely their preference bias criteria, or we can change the distribution of the data to balance it.

Sampling Methods - SMOTE (Synthetic Minority Over-sampling Technique)

Sampling methods change the distribution of the data by sampling over the available data set. The goal is to make the training sample more adequate for our objectives. SMOTE is amongst the most successful sampling methods. SMOTE under-samples the majority class. SMOTE also over-samples the minority class, creating several new cases by clever interpolation on the provided cases. The result is a less unbalanced data set. Package UBL implements several sampling approaches including SMOTE, through function SmoteClassif. The result of this function is a new (more balanced) data set.

Bayesian Classification

Bayesian classifiers are statistical classifiers - they predict the probability that a case belongs to a certain class. Bayesian classification is based on the Bayes Theorem. More specifically, if Y is the domain of the target variable y we want to estimate the probability of each of the possible values given the test case (evidence) x. A particular class of Bayesian classifiers - the Naive Bayes Classifier - has shown rather competitive performance on several problems even when compared to more “sophisticated” methods. The Naive Bayes classi???er is “naive” in that it assuming class condition independence. This essentially assumes that there is no dependence relationship among the predictors of the problem.

Naive Bayes is available in R on package e1071, through function naiveBayes(). Let us use Using Naive Bayes for our task together with SMOTE. The workflow is as follows:

NBsm.wf <- function(form,train,test,C.perc="balance",dist="HEOM",...) {
    require(e1071,quietly=TRUE)
    require(UBL,quietly=TRUE)

    sup <- which(train$Insp != 'unkn')
    data <- as.data.frame(train[sup,c('ID','Prod','Uprice','Insp')])
    data$Insp <- factor(data$Insp,levels=c('ok','fraud'))
    newData <- SmoteClassif(Insp ~ .,data,C.perc=C.perc,dist=dist,...)
    model <- naiveBayes(Insp ~ .,newData)
    preds <- predict(model,test[,c('ID','Prod','Uprice','Insp')],type='raw')
    rankOrder <- order(preds[,'fraud'],decreasing=T)
    rankScore <- preds[,'fraud']
    
    res <- list(testSet=test,
              rankOrder=rankOrder,
              probs=as.matrix(cbind(rankScore,
                                    ifelse(test$Insp=='fraud',1,0))))
    res
}

Let us evaluate the Naive Bayes model.

nbs.res <- performanceEstimation( 
  PredTask(Insp ~ ., sales), 
  Workflow("NBsm.wf"), 
  EstimationTask(metrics = c("Precision","Recall","avgNDTP"), 
                 method = Holdout(nReps=3, hldSz=0.3, strat=TRUE), 
                 evaluator = "evalOutlierRanking", 
                 evaluator.pars = list(Threshold=0.1, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: NBsm.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(nbs.res)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  NBsm.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: NBsm.wf 
##           Precision     Recall   avgNDTP
## avg     0.013909712 0.43947368 6.4017351
## std     0.001332667 0.04210526 0.8850754
## med     0.013909712 0.43947368 6.0882813
## iqr     0.001332667 0.04210526 0.8424183
## min     0.012577045 0.39736842 5.7160437
## max     0.015242379 0.48157895 7.4008803
## invalid 0.000000000 0.00000000 0.0000000

We compare the Naive Bayes against the LOF.

par(mfrow=c(1,2)) 
ps.nbs <- sapply(getIterationsInfo(nbs.res), function(i) i$probs[,1]) 
ts.nbs <- sapply(getIterationsInfo(nbs.res), function(i) i$probs[,2]) 

PRcurve(ps.nbs,ts.nbs,main="PR curve",
        lty=1,xlim=c(0,1), ylim=c(0,1),avg="vertical") 
PRcurve(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
legend('topright',c('smote Naive Bayes','LOF'),lty=c(1,1),col=c("black","red")) 

CRchart(ps.nbs,ts.nbs,main='Cumulative Recall curve',
        lty=1,xlim=c(0,1),ylim=c(0,1),avg='vertical') 
CRchart(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
legend('bottomright',c('smote Naive Bayes','LOF'),lty=c(1,1),col=c("black","red"))

The LOF is superior to the smote Naive Bayes.

AdaBoost Algorithm

The AdaBoost method is a boosting algorithm that belongs to the class of Ensemble methods. Boosting was developed with the goal of answering the question: Can a set of weak learners create a single strong learner? In the above question a “weak” learner is a model that alone has very poor predictive performance.

Boosting algorithms work by iteratively creating a strong learner by adding at each iteration a new weak learner to make the ensemble. Weak learners are added with weights that reflect the learner’s accuracy. After each addition the data is re-weighted such that cases that are still poorly predicted gain more weight for the next iteration. This means that each new weak learner will focus on the errors of the previous ones.

AdaBoost (Adaptive Boosting) is an ensemble algorithm that can be used to improve the performance of a base algorithm. It consists of an iterative process where new models are added to form an ensemble. It is adaptive in the sense that at each new iteration of the algorithm the new models are built to try to overcome the errors made in the previous iterations. At each iteration the weights of the training cases are adjusted so that cases that were wrongly predicted get their weight increased to make new models focus on accurately predicting them AdaBoost was created for classification although variants for regression exist.

In terms of model accuracy, ensemble models generally outperform but they are usually not as interpretable. Boosting is more robust to unbalanced data.

AdaBoost is available in several R packages. The package adabag is an example with the function boosting(). The packages gbm and xgboost include implementations of boosting for regression tasks. We will use the package RWeka that provides the function AdaBoost.M1(). The workflow is as follows:

ab.wf <- function(form,train,test,ntrees=100,...) {
    require(RWeka,quietly=TRUE)
    sup <- which(train$Insp != 'unkn')
    data <- as.data.frame(train[sup,c('ID','Prod','Uprice','Insp')])
    data$Insp <- factor(data$Insp,levels=c('ok','fraud'))
    model <- AdaBoostM1(Insp ~ .,data,
                        control=Weka_control(I=ntrees))
    preds <- predict(model,test[,c('ID','Prod','Uprice','Insp')],
                     type='probability')
    rankOrder <- order(preds[,"fraud"],decreasing=TRUE)
    rankScore <- preds[,"fraud"]
    res <- list(testSet=test,
                rankOrder=rankOrder,
                probs=as.matrix(cbind(rankScore,
                                      ifelse(test$Insp=='fraud',1,0))))
    res
}

Let us evaluate the Adaboost model.

ab.res <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("ab.wf"), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.1, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: ab.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(ab.res)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  ab.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: ab.wf 
##            Precision     Recall   avgNDTP
## avg     0.0204897551 0.64736842 7.0543305
## std     0.0004997501 0.01578947 0.3744884
## med     0.0204897551 0.64736842 6.9309492
## iqr     0.0004997501 0.01578947 0.3589210
## min     0.0199900050 0.63157895 6.7571001
## max     0.0209895052 0.66315789 7.4749422
## invalid 0.0000000000 0.00000000 0.0000000

We compare the Naive Bayes, LOF against the AdaBoost.

par(mfrow=c(1,2)) 
ps.ab <- sapply(getIterationsInfo(ab.res), function(i) i$probs[,1]) 
ts.ab <- sapply(getIterationsInfo(ab.res), function(i) i$probs[,2]) 

PRcurve(ps.nbs,ts.nbs,main="PR curve",lty=1,xlim=c(0,1), ylim=c(0,1),avg="vertical") 
PRcurve(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
PRcurve(ps.ab,ts.ab,add=T,lty=1,col="green",avg='vertical') 
legend('topright',c('smote Naive Bayes','LOF','AdaBoost'), lty=1,col=c("black","red","green")) 
CRchart(ps.nbs,ts.nbs,main='Cumulative Recall curve', lty=1,xlim=c(0,1),ylim=c(0,1),avg='vertical') 
CRchart(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
CRchart(ps.ab,ts.ab,add=T,lty=1,col="green",avg='vertical') 
legend('bottomright',c('smote Naive Bayes','LOF','AdaBoost'), lty=1,col=c("black","red","green"))

The LOF remains superior. The CRC is similar for the LOF and AdaBoost.

The work???ow above for Naive Bayes used a more balanced distribution through SMOTE. Let us explore different settings of SMOTE and to see the impact on the results. We try the “extreme” cases where the sampling percentages are automatically estimated to invert the distribution of examples across the existing classes transforming the majority classes into minority and vice-versa. The modified workflow is as follows:

NBsm.wf1 <- function(form,train,test,C.perc="extreme",dist="HEOM",...) {
    require(e1071,quietly=TRUE)
    require(UBL,quietly=TRUE)

    sup <- which(train$Insp != 'unkn')
    data <- as.data.frame(train[sup,c('ID','Prod','Uprice','Insp')])
    data$Insp <- factor(data$Insp,levels=c('ok','fraud'))
    newData <- SmoteClassif(Insp ~ .,data,C.perc=C.perc,dist=dist,...)
    model <- naiveBayes(Insp ~ .,newData)
    preds <- predict(model,test[,c('ID','Prod','Uprice','Insp')],type='raw')
    rankOrder <- order(preds[,'fraud'],decreasing=T)
    rankScore <- preds[,'fraud']
    
    res <- list(testSet=test,
              rankOrder=rankOrder,
              probs=as.matrix(cbind(rankScore,
                                    ifelse(test$Insp=='fraud',1,0))))
    res
}

Let us evaluate this “extreme” Naive Bayes model.

nbs.res1 <- performanceEstimation( 
  PredTask(Insp ~ ., sales), 
  Workflow("NBsm.wf1"), 
  EstimationTask(metrics = c("Precision","Recall","avgNDTP"), 
                 method = Holdout(nReps=3, hldSz=0.3, strat=TRUE), 
                 evaluator = "evalOutlierRanking", 
                 evaluator.pars = list(Threshold=0.1, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: NBsm.wf1 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(nbs.res1)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  NBsm.wf1 
## 
## -> Task:  sales.Insp
##   *Workflow: NBsm.wf1 
##            Precision     Recall   avgNDTP
## avg     0.0143539341 0.45350877 6.3175151
## std     0.0005669548 0.01791279 0.8585546
## med     0.0141595869 0.44736842 5.9035335
## iqr     0.0005413960 0.01710526 0.7801156
## min     0.0139097118 0.43947368 5.7443903
## max     0.0149925037 0.47368421 7.3046215
## invalid 0.0000000000 0.00000000 0.0000000

The precision and recall improve slightly, while NDTP decreases.

We can also modify the Naive Bayes work???ow to completely eliminate SMOTE, i.e. use the original unbalanced data and change the number of trees used in the ensemble obtained with the adaboost work???ow. This is left as an exercise to the interested reader.

Semi-Supervised Approaches - Self Training

Self training is a generic semi-supervised method that can be applied to any classification method It is an iterative process consisting of the following main steps:

  • Build a classi???er with the current labeled data
  • Use the model to classify the unlabeled data
  • Select the highest confidence classifications and add the respective cases with that classification to the labeled data
  • Repeat the process until some criteria are met The last classifier of this iterative process is the result of the semi-supervised process Function SelfTrain() on package DMwR2 implements these ideas for any probabilistic classifier.

Self-training can be risky if the initial model is poor as we may be training the model using wrong results.

Let us try to improve the AdaBoost with self-training.

pred.ada <- function(m,d) {
    p <- predict(m,d,type='probability')
    data.frame(cl=colnames(p)[apply(p,1,which.max)],
               p=apply(p,1,max)
               )
}

ab.st.wf <- function(form,train,test,ntrees=100,...) {
    require(RWeka,quietly=TRUE)
    require(DMwR2,quietly=TRUE)
    train <- as.data.frame(train[,c('ID','Prod','Uprice','Insp')])
    train[which(train$Insp == 'unkn'),'Insp'] <- NA
    train$Insp <- factor(train$Insp,levels=c('ok','fraud'))
    model <- SelfTrain(form,train,
                       learner='AdaBoostM1',
                       learner.pars=list(control=Weka_control(I=ntrees)),
                       pred='pred.ada')
    preds <- predict(model,test[,c('ID','Prod','Uprice','Insp')],
                     type='probability')

    rankOrder <- order(preds[,'fraud'],decreasing=T)
    rankScore <- preds[,"fraud"]
    
    res <- list(testSet=test,
              rankOrder=rankOrder,
              probs=as.matrix(cbind(rankScore,
                                    ifelse(test$Insp=='fraud',1,0))))
    res
}

Let’s evaluate the model.

ab.st.res <- performanceEstimation( 
  PredTask(Insp ~ .,sales), 
  Workflow("ab.st.wf"), 
  EstimationTask(metrics=c("Precision","Recall","avgNDTP"),
                 method=Holdout(nReps=3,hldSz=0.3,strat=TRUE), 
                 evaluator="evalOutlierRanking", 
                 evaluator.pars=list(Threshold=0.1, statsProds=globalStats)) 
  )
## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: sales.Insp
## 
## ++ MODEL/WORKFLOW :: ab.st.wf 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3
summary(ab.st.res)
## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  Precision,Recall,avgNDTP  using
## Stratified  3 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  sales.Insp
## * Workflows  ::  ab.st.wf 
## 
## -> Task:  sales.Insp
##   *Workflow: ab.st.wf 
##           Precision     Recall   avgNDTP
## avg     0.021294908 0.67280702 7.9596032
## std     0.001087055 0.03434521 0.8194148
## med     0.021655839 0.68421053 7.8507389
## iqr     0.001041146 0.03289474 0.8139729
## min     0.020073297 0.63421053 7.2000624
## max     0.022155589 0.70000000 8.8280083
## invalid 0.000000000 0.00000000 0.0000000

There is a slight improvement over AdaBoost. We can see this in the charts below.

par(mfrow=c(1,2))

ps.ab.st <- sapply(getIterationsInfo(ab.st.res), function(i) i$probs[,1]) 
ts.ab.st <- sapply(getIterationsInfo(ab.st.res), function(i) i$probs[,2])
PRcurve(ps.nbs,ts.nbs,main="PR curve",lty=1,xlim=c(0,1), ylim=c(0,1),avg="vertical") 
PRcurve(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
PRcurve(ps.ab,ts.ab,add=T,lty=1,col="green",avg='vertical') 
PRcurve(ps.ab.st,ts.ab.st,add=T,lty=1,col="blue",avg='vertical') 
legend('topright',c('smote Naive Bayes','LOF','AdaBoost','AdaBoost-ST'),
       lty=1,col=c("black","red","green","blue"))

CRchart(ps.nbs,ts.nbs,main='Cumulative Recall curve', lty=1,xlim=c(0,1),ylim=c(0,1),avg='vertical') 
CRchart(ps.lof,ts.lof,add=T,lty=1,col="red",avg='vertical') 
CRchart(ps.ab,ts.ab,add=T,lty=1,col="green",avg='vertical') 
CRchart(ps.ab.st,ts.ab.st,add=T,lty=1,col="blue",avg='vertical') 
legend('bottomright',c('smote Naive Bayes','LOF','AdaBoost','AdaBoost-ST'),
       lty=1,col=c("black","red","green","blue"))