keywords: KS statistics, lift, gain, area under receiver operating characteristic (ROC) curve, true postive rate, false positive rate, true negative rate, false negative rate, sensitivity, specificity, confusion/contingency matrix

A common way to evaluate the performance of a binary classifier is simply through overall accuracy or misclassification rate. But this is often not sufficient when the power of picking up the trues as `true`

and/or picking up the falses as `false`

is of interest. Examples of metrics that consider these true positive and true negative predictability power are area under receiver operating characteristic (AUROC) curve, Kolmogorov Smirnov (KS) statistics, Gini coefficients etc. In this post, we will see the definitions and how to calculate these metrics. Then we will give an overview of the gains table and discuss how to construct a gains table.

A binary classifier is simply a classification model where the response has just two outcomes(Yes/No, 1/0, True/False, Male/Female, Good/Bad etc). A binary classifier can be made via logistic regression, regression tree, random forest, discriminant analysis, neural network, support vector machine and such. A good model, rather giving a straight forward decision (good/bad, 1/0), gives a probability being good or bad. One must decide a cutoff to label the predictions as good/bad or 1/0.

For this demonstration, we will work with the `mictrotus`

data from `Flury`

package. While this is an ideal data for demonstrating multocollinearity, we will use the data to demostrate metrics of binary classifier.

There are in total 288 observations, each representing a vole. 43 of them are classified as `multiplex`

and 46 are classified as `subterraneus`

. The rest are unknown. We will work with the first 80 observations for simplicity. For simplicity, and for ease of calculations, let us code the two classes as 1 and 0.

```
library(Flury)
data("microtus")
mdata=microtus[c(1:80),c(1,2, 5)] ## read only first 80, keep only three variables
## code 1 and 0
mdata$Group=ifelse(mdata$Group=="multiplex", 1, 0)
summary(mdata)
```

```
## Group M1Left Foramen
## Min. :0.0000 Min. :1640 Min. :3490
## 1st Qu.:0.0000 1st Qu.:1772 1st Qu.:3791
## Median :1.0000 Median :1896 Median :3958
## Mean :0.5375 Mean :1926 Mean :3948
## 3rd Qu.:1.0000 3rd Qu.:2070 3rd Qu.:4097
## Max. :1.0000 Max. :2479 Max. :4662
```

The aim of this post is not covering all the classifiers. So we will work with simple logistic regression. Now from several class project I know that many predictors are correlated. Using only two predictors gives a reasonably good model. So only two predictors were kept in the earlier step. Let us make a logistic model for our chosen data.

```
model=glm(Group~M1Left+Foramen, data=mdata, family = binomial())
summary(model$fitted.values)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000871 0.0115191 0.6049780 0.5375000 0.9997978 1.0000000
```

Now what this model gives us, a bunch of values in between 1 and 0. These are actually probabilities of being the class coded with 1 (multiplex). Now, to find an overall accuracy or error rate, one must choose the cutoff value. 0.5 seems an intuitive and obvious choice for the cut off. Let us use this cut off and find the error rate:

```
## a function for error rate
get_Error_Rate=function(trues, predicted_prb, cutoff){
preds=ifelse(predicted_prb<cutoff,0,1)
tab=table(preds, trues)
round((tab[1,2]+tab[2,1])/sum(tab), 4)
}
get_Error_Rate(mdata$Group,model$fitted.values, 0.5)
```

`## [1] 0.05`

So, using 0.5 as cutoff gives us a 5% error rate. But how to choose the cutoff? One potential solution is get error rate at every possible cutoff and choose one gives the lowest error rate.

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## cutoff_value 0.1 0.15 0.2 0.25 0.3 0.3500 0.4000 0.45 0.50 0.5500
## errRate 0.1 0.10 0.1 0.10 0.1 0.0875 0.0625 0.05 0.05 0.0625
## [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## cutoff_value 0.6000 0.65 0.70 0.75 0.80 0.8500 0.900
## errRate 0.0375 0.05 0.05 0.05 0.05 0.0625 0.075
```

Above figure shows that using 0.6 as cutoff actually gives a better overall accuracy. So using this value as cutoff is tempting.

Now one problem associated with the overall error rate is that it does not account anything about the power of picking trues as trues and falses as false. In certain cases true positive rate might be of more interest than the overall error rate. To consider the power of predicting trues as true and falses as false, we need different metric. Two such metrics and AUROC and KS statisitc.

To get to the ROC curve, let us discuss the true positive rate, true negative rate, false positive rate and false negative rate first. These metrics really depend on what we define as `positive`

and what we define as `negative`

. For the current example, let us define code 1 as positive.

It is the probability that the model detects an actual positve as postive. In other words, it is the proportion of the actual postives that are detected as positive by the model. This is also known as `sensitivity`

. Mathematically: \[
TPR=P(\hat{x}=1|x=1)
\]

It is the probability that the model detects an actual negative as negative. In other words, it is the proportion of the actual negatives that are detected as negative by the model. This is also known as `specificity`

. Mathematically: \[
TNR=P(\hat{x}=0|x=0)
\]

It is the probability that the model detects an actual negative as postive. In other words, it is the proportion of the actual negatives that are detected as positive by the model. This is also known as `Type I error`

. Mathematically: \[
FPR=P(\hat{x}=1|x=0)
\]

It is the probability that the model detects an actual positive as negative. In other words, it is the proportion of the actual negatives that are detected as positive by the model. This is also known as `Type II error`

. Mathematically: \[
FNR=P(\hat{x}=0|x=1)
\]

One good way to present the numbers is confusion matrix/contigency table. It shows the number of true postives, true negatives, false postives and false negatives. One can calculate TRP, FPR, TNR, FNR from the confusion matrix.

```
createConfusionMatrix=function(actual, preds, cutoff_value){
predClass=ifelse(preds<cutoff_value, 0, 1)
message("Cutoff is : ", cutoff_value)
table(actual,predClass)
}
## Confusion matrix for cutoff=0.5
createConfusionMatrix(mdata$Group,model$fitted.values, 0.5)
```

`## Cutoff is : 0.5`

```
## predClass
## actual 0 1
## 0 35 2
## 1 2 41
```

The confusion matrix shown above is for cutoff=0.5. Here, \[ TP=35,\ TN=41, \ FP=2, \ FN=2 \]

These numbers are sensitive to the cutoff used. For example, for cutoff 0.8, the confusion matirx is shown below. Here the FPR decreased, while the overall error rate remained same.

`createConfusionMatrix(mdata$Group,model$fitted.values, 0.8)`

`## Cutoff is : 0.8`

```
## predClass
## actual 0 1
## 0 37 0
## 1 4 39
```

A good model aims to icrease the TPR while decreasing the FPR. But these two change in opposite direction. For example, if all the observations are predicted to be positive then TPR is 100%, which is good. But on the other hand, FPR is also 100% which is not desirable. Similarly if all the observations are predicted to be negative, then both FPR and TPR becomes 0%. So using alone TPR or TNR does not seem very intuitive. So need other measure that cares about both.

Now that TPR and FPR changes as does cut off, one can calculate a bunch of TPR and FPR for different cutoff values. Then if we plot the TPR along y-axis and FPR along x-axis then resulting curve is known as ROC curve. For our case, let us get the ROC curve.

```
getTPRFPR=function(actual, predProb, cutoff){
if(cutoff==1){tpr=0; fpr=0}else{
if(cutoff==0){tpr=1; fpr=1}else{
predClass=ifelse(predProb<cutoff,0,1); tab=table(actual, predClass)
tpr=tab[2,2]/ ( tab[2,2]+tab[2,1] )
tnr=tab[1,1]/ ( tab[1,1]+tab[1,2] ); fpr=1-tnr
}
}
return(c(TPR=tpr, FPR=fpr))
}
cutoff_values=seq(0,1,0.05)->TPR->FPR
for(i in 1: length(TPR)){
metrics=getTPRFPR(mdata$Group,model$fitted.values,cutoff_values[i])
TPR[i]=metrics[1]
FPR[i]=metrics[2]
}
head(cbind(cutoff_values, TPR, FPR))
```

```
## cutoff_values TPR FPR
## [1,] 0.00 1.0000000 1.0000000
## [2,] 0.05 1.0000000 0.2972973
## [3,] 0.10 0.9767442 0.1891892
## [4,] 0.15 0.9767442 0.1891892
## [5,] 0.20 0.9534884 0.1621622
## [6,] 0.25 0.9534884 0.1621622
```

Above we plotted the ROC curve. The black line represents the ROC if the model predicted randomly. It is not hard to say that one would like to pull the curve northwest as this indicates high TPR and low FPR. The area under this curve is called the AUROC. Ideally it can be maximum 1. The higher the value, the better it is.

So, if we have two or more competitive models, AUROC can be used as a performance metric to choose one model.

We manually generated the ROC chart above. But we can do this using `ROCR`

package.

`library(ROCR)`

`## Warning: package 'ROCR' was built under R version 3.4.3`

`## Loading required package: gplots`

```
##
## Attaching package: 'gplots'
```

```
## The following object is masked from 'package:stats':
##
## lowess
```

```
pred=prediction(model$fitted.values, mdata$Group)
perf=performance(pred,"tpr","fpr")
TPRfromROCR=unlist(perf@y.values)
FPRfromROCR=unlist(perf@x.values)
plot(TPRfromROCR~FPRfromROCR, col="deepskyblue",
main="ROC plot- from ROCR package",type="o", pch=16);abline(0,1);grid()
```

```
## Get the AUC
print(performance(pred,"auc")@y.values[[1]])
```

`## [1] 0.9886864`

Above plot is made utilizing the `ROCR`

package. We see that there are more cutoff points taken by the in built `performance`

function.

The AUROC is calcualted to be 0.99. This is actually very close to the ideal value.

Along with AUROC, another measure is the KS statistic. It is the maximum difference between TPR and FPR. For our case, it is calculated to be 0.9302326, which is at cutoff = 0.6262503,

```
diff_TPRFPR=TPRfromROCR-FPRfromROCR
KS=max(diff_TPRFPR)
cutoffAtKS=unlist(perf@alpha.values)[which.max(diff_TPRFPR)]
print(c(KS, cutoffAtKS))
```

`## [1] 0.9302326 0.6262503`

In the figure below, a graphics of KS statistics is shown. Higher KS stat value is indicative of better model. If we have more than one models, then KS statistics can be used as a performance measure.

The above figure for KS statistics can be shown in a different way. In the above figure, we took the cutoff along X-axis, and TPR and FPR along Y-axis. Instead we could take the population depth along X and proportion of 1s and 0s along Y. The max difference of two plots will be the KS stat. A more detailed explanation is given below:

From the model, we will have some probabilities of success. First, we need to sort the observations in descending order of these probabilities.

Now we will start with the first observation. This is almost guaranteed that the first observation will have an actual label 1. So for depth=1, the proporiton of success will be \(\frac{1}{number\ of \ total \ 1s}\). And the proportion of failure would be \(\frac{0}{number\ of \ total \ 1s}\).

Now for each possible depth, calculate the success and failure proportion. We have to keep in mind that this is cumulative proportion. So, at depth=n, success proportion would be \(\frac{number\ of\ 1s\ in\ first\ n\ rank\ ordered\ observation}{number\ of \ total \ 1s}\) and failure proportion would be \(\frac{number\ of\ 0s\ in\ first\ n\ rank\ ordered\ observation}{number\ of \ total \ 1s}\)

Now we will plot the cumulative success and failure proportion against depth and calculate the KS stat as the maximum difference of the two.

Let us do it by hand.

```
AppendedData=as.data.frame(cbind(actuals=mdata$Group, probs=model$fitted.values))
orderedData=AppendedData[order(AppendedData$probs, decreasing = T),]
totalOnes=sum(orderedData$actuals)
totalZeros=nrow(orderedData)-sum(orderedData$actuals)
getprop=function(n){
myarray=orderedData$actuals[1:n]
summyarray=sum(myarray)
success_prop=summyarray/totalOnes
failure_prop=(n-summyarray)/totalZeros
return(c(success_prop, failure_prop))
}
nroworder=nrow(orderedData)
orderedData$cum_Success_Prop=rep(NA, nroworder)
orderedData$cum_Failure_Prop=rep(NA, nroworder)
for(i in 0:nroworder){
k=getprop(i)
orderedData$cum_Success_Prop[i]=k[1]
orderedData$cum_Failure_Prop[i]=k[2]
}
orderedData$Diff_Cum_Prop=orderedData$cum_Success_Prop-orderedData$cum_Failure_Prop
head(orderedData)
```

```
## actuals probs cum_Success_Prop cum_Failure_Prop Diff_Cum_Prop
## 26 1 1.0000000 0.02325581 0 0.02325581
## 11 1 1.0000000 0.04651163 0 0.04651163
## 17 1 0.9999995 0.06976744 0 0.06976744
## 5 1 0.9999989 0.09302326 0 0.09302326
## 14 1 0.9999987 0.11627907 0 0.11627907
## 42 1 0.9999958 0.13953488 0 0.13953488
```

`message("The KS stat as follows")`

`## The KS stat as follows`

`max(orderedData$Diff_Cum_Prop)`

`## [1] 0.9302326`

We can see that the same KS stat is returned.

We plotted the cumulative proportions against depth below. We see this plot has a different shape than the previous one. TPR and FPR started at 1 and ended at 0. In the later plot, the cumulative proportions started at zero and ended at 1. However, both has the same fundamental hysteresis-loop like shape. It is for the variable we are plotting along X, which is making the difference of the two plots.

To get the ROC curve, all that we need a set of TPR-FPR values at different settings. Previously we changed this setting by changing the cutoff. There is another approach, which is very efficient in a spread sheet. This requires to arrange the onbservation to be rank ordered (ordered by probability), just like we did to get the upward KS curve. The process is summarused as follows:

Rank order the observations, ordered by fitted/predicted probability.

Now for \(depth=n\), treat first n predictions as success, the rest as failures. With these predictions, get TPR and FPR.

Find TPR and FPR for all possible depth.

Now we have a number of TPR-FPR pairs. Plot TPR against FPR.

The resulting ROC curve will be same as before.

Let us hand code what we discussed above.. :)

All that we need just the original observations and the fitted/predicted probabilities.

```
## firsrt combine the actual class and fitted probabilities
AppendedData=as.data.frame(cbind(actuals=mdata$Group, probs=model$fitted.values))
## Rank order them, ordered by probs
ndata=AppendedData[order(AppendedData$probs, decreasing = T),]
## Create two empty vectors to save TPR and FPR values
TPRvals=rep(NA, nrow(ndata))->FPRvals
## get number of total ones and zeros. we will need these
totalOnes=sum(ndata$actuals)
nrows=nrow(ndata)
totalZeros=nrows-sum(ndata$actuals)
## run loop
for(n in 1:nrows){
preds=c(rep(1,n), rep(0, (nrows-n))) ## first n predicted 1, rest are zero
TruePositive=sum(preds*ndata$actuals) ## retain only if both actual and pred 1
## to get false positive
## we take only n
## multiply actuals and preds
## False postive will be zero
## this is sum-sum(product)
FalsePositive=n-sum( (preds*ndata$actuals) [1:n])
TPRvals[n]=TruePositive/totalOnes
FPRvals[n]=FalsePositive/totalZeros
}
## Now plot the ROC
plot(TPRvals~FPRvals, type="o", col="#7CAE00", pch=16,
main="ROC plot, hand calculated by rank ordering"); abline(0,1, lwd=2, col="navyblue"); grid()
```

Above ROC curve is exactly the same as we had before.

Gini coefficient is another measure of goodness of a binary classifier. The Gini coefficient is a ratio of two areas. The areas are:

The area between the ROC curve and the random model line (\(y=x\) line, passing through \((0,0)\) and \((1,1)\))

The top left triangle above the random model line (\(y=x\) line).

Now, the second area is just 0.5, since the domain and range of ROC are \([0,1]\). The first area is 0.5 less than the AUROC. So, Gini coefficient can be reduced to:

\[ Gini\ Coefficient=\frac{AUROC-0.5}{0.5}=\frac{AUROC}{0.5}-1=2\times AUROC-1 \]

So, Gini coefficient is just a scaled version of AUROC. So it is not a rocket science that, higher the Gini coefficient, better the model.

- will complete later