1.0 Introduction

As the world of machine learning continues to expand, new methods are continuously (and discretely) popping up to evaluate classifiers. Many of the most commonly used in practice include accuracy, sensitivity & specificity, ROC/AUC, Kappa and Lift. This report will explore these methods, highlight their differences, and analyze their usefulness in specific situations.

The spam dataset from the library kernlab will be used as an example throughtout this report. The model in use will be logistic regression (unless otherwise specified), with a training set size of 500.

\(y \in\) {0,1}, where \(y\) is the true class label. When \(y_{i} \in\) {1}, we will refer to the observation as belonging to the positive class, or that the observation is an event. Likewise, when \(y_{i} \in\) {0}, we will refer to the observation as not belonging to the positive class, or that the observation is a non-event.

library(kernlab)
library(MASS)
data(spam)

#Create X and Y
Y <-as.numeric(spam[, ncol(spam)])-1
X <- spam[ ,-ncol(spam)]

#Create test and train
set.seed(150)
n <- length(Y)
i <- sample.int(n, size = 500, replace = FALSE)
train <- (1:n)[-i]
test <- (1:n)[i]

#Train logistic model
logistic.fit.train <- glm(Y[train] ~ ., data=X[train,],family=binomial)

#Train LDA model
lda.result <- lda(x=X[train,],grouping=Y[train])


#Create probability vectors for logistic
prob.fit.train<-predict(logistic.fit.train,newdata=X[train,],type="response")
prob.fit.test <- predict(logistic.fit.train,newdata=X[test,],type="response")

#Create probability vectors for LDA
prob.lda <- predict(lda.result,newdata=X[test,])$posterior[,2]
predicted.spam.lda <- as.numeric(prob.lda > 0.5)


# Create the test labels
Yt<-Y[test]

1.1 Accuracy, Sensitivity & Specificity

Before analysis of cumulative gains and kappa can occur, first one must understand the confusion matrix and the important rates and probabilities that can be derived from it. Below is a standard confusion matrix:

Table 1: Confusion Matrix
Predicted = Positive Predicted = Negative
Actual = Positive TP (True Positive) FN (False Negative)
Actual = Negative FP (False Positive) TN (True Negative)

As an example, this is a confusion matrix for logistic model fit to the spam data:

Table 2: Spam CM
Predicted = Positive Predicted = Negative
Actual = Positive 173 15
Actual = Negative 19 293

1.1.1 Accuracy

Accuracy of the model is the rate at which events are classified correctly, or

\[\mathrm{Accuracy = \frac{TP + TN}{TP + FP + TN + FN}}\] Although a useful overall measure, this does not determine the validity or reliability of the model. This is where the following rates, sensitivity and specificity, come in.


1.1.2 Sensitivity

Sensitivity of the model is the rate that the event of interest is predicted correctly for all samples having the event, or

\[\mathrm{Sensitivity = \frac{TP}{TP + FN}}\] This is sometimes referred to as the True Positive Rate.


1.1.3 Specificity

Specificity of the model is the rate that the nonevent samples are predicted correctly for all samples not containing the event, or

\[\mathrm{Specificity = \frac{TN}{FP + TN}}\] 1 - Specificity is referred to as the False Positive Rate.


1.1.4 Prevalence

Prevalence is a measure of the proportion of the positive class in the population.

\[\mathrm{Prevalence = \frac{TP + FN}{TP + FN + TN + FP}}\]


Note that sensitivity and specificity are inversely proportional; as sensitivity is increased, specificity will decrease and vice versa. These can be useful rates when evaluating classifiers, as in practice the costs associated with making FP and FN errors are very different. Typically, FN are more costly. As an example, consider screening cancer payments. We would rather falsely identify cancer than fail to identify cancer in a patient.

Note that these measures depend on true class membership, which isn’t always known. Sensitivity is interpreted as the rate in which we have predicted the positive class correctly. This measure only applies to observations where the event has already happened. The same logic applies to specificity, thus these are known as conditional measures.

1.2 PPV and NPV

What if we want to find a measure that isn’t intrinsic to the test? In order to find a measure that doesn’t depend on this true class membership, one can employ the Bayes theorem.

Let \(K\) denote the event that a sample point has a positive outcome, and let \(R\) denote the event that the sample point belongs to the positive class (e.g. a positive event occurred):

\[\begin{aligned} \mathrm{P}(R|K) &= \frac{\mathrm{P}(K|R)\mathrm{P}(R)}{\mathrm{P}(K|R)\mathrm{P}(R) + \mathrm{P}(K|R^{c})\mathrm{P}(R^{c})} \\ &= \frac{Sensitivity \times Prevalence}{Sensitivity \times Prevalence + (1-Specificity) \times (1- Prevalence)} \\ &= PPV (Positive Predicted Value) \end{aligned}\]

This is interpreted as the probability of a sample point belonging to the positive class, given it was predicted to be in the positive class.

Similarly, we can develop this for the probability of a sample point not belonging to the positive class, given it was not predicted to be in the positive class:

\[\begin{aligned} \mathrm{P}(R^{c}|K^{c}) &= \frac{\mathrm{P}(K^{c}|R^{c})\mathrm{P}(R^{c})}{\mathrm{P}(K^{c}|R^{c})\mathrm{P}(R^{c}) + \mathrm{P}(K^{c}|R)\mathrm{P}(R)} \\ &= \frac{Specificity \times (1 - Prevalence)}{Specificity \times (1 - Prevalence) + (1 - Sensitivity) \times (Prevalence)} \\ &= NPV (Negative Predicted Value) \end{aligned}\]

Note that PPV increases as prevalence increases, and NPV decreases as prevalence increases. This makes sense intuitively, since if our population consists of more observations from the positive class, then we would expect a higher probability of correctly identifying the positive class.

Continuing with the spam data set example,

Table 3: Spam CM
Predicted = Positive Predicted = Negative
Actual = Positive 173 15 Sensitivity: 0.92
Actual = Negative 19 293 Specificity: 0.939
PPV: 0.901 NPV: 0.951 Accuracy: 0.932

As can be seen from [Table 3], the overall accuracy is 0.932, which is good. By these rates, we hate a seemingly good classifier.

1.3 ROC Curves

The logical next step is asking the question: “How can we use these measures to evaluate a classification model?”. This question can be answered by the Receiver Operating Characteristic (ROC) curve, which plots \(sensitivity\) against \(1 - specificity\) (also known as \(FPR = False Positive Rate\)). Using this curve, we can determine the optimal value of \(sensitivity\) and \(specificity\) to fit our situation, and come up with a single metric to compare ROC curves against each other.

Above is the ROC curve for the spam data set. Clearly, we want a value on this curve close to (0,1) as this would imply a perfect model; 100% specificity and sensitivity. Unfortunately in practice, this is impossible to obtain. Thus we need to decide on an optimal point for our situation. Does the situation call for a higher specificity, or sensitivity?

Of course, this is only part of the problem. In the situation where we have multiple classifiers, how would one determine which ROC curve is the best?

Above is two ROC curves for the spam data: one fit with a logistic regression model, and one fit with an LDA model. Although it looks like the logistic model is better than the LDA, the curves get very tight and overlap in several places.

1.3.1 AUC


To get around this problem, one can calculate the area under the ROC curve (AUC) to arrive at one number that can be used to determine the better model. In our example,

\[\mathrm{AUC(Logistic) = 0.9692444}\] \[\mathrm{AUC(LDA) = 0.949093}\]


So by AUC, we would conclude that the logistic regression model is the better fitting model.

2.0 Cumulative Gain & Lift Charts

2.1 Introduction

Often in the business world, we come across situations where we have to create a priority list from our data in order to optimize some criterion, and act only on a certain number of these records (usually the top \(n\) most records). This is referred to as resource-constrained batch action classification 1. (Shmueli, 2019). A popular example for this situation is one in marketing, where we must choose to advertise a select group of individuals that will maximize our revenue subject to budget constraints. Note here that we have a list of individuals (test data), we want to order it by “priority” (those that will respond positively), and only act on those that have the highest “priority” (e.g. top 70%).

2.2 What is a Cumulative Gain and Lift Chart

Note that in the world of machine learning, there is confusion between lift and cumulative gains charts.

Cumulative gains charts were specifically made to evaluate classifiers in this resource constrained situation. These graphs shows the proportion of gains in of the total positives with respect to the proportion of test set records. Thus we can analyze what proportion of the test set is needed to yield a certain percentage gain in the total positives. Below is an example of a cumulative gains chart:

As can be seen from this chart, using 50% of the test data gives around 80% of the total positives!

Lift charts, on the other hand, indicate how much better our classifier performs relative to just using random targeting. Below is an example of a lift chart:

2.3 How to Make a Lift Chart

Before we dive into building lift charts, we will define some notation that follows 1. (Shmueli, 2019).


2.3.1 Basic Notation

Consider a test set with \(N\) records. Then:
\(y_{i}\): True class of record \(i\), where \(y_{i} \in\) {0,1}
\(\hat{p_{i}} = \hat{\mathrm{P}}(Y_{i} = 1 | \mathbf{X} = \mathbf{x})\): Predicted probability of belonging to the positive class

In order to begin examining the construction of the cumulative gains plot, we must:

  • Sort the probabilities into descending order
  • Rank the probabilities such that the highest probability has rank 1

Thus we will also be introducing the notation of \(n\) being the number of records being used from the ranking


2.3.2 Confusion Matrix

For convenience, and to best understand the computations and definitions to follow, the standard confusion matrix is stated again:

Table 4: Confusion Matrix
Predicted = Positive Predicted = Negative
Actual = Positive TP FN
Actual = Negative FP TN

where \(TP\), \(FP\), \(FN\) and \(TN\) denote the number of observations in the respective cell. I will also use \(TP_{n}\), \(FP_{n}\), \(FN_{n}\) and \(TN_{n}\) to denote the # of observations in the top \(n\)-ranked records.

Note that since we have ordered our data by descending probabilities, all of our data should be predicted as positive events (1). Therefor, \(FN_{n}\) and \(TN_{n}\) are assumed to be 0.


2.3.3 Cumulative Gains

Cumulative Gains for a given number of records \(n\) is the total number of true positive records among the top-\(n\) ranked records:

\[CumGains(n) = \sum_{j = 1}^{n} y_{(j)} = TP_{n}\] We can also show cumulative gains as a percentage of the total possible gains (\(TP + FN\)):

\[p-CumGains(n/N) = \frac{CumGains(n)}{TP + FN} = \frac{\sum_{j = 1}^{n} y_{(j)}}{\sum_{i = 1}^{n} y_{i}} = \frac{TP_{n}}{TP + FN}\]


2.3.4 Random Targeting

In order to create an effective lift chart, we need to understand the idea behind random targeting, which is represented as a diagonal line. If we use only 20% of our data, how many data points from the positive class would we expect to capture? If the method of choosing our data is completely random, we would expect to capture 20%.

Intuitively, the probability of randomly selecting an individual from the positive class is:

\[K_{p} = \frac{\sum_{i = 1}^{N} y_{i}}{N} = \frac{TP + FN}{N}\]

If we are including \(n\) individuals in our sample, we would expect \(n \times K_{p}\)

2.3.5 Cumulative Gains Chart

A cumulative gains chart is the \(p-CumGains(n/N)\) plotted as a function of the proportion \(n/N\) of targeted records.

cumu_chart <- function(Y_test, test_probs,k)
{
  comb_data = data.frame(test_probs, as.factor(Y_test),as.numeric(test_probs > 0.5))
  ordered_data = comb_data[order(-test_probs),]
  names(ordered_data) = c("test_probs", "Y_test", "Y_pred")
  
  #Probability cutoff vector
  cutoff.vector <- seq(0,1,length=1000)
  
  #Create empty vector for cumugains
  p_CumGain <- numeric(length(cutoff.vector))
  #Create an empty vector for proportion of data
  p_data <- numeric(length(cutoff.vector))
  
  for (i in 1:length(cutoff.vector))
  {
    p_CumGain[i] <- sum(ordered_data$test_probs > cutoff.vector[i] & ordered_data$Y_test == 1)/sum(ordered_data$Y_test==1)
    p_data[i] <- (sum(ordered_data$test_probs > cutoff.vector[i] & ordered_data$Y_test == 1) + sum(ordered_data$test_probs > cutoff.vector[i] & ordered_data$Y_test == 0))/500
  }
  plot(p_data, p_CumGain, type = "l",
       xlim = c(0,1),
       ylim = c(0,1),
       xlab = "Proportion of Targeted Records",
       ylab = "Proportion of Cumulative Gains",
       col = "blue",
       main = paste0("Cumulative Gains Chart ",k))
  abline(0, 1, lty=2, col = "red")
  
}
cumu_chart(Yt,prob.fit.test, "Logistic")

cumu_chart(Yt, prob.lda, "LDA")

Above is the cumulative gains chart for the spam data set, plotted for both logistic and LDA models. For the logistic model, if we used only 20% of the data from our data set, we would be accessing about 60% of the positive class!

2.4 Difference Between Cumulative Gain Charts and The ROC Curve

ROC and cumulative gain charts are similar, but are distinct in the problems that they are addressing.

1. (Shmueli, 2019) clearly outlines two uses for cumulative gain charts:

  1. When the resource constraint is fixed in advance, then we can evaluate classifier performance when used for targeting the top \(n\)-records.
  2. When the resource constraint is not fixed, we can determine an optimal \(n/N\).

ROC was created to visualize the trade off between sensitivity and specificity, and find an optimal value.

Selecting a model based off ROC will not necessarily mean that the same model would be selected based off of cumulative gains. Thus one must seriously consider the problem they are trying to solve before beginning to evaluate classifiers.

3.0 Kappa Statistic

3.1 Introduction

The Kappa statistic is used to give a measure of the magnitude of agreement between two “observers” or “raters”. Another way to think about this is how precise the predictions by the observers are. The formula for the Kappa statistic is as follow:

\[kappa = \frac{O - E}{1 - E}\]

Where:

  • O: Observed Agreement
  • E: Expected Agreement

Thus kappa is just the difference between how much agreement we have observed and how much we have expected, then standardized to fit between -1 & 1 (same logic as max-min standardization).

In the case of classification, O and E can be calculated using the confusion matrix:

Predicted = Positive Predicted = Negative
Actual = Positive TP FN
Actual = Negative FP TN

as follows:

\[\begin{aligned} O &= \frac{TP + TN}{n} \\\\ E &= [\mathrm{P(Observer \ 1 \ is \ positive)} \times \mathrm{P(Observer \ 2 \ is \ positive)}] + [\mathrm{P(Observer \ 1 \ is \ negative)} \times \mathrm{P(Observer \ 2 \ is \ negative)}] \\ &= (\frac{TP + FP}{n} \times \frac{TP + FN}{n}) + (\frac{FN + TN}{n} \times \frac{FP + TN}{n}) \end{aligned}\]

with observer 1 being the predicted class, and observer 2 being the actual class.

3.2 An Example

Note that we have the following confusion matrix for the spam data set:

Table 5: Spam CM
Predicted = Positive Predicted = Negative
Actual = Positive 173 15
Actual = Negative 19 293

\[\begin{aligned} O &= \frac{173 + 293}{500} = 0.932 \\\\ E &= (\frac{173 + 19}{500} \times \frac{173 + 15}{500}) + (\frac{15 + 293}{500} \times \frac{19 + 293}{500} = 0.528768) \\\\ Kappa &= \frac{0.932 - 0.528768}{1 - 0.528768} = 0.8557 \end{aligned}\]

3.3 Interpretation

Kappa is standardized, and can take any value between -1 and 1, where 1 indicates perfect agreement, 0 is what would be expected by chance, and -1 would be agreement less than chance. Although there is no universal scale to judge what level of Kappa is “good”, 3. (Viera & Garrett, 2005) cites the following table:

Table 7: Kappa Scale
Poor Slight Fair Moderate Substantial Almost perfect
Kappa 0 0.2 0.4 0.6 0.8 1

As we can see, the kappa statistic calculated for the spam data set falls in the “substantial” range.

3.4 Drawbacks

As with any method to evaluate classification and regression models, the Kappa statistics suffers from several drawbacks.


3.4.1 Prevalence

If the prevalence of one class is very high, then chance agreement is also high. 5. (Feinstein & Cicchetti, 1990) explores this paradox rigorously, showing that small changes in the prevalence between classes (with the same overall agreement rate) can drastically change kappa. For rare findings, low values of kappa does not imply a low agreement rate. In order to check if kappa will be susceptible to this problem is with the Prevalence Index (\(PI\)), defined as:

\[PI = \frac{|TP - TN|}{n}\]

If the \(PI\) is high, then the prevalence of a positive rating is either very high, or very low. In the case of the spam data set:

\[PI = \frac{|173 - 293|}{500} = 0.24\]

Thus we can see that we have a slight prevalence issue, but maybe not enough to skew kappa dramatically.


3.4.2 Bias

When there is a large bias, kappa is higher than when bias is low or absent. Similar to prevalence, we can check if our kappa will be susceptible to this problem by calculating a Bias Index (\(BI\)):

\[BI = \frac{FP - FN}{n}\]

This index is quantifying the disagreement between the two observers. If this number is relatively large, then there might be a potential bias in the model. In the case of the spam data set:

\[BI = \frac{19 - 15}{500} = 0.008\]

Thus there is not a significant bias with our model as measured by this index.

3.5 Solutions

Solutions are out of scope of this report, but are explored in Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolv-ing the paradoxes. J Clin Epidemiol 1990;43:551-8.

References

  1. Galit Shmueli. 2019. Lift Up and Act! Classifier Performance in Resource-Constrained Applications. preprint. National Tsing Hua University: Institute of Service Science.

  2. Rajul Parikh, Annie Mathai, Shefali Parikh, G Sekhar, Ravi Thomas. 2008. Understanding and using sensitivity, specificity and predictive values. 1st ed. Indian Journal of Ophthalmaology.

  3. Anthony Viera and Joanne Garrett. 2005. Understanding Interobserver Agreement: The Kappa Statistic. 1st ed. Family Medicine Research Series.

  4. Miha Vuk and Tomaz Curk. 2006. ROC Curve, Lift Chart and Calibration Plot. 1st ed. Metodoloski zvezki, vol. 3.

  5. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The prob-lems of two paradoxes. J Clin Epidemiol 1990;43:543-9.