ID5059 Lecture 16 - ROC, AUC, Gain & Lift

C. Donovan
13 April 2018

Administrivia

  • Resolved the compiling problem blighting Weds
    • Briefly cover that - detailed calcs will be on moodle
  • Industrial action: appears to be looming (I have a plan B, TBA, if this eventuates)
  • Some alterations under current 'Plan A':
    • Today we'll do ROC/AUC/Lift & Gain
    • Then Clustering
    • Then CNNs

NB: If it's not in the lecture or lab, it's not in the exam

Today

Evaluating and comparing classifiers

  • Turning predicted probabilities into classes
  • Reciever Operating Characteristic (ROC) curves
    • Relatedly, Area Under the Curve (AUC)
    • Calculating these by hand
  • Lift charts

Comparing classifiers

  • We wish to compare two (binary) classification models,
  • logistic regression, tree, NN, naive Bayes …
  • We need two things
    1. Test/validation data – actual \( \mathbf{y} \in \left\{C_0, C_1\right\} \) for unseen \( \mathbf{X} \)
    2. The predicted probabilities obtained by applying \( M_1 \) and \( M_2 \) to the test set

\[ P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_1) \mathrm{~and~} P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_2) \]

Problem: predicted probs \( \hat{p} \) don't automatically give a class prediction.

Motivating example

Consider a logistic regression:

  • \( \hat{p} \) thresholds for class predictions
  • resulting confusion matrices

An example - comparing classifiers

Since this is a two-class problem

\[ P(C_1) = 1 - P(C_0) \]

and

\[ P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_k) = 1 - P(C_0 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_k) \]

We need to define a positive prediction to discuss. Let's say \( C_1 \) is the positive subclass.

An example - comparing classifiers

Instance True Class \( P(C_1|x_1,x_2,\ldots,x_p,M_1) \) \( P(C_1|x_1,x_2,\ldots,x_p,M_2 \))
1 \( C_1 \) 0.73 0.61
2 \( C_1 \) 0.69 0.03
3 \( C_0 \) 0.44 0.68
4 \( C_0 \) 0.55 0.31
5 \( C_1 \) 0.67 0.45
6 \( C_1 \) 0.47 0.09
7 \( C_0 \) 0.08 0.38
8 \( C_0 \) 0.15 0.05
9 \( C_1 \) 0.45 0.01
10 \( C_0 \) 0.35 0.04

Test data and predicted probabilities for models \( M_1 \) and \( M_2 \)

Reciever Operating characteristic (ROC) curves

These are plots:

  • In a unit square
  • The \( x \) axis is 1-specificity or the False Positive Rate (FPR)
  • The \( y \) axis is sensitivity or the True positive Rate (TPR)

Models will give rise to a line that:

  • passes through 0,0 and 1,1
  • is generally above the 1-to-1 reference line

Area Under the Curve (AUC) is the integral under a model line (usually > 0.5), which is <= 1. Bigger is better

ROC curves

An example - comparing classifiers: ROC

  • Plot the ROC curves for both \( M_1 \) and \( M_2 \)
  • Using cutoff thresholds \( p = 0, 0.25, 0.5, 0.75 \) and \( 1 \) (say)
  • So that any test instances whose predicted probability is greater than \( p \) is classified as a positive example
    • For \( M_1 \) and \( M_2 \) we generate a sequence of contingency tables, and hence TPR and FPR values (confusion matrices = contingency tables)
  • TPR is sensitivity & FPR is 1 - specificity

\[ TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN} \]

Contingency Tables for \(M_1\)

Threshold \( p \) = ?

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) TP FN
\( C_0 \) FP TN
TPR ?
FPR ?

Threshold \( p \) = 0

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 5 0
\( C_0 \) 5 0
TPR 1
FPR 1

Contingency Tables for \(M_1\)

Threshold \( p \) = 0.25

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 5 0
\( C_0 \) 3 2
TPR 1
FPR 0.6

Threshold \( p \) = 0.5

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 3 2
\( C_0 \) 1 4
TPR 0.6
FPR 0.2

Contingency Tables for \(M_1\)

Threshold \( p \) = 0.75

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 0 5
\( C_0 \) 0 5
TPR 0
FPR 0

Threshold \( p \) = 1

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 0 5
\( C_0 \) 0 5
TPR 0
FPR 0

ROC curve for \(M_1\)

  • x-axis is FPR
  • y-axis is TPR
  • Enter the pairs of values for each cutoff \( p \) value
  • Join the points with lines
  • Calculate (estimate) the area under the lines (the AUC)

This demonstrates the method. We would have a better estimate for the AUC for \( M_1 \) if we'd calculated at more \( p \)-values.

Contingency Tables for \(M_2\)

Threshold \( p \) = ?

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) TP FN
\( C_0 \) FP TN
TPR ?
FPR ?

Threshold \( p \) = 0

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 5 0
\( C_0 \) 5 0
TPR 1
FPR 1

Contingency Tables for \(M_2\)

Threshold \( p \) = 0.25

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 2 3
\( C_0 \) 3 2
TPR 0.4
FPR 0.6

Threshold \( p \) = 0.5

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 1 4
\( C_0 \) 1 4
TPR 0.2
FPR 0.2

Contingency Tables for \(M_2\)

Threshold \( p \) = 0.75

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 0 5
\( C_0 \) 0 5
TPR 0
FPR 0

Threshold \( p \) = 1

\( \hat{C}_1 \) \( \hat{C}_0 \)
\( C_1 \) 0 5
\( C_0 \) 0 5
TPR 0
FPR 0

ROC curve for \(M_2\)

  • x-axis is FPR
  • y-axis is TPR
  • Enter the pairs of values for each cutoff \( p \) value
  • Join the points with lines
  • Calculate (estimate) the AUC

Which is the better Model?

From the ROC

  • The one with the greater AUC, noting less than 0.5 is worse than a random guess
  • However, we get more useful information from the ROC curves e.g. does one curve generally dominate the other?

We can also determine:

  • Cutoff values that give the “best” performance
  • Values that give optimal tradeoff between TPR and FPR or for our particular circumstance

Summary

  • I claim that my classifier is better than yours
  • We have a simple method of seeing if I'm right
  • Useful validation technique for any binary classifier
  • Can be extended to non-binary classifiers
  • ROC and AUC used extensively in data mining

Lift charts

Another type of summary plot for comparing classifiers

Lift charts

  • Suppose we are going to approach a pool of customers (phone, mail-out) in the hope that they will respond in some way e.g. buy a new mortgage product
  • Produce models based on historical data that will identify customers that are likely give the desired response. - Typically a classification model quantifies this in terms of a predicted probability of being in the response class.
  • If we had a limited budget, or interested in optimising our spending,we should focus first on those likely to respond i.e. more profitable customers.
  • Lift and gain charts address this, and there are a large number of variants

Lift charts

  • There are many variants of Score ranking' charts available.
  • In every case the \( x \)-axis will be Percentile with the \( y \)-axis being something like the following:

    • Lift, Cumulative lift
    • Gain
    • % response, cumulative % response
    • % captured response, cumulative % captured response
    • Profit, Cumulative profit

Determining the \(x\)-axis

The common \( x \)-axis can be determined relatively easily.

  • Assume a simple 2-class classification problem that we have constructed a model for.
  • The model at some level will produce a predicted probability for each observation e.g. \( \hat{p}_i=\text{Predicted} P(y_i=1|\mathbf{x}_i) \)
  • Take the input data \( \mathbf{X} \) and produce \( \hat{p} \) for every observation (for the training, validation or test datasets as all we require is an input vector \( \mathbf{x} \)).
  • Put these in descending order - there are the values that give us our \( x \)-axis.

General chart structure

Imagine as a customer database ordered by \( \hat{p} \) - from left-to-right we are penetrating' deeper in to the database from high \( \hat{p} \) observations to low \( \hat{p} \) observations:

An implementation

  • Assume the model is intended to predict a particular class which we'll refer to as 'positives' or 'responses'
  • In most cases it is informative to consider a base-line case where there is no model:

  • In this case our observations are arranged randomly with respect to the \( x \)-axis

  • All observations have the same predicted probability of being a positive - the proportion of positives in the training dataset.

Lift

Lift

  • The ratio of % Captured Responses to the baseline % Response, in each bin.
  • The example model is doing three times as good guessing over the first few bins of the database.
  • After about 35% we do worse than guessing (we've captured most of the positives already - they are false positives beyond this).

Lift, Cumulative Lift

Cumulative Lift

  • The cumulative ratio of % Captured Responses to the baseline % Response, across the percentile bins.
  • The example model is doing twice as good guessing over the first 50% of the database.
  • If we sampled/targeted the entire database, we do no better than guessing - we would identify the same number of positives (all of them).

Why use these plots?

The utility of these charts is hopefully clear:

  • if we had a limited budget we can see what kind of level of response this would buy by targeting the (modelled) most likely responders
  • we can see how much value our model has brought to the problem (compared to a random sample of customers) - in direct monetary terms if costs are included
  • perhaps we can do a smaller campaign, as the returns diminish beyond some percentage of customers targeted
  • we can see where a level of customer targeting becomes unprofitable if the costs are known.