ID5059 Lecture 16 - ROC, AUC, Gain & Lift

C. Donovan
13 April 2018

Administrivia

Resolved the compiling problem blighting Weds
- Briefly cover that - detailed calcs will be on moodle
Industrial action: appears to be looming (I have a plan B, TBA, if this eventuates)
Some alterations under current 'Plan A':
- Today we'll do ROC/AUC/Lift & Gain
- Then Clustering
- Then CNNs

NB: If it's not in the lecture or lab, it's not in the exam

Today

Evaluating and comparing classifiers

Turning predicted probabilities into classes
Reciever Operating Characteristic (ROC) curves
- Relatedly, Area Under the Curve (AUC)
- Calculating these by hand
Lift charts

Comparing classifiers

We wish to compare two (binary) classification models,
logistic regression, tree, NN, naive Bayes …
We need two things
1. Test/validation data – actual \( \mathbf{y} \in \left\{C_0, C_1\right\} \) for unseen \( \mathbf{X} \)
2. The predicted probabilities obtained by applying \( M_1 \) and \( M_2 \) to the test set

\[ P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_1) \mathrm{~and~} P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_2) \]

Problem: predicted probs \( \hat{p} \) don't automatically give a class prediction.

Motivating example

Consider a logistic regression:

\( \hat{p} \) thresholds for class predictions
resulting confusion matrices

An example - comparing classifiers

Since this is a two-class problem

\[ P(C_1) = 1 - P(C_0) \]

and

\[ P(C_1 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_k) = 1 - P(C_0 | \mathbf{X_1}, \mathbf{X_2}, \ldots , \mathbf{X_p}, M_k) \]

We need to define a positive prediction to discuss. Let's say \( C_1 \) is the positive subclass.

An example - comparing classifiers

Instance	True Class	\( P(C_1\|x_1,x_2,\ldots,x_p,M_1) \)	\( P(C_1\|x_1,x_2,\ldots,x_p,M_2 \))
1	\( C_1 \)	0.73	0.61
2	\( C_1 \)	0.69	0.03
3	\( C_0 \)	0.44	0.68
4	\( C_0 \)	0.55	0.31
5	\( C_1 \)	0.67	0.45
6	\( C_1 \)	0.47	0.09
7	\( C_0 \)	0.08	0.38
8	\( C_0 \)	0.15	0.05
9	\( C_1 \)	0.45	0.01
10	\( C_0 \)	0.35	0.04

Test data and predicted probabilities for models \( M_1 \) and \( M_2 \)

Reciever Operating characteristic (ROC) curves

These are plots:

In a unit square
The \( x \) axis is 1-specificity or the False Positive Rate (FPR)
The \( y \) axis is sensitivity or the True positive Rate (TPR)

Models will give rise to a line that:

passes through 0,0 and 1,1
is generally above the 1-to-1 reference line

Area Under the Curve (AUC) is the integral under a model line (usually > 0.5), which is <= 1. Bigger is better

ROC curves

An example - comparing classifiers: ROC

Plot the ROC curves for both \( M_1 \) and \( M_2 \)
Using cutoff thresholds \( p = 0, 0.25, 0.5, 0.75 \) and \( 1 \) (say)
So that any test instances whose predicted probability is greater than \( p \) is classified as a positive example
- For \( M_1 \) and \( M_2 \) we generate a sequence of contingency tables, and hence TPR and FPR values (confusion matrices = contingency tables)
TPR is sensitivity & FPR is 1 - specificity

\[ TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN} \]

Contingency Tables for \(M_1\)

Threshold \( p \) = ?

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	TP	FN
\( C_0 \)	FP	TN


TPR	?
FPR	?

Threshold \( p \) = 0

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	5	0
\( C_0 \)	5	0


TPR	1
FPR	1

Contingency Tables for \(M_1\)

Threshold \( p \) = 0.25

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	5	0
\( C_0 \)	3	2


TPR	1
FPR	0.6

Threshold \( p \) = 0.5

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	3	2
\( C_0 \)	1	4


TPR	0.6
FPR	0.2

Contingency Tables for \(M_1\)

Threshold \( p \) = 0.75

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	0	5
\( C_0 \)	0	5


TPR	0
FPR	0

Threshold \( p \) = 1

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	0	5
\( C_0 \)	0	5


TPR	0
FPR	0

ROC curve for \(M_1\)

x-axis is FPR
y-axis is TPR
Enter the pairs of values for each cutoff \( p \) value
Join the points with lines
Calculate (estimate) the area under the lines (the AUC)

This demonstrates the method. We would have a better estimate for the AUC for \( M_1 \) if we'd calculated at more \( p \)-values.

Contingency Tables for \(M_2\)

Threshold \( p \) = ?

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	TP	FN
\( C_0 \)	FP	TN


TPR	?
FPR	?

Threshold \( p \) = 0

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	5	0
\( C_0 \)	5	0


TPR	1
FPR	1

Contingency Tables for \(M_2\)

Threshold \( p \) = 0.25

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	2	3
\( C_0 \)	3	2


TPR	0.4
FPR	0.6

Threshold \( p \) = 0.5

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	1	4
\( C_0 \)	1	4


TPR	0.2
FPR	0.2

Contingency Tables for \(M_2\)

Threshold \( p \) = 0.75

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	0	5
\( C_0 \)	0	5


TPR	0
FPR	0

Threshold \( p \) = 1

	\( \hat{C}_1 \)	\( \hat{C}_0 \)
\( C_1 \)	0	5
\( C_0 \)	0	5


TPR	0
FPR	0

ROC curve for \(M_2\)

x-axis is FPR
y-axis is TPR
Enter the pairs of values for each cutoff \( p \) value
Join the points with lines
Calculate (estimate) the AUC

Which is the better Model?

From the ROC

The one with the greater AUC, noting less than 0.5 is worse than a random guess
However, we get more useful information from the ROC curves e.g. does one curve generally dominate the other?

We can also determine:

Cutoff values that give the “best” performance
Values that give optimal tradeoff between TPR and FPR or for our particular circumstance

Summary

I claim that my classifier is better than yours
We have a simple method of seeing if I'm right
Useful validation technique for any binary classifier
Can be extended to non-binary classifiers
ROC and AUC used extensively in data mining

Lift charts

Another type of summary plot for comparing classifiers

Lift charts

Suppose we are going to approach a pool of customers (phone, mail-out) in the hope that they will respond in some way e.g. buy a new mortgage product
Produce models based on historical data that will identify customers that are likely give the desired response. - Typically a classification model quantifies this in terms of a predicted probability of being in the response class.
If we had a limited budget, or interested in optimising our spending,we should focus first on those likely to respond i.e. more profitable customers.
Lift and gain charts address this, and there are a large number of variants

Lift charts

There are many variants of Score ranking' charts available.
In every case the \( x \)-axis will be Percentile with the \( y \)-axis being something like the following:
- Lift, Cumulative lift
- Gain
- % response, cumulative % response
- % captured response, cumulative % captured response
- Profit, Cumulative profit

Determining the \(x\)-axis

The common \( x \)-axis can be determined relatively easily.

Assume a simple 2-class classification problem that we have constructed a model for.
The model at some level will produce a predicted probability for each observation e.g. \( \hat{p}_i=\text{Predicted} P(y_i=1|\mathbf{x}_i) \)
Take the input data \( \mathbf{X} \) and produce \( \hat{p} \) for every observation (for the training, validation or test datasets as all we require is an input vector \( \mathbf{x} \)).
Put these in descending order - there are the values that give us our \( x \)-axis.

General chart structure

Imagine as a customer database ordered by \( \hat{p} \) - from left-to-right we are penetrating' deeper in to the database from high \( \hat{p} \) observations to low \( \hat{p} \) observations:

An implementation

Assume the model is intended to predict a particular class which we'll refer to as 'positives' or 'responses'
In most cases it is informative to consider a base-line case where there is no model:
In this case our observations are arranged randomly with respect to the \( x \)-axis
All observations have the same predicted probability of being a positive - the proportion of positives in the training dataset.

Lift

The ratio of % Captured Responses to the baseline % Response, in each bin.
The example model is doing three times as good guessing over the first few bins of the database.
After about 35% we do worse than guessing (we've captured most of the positives already - they are false positives beyond this).

Lift, Cumulative Lift

Cumulative Lift

The cumulative ratio of % Captured Responses to the baseline % Response, across the percentile bins.
The example model is doing twice as good guessing over the first 50% of the database.
If we sampled/targeted the entire database, we do no better than guessing - we would identify the same number of positives (all of them).

Why use these plots?

The utility of these charts is hopefully clear:

if we had a limited budget we can see what kind of level of response this would buy by targeting the (modelled) most likely responders
we can see how much value our model has brought to the problem (compared to a random sample of customers) - in direct monetary terms if costs are included
perhaps we can do a smaller campaign, as the returns diminish beyond some percentage of customers targeted
we can see where a level of customer targeting becomes unprofitable if the costs are known.