Optimal Classification Cutoff For Weighted Sum of TPR and TNR

Mark sandstrom
Tue Dec 22 12:24:12 2015

SUMMARY

Classification of predictions produced by a given model into 'pos' and 'neg' classes depends on the cutoff value.
When true pos/neg values for the observations are knowable, accuracy of predictions can be computed as a function of pos/neg classification cutoff.
pos and neg case prediction accuracies i.e. TPR and TNR can have differing weights: Classification quality score can be computed as a weigthed sum a*TPR + (1-a)*TNR.
The app visualizes the optimal classification cuttoff for such a user-defined weighted sum of TPR and TNR.

In the example app server.R forms a set of prediction scores [0..1] s and corresponding true 0/1 class labels l:

library(ROCR); data(ROCR.simple); 
dr = as.data.frame(ROCR.simple)
l = dr$labels
L = length(l)
s = dr$predictions + rnorm(L)*.1 - L:1/(L*10)

For real usage, prediction scores s would be produced e.g. by a machine learning algo based on observed feature values of objects to be classified.
For real user scenarios, the above code is to be replaced by code that forms / brings in the actual prediction scores and true classes of observed objects.

The prediction model gives a score s [0..1] for each observed object, and such scores, if > cutoff are classified into 'pos' class 1 and into 'neg' class 0 othw
The portion of correct classifications of such scores s w.r.t. to their true labels l, i.e., accuracy of the classifier is thus a function of the cutoff value:
Cool.

In many scenarios, correctly classifying the 'pos' and 'neg' objects have differing relative importances.
For instance, from a population of people exposed to a dangerous contagious disease, the importance of correctly classifying an actual positive candidate i.e. True Positive Rate (TPR) can be more important than correctly classifying an actual negative candidate.
In other scenarios, e.g. due to abundance of likely 'pos' objects (e.g. investment or customer prospects), resource constraints, or high cost of false positives, True Negative Rate (TNR) can be given greater weight.

What's more, relative importances of correctly classifying 'pos' and 'neg' cases can vary over time/locations etc., as the object population characteristics, availability/cost of resources, success payoffs etc. circumstances change.
A viable solution thus is to allow the user understanding such dynamic operational conditions to adjust the relative importances for catching the actual 'pos' cases vs. avoiding the actual 'neg' cases, i.e., the weights for the TPR and TNR components of the weigthed accuracy score
..and provide a visualization of estimated optimal cutoff for maximizing such a weighted accuracy score: https://marksandstrom.shinyapps.io/dataproducts