Optimal Classification Cutoff For Weighted Sum of TPR and TNR

Mark sandstrom
Tue Dec 22 12:24:12 2015

Optimal cuttoff for user-defined weighting of True Positive Rate (TPR) vs. True Negative Rate (TNR)

SUMMARY

  • Classification of predictions produced by a given model into 'pos' and 'neg' classes depends on the cutoff value.
  • When true pos/neg values for the observations are knowable, accuracy of predictions can be computed as a function of pos/neg classification cutoff.
  • pos and neg case prediction accuracies i.e. TPR and TNR can have differing weights: Classification quality score can be computed as a weigthed sum a*TPR + (1-a)*TNR.
  • The app visualizes the optimal classification cuttoff for such a user-defined weighted sum of TPR and TNR.

The data

In the example app server.R forms a set of prediction scores [0..1] s and corresponding true 0/1 class labels l:

library(ROCR); data(ROCR.simple); 
dr = as.data.frame(ROCR.simple)
l = dr$labels
L = length(l)
s = dr$predictions + rnorm(L)*.1 - L:1/(L*10)
  • For real usage, prediction scores s would be produced e.g. by a machine learning algo based on observed feature values of objects to be classified.
  • For real user scenarios, the above code is to be replaced by code that forms / brings in the actual prediction scores and true classes of observed objects.

Accuracy as function of cutoff

  • The prediction model gives a score s [0..1] for each observed object, and such scores, if > cutoff are classified into 'pos' class 1 and into 'neg' class 0 othw
  • The portion of correct classifications of such scores s w.r.t. to their true labels l, i.e., accuracy of the classifier is thus a function of the cutoff value: plot of chunk unnamed-chunk-2
  • Cool.

Weighted accuracy of 'pos' and 'neg' cases

  • In many scenarios, correctly classifying the 'pos' and 'neg' objects have differing relative importances.
  • For instance, from a population of people exposed to a dangerous contagious disease, the importance of correctly classifying an actual positive candidate i.e. True Positive Rate (TPR) can be more important than correctly classifying an actual negative candidate.
  • In other scenarios, e.g. due to abundance of likely 'pos' objects (e.g. investment or customer prospects), resource constraints, or high cost of false positives, True Negative Rate (TNR) can be given greater weight.

The solution

  • What's more, relative importances of correctly classifying 'pos' and 'neg' cases can vary over time/locations etc., as the object population characteristics, availability/cost of resources, success payoffs etc. circumstances change.
  • A viable solution thus is to allow the user understanding such dynamic operational conditions to adjust the relative importances for catching the actual 'pos' cases vs. avoiding the actual 'neg' cases, i.e., the weights for the TPR and TNR components of the weigthed accuracy score
  • ..and provide a visualization of estimated optimal cutoff for maximizing such a weighted accuracy score: https://marksandstrom.shinyapps.io/dataproducts