Find optimal cutoff point for binary classification

Introduction

There is a common task to find optimal cutoff point for a binary classification model. cutpointr is a R package for tidy calculation of “optimal” cutpoints. It supports several methods for calculating cutpoints and includes several metrics that can be maximized or minimized by selecting a cutpoint. Some of these methods are designed to be more robust than the simple empirical optimization of a metric. Additionally, cutpointr can automatically bootstrap the variability of the optimal cutpoints and return out-of-bag estimates of various performance metrics.

Method functions for cutpoint estimation

The included methods for calculating cutpoints are:

maximize_metric: Maximize the metric function
minimize_metric: Minimize the metric function
maximize_loess_metric: Maximize the metric function after LOESS smoothing
minimize_loess_metric: Minimize the metric function after LOESS smoothing
maximize_spline_metric: Maximize the metric function after spline smoothing
minimize_spline_metric: Minimize the metric function after spline smoothing
maximize_gam_metric: Maximize the metric function after smoothing via Generalized Additive Models
minimize_gam_metric: Minimize the metric function after smoothing via Generalized Additive Models
maximize_boot_metric: Bootstrap the optimal cutpoint when maximizing a metric
minimize_boot_metric: Bootstrap the optimal cutpoint when minimizing a metric
oc_manual: Specify the cutoff value manually
oc_mean: Use the sample mean as the “optimal” cutpoint
oc_median: Use the sample median as the “optimal” cutpoint
oc_youden_kernel: Maximize the Youden-Index after kernel smoothing the distributions of the two classes
oc_youden_normal: Maximize the Youden-Index parametrically assuming normally distributed data in both classes

Metric functions

The included metrics to be used with the minimization and maximization methods are:

accuracy: Fraction correctly classified
abs_d_sens_spec: The absolute difference of sensitivity and specificity
abs_d_ppv_npv: The absolute difference between positive predictive value (PPV) and negative predictive value (NPV)
roc01: Distance to the point (0,1) on ROC space
cohens_kappa: Cohen’s Kappa
sum_sens_spec: sensitivity + specificity
sum_ppv_npv: The sum of positive predictive value (PPV) and negative predictive value (NPV)
prod_sens_spec: sensitivity * specificity
prod_ppv_npv: The product of positive predictive value (PPV) and negative predictive value (NPV)
youden: Youden- or J-Index = sensitivity + specificity - 1
odds_ratio: (Diagnostic) odds ratio
risk_ratio: risk ratio (relative risk)
p_chisquared: The p-value of a chi-squared test on the confusion matrix
cost_misclassification: The sum of the misclassification cost of false positives and false negatives. Additional arguments: cost_fp, cost_fn
total_utility: The total utility of true / false positives / negatives. Additional arguments: utility_tp, utility_tn, cost_fp, cost_fn
F1_score: The F1-score (2 * TP) / (2 * TP + FP + FN)
metric_constrain: Maximize a selected metric given a minimal value of another selected metric
sens_constrain: Maximize sensitivity given a minimal value of specificity
spec_constrain: Maximize specificity given a minimal value of sensitivity
acc_constrain: Maximize accuracy given a minimal value of sensitivity

Furthermore, the following functions are included which can be used as metric functions but are more useful for plotting purposes, for example in plot_cutpointr, or for defining new metric functions: tp, fp, tn, fn, tpr, fpr, tnr, fnr, false_omission_rate, false_discovery_rate, ppv, npv, precision, recall, sensitivity, and specificity.

The inputs to the arguments method and metric are functions so that user-defined functions can easily be supplied instead of the built-in ones.

Examples

library(tidyverse)

## ── Attaching packages ──────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.3
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## ── Conflicts ─────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(cutpointr)

## Warning: package 'cutpointr' was built under R version 3.6.2

data("suicide")
head(suicide)

##   age gender dsi suicide
## 1  29 female   1      no
## 2  26   male   0      no
## 3  26 female   0      no
## 4  27 female   0      no
## 5  28 female   0      no
## 6  53   male   2      no

cp <- cutpointr(suicide, dsi, suicide, pos_class = "yes", neg_class = "no", direction = ">=",
                method = maximize_metric, metric = sum_sens_spec)

summary(cp)

## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
##                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

plot(cp)

Manucal and mean/median cutpoints

Using the oc_manual function the optimal cutpoint will not be determined based on, for example, a metric but is instead set manually using the cutpoint argument. This is useful for supplying and evaluating cutpoints that were found in the literature or in other external sources.

The oc_manual function could also be used to set the cutpoint to the sample mean using cutpoint = mean(data$x). However, this may introduce bias into the bootstrap validation procedure, since the actual mean of the population is not known and thus the mean to be used as the cutpoint should be automatically determined in every resample. To do so, the oc_mean and oc_median functions can be used.

set.seed(100)
opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual, 
                       cutpoint = mean(suicide$dsi), boot_runs = 100)

## Assuming the positive class is yes

## Assuming the positive class has higher x values

## Running bootstrap...

set.seed(100)
opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 100)

## Assuming the positive class is yes

## Assuming the positive class has higher x values

## Running bootstrap...

summary(opt_cut_manual)

## Method: oc_manual 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Nr. of bootstraps: 100 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn  fp  tn
##            0.9211        1.7025 0.7707      0.9444      0.7581 34  2 120 376
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
## 
## Bootstrap summary: 
##           Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
##   optimal_cutpoint 0.92 0.92    0.92   0.92 0.92    0.92 0.92 0.92 0.00   0
##              AUC_b 0.85 0.88    0.91   0.92 0.92    0.94 0.96 0.98 0.02   0
##            AUC_oob 0.83 0.86    0.90   0.93 0.93    0.95 0.97 0.99 0.03   0
##    sum_sens_spec_b 1.55 1.63    1.67   1.70 1.70    1.73 1.76 1.79 0.04   0
##  sum_sens_spec_oob 1.52 1.61    1.67   1.71 1.71    1.75 1.78 1.81 0.06   0
##              acc_b 0.73 0.74    0.76   0.77 0.77    0.78 0.80 0.81 0.02   0
##            acc_oob 0.71 0.73    0.75   0.77 0.77    0.78 0.82 0.84 0.03   0
##      sensitivity_b 0.82 0.88    0.92   0.94 0.94    0.97 1.00 1.00 0.04   0
##    sensitivity_oob 0.78 0.86    0.92   0.94 0.95    1.00 1.00 1.00 0.05   0
##      specificity_b 0.71 0.72    0.74   0.76 0.76    0.77 0.79 0.80 0.02   0
##    specificity_oob 0.69 0.71    0.74   0.76 0.76    0.77 0.80 0.84 0.03   0
##     cohens_kappa_b 0.14 0.21    0.24   0.27 0.27    0.30 0.35 0.37 0.04   0
##   cohens_kappa_oob 0.16 0.19    0.25   0.28 0.28    0.31 0.38 0.47 0.06   0

Optimisation based on Youden index

opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
                     neg_class = "no", method = maximize_metric, metric = youden)

summary(opt_cut)

## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##                 2 0.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

Cutoff based on ROC01

opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
                     neg_class = "no", method = minimize_metric, metric = roc01)

summary(opt_cut)

## Method: minimize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint  roc01    acc sensitivity specificity tp fn fp  tn
##                 2 0.1765 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

Minimize the misclassification cost

Since false negatives (a suicide attempt was not anticipated) can be regarded as much more severe than false positives we can set the cost of a false negative cost_fn for example to ten times the cost of a false positive.

opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)

## Assuming the positive class is yes

## Assuming the positive class has higher x values

summary(opt_cut)

## Method: minimize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Subgroups: female, male 
## 
## Subgroup: female 
## -------------------------------------------------------------------------------- 
##     AUC   n n_pos n_neg
##  0.9446 392    27   365
## 
##  optimal_cutpoint misclassification_cost    acc sensitivity specificity tp fn
##                 2                     63 0.8852      0.9259      0.8822 25  2
##  fp  tn
##  43 322
## 
## Predictor summary: 
##     Data Min.  5% 1st Qu. Median      Mean 3rd Qu. 95% Max.       SD NAs
##  Overall    0 0.0       0      0 0.8392857       1   5   10 1.745198   0
##       no    0 0.0       0      0 0.5479452       0   4   10 1.318102   0
##      yes    0 1.3       4      5 4.7777778       6   7    9 2.044379   0
## 
## Subgroup: male 
## -------------------------------------------------------------------------------- 
##     AUC   n n_pos n_neg
##  0.8617 140     9   131
## 
##  optimal_cutpoint misclassification_cost    acc sensitivity specificity tp fn
##                 3                     40 0.8429      0.7778      0.8473  7  2
##  fp  tn
##  20 111
## 
## Predictor summary: 
##     Data Min.  5% 1st Qu. Median     Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.0       0      0 1.150000       1  6.0   11 2.115122   0
##       no    0 0.0       0      0 0.870229       1  5.0    6 1.628576   0
##      yes    0 0.4       3      4 5.222222       8 10.6   11 3.833333   0

plot_metric(opt_cut)

Plotting

cutpointr includes several convenience functions for plotting data from a cutpointr object. These include:

plot_cutpointr: General purpose plotting function for cutpointr or roc_cutpointr objects
plot_cut_boot: Plot the bootstrapped distribution of optimal cutpoints
plot_metric: If maximize_metric or minimize_metric was used this function plots all possible cutoffs on the x-axis vs. the respective metric values on the y-axis. If bootstrapping was run, a confidence interval based on the bootstrapped distribution of metric values at each cutpoint can be displayed. To display no confidence interval set conf_lvl = 0.
plot_metric_boot: Plot the distribution of out-of-bag metric values
plot_precision_recall: Plot the precision recall curve
plot_sensitivity_specificity: Plot all cutpoints vs. sensitivity and specificity
plot_roc: Plot the ROC curve
plot_x: Plot the distribution of the predictor variable

plot_precision_recall(opt_cut)

plot_sensitivity_specificity(opt_cut)

plot_roc(opt_cut)

plot_cutpointr(opt_cut, xvar = cutpoints, yvar = youden, conf_lvl = 0.9)

plot_cutpointr(opt_cut, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)

Adding metrics to the result of cutpointr() or roc()

By default, the output of cutpointr includes the optimized metric and several other metrics. The add_metric function adds further metrics.

cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>% 
    add_metric(list(tpr,precision,recall)) %>% 
    select(subgroup, optimal_cutpoint, youden,tpr,precision,recall)

## # A tibble: 2 x 6
##   subgroup optimal_cutpoint   youden      tpr precision   recall
##   <chr>               <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
## 1 female                  2 0.808118 0.925926  0.367647 0.925926
## 2 male                    3 0.625106 0.777778  0.259259 0.777778

roc(data = suicide, x = dsi, class = suicide, pos_class = "yes",
    neg_class = "no", direction = ">=") %>% 
  add_metric(list(cohens_kappa, F1_score,roc01,youden,precision,recall)) %>% 
  select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score,roc01,youden,precision,recall)

## # A tibble: 13 x 11
##    x.sorted    tp    fp    tn    fn cohens_kappa F1_score roc01 youden precision
##       <dbl> <dbl> <dbl> <int> <int>        <dbl>    <dbl> <dbl>  <dbl>     <dbl>
##  1      Inf     0     0   496    36       0        0      1     0       NaN     
##  2       11     1     0   496    35       0.0506   0.0541 0.972 0.0278    1     
##  3       10     2     1   495    34       0.0931   0.103  0.944 0.0535    0.667 
##  4        9     3     1   495    33       0.138    0.15   0.917 0.0813    0.75  
##  5        8     4     1   495    32       0.182    0.195  0.889 0.109     0.8   
##  6        7     7     1   495    29       0.301    0.318  0.806 0.192     0.875 
##  7        6    16     6   490    20       0.527    0.552  0.556 0.432     0.727 
##  8        5    20    16   480    16       0.523    0.556  0.446 0.523     0.556 
##  9        4    28    44   452     8       0.471    0.519  0.239 0.689     0.389 
## 10        3    29    56   440     7       0.425    0.479  0.225 0.693     0.341 
## 11        2    32    68   428     4       0.412    0.471  0.176 0.752     0.32  
## 12        1    34   120   376     2       0.279    0.358  0.248 0.703     0.221 
## 13        0    36   496     0     0       0        0.127  1     0         0.0677
## # … with 1 more variable: recall <dbl>

Customise the plot

opt_cut <- cutpointr(suicide, dsi, suicide, metric = youden, silent = TRUE) %>% 
    add_metric(list(tpr,fpr,precision,recall))

plot_roc(opt_cut) +
  labs(title=paste0("ROC Curve - Optimal: ",opt_cut$optimal_cutpoint),subtitle = "based on Youden's Index") +
  geom_abline(intercept = 0,slope = 1,linetype="dashed") +
  geom_segment(aes(x=opt_cut$fpr,y=opt_cut$fpr,
                   xend=opt_cut$fpr,yend=opt_cut$tpr),linetype="dotted",color="red")

plot_precision_recall(opt_cut) +
  geom_segment(aes(x=opt_cut$recall,y=0,
                   xend=opt_cut$recall,
                   yend=opt_cut$precision),linetype="dotted",color="red") +
  geom_segment(aes(x=0,y=opt_cut$precision,
                   xend = opt_cut$recall,
                   yend=opt_cut$precision),linetype="dotted",color='red') +
  annotate(geom = "text",
           x=opt_cut$recall-0.05,
           y=opt_cut$precision+0.2,
           label=paste0("precision: ",round(opt_cut$precision,2),
                        "\nrecall: ",round(opt_cut$recall,2),
                        "\n threshold: ",round(opt_cut$optimal_cutpoint,2)))

Compare to pROC

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:cutpointr':
## 
##     auc, roc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

r <- suicide %>% pROC::roc(suicide,dsi)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

coords(r, "best")

## Warning in coords.roc(r, "best"): The 'transpose' argument to FALSE by
## default since pROC 1.16. Set transpose = TRUE explicitly to revert to
## the previous behavior, or transpose = TRUE to silence this warning. Type
## help(coords_transpose) for additional information.

##   threshold specificity sensitivity
## 1       1.5   0.8629032   0.8888889

coords(r, x="best", input="threshold", best.method="youden", transpose = F)

##   threshold specificity sensitivity
## 1       1.5   0.8629032   0.8888889

suicide %>% pROC::roc(suicide,dsi) %>% 
  coords(ret = "all", transpose = FALSE) %>%
  select(threshold,precision, recall,tpr,fpr,youden,closest.topleft)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

##    threshold  precision     recall        tpr         fpr   youden
## 1       -Inf 0.06766917 1.00000000 1.00000000 1.000000000 1.000000
## 2        0.5 0.22077922 0.94444444 0.94444444 0.241935484 1.702509
## 3        1.5 0.32000000 0.88888889 0.88888889 0.137096774 1.751792
## 4        2.5 0.34117647 0.80555556 0.80555556 0.112903226 1.692652
## 5        3.5 0.38888889 0.77777778 0.77777778 0.088709677 1.689068
## 6        4.5 0.55555556 0.55555556 0.55555556 0.032258065 1.523297
## 7        5.5 0.72727273 0.44444444 0.44444444 0.012096774 1.432348
## 8        6.5 0.87500000 0.19444444 0.19444444 0.002016129 1.192428
## 9        7.5 0.80000000 0.11111111 0.11111111 0.002016129 1.109095
## 10       8.5 0.75000000 0.08333333 0.08333333 0.002016129 1.081317
## 11       9.5 0.66666667 0.05555556 0.05555556 0.002016129 1.053539
## 12      10.5 1.00000000 0.02777778 0.02777778 0.000000000 1.027778
## 13       Inf        NaN 0.00000000 0.00000000 0.000000000 1.000000
##    closest.topleft
## 1       1.00000000
## 2       0.06161920
## 3       0.03114120
## 4       0.05055578
## 5       0.05725212
## 6       0.19857145
## 7       0.30878831
## 8       0.64892382
## 9       0.79012752
## 10      0.84028184
## 11      0.89197937
## 12      0.94521605
## 13      1.00000000

gl <- ggroc(r, legacy.axes = TRUE)
gl

gl + xlab("FPR") + ylab("TPR") + 
    geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), color="darkgrey", linetype="dashed")

plot(r, print.thres="best", print.thres.best.method="youden")

# You can use coords to plot for instance a sensitivity + specificity vs. cut-off diagram
plot(specificity + sensitivity ~ threshold, 
     coords(r, "all", transpose = FALSE), 
     type = "l", log="x", 
     subset = is.finite(threshold))

# Plot the Precision-Recall curve
plot(precision ~ recall, 
     coords(r, "all", ret = c("recall", "precision"), transpose = FALSE),
     type="l", ylim = c(0, 1))

# Alternatively plot the curve with TPR and FPR instead of SE/SP 
# (identical curve, only the axis change)
plot(tpr ~ fpr, 
     coords(r, "all", ret = c("tpr", "fpr"), transpose = FALSE),
     type="l")

# Print the thresholds on the plot
# Plot some thresholds, add them to the same plot
plot(r, print.thres="best", print.thres.best.method="youden")
plot(r, print.thres="best", print.thres.best.method="closest.topleft",
                                   add = TRUE)