Introduction

There is a common task to find optimal cutoff point for a binary classification model. cutpointr is a R package for tidy calculation of “optimal” cutpoints. It supports several methods for calculating cutpoints and includes several metrics that can be maximized or minimized by selecting a cutpoint. Some of these methods are designed to be more robust than the simple empirical optimization of a metric. Additionally, cutpointr can automatically bootstrap the variability of the optimal cutpoints and return out-of-bag estimates of various performance metrics.

Method functions for cutpoint estimation

The included methods for calculating cutpoints are:

Metric functions

The included metrics to be used with the minimization and maximization methods are:

Furthermore, the following functions are included which can be used as metric functions but are more useful for plotting purposes, for example in plot_cutpointr, or for defining new metric functions: tp, fp, tn, fn, tpr, fpr, tnr, fnr, false_omission_rate, false_discovery_rate, ppv, npv, precision, recall, sensitivity, and specificity.

The inputs to the arguments method and metric are functions so that user-defined functions can easily be supplied instead of the built-in ones.

Examples

library(tidyverse)
## ── Attaching packages ──────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.3
## ✓ tibble  3.0.1     ✓ dplyr   1.0.0
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ─────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(cutpointr)
## Warning: package 'cutpointr' was built under R version 3.6.2
data("suicide")
head(suicide)
##   age gender dsi suicide
## 1  29 female   1      no
## 2  26   male   0      no
## 3  26 female   0      no
## 4  27 female   0      no
## 5  28 female   0      no
## 6  53   male   2      no
cp <- cutpointr(suicide, dsi, suicide, pos_class = "yes", neg_class = "no", direction = ">=",
                method = maximize_metric, metric = sum_sens_spec)
summary(cp)
## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn fp  tn
##                 2        1.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
plot(cp)

Using the oc_manual function the optimal cutpoint will not be determined based on, for example, a metric but is instead set manually using the cutpoint argument. This is useful for supplying and evaluating cutpoints that were found in the literature or in other external sources.

The oc_manual function could also be used to set the cutpoint to the sample mean using cutpoint = mean(data$x). However, this may introduce bias into the bootstrap validation procedure, since the actual mean of the population is not known and thus the mean to be used as the cutpoint should be automatically determined in every resample. To do so, the oc_mean and oc_median functions can be used.

set.seed(100)
opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual, 
                       cutpoint = mean(suicide$dsi), boot_runs = 100)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
## Running bootstrap...
set.seed(100)
opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 100)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
## Running bootstrap...
summary(opt_cut_manual)
## Method: oc_manual 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Nr. of bootstraps: 100 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint sum_sens_spec    acc sensitivity specificity tp fn  fp  tn
##            0.9211        1.7025 0.7707      0.9444      0.7581 34  2 120 376
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
## 
## Bootstrap summary: 
##           Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
##   optimal_cutpoint 0.92 0.92    0.92   0.92 0.92    0.92 0.92 0.92 0.00   0
##              AUC_b 0.85 0.88    0.91   0.92 0.92    0.94 0.96 0.98 0.02   0
##            AUC_oob 0.83 0.86    0.90   0.93 0.93    0.95 0.97 0.99 0.03   0
##    sum_sens_spec_b 1.55 1.63    1.67   1.70 1.70    1.73 1.76 1.79 0.04   0
##  sum_sens_spec_oob 1.52 1.61    1.67   1.71 1.71    1.75 1.78 1.81 0.06   0
##              acc_b 0.73 0.74    0.76   0.77 0.77    0.78 0.80 0.81 0.02   0
##            acc_oob 0.71 0.73    0.75   0.77 0.77    0.78 0.82 0.84 0.03   0
##      sensitivity_b 0.82 0.88    0.92   0.94 0.94    0.97 1.00 1.00 0.04   0
##    sensitivity_oob 0.78 0.86    0.92   0.94 0.95    1.00 1.00 1.00 0.05   0
##      specificity_b 0.71 0.72    0.74   0.76 0.76    0.77 0.79 0.80 0.02   0
##    specificity_oob 0.69 0.71    0.74   0.76 0.76    0.77 0.80 0.84 0.03   0
##     cohens_kappa_b 0.14 0.21    0.24   0.27 0.27    0.30 0.35 0.37 0.04   0
##   cohens_kappa_oob 0.16 0.19    0.25   0.28 0.28    0.31 0.38 0.47 0.06   0
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
                     neg_class = "no", method = maximize_metric, metric = youden)

summary(opt_cut)
## Method: maximize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint youden    acc sensitivity specificity tp fn fp  tn
##                 2 0.7518 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
                     neg_class = "no", method = minimize_metric, metric = roc01)

summary(opt_cut)
## Method: minimize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9238 532    36   496
## 
##  optimal_cutpoint  roc01    acc sensitivity specificity tp fn fp  tn
##                 2 0.1765 0.8647      0.8889      0.8629 32  4 68 428
## 
## Predictor summary: 
##     Data Min.   5% 1st Qu. Median      Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.00       0      0 0.9210526       1 5.00   11 1.852714   0
##       no    0 0.00       0      0 0.6330645       0 4.00   10 1.412225   0
##      yes    0 0.75       4      5 4.8888889       6 9.25   11 2.549821   0

Since false negatives (a suicide attempt was not anticipated) can be regarded as much more severe than false positives we can set the cost of a false negative cost_fn for example to ten times the cost of a false positive.

opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
                     metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
summary(opt_cut)
## Method: minimize_metric 
## Predictor: dsi 
## Outcome: suicide 
## Direction: >= 
## Subgroups: female, male 
## 
## Subgroup: female 
## -------------------------------------------------------------------------------- 
##     AUC   n n_pos n_neg
##  0.9446 392    27   365
## 
##  optimal_cutpoint misclassification_cost    acc sensitivity specificity tp fn
##                 2                     63 0.8852      0.9259      0.8822 25  2
##  fp  tn
##  43 322
## 
## Predictor summary: 
##     Data Min.  5% 1st Qu. Median      Mean 3rd Qu. 95% Max.       SD NAs
##  Overall    0 0.0       0      0 0.8392857       1   5   10 1.745198   0
##       no    0 0.0       0      0 0.5479452       0   4   10 1.318102   0
##      yes    0 1.3       4      5 4.7777778       6   7    9 2.044379   0
## 
## Subgroup: male 
## -------------------------------------------------------------------------------- 
##     AUC   n n_pos n_neg
##  0.8617 140     9   131
## 
##  optimal_cutpoint misclassification_cost    acc sensitivity specificity tp fn
##                 3                     40 0.8429      0.7778      0.8473  7  2
##  fp  tn
##  20 111
## 
## Predictor summary: 
##     Data Min.  5% 1st Qu. Median     Mean 3rd Qu.  95% Max.       SD NAs
##  Overall    0 0.0       0      0 1.150000       1  6.0   11 2.115122   0
##       no    0 0.0       0      0 0.870229       1  5.0    6 1.628576   0
##      yes    0 0.4       3      4 5.222222       8 10.6   11 3.833333   0
plot_metric(opt_cut)

Plotting

cutpointr includes several convenience functions for plotting data from a cutpointr object. These include:

plot_precision_recall(opt_cut)

plot_sensitivity_specificity(opt_cut)

plot_roc(opt_cut)

plot_cutpointr(opt_cut, xvar = cutpoints, yvar = youden, conf_lvl = 0.9)

plot_cutpointr(opt_cut, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)

Adding metrics to the result of cutpointr() or roc()

By default, the output of cutpointr includes the optimized metric and several other metrics. The add_metric function adds further metrics.

cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>% 
    add_metric(list(tpr,precision,recall)) %>% 
    select(subgroup, optimal_cutpoint, youden,tpr,precision,recall)
## # A tibble: 2 x 6
##   subgroup optimal_cutpoint   youden      tpr precision   recall
##   <chr>               <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
## 1 female                  2 0.808118 0.925926  0.367647 0.925926
## 2 male                    3 0.625106 0.777778  0.259259 0.777778
roc(data = suicide, x = dsi, class = suicide, pos_class = "yes",
    neg_class = "no", direction = ">=") %>% 
  add_metric(list(cohens_kappa, F1_score,roc01,youden,precision,recall)) %>% 
  select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score,roc01,youden,precision,recall)
## # A tibble: 13 x 11
##    x.sorted    tp    fp    tn    fn cohens_kappa F1_score roc01 youden precision
##       <dbl> <dbl> <dbl> <int> <int>        <dbl>    <dbl> <dbl>  <dbl>     <dbl>
##  1      Inf     0     0   496    36       0        0      1     0       NaN     
##  2       11     1     0   496    35       0.0506   0.0541 0.972 0.0278    1     
##  3       10     2     1   495    34       0.0931   0.103  0.944 0.0535    0.667 
##  4        9     3     1   495    33       0.138    0.15   0.917 0.0813    0.75  
##  5        8     4     1   495    32       0.182    0.195  0.889 0.109     0.8   
##  6        7     7     1   495    29       0.301    0.318  0.806 0.192     0.875 
##  7        6    16     6   490    20       0.527    0.552  0.556 0.432     0.727 
##  8        5    20    16   480    16       0.523    0.556  0.446 0.523     0.556 
##  9        4    28    44   452     8       0.471    0.519  0.239 0.689     0.389 
## 10        3    29    56   440     7       0.425    0.479  0.225 0.693     0.341 
## 11        2    32    68   428     4       0.412    0.471  0.176 0.752     0.32  
## 12        1    34   120   376     2       0.279    0.358  0.248 0.703     0.221 
## 13        0    36   496     0     0       0        0.127  1     0         0.0677
## # … with 1 more variable: recall <dbl>

Customise the plot

opt_cut <- cutpointr(suicide, dsi, suicide, metric = youden, silent = TRUE) %>% 
    add_metric(list(tpr,fpr,precision,recall)) 
plot_roc(opt_cut) +
  labs(title=paste0("ROC Curve - Optimal: ",opt_cut$optimal_cutpoint),subtitle = "based on Youden's Index") +
  geom_abline(intercept = 0,slope = 1,linetype="dashed") +
  geom_segment(aes(x=opt_cut$fpr,y=opt_cut$fpr,
                   xend=opt_cut$fpr,yend=opt_cut$tpr),linetype="dotted",color="red")

plot_precision_recall(opt_cut) +
  geom_segment(aes(x=opt_cut$recall,y=0,
                   xend=opt_cut$recall,
                   yend=opt_cut$precision),linetype="dotted",color="red") +
  geom_segment(aes(x=0,y=opt_cut$precision,
                   xend = opt_cut$recall,
                   yend=opt_cut$precision),linetype="dotted",color='red') +
  annotate(geom = "text",
           x=opt_cut$recall-0.05,
           y=opt_cut$precision+0.2,
           label=paste0("precision: ",round(opt_cut$precision,2),
                        "\nrecall: ",round(opt_cut$recall,2),
                        "\n threshold: ",round(opt_cut$optimal_cutpoint,2)))

Compare to pROC

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:cutpointr':
## 
##     auc, roc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
r <- suicide %>% pROC::roc(suicide,dsi)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
coords(r, "best")
## Warning in coords.roc(r, "best"): The 'transpose' argument to FALSE by
## default since pROC 1.16. Set transpose = TRUE explicitly to revert to
## the previous behavior, or transpose = TRUE to silence this warning. Type
## help(coords_transpose) for additional information.
##   threshold specificity sensitivity
## 1       1.5   0.8629032   0.8888889
coords(r, x="best", input="threshold", best.method="youden", transpose = F)
##   threshold specificity sensitivity
## 1       1.5   0.8629032   0.8888889
suicide %>% pROC::roc(suicide,dsi) %>% 
  coords(ret = "all", transpose = FALSE) %>%
  select(threshold,precision, recall,tpr,fpr,youden,closest.topleft)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
##    threshold  precision     recall        tpr         fpr   youden
## 1       -Inf 0.06766917 1.00000000 1.00000000 1.000000000 1.000000
## 2        0.5 0.22077922 0.94444444 0.94444444 0.241935484 1.702509
## 3        1.5 0.32000000 0.88888889 0.88888889 0.137096774 1.751792
## 4        2.5 0.34117647 0.80555556 0.80555556 0.112903226 1.692652
## 5        3.5 0.38888889 0.77777778 0.77777778 0.088709677 1.689068
## 6        4.5 0.55555556 0.55555556 0.55555556 0.032258065 1.523297
## 7        5.5 0.72727273 0.44444444 0.44444444 0.012096774 1.432348
## 8        6.5 0.87500000 0.19444444 0.19444444 0.002016129 1.192428
## 9        7.5 0.80000000 0.11111111 0.11111111 0.002016129 1.109095
## 10       8.5 0.75000000 0.08333333 0.08333333 0.002016129 1.081317
## 11       9.5 0.66666667 0.05555556 0.05555556 0.002016129 1.053539
## 12      10.5 1.00000000 0.02777778 0.02777778 0.000000000 1.027778
## 13       Inf        NaN 0.00000000 0.00000000 0.000000000 1.000000
##    closest.topleft
## 1       1.00000000
## 2       0.06161920
## 3       0.03114120
## 4       0.05055578
## 5       0.05725212
## 6       0.19857145
## 7       0.30878831
## 8       0.64892382
## 9       0.79012752
## 10      0.84028184
## 11      0.89197937
## 12      0.94521605
## 13      1.00000000
gl <- ggroc(r, legacy.axes = TRUE)
gl

gl + xlab("FPR") + ylab("TPR") + 
    geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), color="darkgrey", linetype="dashed")

plot(r, print.thres="best", print.thres.best.method="youden") 

# You can use coords to plot for instance a sensitivity + specificity vs. cut-off diagram
plot(specificity + sensitivity ~ threshold, 
     coords(r, "all", transpose = FALSE), 
     type = "l", log="x", 
     subset = is.finite(threshold))

# Plot the Precision-Recall curve
plot(precision ~ recall, 
     coords(r, "all", ret = c("recall", "precision"), transpose = FALSE),
     type="l", ylim = c(0, 1))

# Alternatively plot the curve with TPR and FPR instead of SE/SP 
# (identical curve, only the axis change)
plot(tpr ~ fpr, 
     coords(r, "all", ret = c("tpr", "fpr"), transpose = FALSE),
     type="l")

# Print the thresholds on the plot
# Plot some thresholds, add them to the same plot
plot(r, print.thres="best", print.thres.best.method="youden")
plot(r, print.thres="best", print.thres.best.method="closest.topleft",
                                   add = TRUE)

Reference