There is a common task to find optimal cutoff point for a binary classification model. cutpointr is a R package for tidy calculation of “optimal” cutpoints. It supports several methods for calculating cutpoints and includes several metrics that can be maximized or minimized by selecting a cutpoint. Some of these methods are designed to be more robust than the simple empirical optimization of a metric. Additionally, cutpointr can automatically bootstrap the variability of the optimal cutpoints and return out-of-bag estimates of various performance metrics.
The included methods for calculating cutpoints are:
maximize_metric
: Maximize the metric function
minimize_metric
: Minimize the metric function
maximize_loess_metric
: Maximize the metric function after LOESS smoothing
minimize_loess_metric
: Minimize the metric function after LOESS smoothing
maximize_spline_metric
: Maximize the metric function after spline smoothing
minimize_spline_metric
: Minimize the metric function after spline smoothing
maximize_gam_metric
: Maximize the metric function after smoothing via Generalized Additive Models
minimize_gam_metric
: Minimize the metric function after smoothing via Generalized Additive Models
maximize_boot_metric
: Bootstrap the optimal cutpoint when maximizing a metric
minimize_boot_metric
: Bootstrap the optimal cutpoint when minimizing a metric
oc_manual
: Specify the cutoff value manually
oc_mean
: Use the sample mean as the “optimal” cutpoint
oc_median
: Use the sample median as the “optimal” cutpoint
oc_youden_kernel
: Maximize the Youden-Index after kernel smoothing the distributions of the two classes
oc_youden_normal
: Maximize the Youden-Index parametrically assuming normally distributed data in both classes
The included metrics to be used with the minimization and maximization methods are:
accuracy
: Fraction correctly classified
abs_d_sens_spec
: The absolute difference of sensitivity and specificity
abs_d_ppv_npv
: The absolute difference between positive predictive value (PPV) and negative predictive value (NPV)
roc01
: Distance to the point (0,1) on ROC space
cohens_kappa
: Cohen’s Kappa
sum_sens_spec
: sensitivity + specificity
sum_ppv_npv
: The sum of positive predictive value (PPV) and negative predictive value (NPV)
prod_sens_spec
: sensitivity * specificity
prod_ppv_npv
: The product of positive predictive value (PPV) and negative predictive value (NPV)
youden
: Youden- or J-Index = sensitivity + specificity - 1
odds_ratio
: (Diagnostic) odds ratio
risk_ratio
: risk ratio (relative risk)
p_chisquared
: The p-value of a chi-squared test on the confusion matrix
cost_misclassification
: The sum of the misclassification cost of false positives and false negatives. Additional arguments: cost_fp, cost_fn
total_utility
: The total utility of true / false positives / negatives. Additional arguments: utility_tp, utility_tn, cost_fp, cost_fn
F1_score
: The F1-score (2 * TP) / (2 * TP + FP + FN)
metric_constrain
: Maximize a selected metric given a minimal value of another selected metric
sens_constrain
: Maximize sensitivity given a minimal value of specificity
spec_constrain
: Maximize specificity given a minimal value of sensitivity
acc_constrain
: Maximize accuracy given a minimal value of sensitivity
Furthermore, the following functions are included which can be used as metric functions but are more useful for plotting purposes, for example in plot_cutpointr
, or for defining new metric functions: tp
, fp
, tn
, fn
, tpr
, fpr
, tnr
, fnr
, false_omission_rate
, false_discovery_rate
, ppv
, npv
, precision
, recall
, sensitivity
, and specificity
.
The inputs to the arguments method
and metric
are functions so that user-defined functions can easily be supplied instead of the built-in ones.
library(tidyverse)
## ── Attaching packages ──────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.3
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ─────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(cutpointr)
## Warning: package 'cutpointr' was built under R version 3.6.2
data("suicide")
head(suicide)
## age gender dsi suicide
## 1 29 female 1 no
## 2 26 male 0 no
## 3 26 female 0 no
## 4 27 female 0 no
## 5 28 female 0 no
## 6 53 male 2 no
cp <- cutpointr(suicide, dsi, suicide, pos_class = "yes", neg_class = "no", direction = ">=",
method = maximize_metric, metric = sum_sens_spec)
summary(cp)
## Method: maximize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint sum_sens_spec acc sensitivity specificity tp fn fp tn
## 2 1.7518 0.8647 0.8889 0.8629 32 4 68 428
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
plot(cp)
Using the oc_manual function the optimal cutpoint will not be determined based on, for example, a metric but is instead set manually using the cutpoint argument. This is useful for supplying and evaluating cutpoints that were found in the literature or in other external sources.
The oc_manual function could also be used to set the cutpoint to the sample mean using cutpoint = mean(data$x). However, this may introduce bias into the bootstrap validation procedure, since the actual mean of the population is not known and thus the mean to be used as the cutpoint should be automatically determined in every resample. To do so, the oc_mean and oc_median functions can be used.
set.seed(100)
opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual,
cutpoint = mean(suicide$dsi), boot_runs = 100)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
## Running bootstrap...
set.seed(100)
opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 100)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
## Running bootstrap...
summary(opt_cut_manual)
## Method: oc_manual
## Predictor: dsi
## Outcome: suicide
## Direction: >=
## Nr. of bootstraps: 100
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint sum_sens_spec acc sensitivity specificity tp fn fp tn
## 0.9211 1.7025 0.7707 0.9444 0.7581 34 2 120 376
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
##
## Bootstrap summary:
## Variable Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## optimal_cutpoint 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.00 0
## AUC_b 0.85 0.88 0.91 0.92 0.92 0.94 0.96 0.98 0.02 0
## AUC_oob 0.83 0.86 0.90 0.93 0.93 0.95 0.97 0.99 0.03 0
## sum_sens_spec_b 1.55 1.63 1.67 1.70 1.70 1.73 1.76 1.79 0.04 0
## sum_sens_spec_oob 1.52 1.61 1.67 1.71 1.71 1.75 1.78 1.81 0.06 0
## acc_b 0.73 0.74 0.76 0.77 0.77 0.78 0.80 0.81 0.02 0
## acc_oob 0.71 0.73 0.75 0.77 0.77 0.78 0.82 0.84 0.03 0
## sensitivity_b 0.82 0.88 0.92 0.94 0.94 0.97 1.00 1.00 0.04 0
## sensitivity_oob 0.78 0.86 0.92 0.94 0.95 1.00 1.00 1.00 0.05 0
## specificity_b 0.71 0.72 0.74 0.76 0.76 0.77 0.79 0.80 0.02 0
## specificity_oob 0.69 0.71 0.74 0.76 0.76 0.77 0.80 0.84 0.03 0
## cohens_kappa_b 0.14 0.21 0.24 0.27 0.27 0.30 0.35 0.37 0.04 0
## cohens_kappa_oob 0.16 0.19 0.25 0.28 0.28 0.31 0.38 0.47 0.06 0
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
neg_class = "no", method = maximize_metric, metric = youden)
summary(opt_cut)
## Method: maximize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint youden acc sensitivity specificity tp fn fp tn
## 2 0.7518 0.8647 0.8889 0.8629 32 4 68 428
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
neg_class = "no", method = minimize_metric, metric = roc01)
summary(opt_cut)
## Method: minimize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
##
## AUC n n_pos n_neg
## 0.9238 532 36 496
##
## optimal_cutpoint roc01 acc sensitivity specificity tp fn fp tn
## 2 0.1765 0.8647 0.8889 0.8629 32 4 68 428
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.00 0 0 0.9210526 1 5.00 11 1.852714 0
## no 0 0.00 0 0 0.6330645 0 4.00 10 1.412225 0
## yes 0 0.75 4 5 4.8888889 6 9.25 11 2.549821 0
Since false negatives (a suicide attempt was not anticipated) can be regarded as much more severe than false positives we can set the cost of a false negative cost_fn for example to ten times the cost of a false positive.
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric,
metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
## Assuming the positive class is yes
## Assuming the positive class has higher x values
summary(opt_cut)
## Method: minimize_metric
## Predictor: dsi
## Outcome: suicide
## Direction: >=
## Subgroups: female, male
##
## Subgroup: female
## --------------------------------------------------------------------------------
## AUC n n_pos n_neg
## 0.9446 392 27 365
##
## optimal_cutpoint misclassification_cost acc sensitivity specificity tp fn
## 2 63 0.8852 0.9259 0.8822 25 2
## fp tn
## 43 322
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.0 0 0 0.8392857 1 5 10 1.745198 0
## no 0 0.0 0 0 0.5479452 0 4 10 1.318102 0
## yes 0 1.3 4 5 4.7777778 6 7 9 2.044379 0
##
## Subgroup: male
## --------------------------------------------------------------------------------
## AUC n n_pos n_neg
## 0.8617 140 9 131
##
## optimal_cutpoint misclassification_cost acc sensitivity specificity tp fn
## 3 40 0.8429 0.7778 0.8473 7 2
## fp tn
## 20 111
##
## Predictor summary:
## Data Min. 5% 1st Qu. Median Mean 3rd Qu. 95% Max. SD NAs
## Overall 0 0.0 0 0 1.150000 1 6.0 11 2.115122 0
## no 0 0.0 0 0 0.870229 1 5.0 6 1.628576 0
## yes 0 0.4 3 4 5.222222 8 10.6 11 3.833333 0
plot_metric(opt_cut)
cutpointr includes several convenience functions for plotting data from a cutpointr
object. These include:
plot_cutpointr
: General purpose plotting function for cutpointr or roc_cutpointr objects
plot_cut_boot
: Plot the bootstrapped distribution of optimal cutpoints
plot_metric
: If maximize_metric
or minimize_metric
was used this function plots all possible cutoffs on the x-axis vs. the respective metric values on the y-axis. If bootstrapping was run, a confidence interval based on the bootstrapped distribution of metric values at each cutpoint can be displayed. To display no confidence interval set conf_lvl = 0
.
plot_metric_boot
: Plot the distribution of out-of-bag metric values
plot_precision_recall
: Plot the precision recall curve
plot_sensitivity_specificity
: Plot all cutpoints vs. sensitivity and specificity
plot_roc
: Plot the ROC curve
plot_x
: Plot the distribution of the predictor variable
plot_precision_recall(opt_cut)
plot_sensitivity_specificity(opt_cut)
plot_roc(opt_cut)
plot_cutpointr(opt_cut, xvar = cutpoints, yvar = youden, conf_lvl = 0.9)
plot_cutpointr(opt_cut, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0)
By default, the output of cutpointr includes the optimized metric and several other metrics. The add_metric function adds further metrics.
cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>%
add_metric(list(tpr,precision,recall)) %>%
select(subgroup, optimal_cutpoint, youden,tpr,precision,recall)
## # A tibble: 2 x 6
## subgroup optimal_cutpoint youden tpr precision recall
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 2 0.808118 0.925926 0.367647 0.925926
## 2 male 3 0.625106 0.777778 0.259259 0.777778
roc(data = suicide, x = dsi, class = suicide, pos_class = "yes",
neg_class = "no", direction = ">=") %>%
add_metric(list(cohens_kappa, F1_score,roc01,youden,precision,recall)) %>%
select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score,roc01,youden,precision,recall)
## # A tibble: 13 x 11
## x.sorted tp fp tn fn cohens_kappa F1_score roc01 youden precision
## <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Inf 0 0 496 36 0 0 1 0 NaN
## 2 11 1 0 496 35 0.0506 0.0541 0.972 0.0278 1
## 3 10 2 1 495 34 0.0931 0.103 0.944 0.0535 0.667
## 4 9 3 1 495 33 0.138 0.15 0.917 0.0813 0.75
## 5 8 4 1 495 32 0.182 0.195 0.889 0.109 0.8
## 6 7 7 1 495 29 0.301 0.318 0.806 0.192 0.875
## 7 6 16 6 490 20 0.527 0.552 0.556 0.432 0.727
## 8 5 20 16 480 16 0.523 0.556 0.446 0.523 0.556
## 9 4 28 44 452 8 0.471 0.519 0.239 0.689 0.389
## 10 3 29 56 440 7 0.425 0.479 0.225 0.693 0.341
## 11 2 32 68 428 4 0.412 0.471 0.176 0.752 0.32
## 12 1 34 120 376 2 0.279 0.358 0.248 0.703 0.221
## 13 0 36 496 0 0 0 0.127 1 0 0.0677
## # … with 1 more variable: recall <dbl>
opt_cut <- cutpointr(suicide, dsi, suicide, metric = youden, silent = TRUE) %>%
add_metric(list(tpr,fpr,precision,recall))
plot_roc(opt_cut) +
labs(title=paste0("ROC Curve - Optimal: ",opt_cut$optimal_cutpoint),subtitle = "based on Youden's Index") +
geom_abline(intercept = 0,slope = 1,linetype="dashed") +
geom_segment(aes(x=opt_cut$fpr,y=opt_cut$fpr,
xend=opt_cut$fpr,yend=opt_cut$tpr),linetype="dotted",color="red")
plot_precision_recall(opt_cut) +
geom_segment(aes(x=opt_cut$recall,y=0,
xend=opt_cut$recall,
yend=opt_cut$precision),linetype="dotted",color="red") +
geom_segment(aes(x=0,y=opt_cut$precision,
xend = opt_cut$recall,
yend=opt_cut$precision),linetype="dotted",color='red') +
annotate(geom = "text",
x=opt_cut$recall-0.05,
y=opt_cut$precision+0.2,
label=paste0("precision: ",round(opt_cut$precision,2),
"\nrecall: ",round(opt_cut$recall,2),
"\n threshold: ",round(opt_cut$optimal_cutpoint,2)))
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:cutpointr':
##
## auc, roc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
r <- suicide %>% pROC::roc(suicide,dsi)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
coords(r, "best")
## Warning in coords.roc(r, "best"): The 'transpose' argument to FALSE by
## default since pROC 1.16. Set transpose = TRUE explicitly to revert to
## the previous behavior, or transpose = TRUE to silence this warning. Type
## help(coords_transpose) for additional information.
## threshold specificity sensitivity
## 1 1.5 0.8629032 0.8888889
coords(r, x="best", input="threshold", best.method="youden", transpose = F)
## threshold specificity sensitivity
## 1 1.5 0.8629032 0.8888889
suicide %>% pROC::roc(suicide,dsi) %>%
coords(ret = "all", transpose = FALSE) %>%
select(threshold,precision, recall,tpr,fpr,youden,closest.topleft)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
## threshold precision recall tpr fpr youden
## 1 -Inf 0.06766917 1.00000000 1.00000000 1.000000000 1.000000
## 2 0.5 0.22077922 0.94444444 0.94444444 0.241935484 1.702509
## 3 1.5 0.32000000 0.88888889 0.88888889 0.137096774 1.751792
## 4 2.5 0.34117647 0.80555556 0.80555556 0.112903226 1.692652
## 5 3.5 0.38888889 0.77777778 0.77777778 0.088709677 1.689068
## 6 4.5 0.55555556 0.55555556 0.55555556 0.032258065 1.523297
## 7 5.5 0.72727273 0.44444444 0.44444444 0.012096774 1.432348
## 8 6.5 0.87500000 0.19444444 0.19444444 0.002016129 1.192428
## 9 7.5 0.80000000 0.11111111 0.11111111 0.002016129 1.109095
## 10 8.5 0.75000000 0.08333333 0.08333333 0.002016129 1.081317
## 11 9.5 0.66666667 0.05555556 0.05555556 0.002016129 1.053539
## 12 10.5 1.00000000 0.02777778 0.02777778 0.000000000 1.027778
## 13 Inf NaN 0.00000000 0.00000000 0.000000000 1.000000
## closest.topleft
## 1 1.00000000
## 2 0.06161920
## 3 0.03114120
## 4 0.05055578
## 5 0.05725212
## 6 0.19857145
## 7 0.30878831
## 8 0.64892382
## 9 0.79012752
## 10 0.84028184
## 11 0.89197937
## 12 0.94521605
## 13 1.00000000
gl <- ggroc(r, legacy.axes = TRUE)
gl
gl + xlab("FPR") + ylab("TPR") +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), color="darkgrey", linetype="dashed")
plot(r, print.thres="best", print.thres.best.method="youden")
# You can use coords to plot for instance a sensitivity + specificity vs. cut-off diagram
plot(specificity + sensitivity ~ threshold,
coords(r, "all", transpose = FALSE),
type = "l", log="x",
subset = is.finite(threshold))
# Plot the Precision-Recall curve
plot(precision ~ recall,
coords(r, "all", ret = c("recall", "precision"), transpose = FALSE),
type="l", ylim = c(0, 1))
# Alternatively plot the curve with TPR and FPR instead of SE/SP
# (identical curve, only the axis change)
plot(tpr ~ fpr,
coords(r, "all", ret = c("tpr", "fpr"), transpose = FALSE),
type="l")
# Print the thresholds on the plot
# Plot some thresholds, add them to the same plot
plot(r, print.thres="best", print.thres.best.method="youden")
plot(r, print.thres="best", print.thres.best.method="closest.topleft",
add = TRUE)