Logistic regression is a regression method where the response variable is a binary outcome. The article uses a fictitious data set to demonstrate:
Choosing the threshold is the most subjective step.
set.seed(123)
numObs <- 0.5e6
age <- rpois(numObs, 40)
nbr_prior_visits <- rpois(numObs, 1) ## nbr of visits in the last 12 months
logodds <- -100 + 2*age + 4*nbr_prior_visits
#install.packages("boot")
readmission <- rbinom(numObs, 1, boot::inv.logit(logodds)) ## if patient is re-admitted in next 3 months
df <- data.frame(age=age, nbr_prior_visits=nbr_prior_visits, readmission=readmission)
prop.table(table(df$readmission))
##
## 0 1
## 0.881846 0.118154
summary(df)
## age nbr_prior_visits readmission
## Min. :14.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:36.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :40.00 Median :1.0000 Median :0.0000
## Mean :40.01 Mean :0.9986 Mean :0.1182
## 3rd Qu.:44.00 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :73.00 Max. :9.0000 Max. :1.0000
#install.packages("caTools")
trainIndex <- caTools::sample.split(df$readmission, SplitRatio = 0.7)
train <- df[trainIndex,]
test <- df[!trainIndex,]
bad.mod.1 <- glm(readmission ~ age, data=train, family=binomial)
bad.mod.1.preds <- predict(bad.mod.1, newdata=test, type="response") ## test data
bad.mod.2 <- glm(readmission ~ nbr_prior_visits, data=train, family=binomial)
bad.mod.2.preds <- predict(bad.mod.2, newdata=test, type="response") ## test data
perfect.mod <- glm(readmission ~ ., data=train, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
perfect.mod.preds <- predict(perfect.mod, newdata=test, type="response") ## test data
allMods <- caTools::colAUC(data.frame(ageModel=bad.mod.1.preds, priorVisitsModel=bad.mod.2.preds,
AgeAndPriorVisitsModel=perfect.mod.preds), test$readmission, plotROC = TRUE)
pred <- ROCR::prediction(perfect.mod.preds, test$readmission)
perf <- ROCR::performance(pred,"tpr","fpr")
cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]],
tpr=perf@y.values[[1]])
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
head(cutoffs[cutoffs$tpr <= 0.95,])
## cut fpr tpr
## 106 0.4921926 0.017070239 0.9435761
## 105 0.4936862 0.012957657 0.9143486
## 104 0.4951800 0.007287737 0.8719743
## 103 0.8773607 0.003583389 0.8416182
## 102 0.8780023 0.003583389 0.8414490
## 101 0.8786409 0.003560710 0.8396998
Setting a threshold at 0.4921926 means that the hospital would enroll everyone with a predicted probability that exceeds 0.4921926 into the prevention program. Inside the pool of enrollees are true positive patients and true negative patients. This threshold scheme would capture 94.3576144% of the entire universe of true positives. However, this scheme would also capture 1.7070239% of the entire universe of true negatives. The ROC curve demonstrates that, inevitably, there will be enrollees that do not need the prevention program.