Summary

Logistic regression is a regression method where the response variable is a binary outcome. The article uses a fictitious data set to demonstrate:

  1. Fitting several logistic regression models
  2. Choosing the best model using the Area Under the Curve (AUC) metric that is calculated from a common test data set
  3. Setting a predicted probability threshold that would enroll patients into a “Readmissions Prevention Program”

Choosing the threshold is the most subjective step.

Simulating a Data Set

set.seed(123)
numObs <- 0.5e6

age <- rpois(numObs, 40)
nbr_prior_visits <- rpois(numObs, 1)  ## nbr of visits in the last 12 months

logodds <- -100 + 2*age + 4*nbr_prior_visits

#install.packages("boot")
readmission <- rbinom(numObs, 1, boot::inv.logit(logodds))  ## if patient is re-admitted in next 3 months

df <- data.frame(age=age, nbr_prior_visits=nbr_prior_visits, readmission=readmission)

prop.table(table(df$readmission))
## 
##        0        1 
## 0.881846 0.118154
summary(df)
##       age        nbr_prior_visits  readmission    
##  Min.   :14.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:36.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :40.00   Median :1.0000   Median :0.0000  
##  Mean   :40.01   Mean   :0.9986   Mean   :0.1182  
##  3rd Qu.:44.00   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :73.00   Max.   :9.0000   Max.   :1.0000

Split the data set between Training and Test

#install.packages("caTools")
trainIndex <- caTools::sample.split(df$readmission, SplitRatio = 0.7)

train <- df[trainIndex,]
test <- df[!trainIndex,]

Training several models

Age Model

bad.mod.1 <- glm(readmission ~ age, data=train, family=binomial)

bad.mod.1.preds <- predict(bad.mod.1, newdata=test, type="response") ## test data

PriorVisits Model

bad.mod.2 <- glm(readmission ~ nbr_prior_visits, data=train, family=binomial)

bad.mod.2.preds <- predict(bad.mod.2, newdata=test, type="response")  ## test data

AgeAndPriorVisits Model

perfect.mod <- glm(readmission ~ ., data=train, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
perfect.mod.preds <- predict(perfect.mod, newdata=test, type="response") ## test data

ROC and AUC using the Test Data Set

allMods <- caTools::colAUC(data.frame(ageModel=bad.mod.1.preds, priorVisitsModel=bad.mod.2.preds,
                           AgeAndPriorVisitsModel=perfect.mod.preds), test$readmission, plotROC = TRUE)

Setting Threshold or Cutoff

pred <- ROCR::prediction(perfect.mod.preds, test$readmission)
perf <- ROCR::performance(pred,"tpr","fpr")

cutoffs <- data.frame(cut=perf@alpha.values[[1]], fpr=perf@x.values[[1]], 
                      tpr=perf@y.values[[1]])

cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]

head(cutoffs[cutoffs$tpr <= 0.95,])
##           cut         fpr       tpr
## 106 0.4921926 0.017070239 0.9435761
## 105 0.4936862 0.012957657 0.9143486
## 104 0.4951800 0.007287737 0.8719743
## 103 0.8773607 0.003583389 0.8416182
## 102 0.8780023 0.003583389 0.8414490
## 101 0.8786409 0.003560710 0.8396998

Setting a threshold at 0.4921926 means that the hospital would enroll everyone with a predicted probability that exceeds 0.4921926 into the prevention program. Inside the pool of enrollees are true positive patients and true negative patients. This threshold scheme would capture 94.3576144% of the entire universe of true positives. However, this scheme would also capture 1.7070239% of the entire universe of true negatives. The ROC curve demonstrates that, inevitably, there will be enrollees that do not need the prevention program.