Goals of Care Bag of Words Logistic Regression

Introduction

This is a rule-based patient note analysis to capture patient goals of care (GOC) and advanced care planning (ACP) from text at the note level. We will utilize text phrases generated by clinicians who have annotated notes contained in the MIMIC-III Intensive Care Unit database to generate a Bag of Words (BoW) Model.

The BoW model does not utilize grammar or semantic relatedness, as it classifies text without regard to the spacial relation of words making up the text. The model utilizes only the presence or absence of words known to be associated with GOC or ACP (as per the clinicians who generated these rules during text classification).

This BoW model is employed as a means to compare a rule-based method to a deep-learning method which account for grammar, semantic relatedness, and named-entity recognition NeuroNER.

Note: Aggregated data are representative of actual MIMIC data, but to ensure and maintain patient privacy, all strings of characters here that appear to be patient note text are made up by myself as representative cases.

Strategy

Use only phrase and text data from training set to generate model
Tokenize text:

Remove carriage returns
Remove obfuscated text areas
Remove punctuation
Keep only single white spaces

Remove the top 0.1% represented words in the text from all phrase dictionaries
Create “bag of word” sets of top 10% tokens in each domain phrase dictionary
Create covariates consisting of:

Presence/absence of top tokens in phrase dictionary

Any covariate with 0 variance will be dropped
Run Logistic Regression, optimize parameters for accuracy
Apply models to validation set for performance metrics

Utility Functions

Libraries

library("ggplot2")
library("reshape2")
library("data.table")
library("DescTools")
library("pscl")
library("PresenceAbsence")

load_data

load_data() will accept a label, lab, the set identifier (train/valid) set, and the map from file to row_id, mp. It will return the portion of the map particular to the set input as an argument.

load_data <- function(lab, set, mp){
  #Read in data
  tmp <- read.csv(paste("~/goals_of_care/regex_v2/",lab,"_",set,"_processed.csv", sep = ''),
                  header = T, stringsAsFactors = F, quote = "", row.names = NULL)
  
  #Change "filename" to file for consistency
  colnames(tmp)[which(colnames(tmp) == "filename")] <- "file"
  
  #Subset unique filenames
  tmp <- unique(tmp$file)
  
  #Take substring of names to ensure we can map across both train_/valid_ files
  tmp <- substr(tmp, nchar(tmp) - 9, nchar(tmp))
  
  #Subset files in train set from key map
  tmp <- mp[(mp$file %in% tmp),]

  #Return the mapped data
  return(tmp)
}

strict_regex

strict_regex() will accept all phrases kwds, and all note texts, texts, it will utilize grepl() to find direct matches in the text, and will return a list of booleans.

strict_regex <- function(kwds, texts){
  #Create a list to store results
  tmpList <- list()
  #Loop through all keywords
  for (i in 1:length(kwds)){
    #Store results as a logical vector in its respective list entry position
    tmpList[[i]] <- grepl(kwds[i], texts, ignore.case = TRUE)
  }
  #Return list and control to environment
  return(tmpList)
}

to_df()

to_df() will convert the list to a data.frame and apply the rule tokens as column names to indicate what each variable represents.

to_df <- function(domain, rule){
  #Convert list from grepl to data frame
  domain <- as.data.frame(domain)
  #Show column names as phrases
  colnames(domain) <- rule[rule != '']
  #Multiply by 1 for binary numeric
  domain <- domain*1
  return(domain)
}

clean_text

clean_text() accept a string of text, tokens, as well as a boolean, printout. It will remove carriage returns, remove text obfuscations, and convert the text to lowercase. If printout is TRUE, it will print out example text resulting from the removal of the above.

clean_text <- function(tokens, printout){
    #Create a fake patient note phrase as a representative sample
    ex_token <- "Example note:\nThe patient is a 81yo m who was found down in [** location **] on [** date **] by daughter, [** name **].\n Pt was in usual state of health until four days ago, when began to complain to family of heartburn for which the pt was taking tums in addition to his prescribed PPI, without resolution."
  if (printout){
    print(substr(ex_token, 1, 100))
  }
  
  #Remove carriage returns, convert to lower
  tokens <- tolower(gsub('\r', ' ', tokens))
  tokens <- tolower(gsub('\n', ' ', tokens))
  ex_token <- tolower(gsub('\n', ' ', ex_token))
  if (printout){
    cat("Result after removing carriage returns:\n")
    print(substr(ex_token, 1, 100))
  }
  
  #https://stackoverflow.com/questions/13529360/replace-text-within-parenthesis-in-r
  #Remove obfuscations between '[' and ']'
  tokens <- gsub(" *\\[.*?\\] *", ' ', tokens)
  ex_token <- gsub(" *\\[.*?\\] *", ' ', ex_token)
  if (printout){
    cat("Result after leaving [obfuscation]:\n")
    print(substr(ex_token, 1, 100))
  }
  
  #Keep only words & numeric
  tokens <- gsub("[^[:alnum:][:space:]]", '', tokens)
  ex_token <- gsub("[^[:alnum:][:space:]]", '', ex_token)
  if (printout){
    cat("Result after removing all but alphanumeric and spaces:\n")
    print(substr(ex_token, 1, 100))
  }
  
  #Keep only a single white space
  #https://stackoverflow.com/questions/25707647/merge-multiple-spaces-to-single-space-remove-trailing-leading-spaces
  tokens <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", '', tokens, perl=TRUE)
  ex_token <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", '', ex_token, perl=TRUE)
  if (printout){
    cat("Result after keeping only single spaces:\n")
    print(substr(ex_token, 1, 100))
  }

  return(tokens)
}

bow

bow() will accept a character vector, phrases, will split them on spaces, clean tokens of empty observations and return a character vector of tokens.

bow <- function(phrases){
  tmp <- unlist(strsplit(phrases, ' '))
  tmp <- tmp[tmp != '']
  return(tmp)
}

plot_bow

plot_bow() will accept a character vector bows containing the bag of words, as well as a number, n, and a label lab,it will tabulate them and plot the top n-most represented words in a barplot, the title will contain the label information.

plot_bow <- function(bows, n, lab){
  tmp_tab <- table(bows)[rev(order(table(bows)))]
  par(mai=c(1,2,1,1))
  barplot(rev(head(tmp_tab, n)), horiz = T, las = 1,
          main = paste("Most Frequent Words in the ",lab ," Dictionary", sep = ''),
          xlab = "Frequency")
}

bow_subset

bow_subset <- function(bows, n){
  #Make an ordered table
  tmp_tab <- table(bows)[rev(order(table(bows)))]
  #print(head(tmp_tab))
  #Percent calculation
  perc <- length(tmp_tab)*(n/100)
  #print(perc)
  #Return top n represented values
  return(attr(head(tmp_tab, perc), "names"))
}

cohort_gen

cohort_gen() will accept the annotation data dat, train_set, and validation_set, and will map cohorts, returning the annotation data frame.

cohort_gen <- function(dat, train_set, valid_set){
  dat$COHORT <- rep('', each = nrow(dat))
  for (i in 1:nrow(dat)){
    if (dat$ROW_ID[i] %in% train_set$row_id){
      dat$COHORT[i] <- "train"
    } else if (dat$ROW_ID[i] %in% valid_set$row_id){
      dat$COHORT[i] <- "validation"
    } else {
      dat$COHORT[i] <- ""
    }
  }
  return(dat)
}

standardize

standardize() will standardize numeric data.

standardize <- function(dat){
  return((dat - mean(dat))/sd(dat))
}

check_variance

check_variance <- function(dat){
  ##Remove columns if they have no variance
  for (name in colnames(dat)){
    if (all(dat[[name]] == 1) | all (dat[[name]] == 0)){
      cat(paste("\"",name, "\" has no variance and will be dropped.\n", sep = ''))
      dat[[name]] <- NULL
    }
  }
  return(dat)
}

model_info

model_info <- function(fit){
  #Summary info
  model_sum <- summary(fit)
  #Odds ratio, confidence interval
  odds_ratio <- cbind(OR = exp(fit$coef), exp(confint(fit)))
  #Create list for return
  my_list <- list(model_sum, odds_ratio)
  #names
  names(my_list) <- c("Model Summary","OR Summary")
  return(my_list)
}

confusion_data

confusion_data <- function(fit){
  
  #Create variables
  accuracy <- vector()
  #Threshold sequence
  threshold <- seq(0.1,0.9, by=.01)
  
  
  for (i in 1:length(threshold)){
    #Accuracy calculation from confusion matrix
    accuracy[i] <- Conf(fit, cutoff = threshold[i])$acc
  }
  
  #Confusion matrix
  cutoff <- threshold[which.max(accuracy)]
  conf_mat <- Conf(fit, cutoff = cutoff)
  cat("Maximum accuracy is acheived at a cutoff of: ", cutoff, '\n', sep = '')
  
  #Plot
  layout(matrix(1:2,ncol = 2))
  plot(threshold, accuracy, type = "l", main = "Cutoff Based on Accuracy")
  abline(h=max(accuracy), v = cutoff, col="red")
  fourfoldplot(conf_mat$table, main = "Confusion Matrix Plot", 
               color = c("red","green"))

  return(conf_mat)
}

roc_plot

roc_plot <- function(observed, fit, lab){
  df <- data.frame(ID=1:length(observed),observed = observed, 
                   predicted = predict(fit, type = "response"))
  
  #Maximize PCC (percent correctly classified)
  auc.roc.plot(df, opt.methods = 5, color = TRUE, main = paste(lab, "AUC ROC Plot", sep = ' '))

}

stat_gen

stat_gen <- function(dat, lab){
    h_lab <- substr(lab, 1, 3)
    #True Positives
    tp <- 0
    #True Negatives
    tn <- 0
    #False Positives
    fp <- 0
    #False Negatives
    fn <- 0
    for (i in 1:nrow(dat)){
      #if both human and model mark positive
      if (dat[[h_lab]][i] == 1 & dat[[lab]][i] == 1){
        tp <- tp + 1
        #if human marks negative and model assigns label
      } else if (dat[[h_lab]][i] == 0 & dat[[lab]][i] == 1){
        fp <- fp + 1
        #if human marks positive and model marks negative
      } else if (dat[[h_lab]][i] == 1 & dat[[lab]][i] == 0){
        #false negative
        fn <- fn + 1
      } else if (dat[[h_lab]][i] == 0 & dat[[lab]][i] == 0){
        tn <- tn + 1
      }
    }
  tmpFrame <- cbind(tp, tn, fp, fn)
  colnames(tmpFrame) <- c("tp", "tn", "fp", "fn")
  return(as.data.frame(tmpFrame))
}

model_stats

model_stats <- function(dat){
  dat <- as.data.frame(dat)
  accuracy <- (dat$tp + dat$tn)/(dat$tp + dat$tn + dat$fp + dat$fn)
  precision <- dat$tp/(dat$tp + dat$fp)
  sensitivity <- dat$tp/(dat$tp + dat$fn)
  specificity <- dat$tn/(dat$tn+dat$fp)
  F1 <- 2*(precision*sensitivity)/(precision + sensitivity)
  
  dat$accuracy <- round(accuracy*100, 2)
  dat$precision <- round(precision*100, 2)
  dat$sensitivity <- round(sensitivity*100, 2)
  dat$specificity <- round(specificity*100, 2)
  dat$F1 <- round(F1*100, 2)
  
  return(dat)
}

Load Annotation Data

#Read csv
dat <- fread("~/goals_of_care/regex_v3/op_annotations_122017.csv", 
                header = T, stringsAsFactors = F)

#Remove note w/ missing data
missing_note <- fread("~/goals_of_care/unmatched_note_result.csv", header = T, stringsAsFactors = F)
nrow(missing_note)

## [1] 1

#Remove note with missing data
dat <- dat[!dat$ROW_ID %in% missing_note$ROW_ID, ]

#Make the names easier to deal with:
colnames(dat)[which(colnames(dat) == "Code Status Limitations")] <- "LIM"
colnames(dat)[which(colnames(dat) == "Full Code Status")] <- "COD"
colnames(dat)[which(colnames(dat) == "Patient and Family Care Preferences")] <- "CAR"
colnames(dat)[which(colnames(dat) == "Palliative Care Team Involvement")] <- "PAL"
colnames(dat)[which(colnames(dat) == "Communication with Family")] <- "FAM"
#Dat CIM is already accounted for

#Check the data
colnames(dat)

##  [1] "V1"                                      
##  [2] "ROW_ID"                                  
##  [3] "SUBJECT_ID"                              
##  [4] "HADM_ID"                                 
##  [5] "CATEGORY"                                
##  [6] "DESCRIPTION"                             
##  [7] "TEXT"                                    
##  [8] "COHORT"                                  
##  [9] "CAR"                                     
## [10] "Patient and Family Care Preferences Text"
## [11] "FAM"                                     
## [12] "Communication with Family Text"          
## [13] "COD"                                     
## [14] "Full Code Status Text"                   
## [15] "LIM"                                     
## [16] "Code Status Limitations Text"            
## [17] "PAL"                                     
## [18] "Palliative Care Team Involvement Text"   
## [19] "Ambiguous"                               
## [20] "Ambiguous Text"                          
## [21] "Ambiguous Comments"                      
## [22] "None"                                    
## [23] "STAMP"                                   
## [24] "operator"                                
## [25] "original_filename"

nrow(dat)

## [1] 1496

Load Map for training/validation sets

row_id_to_file_num.txt contains the key we will use to convert file as output from NeuroNER to row_id from MIMICIII.NOTEEVENTS, which was used as substrate to generate the annotated data set.

#Contains map of files (BRAT format for neuroNER) to row_id (MIMIC NOTEEVENTS)
map <- fread("~/goals_of_care/summary_stats/row_id_to_file_num.txt", header = F, stringsAsFactors = F, sep = '\t')
head(map)

##        V1         V2
## 1: 407635 text_00000
## 2: 345097 text_00001
## 3: 371117 text_00002
## 4: 401545 text_00003
## 5: 393792 text_00004
## 6: 382263 text_00005

#Add column names for ease of merging
colnames(map) <- c("row_id", "file")

Load training/validation sets and generate cohort info

#Code Status Limitations
lim_train_set <- load_data("LIM", "train", map)
lim_valid_set <- load_data("LIM", "valid", map)

#Full Code
cod_train_set <- load_data("COD", "train", map)
cod_valid_set <- load_data("COD", "valid", map)

#Patient/Family Care Preferences
car_train_set <- load_data("CAR", "train", map)
car_valid_set <- load_data("CAR", "valid", map)

#Palliative Care Consult
pal_train_set <- load_data("PAL", "train", map)
pal_valid_set <- load_data("PAL", "valid", map)

#Family meeting held
fam_train_set <- load_data("FAM", "train", map)
fam_valid_set <- load_data("FAM", "valid", map)

#dat <- cohort_gen(dat, train_set, valid_set)
lim_dat <- cohort_gen(dat, lim_train_set, lim_valid_set)
cod_dat <- cohort_gen(dat, cod_train_set, cod_valid_set)
car_dat <- cohort_gen(dat, car_train_set, car_valid_set)
pal_dat <- cohort_gen(dat, pal_train_set, pal_valid_set)
fam_dat <- cohort_gen(dat, fam_train_set, fam_valid_set)

#apply train/valid cohort information to dat
if (identical(dat$ROW_ID, lim_dat$ROW_ID)){
  dat$COHORT <- lim_dat$COHORT
}

levels(factor(dat$COHORT))

## [1] "train"      "validation"

Generate CIM Measure

CIM will be defined as including code status limitations and documentation of care preferences.

#Create CIM dat by using any other data source
cim_dat <- lim_dat
cim <- vector()
for (i in 1:nrow(lim_dat)){
  if (lim_dat$LIM[i] == 1 | lim_dat$CAR[i] == 1){
    cim[length(cim)+1] <- 1
  } else {
    cim[length(cim)+1] <- 0
  }
}

#Populate CIM data
cim_dat$CIM <- cim
#Add to dat
dat$CIM <- cim

Subset Phrases

Note: Only text and phrases from the training set will be used.

##Only use phrases from training set
lim_phrases <- lim_dat$`Code Status Limitations Text`[lim_dat$COHORT == "train"]
cod_phrases <- cod_dat$`Full Code Status Text`[cod_dat$COHORT == "train"]
car_phrases <- car_dat$`Patient and Family Care Preferences Text`[car_dat$COHORT == "train"]
pal_phrases <- pal_dat$`Palliative Care Team Involvement Text`[pal_dat$COHORT == "train"]
fam_phrases <- fam_dat$`Communication with Family Text`[fam_dat$COHORT == "train"]

##remove empty observations ""
lim_phrases <- lim_phrases[lim_phrases != '']
cod_phrases <- cod_phrases[cod_phrases != '']
car_phrases <- car_phrases[car_phrases != '']
pal_phrases <- pal_phrases[pal_phrases != '']
fam_phrases <- fam_phrases[fam_phrases != '']

#CIM is a combination of care preferences/code status limitations
cim_phrases <- c(car_phrases, lim_phrases)

Clean Phrases & Text

##Remove duplicate observations (keep first)
dat <- dat[!duplicated(dat$TEXT), ]

#Subset texts from training set
txts <- clean_text(dat$TEXT[dat$COHORT == "train"], TRUE)

## [1] "Example note:\nThe patient is a 81yo m who was found down in [** location **] on [** date **] by daug"
## Result after removing carriage returns:
## [1] "example note: the patient is a 81yo m who was found down in [** location **] on [** date **] by daug"
## Result after leaving [obfuscation]:
## [1] "example note: the patient is a 81yo m who was found down in  on  by daughter, .  pt was in usual sta"
## Result after removing all but alphanumeric and spaces:
## [1] "example note the patient is a 81yo m who was found down in  on  by daughter   pt was in usual state "
## Result after keeping only single spaces:
## [1] "example note the patient is a 81yo m who was found down in on by daughter pt was in usual state of h"

lim_phrases <- clean_text(lim_phrases, FALSE)
cod_phrases <- clean_text(cod_phrases, FALSE)
car_phrases <- clean_text(car_phrases, FALSE)
pal_phrases <- clean_text(pal_phrases, FALSE)
fam_phrases <- clean_text(fam_phrases, FALSE)

#
cim_phrases <- clean_text(cim_phrases, FALSE)

Create a Bag of Words

#Create bag of words by splitting phrases on spaces and unlisting result
lim_bow <- bow(lim_phrases)
cod_bow <- bow(cod_phrases)
car_bow <- bow(car_phrases)
pal_bow <- bow(pal_phrases)
fam_bow <- bow(fam_phrases)

#cim
cim_bow <- bow(cim_phrases)

#All texts
txt_bow <- bow(txts)

BoW Contents

plot_bow(lim_bow, 20, "LIM Phrase")

plot_bow(cod_bow, 20, "COD Phrase")

plot_bow(car_bow, 20, "CAR Phrase")

plot_bow(pal_bow, 20, "PAL Phrase")

plot_bow(fam_bow, 20, "FAM Phrase")

plot_bow(cim_bow, 20, "CIM Phrase")

plot_bow(txt_bow, 20, "Corpus")

Subset top tokens & remove common tokens from each phrase dictionary

#Top non-specific tokens from all texts
txt_bow <- bow_subset(txt_bow, .1)

#Clean non-specific tokens from phrase dictionaries
lim_bow <- lim_bow[!lim_bow %in% txt_bow]
cod_bow <- cod_bow[!cod_bow %in% txt_bow]
car_bow <- car_bow[!car_bow %in% txt_bow]
pal_bow <- pal_bow[!pal_bow %in% txt_bow]
fam_bow <- fam_bow[!fam_bow %in% txt_bow]
cim_bow <- cim_bow[!cim_bow %in% txt_bow]

#Subset top tokens in each phrase dictionary
(lim_bow <- bow_subset(lim_bow, 10))

##  [1] "code"         "status"       "dnr"          "dni"         
##  [5] "dnrdni"       "full"         "do"           "resuscitate" 
##  [9] "family"       "patient"      "cmo"          "confirmed"   
## [13] "comfort"      "care"         "only"         "but"         
## [17] "measures"     "made"         "be"           "her"         
## [21] "would"        "now"          "that"         "intubate"    
## [25] "hcp"          "compressions" "changed"      "want"        
## [29] "pressors"     "if"           "his"          "a"           
## [33] "will"         "pt"           "per"          "patients"

(cod_bow <- bow_subset(cod_bow, 10))

##  [1] "code"       "full"       "status"     "presumed"   "confirmed" 
##  [6] "discussed"  "patient"    "family"     "will"       "wife"      
## [11] "per"        "hcp"        "discussion"

(car_bow <- bow_subset(car_bow, 10))

##  [1] "family"     "care"       "code"       "patient"    "would"     
##  [6] "be"         "full"       "status"     "comfort"    "that"      
## [11] "cmo"        "want"       "like"       "her"        "goals"     
## [16] "but"        "his"        "will"       "measures"   "dnrdni"    
## [21] "she"        "confirmed"  "meeting"    "discussion" "made"      
## [26] "dnr"        "decision"   "he"         "focus"      "a"         
## [31] "they"       "son"        "or"         "daughter"   "central"   
## [36] "procedures" "patients"   "discussed"  "aggressive" "pt"        
## [41] "pressors"   "per"        "invasive"   "now"        "this"      
## [46] "after"      "including"  "hcp"        "escalation" "wife"      
## [51] "lines"      "home"       "by"         "wishes"     "we"        
## [56] "states"     "prognosis"  "plan"       "only"       "dni"       
## [61] "any"        "morphine"   "avoid"      "are"        "agreed"    
## [66] "yesterday"  "pts"        "intubation" "have"

(pal_bow <- bow_subset(pal_bow, 10))

##  [1] "care"       "palliative" "family"     "goals"      "consult"   
##  [6] "would"      "like"       "patient"    "hospice"    "her"

(fam_bow <- bow_subset(fam_bow, 10))

##  [1] "family"        "meeting"       "code"          "patient"      
##  [5] "care"          "discussed"     "communication" "held"         
##  [9] "full"          "status"        "will"          "wife"         
## [13] "her"           "son"           "goals"         "discussion"   
## [17] "be"            "a"             "daughter"      "confirmed"    
## [21] "hcp"           "would"         "his"           "dnrdni"       
## [25] "this"          "that"          "comfort"       "patients"     
## [29] "husband"       "consent"       "who"           "s"            
## [33] "we"            "spoke"         "have"          "plan"         
## [37] "icu"           "yesterday"     "today"         "they"         
## [41] "met"           "comments"      "after"         "signed"       
## [45] "him"           "decision"      "but"           "team"         
## [49] "prognosis"     "like"          "he"            "are"          
## [53] "w"             "pts"           "pt"            "dnr"          
## [57] "discuss"       "update"        "rounds"        "length"       
## [61] "aware"         "night"

(cim_bow <- bow_subset(cim_bow, 10))

##  [1] "code"         "status"       "dnr"          "family"      
##  [5] "dni"          "full"         "care"         "patient"     
##  [9] "dnrdni"       "would"        "cmo"          "comfort"     
## [13] "be"           "do"           "that"         "confirmed"   
## [17] "but"          "want"         "resuscitate"  "her"         
## [21] "measures"     "made"         "goals"        "like"        
## [25] "his"          "only"         "will"         "meeting"     
## [29] "discussion"   "a"            "she"          "now"         
## [33] "he"           "pressors"     "patients"     "decision"    
## [37] "daughter"     "central"      "son"          "pt"          
## [41] "per"          "hcp"          "focus"        "invasive"    
## [45] "discussed"    "procedures"   "aggressive"   "they"        
## [49] "or"           "escalation"   "after"        "wife"        
## [53] "this"         "by"           "plan"         "intubation"  
## [57] "compressions" "we"           "lines"        "including"   
## [61] "changed"      "wishes"       "prognosis"    "if"          
## [65] "home"         "given"        "chest"        "states"      
## [69] "make"         "have"         "any"          "agreed"      
## [73] "should"       "morphine"

Run Regex

#Strict regex
lim_tmp <- strict_regex(lim_bow, txts)
cod_tmp <- strict_regex(cod_bow, txts)
car_tmp <- strict_regex(car_bow, txts)
pal_tmp <- strict_regex(pal_bow, txts)
fam_tmp <- strict_regex(fam_bow, txts)
#CIM
cim_tmp <- strict_regex(cim_bow, txts)

#Convert list to data.frame
lim_tmp <- to_df(lim_tmp, lim_bow)
cod_tmp <- to_df(cod_tmp, cod_bow)
car_tmp <- to_df(car_tmp, car_bow)
pal_tmp <- to_df(pal_tmp, pal_bow)
fam_tmp <- to_df(fam_tmp, fam_bow)
#CIM
cim_tmp <- to_df(cim_tmp, cim_bow)

Create Metrics for Analysis and Check Variance

#Check variance
print("Check LIM Variances:")

## [1] "Check LIM Variances:"

lim_tmp <- check_variance(lim_tmp)

## "a" has no variance and will be dropped.

print("Check COD Variances:")

## [1] "Check COD Variances:"

cod_tmp <- check_variance(cod_tmp)
print("Check CAR Variances:")

## [1] "Check CAR Variances:"

car_tmp <- check_variance(car_tmp)

## "he" has no variance and will be dropped.
## "a" has no variance and will be dropped.
## "or" has no variance and will be dropped.

print("Check PAL Variances:")

## [1] "Check PAL Variances:"

pal_tmp <- check_variance(pal_tmp)
print("Check FAM Variances:")

## [1] "Check FAM Variances:"

fam_tmp <- check_variance(fam_tmp)

## "a" has no variance and will be dropped.
## "s" has no variance and will be dropped.
## "he" has no variance and will be dropped.
## "w" has no variance and will be dropped.

print("Check CIM Variances:")

## [1] "Check CIM Variances:"

cim_tmp <- check_variance(cim_tmp)

## "a" has no variance and will be dropped.
## "he" has no variance and will be dropped.
## "or" has no variance and will be dropped.

Prepare data frames for logistic regression

#cbind data
LIM <- dat$LIM[dat$COHORT == "train"]
lim_tmp <- cbind(LIM, lim_tmp)
#Note: NA's must be omitted for COD due to missing observations
COD <- dat$COD[dat$COHORT == "train"]
cod_tmp <- na.omit(cbind(COD, cod_tmp))

CAR <- dat$CAR[dat$COHORT == "train"]
car_tmp <- cbind(CAR, car_tmp)
PAL <- dat$PAL[dat$COHORT == "train"]
pal_tmp <- cbind(PAL, pal_tmp)
FAM <- dat$FAM[dat$COHORT == "train"]
fam_tmp <- cbind(FAM, fam_tmp)
CIM <- dat$CIM[dat$COHORT == "train"]
cim_tmp <- cbind(CIM, cim_tmp)

Perform Logistic Regression

#LIM
lim_reg <- glm(LIM ~ ., family = binomial(link = 'logit'), data = lim_tmp)
#COD
cod_reg <- glm(COD ~ ., family = binomial(link = 'logit'), data = cod_tmp)
#CAR
car_reg <- glm(CAR ~ ., family = binomial(link = 'logit'), data = car_tmp)
#PAL
pal_reg <- glm(PAL ~ ., family = binomial(link = 'logit'), data = pal_tmp)
#FAM
fam_reg <- glm(FAM ~ ., family = binomial(link = 'logit'), data = fam_tmp)
#CIM
cim_reg <- glm(CIM ~ ., family = binomial(link = 'logit'), data = cim_tmp)

Generate Model Statistics and Visualizations

LIM

model_info(lim_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = LIM ~ ., family = binomial(link = "logit"), data = lim_tmp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6820  -0.6086  -0.4439   0.4427   2.5766  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.490e+01  8.192e+02  -0.018  0.98549    
## code          1.276e+00  9.185e-01   1.389  0.16478    
## status        1.311e-01  9.426e-01   0.139  0.88939    
## dnr           3.159e+00  5.353e-01   5.901 3.61e-09 ***
## dni           2.778e-01  4.262e-01   0.652  0.51447    
## dnrdni       -6.399e-01  6.631e-01  -0.965  0.33451    
## full         -8.391e-01  4.462e-01  -1.881  0.06004 .  
## do            1.325e+00  1.190e+03   0.001  0.99911    
## resuscitate   3.853e-01  5.464e-01   0.705  0.48073    
## family        5.374e-01  6.171e-01   0.871  0.38385    
## patient       2.529e-01  5.062e-01   0.500  0.61737    
## cmo          -7.769e-01  5.743e-01  -1.353  0.17613    
## confirmed    -6.803e-02  4.528e-01  -0.150  0.88057    
## comfort      -8.872e-03  3.841e-01  -0.023  0.98157    
## care          1.567e+00  1.011e+00   1.550  0.12106    
## only          5.423e-01  3.641e-01   1.489  0.13640    
## but          -1.148e-01  3.865e-01  -0.297  0.76647    
## measures      8.065e-01  7.721e-01   1.045  0.29623    
## made          8.609e-01  5.484e-01   1.570  0.11651    
## be           -3.061e+00  2.626e+00  -1.166  0.24376    
## her           1.462e+01  8.628e+02   0.017  0.98648    
## would         9.032e-02  3.258e-01   0.277  0.78157    
## now          -2.072e-01  4.667e-01  -0.444  0.65710    
## that         -2.491e-02  3.135e-01  -0.079  0.93666    
## intubate     -8.089e-01  2.980e-01  -2.715  0.00663 ** 
## hcp          -3.157e-01  4.256e-01  -0.742  0.45822    
## compressions  1.563e+00  8.943e-01   1.748  0.08044 .  
## changed      -4.105e-02  3.404e-01  -0.121  0.90403    
## want         -1.906e-01  4.718e-01  -0.404  0.68621    
## pressors      1.358e-01  3.313e-01   0.410  0.68187    
## `if`          8.016e-01  5.529e-01   1.450  0.14713    
## his          -5.420e-02  6.612e-01  -0.082  0.93467    
## will          8.687e-02  3.436e-01   0.253  0.80039    
## pt           -1.046e+00  8.427e-01  -1.241  0.21450    
## per          -1.401e+00  1.027e+00  -1.365  0.17241    
## patients     -4.012e-01  4.277e-01  -0.938  0.34815    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 608.78  on 447  degrees of freedom
## Residual deviance: 366.60  on 412  degrees of freedom
## AIC: 438.6
## 
## Number of Fisher Scoring iterations: 14
## 
## 
## $`OR Summary`
##                        OR        2.5 %       97.5 %
## (Intercept)  3.390758e-07           NA 7.744298e+48
## code         3.582003e+00 5.757349e-01 2.234698e+01
## status       1.140072e+00 2.013795e-01 8.935358e+00
## dnr          2.353982e+01 8.684025e+00 7.198795e+01
## dni          1.320245e+00 5.593838e-01 3.003546e+00
## dnrdni       5.273281e-01 1.395719e-01 1.500222e+00
## full         4.320786e-01 1.795068e-01 1.044885e+00
## do           3.761014e+00 3.194168e-06 5.534231e+06
## resuscitate  1.470077e+00 4.935604e-01 4.278536e+00
## family       1.711550e+00 5.307354e-01 6.046266e+00
## patient      1.287712e+00 4.948531e-01 3.644624e+00
## cmo          4.598160e-01 1.465789e-01 1.410505e+00
## confirmed    9.342276e-01 3.758651e-01 2.232875e+00
## comfort      9.911671e-01 4.590274e-01 2.080963e+00
## care         4.793846e+00 7.638770e-01 3.947607e+01
## only         1.719968e+00 8.368912e-01 3.505366e+00
## but          8.915502e-01 4.191380e-01 1.916412e+00
## measures     2.240042e+00 5.272346e-01 1.097907e+01
## made         2.365174e+00 9.871794e-01 6.997559e+00
## be           4.684687e-02 2.832721e-04 5.904301e+00
## her          2.226937e+06 5.824145e-52           NA
## would        1.094526e+00 5.742072e-01 2.067283e+00
## now          8.128751e-01 3.312706e-01 1.714930e+00
## that         9.753968e-01 5.250381e-01 1.801435e+00
## intubate     4.453258e-01 2.468890e-01 7.967078e-01
## hcp          7.293000e-01 3.108680e-01 1.660108e+00
## compressions 4.775233e+00 8.518711e-01 3.098439e+01
## changed      9.597830e-01 1.574832e+00 1.642154e+00
## want         8.264527e-01 3.906924e-01 2.066883e+00
## pressors     1.145453e+00 5.933739e-01 2.185284e+00
## `if`         2.229147e+00 7.914195e-01 7.022790e+00
## his          9.472438e-01 2.641518e-01 3.596028e+00
## will         1.090757e+00 5.581471e-01 1.882032e+00
## pt           3.513316e-01 6.731749e-02 1.899641e+00
## per          2.463941e-01 3.164475e-02 1.938602e+00
## patients     6.694846e-01 3.377275e-01 1.527315e+00

confusion_data(lim_reg)

## Maximum accuracy is acheived at a cutoff of: 0.63

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 249  59
##          1  12 128
## 
##                Accuracy : 0.8415
##                  95% CI : (0.8048, 0.8724)
##     No Information Rate : 0.5826
##     P-Value [Acc > NIR] : < 2.2e-16
## 
##                   Kappa : 0.6621
##  Mcnemar's Test P-Value : 4.78e-08
## 
##             Sensitivity : 0.9540
##             Specificity : 0.6845
##          Pos Pred Value : 0.8084
##          Neg Pred Value : 0.9143
##              Prevalence : 0.5826
##          Detection Rate : 0.5558
##    Detection Prevalence : 0.5826
##       Balanced Accuracy : 0.8193
##          F-val Accuracy : 0.8752
## 
##        'Positive' Class : 0

COD

model_info(cod_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = COD ~ ., family = binomial(link = "logit"), data = cod_tmp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6589   0.0000   0.2433   0.4909   1.8371  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -17.9327  3452.6878  -0.005   0.9959    
## code          18.4886  1345.1052   0.014   0.9890    
## full          34.6984  1901.7232   0.018   0.9854    
## status       -15.3803  1345.1044  -0.011   0.9909    
## presumed       1.5582     0.9206   1.693   0.0905 .  
## confirmed     -0.6856     0.5381  -1.274   0.2027    
## discussed      1.0640     0.6260   1.700   0.0892 .  
## patient       -0.5915     0.8222  -0.719   0.4719    
## family        -0.3863     1.0101  -0.382   0.7021    
## will          -0.7086     0.6458  -1.097   0.2725    
## wife          -0.1738     0.5893  -0.295   0.7680    
## per          -16.1324  3500.0097  -0.005   0.9963    
## hcp           -0.4296     0.5915  -0.726   0.4676    
## discussion    -2.4229     0.5433  -4.460 8.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 466.19  on 355  degrees of freedom
## Residual deviance: 150.88  on 342  degrees of freedom
## AIC: 178.88
## 
## Number of Fisher Scoring iterations: 19
## 
## 
## $`OR Summary`
##                       OR         2.5 %        97.5 %
## (Intercept) 1.628993e-08  7.287626e-37  1.788560e+21
## code        1.070332e+08  4.634765e-17 5.998036e+203
## full        1.173020e+15           Inf 1.500898e+223
## status      2.091309e-07 4.626247e-177  2.780290e+22
## presumed    4.750226e+00  9.766722e-01  4.002518e+01
## confirmed   5.038019e-01  1.802375e-01  1.520284e+00
## discussed   2.897814e+00  9.365957e-01  1.145960e+01
## patient     5.534749e-01  7.750053e-02  2.289641e+00
## family      6.795678e-01  6.178491e-02  3.903444e+00
## will        4.923112e-01  1.194863e-01  1.579889e+00
## wife        8.404230e-01  2.794852e-01  2.929783e+00
## per         9.858233e-08  1.221274e-40  3.136681e+26
## hcp         6.507705e-01  2.132282e-01  2.223450e+00
## discussion  8.866770e-02  2.942014e-02  2.538987e-01

confusion_data(cod_reg)

## Maximum accuracy is acheived at a cutoff of: 0.59

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 109   5
##          1  20 222
## 
##                Accuracy : 0.9298
##                  95% CI : (0.8984, 0.9520)
##     No Information Rate : 0.6376
##     P-Value [Acc > NIR] : < 2.2e-16
## 
##                   Kappa : 0.8441
##  Mcnemar's Test P-Value : 0.00511
## 
##             Sensitivity : 0.8450
##             Specificity : 0.9780
##          Pos Pred Value : 0.9561
##          Neg Pred Value : 0.9174
##              Prevalence : 0.3624
##          Detection Rate : 0.3062
##    Detection Prevalence : 0.3624
##       Balanced Accuracy : 0.9115
##          F-val Accuracy : 0.8971
## 
##        'Positive' Class : 0

CAR

model_info(car_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = CAR ~ ., family = binomial(link = "logit"), data = car_tmp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94229  -0.49402  -0.22153  -0.00041   2.80976  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -22.39207 1096.89433  -0.020  0.98371   
## family         0.42734    0.92210   0.463  0.64304   
## care           3.33315    1.94490   1.714  0.08657 . 
## code          -3.22045    1.69927  -1.895  0.05807 . 
## patient        1.31230    0.74600   1.759  0.07856 . 
## would          0.53032    0.40065   1.324  0.18562   
## be            19.86085 1096.88974   0.018  0.98555   
## full          -0.02800    0.55634  -0.050  0.95987   
## status        -5.18044    1.61929  -3.199  0.00138 **
## comfort        0.63346    0.43821   1.446  0.14830   
## that          -0.81566    0.42542  -1.917  0.05520 . 
## cmo            1.77017    0.64738   2.734  0.00625 **
## want           0.07593    0.59741   0.127  0.89887   
## like           0.24036    0.52937   0.454  0.64980   
## her           -3.01157    3.06162  -0.984  0.32529   
## goals          0.64383    0.48670   1.323  0.18588   
## but           -0.62694    0.51756  -1.211  0.22577   
## his            1.40015    1.12360   1.246  0.21272   
## will           0.95920    0.53080   1.807  0.07075 . 
## measures       1.55816    0.88418   1.762  0.07803 . 
## dnrdni         0.44425    0.67879   0.654  0.51281   
## she            5.77221    1.81844   3.174  0.00150 **
## confirmed      0.82167    0.47162   1.742  0.08147 . 
## meeting        0.44702    0.44164   1.012  0.31145   
## discussion     0.87539    0.54751   1.599  0.10985   
## made          -1.58002    0.77212  -2.046  0.04072 * 
## dnr            1.01248    0.64605   1.567  0.11707   
## decision       0.23256    0.69613   0.334  0.73832   
## focus         -0.06517    0.91600  -0.071  0.94328   
## they           1.02305    0.74824   1.367  0.17154   
## son           -0.25389    0.41300  -0.615  0.53873   
## daughter       0.61733    0.47931   1.288  0.19776   
## central        0.28043    0.52106   0.538  0.59045   
## procedures     1.29927    0.88998   1.460  0.14432   
## patients       0.86881    0.50045   1.736  0.08255 . 
## discussed     -0.19037    0.48195  -0.395  0.69285   
## aggressive     1.26548    0.48692   2.599  0.00935 **
## pt            -0.94304    1.00650  -0.937  0.34878   
## pressors      -0.49708    0.45344  -1.096  0.27297   
## per           -1.46686    1.49294  -0.983  0.32584   
## invasive       0.36829    0.43641   0.844  0.39872   
## now           -0.85329    0.65002  -1.313  0.18928   
## this           0.31908    0.46122   0.692  0.48905   
## after          0.76695    0.38993   1.967  0.04920 * 
## including     -0.82800    0.41728  -1.984  0.04723 * 
## hcp            0.29299    0.50172   0.584  0.55923   
## escalation     3.11542    1.33654   2.331  0.01976 * 
## wife           1.31499    0.51572   2.550  0.01078 * 
## lines          1.43020    1.01493   1.409  0.15879   
## home          -0.50978    0.40211  -1.268  0.20489   
## by            -0.78443    0.40366  -1.943  0.05198 . 
## wishes         1.81498    1.20817   1.502  0.13303   
## we             0.18450    0.84326   0.219  0.82681   
## states         0.98044    0.94565   1.037  0.29983   
## prognosis      0.91899    0.55216   1.664  0.09604 . 
## plan           0.83213    2.73841   0.304  0.76122   
## only          -0.54291    0.50388  -1.077  0.28127   
## dni           -0.55042    0.60935  -0.903  0.36637   
## any            0.20009    0.38154   0.524  0.59999   
## morphine       0.94329    0.40767   2.314  0.02067 * 
## avoid         -0.55810    0.53304  -1.047  0.29509   
## are           -0.90538    2.93637  -0.308  0.75783   
## agreed         3.02283    1.57815   1.915  0.05544 . 
## yesterday     -0.47403    0.42862  -1.106  0.26875   
## pts            0.34627    0.58808   0.589  0.55599   
## intubation     0.82780    0.42863   1.931  0.05345 . 
## have          -0.23816    0.38742  -0.615  0.53874   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 501.64  on 447  degrees of freedom
## Residual deviance: 265.89  on 381  degrees of freedom
## AIC: 399.89
## 
## Number of Fisher Scoring iterations: 15
## 
## 
## $`OR Summary`
##                       OR         2.5 %       97.5 %
## (Intercept) 1.884715e-10 2.733766e-182 8.227312e+07
## family      1.533181e+00  2.872708e-01 1.187854e+01
## care        2.802659e+01  1.146395e+00 2.810263e+03
## code        3.993721e-02  6.768861e-04 7.557627e-01
## patient     3.714699e+00  9.338393e-01 1.786655e+01
## would       1.699469e+00  7.717643e-01 3.742223e+00
## be          4.221414e+08  1.584563e-42           NA
## full        9.723928e-01  3.271114e-01 2.929014e+00
## status      5.625505e-03  1.900293e-04 1.103152e-01
## comfort     1.884118e+00  7.894000e-01 4.440380e+00
## that        4.423464e-01  1.884954e-01 1.007339e+00
## cmo         5.871864e+00  1.669936e+00 2.148728e+01
## want        1.078883e+00  3.274490e-01 3.444219e+00
## like        1.271706e+00  4.596239e-01 3.705851e+00
## her         4.921451e-02  3.537771e-05 1.351322e+01
## goals       1.903765e+00  7.236902e-01 4.932499e+00
## but         5.342266e-01  1.923257e-01 1.481087e+00
## his         4.055817e+00  5.043043e-01 4.373709e+01
## will        2.609616e+00  9.534956e-01 7.722475e+00
## measures    4.750082e+00  8.853171e-01 2.925056e+01
## dnrdni      1.559315e+00  4.143922e-01 6.019716e+00
## she         3.212458e+02  1.031149e+01 1.622461e+04
## confirmed   2.274290e+00  8.953247e-01 5.747602e+00
## meeting     1.563645e+00  6.499789e-01 3.711084e+00
## discussion  2.399810e+00  8.120149e-01 7.016203e+00
## made        2.059709e-01  4.222284e-02 8.809848e-01
## dnr         2.752421e+00  7.660604e-01 9.805700e+00
## decision    1.261827e+00  3.087643e-01 4.827918e+00
## focus       9.369052e-01  1.492931e-01 5.645147e+00
## they        2.781676e+00  6.176387e-01 1.193126e+01
## son         7.757793e-01  3.402924e-01 1.730437e+00
## daughter    1.853970e+00  7.197334e-01 4.758819e+00
## central     1.323694e+00  4.660552e-01 3.638201e+00
## procedures  3.666612e+00  6.535049e-01 2.201928e+01
## patients    2.384077e+00  8.910112e-01 6.415325e+00
## discussed   8.266563e-01  3.154346e-01 2.104103e+00
## aggressive  3.544788e+00  1.361483e+00 9.290068e+00
## pt          3.894437e-01  5.809819e-02 3.176486e+00
## pressors    6.083049e-01  2.410216e-01 1.442619e+00
## per         2.306477e-01  1.404212e-02 5.506153e+00
## invasive    1.445260e+00  6.054986e-01 3.384842e+00
## now         4.260110e-01  1.223449e-01 1.613143e+00
## this        1.375863e+00  5.643252e-01 3.477374e+00
## after       2.153191e+00  1.011930e+00 4.704100e+00
## including   4.369241e-01  1.882151e-01 9.740241e-01
## hcp         1.340433e+00  4.971407e-01 3.589305e+00
## escalation  2.254290e+01  1.850248e+00 3.971859e+02
## wife        3.724726e+00  1.357923e+00 1.038859e+01
## lines       4.179534e+00  6.264030e-01 3.376135e+01
## home        6.006275e-01  2.679594e-01 1.306164e+00
## by          4.563815e-01  2.027632e-01 9.949029e-01
## wishes      6.140929e+00  6.436091e-01 7.954753e+01
## we          1.202615e+00  2.441825e-01 6.889857e+00
## states      2.665632e+00  3.841483e-01 1.616689e+01
## prognosis   2.506757e+00  8.431607e-01 7.452076e+00
## plan        2.298213e+00  2.878130e-03 3.446716e+02
## only        5.810555e-01  2.063825e-01 1.507808e+00
## dni         5.767057e-01  1.647850e-01 1.825411e+00
## any         1.221508e+00  5.780078e-01 2.599158e+00
## morphine    2.568410e+00  1.149604e+00 5.730888e+00
## avoid       5.722969e-01  1.924311e-01 1.575519e+00
## are         4.043884e-01  1.338146e-03 1.878436e+02
## agreed      2.054943e+01  1.078347e+00 7.014697e+02
## yesterday   6.224905e-01  2.614204e-01 1.417158e+00
## pts         1.413778e+00  4.336287e-01 4.414384e+00
## intubation  2.288282e+00  9.963440e-01 5.398698e+00
## have        7.880789e-01  3.659929e-01 1.684207e+00

confusion_data(car_reg)

## Maximum accuracy is acheived at a cutoff of: 0.48

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 319  39
##          1  18  72
## 
##                Accuracy : 0.8728
##                  95% CI : (0.8387, 0.9005)
##     No Information Rate : 0.7522
##     P-Value [Acc > NIR] : 1.82e-10
## 
##                   Kappa : 0.6356
##  Mcnemar's Test P-Value : 0.008071
## 
##             Sensitivity : 0.9466
##             Specificity : 0.6486
##          Pos Pred Value : 0.8911
##          Neg Pred Value : 0.8000
##              Prevalence : 0.7522
##          Detection Rate : 0.7121
##    Detection Prevalence : 0.7522
##       Balanced Accuracy : 0.7976
##          F-val Accuracy : 0.9180
## 
##        'Positive' Class : 0

PAL

model_info(pal_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = PAL ~ ., family = binomial(link = "logit"), data = pal_tmp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.30799  -0.00002  -0.00001  -0.00001   2.10994  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.150e+01  2.412e+04  -0.001    0.999
## care         8.641e-02  1.132e+04   0.000    1.000
## palliative   2.243e+01  2.203e+03   0.010    0.992
## family       5.378e-01  6.470e+03   0.000    1.000
## goals       -2.038e+00  1.717e+00  -1.187    0.235
## consult     -6.160e-01  9.477e-01  -0.650    0.516
## would        2.167e+00  1.381e+00   1.569    0.117
## like        -9.113e-01  1.538e+00  -0.593    0.553
## patient     -3.842e+00  2.715e+00  -1.415    0.157
## hospice      2.285e+00  1.588e+00   1.439    0.150
## her          7.931e-01  2.042e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88.154  on 447  degrees of freedom
## Residual deviance: 32.719  on 437  degrees of freedom
## AIC: 54.719
## 
## Number of Fisher Scoring iterations: 21
## 
## 
## $`OR Summary`
##                       OR         2.5 %        97.5 %
## (Intercept) 4.610190e-10  0.000000e+00           Inf
## care        1.090249e+00 1.128223e-194 1.053554e+194
## palliative  5.492412e+09  1.447931e-26  1.018147e+44
## family      1.712303e+00 1.581603e-107 5.734711e+108
## goals       1.302341e-01  2.131164e-03  2.265094e+00
## consult     5.400928e-01  7.645694e-02  3.460647e+00
## would       8.734935e+00  8.730233e-01  2.870073e+02
## like        4.019869e-01  1.326125e-02  1.197775e+01
## patient     2.144893e-02  2.991976e-05  2.892339e+00
## hospice     9.821736e+00  6.024142e-01  4.451343e+02
## her         2.210308e+00  0.000000e+00           Inf

confusion_data(pal_reg)

## Maximum accuracy is acheived at a cutoff of: 0.47

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 438   6
##          1   1   3
## 
##                Accuracy : 0.9844
##                  95% CI : (0.9681, 0.9924)
##     No Information Rate : 0.9799
##     P-Value [Acc > NIR] : 0.3215
## 
##                   Kappa : 0.4548
##  Mcnemar's Test P-Value : 0.1306
## 
##             Sensitivity : 0.9977
##             Specificity : 0.3333
##          Pos Pred Value : 0.9865
##          Neg Pred Value : 0.7500
##              Prevalence : 0.9799
##          Detection Rate : 0.9777
##    Detection Prevalence : 0.9799
##       Balanced Accuracy : 0.6655
##          F-val Accuracy : 0.9921
## 
##        'Positive' Class : 0

FAM

model_info(fam_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = FAM ~ ., family = binomial(link = "logit"), data = fam_tmp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4114  -0.5834  -0.2959   0.4047   2.7483  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.528e+00  2.292e+03  -0.001  0.99947    
## family         2.259e-01  7.286e-01   0.310  0.75650    
## meeting        2.005e+00  3.948e-01   5.079  3.8e-07 ***
## code          -1.313e-01  9.706e-01  -0.135  0.89237    
## patient        6.589e-01  5.783e-01   1.139  0.25451    
## care          -1.147e+00  1.044e+00  -1.098  0.27230    
## discussed      8.388e-01  5.502e-01   1.524  0.12741    
## communication  1.232e-01  8.525e-01   0.145  0.88508    
## held           1.207e+00  4.188e-01   2.882  0.00395 ** 
## full          -4.428e-01  4.480e-01  -0.988  0.32293    
## status        -1.142e+00  1.011e+00  -1.129  0.25899    
## will           2.427e-01  3.780e-01   0.642  0.52081    
## wife           2.006e-01  4.343e-01   0.462  0.64420    
## her            2.642e-01  2.107e+03   0.000  0.99990    
## son            6.628e-02  3.311e-01   0.200  0.84133    
## goals         -3.720e-01  4.427e-01  -0.840  0.40081    
## discussion     8.260e-01  5.590e-01   1.478  0.13951    
## be            -1.208e+00  2.351e+00  -0.514  0.60717    
## daughter      -5.260e-02  3.914e-01  -0.134  0.89310    
## confirmed      1.140e+00  4.193e-01   2.720  0.00653 ** 
## hcp           -2.785e-02  4.216e-01  -0.066  0.94733    
## would         -8.181e-03  3.396e-01  -0.024  0.98078    
## his            4.585e-01  8.347e-01   0.549  0.58284    
## dnrdni         2.004e-01  4.954e-01   0.404  0.68589    
## this           1.716e-01  3.970e-01   0.432  0.66553    
## that          -3.124e-01  3.543e-01  -0.882  0.37791    
## comfort        3.495e-01  3.668e-01   0.953  0.34067    
## patients      -8.660e-01  4.751e-01  -1.823  0.06832 .  
## husband        3.447e-01  6.084e-01   0.567  0.57096    
## consent       -3.717e-01  6.609e-01  -0.562  0.57379    
## who            5.909e-01  3.340e-01   1.769  0.07684 .  
## we             7.503e-01  8.739e-01   0.859  0.39057    
## spoke          8.619e-01  7.332e-01   1.176  0.23976    
## have           1.529e-01  3.270e-01   0.468  0.64004    
## plan           1.644e+01  1.558e+03   0.011  0.99158    
## icu           -1.646e+01  2.696e+03  -0.006  0.99513    
## yesterday      6.468e-01  3.572e-01   1.811  0.07017 .  
## today          5.103e-02  3.249e-01   0.157  0.87521    
## they           4.938e-01  5.611e-01   0.880  0.37883    
## met           -1.732e-01  3.807e-01  -0.455  0.64915    
## comments      -3.321e-01  6.613e-01  -0.502  0.61553    
## after          7.473e-04  3.162e-01   0.002  0.99811    
## signed         7.179e-01  6.911e-01   1.039  0.29894    
## him            1.011e+00  5.391e-01   1.875  0.06076 .  
## decision       9.344e-02  5.350e-01   0.175  0.86135    
## but           -1.587e-01  4.569e-01  -0.347  0.72833    
## team           6.697e-01  4.005e-01   1.672  0.09444 .  
## prognosis      1.845e-01  4.519e-01   0.408  0.68300    
## like          -4.601e-01  4.413e-01  -1.043  0.29713    
## are            1.719e+00  2.147e+00   0.801  0.42326    
## pts           -2.622e-01  5.522e-01  -0.475  0.63486    
## pt            -1.425e+00  8.371e-01  -1.703  0.08862 .  
## dnr            5.035e-01  4.989e-01   1.009  0.31282    
## discuss        5.093e-01  4.924e-01   1.034  0.30101    
## update         2.697e+00  1.058e+00   2.548  0.01083 *  
## rounds        -9.847e-01  5.992e-01  -1.643  0.10032    
## length         4.212e-01  1.129e+00   0.373  0.70900    
## aware          2.201e-01  5.193e-01   0.424  0.67169    
## night          4.073e-01  3.068e-01   1.327  0.18437    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 551.67  on 447  degrees of freedom
## Residual deviance: 339.75  on 389  degrees of freedom
## AIC: 457.75
## 
## Number of Fisher Scoring iterations: 15
## 
## 
## $`OR Summary`
##                         OR         2.5 %        97.5 %
## (Intercept)   2.170565e-01 3.223067e-138 1.200239e+216
## family        1.253477e+00  3.275840e-01  5.994485e+00
## meeting       7.426795e+00  3.466137e+00  1.639470e+01
## code          8.769234e-01  1.342624e-01  6.195737e+00
## patient       1.932680e+00  6.500384e-01  6.380657e+00
## care          3.177319e-01  3.754745e-02  2.410075e+00
## discussed     2.313548e+00  7.986981e-01  6.970199e+00
## communication 1.131126e+00  2.180393e-01  6.378723e+00
## held          3.343269e+00  1.469974e+00  7.644269e+00
## full          6.422098e-01  2.663211e-01  1.551079e+00
## status        3.193316e-01  4.428823e-02  2.446326e+00
## will          1.274727e+00  6.125150e-01  2.713477e+00
## wife          1.222122e+00  5.145326e-01  2.845687e+00
## her           1.302371e+00  8.548016e-11  1.366712e+10
## son           1.068531e+00  5.556994e-01  2.044619e+00
## goals         6.893628e-01  2.857589e-01  1.630057e+00
## discussion    2.284116e+00  7.725436e-01  6.981762e+00
## be            2.986614e-01  1.520719e-03  2.819434e+01
## daughter      9.487572e-01  4.360776e-01  2.034743e+00
## confirmed     3.128135e+00  1.380697e+00  7.189474e+00
## hcp           9.725294e-01  4.213840e-01  2.213604e+00
## would         9.918525e-01  5.063883e-01  1.926032e+00
## his           1.581665e+00  3.149524e-01  8.507757e+00
## dnrdni        1.221857e+00  4.604711e-01  3.232037e+00
## this          1.187227e+00  5.479813e-01  2.615305e+00
## that          7.316602e-01  3.621112e-01  1.459236e+00
## comfort       1.418303e+00  6.870861e-01  2.907420e+00
## patients      4.206356e-01  1.607737e-01  1.043801e+00
## husband       1.411594e+00  4.192239e-01  4.632343e+00
## consent       6.895489e-01  1.813223e-01  2.450391e+00
## who           1.805593e+00  9.379043e-01  3.488615e+00
## we            2.117621e+00  4.161734e-01  1.315305e+01
## spoke         2.367749e+00  5.454513e-01  9.933434e+00
## have          1.165244e+00  6.134287e-01  2.220314e+00
## plan          1.374916e+07  2.546632e-98            NA
## icu           7.074388e-08  1.110138e-20  4.982158e+05
## yesterday     1.909439e+00  9.469676e-01  3.860932e+00
## today         1.052352e+00  5.555244e-01  1.993767e+00
## they          1.638577e+00  5.432932e-01  4.971899e+00
## met           8.409674e-01  3.993832e-01  1.786011e+00
## comments      7.174124e-01  1.995374e-01  2.735483e+00
## after         1.000748e+00  5.351740e-01  1.856378e+00
## signed        2.050047e+00  5.172111e-01  7.901273e+00
## him           2.748255e+00  9.591694e-01  7.995935e+00
## decision      1.097943e+00  3.779591e-01  3.109322e+00
## but           8.532497e-01  3.503299e-01  2.118238e+00
## team          1.953728e+00  8.872755e-01  4.289326e+00
## prognosis     1.202643e+00  4.876817e-01  2.896382e+00
## like          6.312484e-01  2.656635e-01  1.508058e+00
## are           5.579361e+00  1.305334e-01  5.790743e+02
## pts           7.693394e-01  2.551454e-01  2.249317e+00
## pt            2.404234e-01  4.959423e-02  1.381489e+00
## dnr           1.654561e+00  6.234723e-01  4.439366e+00
## discuss       1.664170e+00  6.237237e-01  4.338159e+00
## update        1.483508e+01  2.095409e+00  1.477647e+02
## rounds        3.735567e-01  1.125219e-01  1.191562e+00
## length        1.523778e+00  1.802792e-01  1.727527e+01
## aware         1.246204e+00  4.365623e-01  3.389276e+00
## night         1.502685e+00  8.232716e-01  2.751794e+00

confusion_data(fam_reg)

## Maximum accuracy is acheived at a cutoff of: 0.4

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 276  35
##          1  35 102
## 
##                Accuracy : 0.8438
##                  95% CI : (0.8072, 0.8744)
##     No Information Rate : 0.6942
##     P-Value [Acc > NIR] : 2.15e-13
## 
##                   Kappa : 0.6320
##  Mcnemar's Test P-Value : 1
## 
##             Sensitivity : 0.8875
##             Specificity : 0.7445
##          Pos Pred Value : 0.8875
##          Neg Pred Value : 0.7445
##              Prevalence : 0.6942
##          Detection Rate : 0.6161
##    Detection Prevalence : 0.6942
##       Balanced Accuracy : 0.8160
##          F-val Accuracy : 0.8875
## 
##        'Positive' Class : 0

CIM

model_info(cim_reg)

## Waiting for profiling to be done...

## $`Model Summary`
## 
## Call:
## glm(formula = CIM ~ ., family = binomial(link = "logit"), data = cim_tmp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6501  -0.6198   0.0000   0.3483   2.4737  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.190e+01  5.919e+03  -0.009  0.99300    
## code         -2.580e+00  3.252e+00  -0.793  0.42753    
## status       -2.466e+00  1.449e+00  -1.702  0.08883 .  
## dnr           3.715e+00  7.812e-01   4.756 1.98e-06 ***
## family        2.142e+00  7.132e-01   3.003  0.00267 ** 
## dni           5.497e-01  5.356e-01   1.026  0.30475    
## full         -5.707e-01  6.297e-01  -0.906  0.36477    
## care          3.030e+00  1.465e+00   2.068  0.03861 *  
## patient       3.541e-01  5.478e-01   0.646  0.51800    
## dnrdni       -9.071e-01  9.774e-01  -0.928  0.35337    
## would         8.205e-01  3.782e-01   2.169  0.03006 *  
## cmo           5.012e-01  8.049e-01   0.623  0.53354    
## comfort       2.571e-01  4.212e-01   0.610  0.54167    
## be           -2.397e+00  8.383e+00  -0.286  0.77491    
## do            3.704e+01  5.670e+03   0.007  0.99479    
## that         -4.337e-02  3.865e-01  -0.112  0.91064    
## confirmed     1.148e-01  5.103e-01   0.225  0.82204    
## but          -3.306e-01  4.457e-01  -0.742  0.45821    
## want          2.468e-02  6.211e-01   0.040  0.96830    
## resuscitate   4.743e-01  6.600e-01   0.719  0.47234    
## her          -1.672e+01  1.700e+03  -0.010  0.99215    
## measures      3.237e+00  1.945e+00   1.665  0.09601 .  
## made          6.897e-02  7.814e-01   0.088  0.92967    
## goals        -9.306e-02  5.649e-01  -0.165  0.86916    
## like         -3.657e-01  4.421e-01  -0.827  0.40808    
## his          -3.584e-03  7.929e-01  -0.005  0.99639    
## only         -3.616e-02  4.385e-01  -0.082  0.93429    
## will          2.740e-02  4.264e-01   0.064  0.94876    
## meeting      -7.299e-02  4.121e-01  -0.177  0.85940    
## discussion    8.844e-01  6.029e-01   1.467  0.14245    
## she           6.458e+00  3.266e+00   1.977  0.04800 *  
## now           1.718e-01  5.690e-01   0.302  0.76272    
## pressors     -7.352e-01  3.995e-01  -1.840  0.06571 .  
## patients      4.356e-01  4.901e-01   0.889  0.37407    
## decision     -5.868e-01  7.757e-01  -0.756  0.44938    
## daughter      6.044e-01  5.285e-01   1.144  0.25281    
## central       7.007e-01  4.475e-01   1.566  0.11742    
## son          -2.267e-01  3.752e-01  -0.604  0.54575    
## pt           -1.264e-01  1.095e+00  -0.115  0.90810    
## per          -9.911e-01  1.386e+00  -0.715  0.47447    
## hcp           8.898e-02  5.440e-01   0.164  0.87006    
## focus        -4.812e-01  1.110e+00  -0.434  0.66452    
## invasive      5.052e-01  3.804e-01   1.328  0.18410    
## discussed    -5.864e-01  4.285e-01  -1.368  0.17116    
## procedures   -1.576e-01  1.212e+00  -0.130  0.89653    
## aggressive    2.636e-01  4.984e-01   0.529  0.59692    
## they          7.885e-01  7.089e-01   1.112  0.26605    
## escalation    4.450e+01  2.982e+03   0.015  0.98809    
## after         2.331e-01  3.456e-01   0.674  0.50010    
## wife          2.878e-01  5.090e-01   0.565  0.57182    
## this          9.740e-01  4.287e-01   2.272  0.02309 *  
## by           -9.936e-01  3.658e-01  -2.716  0.00660 ** 
## plan          2.798e+01  2.404e+03   0.012  0.99071    
## intubation    1.444e-01  3.569e-01   0.405  0.68570    
## compressions  2.013e+00  1.125e+00   1.790  0.07349 .  
## we            7.862e-01  7.264e-01   1.082  0.27911    
## lines        -5.418e-01  9.956e-01  -0.544  0.58632    
## including    -3.753e-01  3.696e-01  -1.015  0.30987    
## changed      -6.125e-01  4.277e-01  -1.432  0.15211    
## wishes        1.592e+01  2.149e+03   0.007  0.99409    
## prognosis     1.092e+00  6.254e-01   1.747  0.08072 .  
## `if`          4.992e-01  5.910e-01   0.845  0.39821    
## home         -4.553e-01  3.587e-01  -1.269  0.20430    
## given        -8.774e-01  4.299e-01  -2.041  0.04127 *  
## chest        -7.414e-02  3.731e-01  -0.199  0.84247    
## states       -4.128e-01  7.937e-01  -0.520  0.60301    
## make          1.041e-01  5.689e-01   0.183  0.85477    
## have          9.542e-03  3.461e-01   0.028  0.97801    
## any          -4.811e-02  3.218e-01  -0.150  0.88115    
## agreed        1.961e+01  2.494e+03   0.008  0.99373    
## should       -3.508e-01  4.698e-01  -0.747  0.45524    
## morphine      7.178e-01  4.094e-01   1.753  0.07956 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 620.92  on 447  degrees of freedom
## Residual deviance: 320.46  on 376  degrees of freedom
## AIC: 464.46
## 
## Number of Fisher Scoring iterations: 18
## 
## 
## $`OR Summary`
##                        OR         2.5 %       97.5 %
## (Intercept)  2.885650e-23  0.000000e+00 2.646738e+61
## code         7.578228e-02  2.661129e-05 1.069862e+01
## status       8.494883e-02  4.310744e-03 1.330946e+00
## dnr          4.106270e+01  9.858946e+00 2.182962e+02
## family       8.513799e+00  2.196802e+00 2.177634e+01
## dni          1.732679e+00  3.450162e+00 3.255335e+00
## full         5.651378e-01  1.646408e-01 1.992721e+00
## care         2.070067e+01  1.552141e+00 4.413957e+02
## patient      1.424936e+00  4.956526e-01 4.298734e+00
## dnrdni       4.036831e-01  1.263184e-01 1.451456e-01
## would        2.271657e+00  1.087450e+00 4.818322e+00
## cmo          1.650652e+00  3.506301e-01 8.703063e+00
## comfort      1.293139e+00  5.591895e-01 2.938809e+00
## be           9.096011e-02  1.246728e-05 8.403369e+03
## do           1.214421e+16  2.596649e-65          Inf
## that         9.575520e-01  4.481980e-01 2.051531e+00
## confirmed    1.121620e+00  4.051863e-01 3.031724e+00
## but          7.184762e-01  2.988812e-01 1.726774e+00
## want         1.024988e+00  2.995983e-01 2.130950e+00
## resuscitate  1.606940e+00  1.832075e+00 3.504325e+00
## her          5.464928e-08 9.896027e-285 1.474195e+18
## measures     2.545304e+01  9.564055e-01 3.216326e+02
## made         1.071404e+00  2.297233e-01 5.052974e+00
## goals        9.111375e-01  2.985831e-01 2.763026e+00
## like         6.936830e-01  2.898714e-01 1.653204e+00
## his          9.964219e-01  2.092106e-01 4.771258e+00
## only         9.644866e-01  5.535863e-01 2.265932e+00
## will         1.027781e+00  1.076605e+00 1.766525e+00
## meeting      9.296084e-01  5.529164e-01 2.071458e+00
## discussion   2.421420e+00  7.342459e-01 7.949132e+00
## she          6.375368e+02  4.256973e+01 7.300542e+05
## now          1.187417e+00  3.924637e-01 2.426262e+00
## pressors     4.793933e-01  2.133629e-01 1.030478e+00
## patients     1.545934e+00  5.897881e-01 4.067884e+00
## decision     5.561300e-01  1.176322e-01 2.529543e+00
## daughter     1.830116e+00  6.517975e-01 5.221823e+00
## central      2.015112e+00  8.336656e-01 4.861361e+00
## son          7.971873e-01  3.777300e-01 1.231040e+00
## pt           8.812364e-01  1.126270e-01 9.293417e+00
## per          3.711692e-01  2.158453e-02 5.962546e+00
## hcp          1.093062e+00  3.725100e-01 3.175969e+00
## focus        6.180348e-01  6.003227e-02 5.048877e+00
## invasive     1.657388e+00  1.056791e+00 9.875537e-01
## discussed    5.563440e-01  2.359661e-01 1.274601e+00
## procedures   8.541675e-01  6.933518e-02 7.844775e+00
## aggressive   1.301557e+00  4.813022e-01 3.437713e+00
## they         2.200057e+00  5.425214e-01 8.993352e+00
## escalation   2.129323e+19  2.376448e-26          Inf
## after        1.262467e+00  6.407945e-01 2.496608e+00
## wife         1.333482e+00  4.876861e-01 3.626664e+00
## this         2.648543e+00  1.163660e+00 6.288271e+00
## by           3.702260e-01  1.773696e-01 7.479320e-01
## plan         1.420346e+12  8.780789e-24          Inf
## intubation   1.155383e+00  5.740775e-01 2.338198e+00
## compressions 7.488880e+00  8.659329e-01 8.119132e+01
## we           2.194974e+00  5.365602e-01 9.441300e+00
## lines        5.816969e-01  8.566180e-02 4.367200e+00
## including    6.870564e-01  3.300410e-01 1.054915e+00
## changed      5.419887e-01  3.176291e-01 1.246431e+00
## wishes       8.196015e+06  1.421102e-24          Inf
## prognosis    2.980936e+00  8.796571e-01 1.037426e+01
## `if`         1.647480e+00  5.326832e-01 4.237530e+00
## home         6.342540e-01  3.096227e-01 1.269831e+00
## given        4.158655e-01  1.772297e-01 9.628503e-01
## chest        9.285429e-01  4.462324e-01 1.936053e+00
## states       6.618230e-01  1.202202e-01 1.678618e+00
## make         1.109738e+00  3.564363e-01 3.364097e+00
## have         1.009587e+00  6.713153e-01 1.084051e+00
## any          9.530334e-01  8.588421e-01 1.389834e+00
## agreed       3.276214e+08  8.875568e-27          Inf
## should       7.040969e-01  2.744878e-01 1.746755e+00
## morphine     2.050020e+00  9.197753e-01 4.609946e+00

confusion_data(cim_reg)

## Maximum accuracy is acheived at a cutoff of: 0.49

## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 202  48
##          1  18 180
## 
##                Accuracy : 0.8527
##                  95% CI : (0.8169, 0.8825)
##     No Information Rate : 0.5089
##     P-Value [Acc > NIR] : < 2.2e-16
## 
##                   Kappa : 0.7060
##  Mcnemar's Test P-Value : 3.57e-04
## 
##             Sensitivity : 0.9182
##             Specificity : 0.7895
##          Pos Pred Value : 0.8080
##          Neg Pred Value : 0.9091
##              Prevalence : 0.4911
##          Detection Rate : 0.4509
##    Detection Prevalence : 0.4911
##       Balanced Accuracy : 0.8538
##          F-val Accuracy : 0.8596
## 
##        'Positive' Class : 0

Validation

Clean Data and Generate Covariates

txts <- clean_text(dat$TEXT[dat$COHORT == "validation"], FALSE)

Run Regex

#Strict regex
lim_tmp <- strict_regex(lim_bow, txts)
cod_tmp <- strict_regex(cod_bow, txts)
car_tmp <- strict_regex(car_bow, txts)
pal_tmp <- strict_regex(pal_bow, txts)
fam_tmp <- strict_regex(fam_bow, txts)
#CIM
cim_tmp <- strict_regex(cim_bow, txts)

#Convert list to data.frame
lim_tmp <- to_df(lim_tmp, lim_bow)
cod_tmp <- to_df(cod_tmp, cod_bow)
car_tmp <- to_df(car_tmp, car_bow)
pal_tmp <- to_df(pal_tmp, pal_bow)
fam_tmp <- to_df(fam_tmp, fam_bow)
#CIM
cim_tmp <- to_df(cim_tmp, cim_bow)

Prepare data frames for predictions

#cbind data
LIM <- dat$LIM[dat$COHORT == "validation"]
lim_tmp <- cbind(LIM, lim_tmp)
#Note: NA's must be omitted for COD due to missing observations
COD <- dat$COD[dat$COHORT == "validation"]
cod_tmp <- na.omit(cbind(COD, cod_tmp))

CAR <- dat$CAR[dat$COHORT == "validation"]
car_tmp <- cbind(CAR, car_tmp)
PAL <- dat$PAL[dat$COHORT == "validation"]
pal_tmp <- cbind(PAL, pal_tmp)
FAM <- dat$FAM[dat$COHORT == "validation"]
fam_tmp <- cbind(FAM, fam_tmp)
CIM <- dat$CIM[dat$COHORT == "validation"]
cim_tmp <- cbind(CIM, cim_tmp)

Apply Model

Note: probabilities associated with highest accuracy will be used as cutoffs:

lim_reg: 0.63
cod_reg: 0.59
car_reg: 0.48
pal_reg: 0.47
fam_reg: 0.40
cim_reg: 0.49

#Subset data for validation set
valid <- dat[dat$COHORT == "validation",]
#Subset additional set for COD validation
cod_valid <- na.omit(dat[dat$COHORT == "validation",])

#Generate probability values
LIM_PRED <- predict(lim_reg, lim_tmp, type="response")
COD_PRED <- predict(cod_reg, cod_tmp, type="response")
CAR_PRED <- predict(car_reg, car_tmp, type="response")
PAL_PRED <- predict(pal_reg, pal_tmp, type="response")
FAM_PRED <- predict(fam_reg, fam_tmp, type="response")
CIM_PRED <- predict(cim_reg, cim_tmp, type="response")

#Show distribution of probabilities
boxplot(LIM_PRED, COD_PRED, CAR_PRED, PAL_PRED, FAM_PRED, CIM_PRED, main = "Note Prediction Probability Distributions\n (LIM) (COD) (CAR) (PAL) (FAM) (CIM)")

#Apply predictions given cutoff probability
LIM_PRED <- ifelse(LIM_PRED > 0.63, 1, 0)
COD_PRED <- ifelse(COD_PRED > 0.59, 1, 0)
CAR_PRED <- ifelse(CAR_PRED > 0.48, 1, 0)
PAL_PRED <- ifelse(PAL_PRED > 0.47, 1, 0)
FAM_PRED <- ifelse(FAM_PRED > 0.40, 1, 0)
CIM_PRED <- ifelse(CIM_PRED > 0.49, 1, 0)

valid <- cbind(valid,
             LIM_PRED,
             #COD_PRED,
             CAR_PRED,
             PAL_PRED,
             FAM_PRED,
             CIM_PRED)

cod_valid <- cbind(cod_valid, COD_PRED)


results <- rbind(stat_gen(valid, "LIM_PRED"),
                 stat_gen(cod_valid, "COD_PRED"),
                 stat_gen(valid, "CAR_PRED"),
                 stat_gen(valid, "PAL_PRED"),
                 stat_gen(valid, "FAM_PRED"),
                 stat_gen(valid, "CIM_PRED"))

row.names(results) <- c("LIM", "COD", "CAR", "PAL", "FAM", "CIM")

results <- model_stats(results)

print(results)

##     tp  tn fp fn accuracy precision sensitivity specificity     F1
## LIM 41 114  7 30    80.73     85.42       57.75       94.21  68.91
## COD 88  41  8  5    90.85     91.67       94.62       83.67  93.12
## CAR 13 151 14 14    85.42     48.15       48.15       91.52  48.15
## PAL  1 191  0  0   100.00    100.00      100.00      100.00 100.00
## FAM 20 141 19 12    83.85     51.28       62.50       88.12  56.34
## CIM 49  89 25 29    71.88     66.22       62.82       78.07  64.47

results$domain <- row.names(results)

#Save Results
write.csv(results, file = "/Users/Edward/Desktop/bow_model_results28Jan18.csv", row.names = T)

Graphing

#Remove tp, tn, fp, fn
results$tp <- NULL
results$tn <- NULL
results$fp <- NULL
results$fn <- NULL

#Melt results for long data
results <- melt(results)

## Using domain as id variables

ggplot(results, aes(x = domain, y = value, label = value, group = variable)) + 
  geom_bar(stat = "identity", aes(fill = variable), colour = "black" , position = position_dodge()) +
  xlab("Domain") + ylab("Value") +
  ggtitle("Model Metrics By Domain") +
  geom_text(aes(x = domain, y = value, label = round(value, 1), group = variable),
            vjust = -1,
            position = position_dodge(width = 0.9),
            size = 2, fontface = "bold") +
  scale_fill_brewer(palette = "Set3") + #Set3 is good, empty yields blues 
  theme_minimal()# +

  #geom_text_repel(data=results, aes(label=value))

Goals of Care Bag of Words Logistic Regression

Edward T. Moseley

January 27, 2018

Introduction

Strategy

Utility Functions

Libraries

load_data

strict_regex

to_df()

clean_text

bow

plot_bow

bow_subset

cohort_gen

standardize

check_variance

model_info

confusion_data

roc_plot

stat_gen

model_stats

Load Annotation Data

Load Map for training/validation sets

Load training/validation sets and generate cohort info

Generate CIM Measure

Subset Phrases

Clean Phrases & Text

Create a Bag of Words

BoW Contents

Subset top tokens & remove common tokens from each phrase dictionary

Run Regex

Create Metrics for Analysis and Check Variance

Prepare data frames for logistic regression

Perform Logistic Regression

Generate Model Statistics and Visualizations

LIM

COD

CAR

PAL

FAM

CIM

Validation

Clean Data and Generate Covariates

Run Regex

Prepare data frames for predictions

Apply Model

Graphing