Using a Random Forest algorithm to model diabetes

Small companies are often self-insured. It is critical that they have an accurate understanding of the current and future pharmacy requirements among their employee’s and insured dependents. The onset of diabetes is an ideal condition to leverage machine learning thanks to the often predictable disease progression.

This notebook will demonstrate the concept of how one can predict the probability of an individual developing diabetes within the next calendar year using 100% synthetic data. I must stress that while this data mimics the type of information that would be available to an insurer or pharmacy provider, it is completely synthetic and doesn’t reflect any real people, conditions, or clients.

The steps in the process:

  1. Generate Synthetic Data
    1. To avoid using any proprietary or personal health information (PHI) we will have to generate synthetic data this of the same format as what an insurer might have
  1. Prep the data
    1. Describe: What is the nature of the data that we have access to for this project?
    2. Select: Initial features for model building (use common-sense and subject matter expertise)
    3. Trim: Find the rows that are relevant for current model building
    4. Clean: Replace or remove any row with missing data
    5. Label: Typically Mark each row with a 1/0 depending on the dependent variable being modeled
    6. Collect: Bring aggregated and summarized data into data science environment (RStudio)
  2. Model Building (demo will pick up here)
    1. Train/Test Split
    2. Initial Model build using Ranger
    3. Feature Selection: remove features with low/zero importance
    4. Final Model Build on Train
    5. Evaluate model on Test. Iterate as necessary
  3. Deploy
    1. Generate a slide demonstrating how model was built and results of testing
    2. Send model to production (Shiny App, BI tool, Static Report)

Generate:

  1. The important features (columns) for this prediction is

    1. MEMBERID

    2. THERAPEUTIC_CLASS

    3. DAYS_SUPPLY

  2. Here we generate some synthetic data according to realistic parameters surrounding pharmacy claims:

    1. ~25% of all members have least one claim

    2. members have between 0 and 10 therapeutic classes prescribed in one year

    3. members days supply for a year is ~200 days +/- 100 days for each class n.b.: the purpose here is illustrate methodology, not recreate the nature of pharmacy claims. Each class would in truth have a distinct mean days supply. These parameters are sufficiently realistic to perform the analysis

#generate random numbers
id_numbers <- round(runif(n = 1000,min = 1000,max = 2000),0)
#paste numbers to an identifier
memberIDs <- paste0('memberID_',id_numbers)
#importtherapeutic classes
therapeutic_classes <- read_excel("Untitled spreadsheet.xlsx") %>% filter(!is.na(`Therapeutic Category` ))
  1. We need to put this together in a single dataframe that follows a realistic pattern
    1. This will be carried out using a series of random draws over a sample space of variables mentioned above, as follows:

      a) In the course of two years, each MEMBERID will have been either prescribed 0 or non-zero number of medications. If zero, the number of THERAPEUTIC_CLASS will be determined according to a random Poisson distribution.

      b) For each THERAPEUTIC_CLASS that a member is prescribed, the DAYSSUPPLY will be assigned according to a Normal distribution.

      c) The data will then be combined randomly as it would be acquired by the data scientist from a data warehouse

#create a series of loops to iterate through each variable. 
#combine the results together at the end of each loop.
##assume 25% of all member had at least 1 Rx (this is normal according to my experience)
set.seed(1)
member_rx <- head(sample(x = memberIDs),length(memberIDs)/4)
#ensure we have ~25% 
length(unique(member_rx))/length(memberIDs)
## [1] 0.222

The algorithm is in-effect a time-series. Thus we first will calculate synthetic data for each of the years 2020:2023.

Generate Data

years <- c(2021,2022,2023)
pharm_data <- NULL
for (y in seq(from =1, to =length(years), by= 1)) {
current_year <- years[y]
set.seed(y)
#iterate through all member_rx and find which therapeutic classes they have been prescribed
therapeutic_classes_all <- sample(therapeutic_classes$`Therapeutic Category`)
#R doesn't like spaces and special characters, so they need to be cleaned up here
therapeutic.classes.clean <- gsub(pattern = " ",replacement = "",x = therapeutic_classes_all)

therapeutic.classes.clean <- gsub(pattern = "\\(",replacement = "",x = therapeutic.classes.clean)

therapeutic.classes.clean <- gsub(pattern = "\\)",replacement = "",x = therapeutic.classes.clean)

therapeutic.classes.clean <- gsub(pattern = "\n",replacement = "",x = therapeutic.classes.clean)

therapeutic.classes.clean <- gsub(pattern = "/",replacement = ".",x = therapeutic.classes.clean)

therapeutic.classes.clean <- gsub(pattern = ",",replacement = ".",x = therapeutic.classes.clean)

therapeutic.classes.clean <- gsub(pattern = "-",replacement = ".",x = therapeutic.classes.clean)



for (i in seq(from =1, to = length(member_rx))) {
  set.seed(i)
  current_member <- member_rx[i]
  #how many therapeutic classes have they been prescribed
  therap_classes_assigned <- round(runif(n = 1,min = 1,max = 10),0)
  #randomize classes 
  random_classes <- sample(therapeutic.classes.clean)[i:therap_classes_assigned]
  #for each random_class assigned to this member, assign some days supply
  for (k in seq(from=1, to = length(random_classes),by = 1)) {
    spp_random_class <- random_classes[k]
    tmp_days_supply     <-  round(rnorm(n = 1,mean = 300,sd = 300),0)
    tmp_days_supply <- ifelse(tmp_days_supply < 0,0,tmp_days_supply)
    tmp_pharm_data        <- data.frame(cbind(current_member,spp_random_class,tmp_days_supply,current_year))
    pharm_data            <- rbind(pharm_data,tmp_pharm_data)
    }
}
}
names(pharm_data) <- c("MEMBERID","THERAPEUTIC_CLASS","DAYS_SUPPLY","CALENDAR_YEAR")

saveRDS(pharm_data, "pharm_data.RDS")

Feature Engineering

The goal here predict diabetics in cy2024. To do this we need to be able to look at the prescription utilization of all members in 2023 and determine how that impacts the probability of them developing diabetes in 2024. To generate an unbiased training set, we need to find actual diabetics, and then track backward in time to before they had developed diabetes. We can use them as our positive label in the training set. The negative label will come from patients that do not have diabetes in 2023, and have been patients since at least 2021.

Step1. Find all diabetics, and then find the year of their initial diagnosis

DiabDataOnly <- pharm_data %>% filter(THERAPEUTIC_CLASS == "BloodGlucoseRegulators") 
DiabSummary <- DiabDataOnly %>% group_by(MEMBERID) %>% summarise(FirstYear = min(CALENDAR_YEAR))
table(DiabSummary$FirstYear)
## 
## 2021 2022 2023 
##  184   19   10

This table shows us that there were 187 diabetics diagnosed in 2021, 20 diagnosed in 2022, and 10 diagnosed in 2023.

Step2. Create datasets that comprise of each diabetics medications in the year(s) prior to their diagnosis

tmp_diab_22 <- DiabSummary %>% filter(FirstYear == 2022)
#find memberIDs that were diagnosed in 22
diab_22_members <- unique(tmp_diab_22$MEMBERID)
#find those members claims in 21
pre.diab.claim21 <- pharm_data %>% filter(CALENDAR_YEAR == 2021)  %>% filter(MEMBERID %in% diab_22_members)
pre.diab.claim21$DAYS_SUPPLY <- as.numeric(pre.diab.claim21$DAYS_SUPPLY)
pre.diab.claim21Summary <- pre.diab.claim21 %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert to wide format
library(tidyr)
pre_diab_wide <- pre.diab.claim21Summary %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
pre_diab_wide$label <- 1
#remove the NA column
pre_diab_wide <- pre_diab_wide %>% select(., -c("<NA>"))
pre_diab_wide[1:10,1:10]
## # A tibble: 10 × 10
## # Groups:   MEMBERID [10]
##    MEMBERID      Analgesics Anesthetics Anti.inflammatoryAgents Antibacterials
##    <chr>              <dbl>       <dbl>                   <dbl>          <dbl>
##  1 memberID_1013         NA          NA                     427             NA
##  2 memberID_1047        556         416                     401            153
##  3 memberID_1069        505          NA                     647            392
##  4 memberID_1114        504         494                      NA            333
##  5 memberID_1124         NA          NA                     767             NA
##  6 memberID_1141        594         262                     763            133
##  7 memberID_1245         NA          NA                      NA            706
##  8 memberID_1283        113         746                     262            303
##  9 memberID_1313        223         275                     813            348
## 10 memberID_1322          0          65                     273            466
## # … with 5 more variables: Anticonvulsants <dbl>, AntidementiaAgents <dbl>,
## #   Antidepressants <dbl>, Antidotes.Deterrents.andToxicologicAgents <dbl>,
## #   Antiemetics <dbl>

We now have a matrix with a row for each member, a column for each therapeutic class, and the total Days Supply for each.

This concludes the first part of the model building where we can identify labeled data representing those members that have diabetes. The next step is to perform the same analysis to create a matched data set of members without diabetes, ideally over the same timespan.

  1. Find our diabetic members (all-time)
    1. this is done by first creating the wide dataset and summing the total days supply for all therapeutic classes over all-time
    2. Non-diabetics are any member that have a sum of 0 for “Blood Glucose Regulators”
    3. Filter those members that also received at least one Rx in 2020
    4. Label as non-diabetics (0)
    5. Combine with positive data set
#convert to numeric
pharm_data$DAYS_SUPPLY <- as.numeric(pharm_data$DAYS_SUPPLY)
#group by members and therapeutic classes then sum days supply (to remove dups)
#create summary for all members and all years
TherapClassSummary <- pharm_data %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS =sum(DAYS_SUPPLY,na.rm = T))
#remove missing data
TherapClassSummary$TotalDS <- ifelse(is.na(TherapClassSummary$TotalDS),0,TherapClassSummary$TotalDS)
#spread wide
TherapWide <- TherapClassSummary %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS)
#find never-diabetics and store their uniqueIDs
NonDiab <- TherapWide %>% filter(BloodGlucoseRegulators == 0)
NoDiabIDs <- unique(NonDiab$MEMBERID)
#filter all the claims from the never-diabetics from 2021 only
AllClaimsNoDiabs <- pharm_data %>% filter(CALENDAR_YEAR == 2021) %>% filter(MEMBERID %in% NoDiabIDs) %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert the Rx from never-diabetics in 2021 to the wide format
NoDiabWide <- AllClaimsNoDiabs %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
#replace N/A with 0 Days  Supply
NoDiabWide[2:ncol(NoDiabWide)] <- lapply(NoDiabWide[2:ncol(NoDiabWide)], function(x) ifelse(is.na(x),0,x))
#add label
NoDiabWide$label <- 0
NoDiabWide <- NoDiabWide %>% select(., -"<NA>")
NoDiabWide[1:7,1:7]
FALSE # A tibble: 7 × 7
FALSE # Groups:   MEMBERID [7]
FALSE   MEMBERID      Analgesics Anesthetics Anti.inflammatoryAgents Antibacterials
FALSE   <chr>              <dbl>       <dbl>                   <dbl>          <dbl>
FALSE 1 memberID_1019        445           0                     613            357
FALSE 2 memberID_1088          0           0                    1036              0
FALSE 3 memberID_1103          0           0                     649              0
FALSE 4 memberID_1486        474         460                       0              0
FALSE 5 memberID_1830          0         346                     518            913
FALSE 6 memberID_1954        690           0                      17            385
FALSE 7 <NA>                  NA          NA                      NA             NA
FALSE # … with 2 more variables: Anticonvulsants <dbl>, AntidementiaAgents <dbl>

Combine the positive and negative labeled data into one training set

This training set is specifically from claims in 2021. We can use 2022 as a dev set to hyper-parameter tune and 2023 as the final test set.

train <- rbind(pre_diab_wide,NoDiabWide)
table(train$label)
## 
##  0  1 
##  6 19

We have 7 members without diabetes and 20 that are pre-diabetic. There are two issues here, one obvious and one less so. This is obviously an extremely small data set. To generate a more authentic data set we would be starting with ~10,000x more data. Less obvious is the ratio of pre-diabetics to non-diabetics. The number of members that would pass these filters as being pass-able as actual training data for becoming diabetic should mimic the fraction of the population that develops diabetes in a given year, roughly 5-10% of the total population. However, even with these low numbers we can still carry out the procedure to illustrate how it can be done with even drastically larger datasets.

Model Building

Finally the fun part! We have accessed the data (in this example created synthetic data), performed feature engineering, created a reasonable training set, and now we can let the algorithm run. I will use Ranger, which is a CART model available in R that works very well.

library(ranger)
#fix missing data
train[2:ncol(train)] <- lapply(train[2:ncol(train)], function(x) ifelse(is.na(x),0,x))
#build model
cart.model.1 <- ranger(formula = label ~., data = train, num.trees = 10,mtry = ncol(train)/3,importance = 'impurity',probability = T)
#Analyze important features
importanceDF <- data.frame(cart.model.1$variable.importance)
importanceDF <- importanceDF %>% arrange(-cart.model.1.variable.importance)
par(mai=c(2,1,1,1))
barplot(importanceDF$cart.model.1.variable.importance,names.arg = row.names(importanceDF),las=2,cex.names=.75)

Feature Selection

We built a CART model using the training set and then plotted the features, ranked and sorted according to their Gini Impurity Index. Its clear by this visualization that most of the therpeutic classes have 0 predictive power, but a small number have alot. In production, I would re-build the model using only those features that have an importance greater than a specific cutoff.

Test Set

The next step is to use this predictive model to make predictions. The fun part. We need to be careful here, because there is a distinct possibility of information from the training set bleeding over into the test. To prevent this we analyze only data from 2022

  1. Keep the train and test sets separated in time (2021 & 2022, respectively)
#filter out 2022 
AllClaimsNoDiabs <- pharm_data %>% filter(CALENDAR_YEAR == 2022) %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>%
  summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert the Rx from never-diabetics in 2021 to the wide format
test <- AllClaimsNoDiabs %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
#replace N/A with 0 Days  Supply
test[2:ncol(test)] <- lapply(test[2:ncol(test)], function(x) ifelse(is.na(x),0,x))
test$pDiabetes <- predict(object = cart.model.1,data = test,type = 'response')$predictions[,1]
hist(test$pDiabetes,main = "Histogram of Model Output", xlab = "pDiabetes", col = 'dark blue')

Results of predictions.

Its always a good idea to look at the probability output from the Ranger model to see if the predictions make sense. In this case they do not, but that is because I’m using synthetic data. If this were based on actual insurance claims, we would expect to see a much longer right-tailed skew and a peak somewhere around 20%, something akin to what is shown below:

random_output <- rnbinom(n = 10000,size = 10,.7)
hist(random_output,xlab = "",axes = F, col = 'dark blue',main = "Example Output from Real-World Data")

Model Evaluation

#label the test data so we can compare
test$label <- ifelse(test$BloodGlucoseRegulators > 0 ,1, 0)
test$prediction <- ifelse(test$pDiabetes >= .5, 1, 0)
Actual <- test$label
Predicted <- test$prediction
table(Actual,Predicted)
##       Predicted
## Actual   0   1
##      0  13  59
##      1   5 148

This confusion matrix gives a high level view of how the model performed. Some simple calculations:

Accuracy: \((151+8)/(151+8+64+2) = 70.7%\)

TruePositiveRate: \(151/157 = 98.6\%\)

FalsePositiveRate: \((64/151+64)=29.7%\)

These static numbers are useful in understanding the overall performance of the model. We can also plot the Receiver Operating Characterstic (ROC) curve. This output takes into consideration different decision boundary thresholds to tell a holistic picture of how well the model works. The model was very good at predicting diabetics, but not nearly as good as predicting non-diabetics.

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
model.eval.roc <- pROC::roc(test$label,test$pDiabetes)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(model.eval.roc)

An AUC of 0.593 is not great. Futher work would be required to understand why the model missed so many predictions when the label was ‘0’. This could be improved through hyperparameter tuning and of course adding more data. Once those attempts have been exhausted we would next try to add a case weight to penalize missing the predictions when the label is ‘0’.