Small companies are often self-insured. It is critical that they have an accurate understanding of the current and future pharmacy requirements among their employee’s and insured dependents. The onset of diabetes is an ideal condition to leverage machine learning thanks to the often predictable disease progression.
This notebook will demonstrate the concept of how one can predict the probability of an individual developing diabetes within the next calendar year using 100% synthetic data. I must stress that while this data mimics the type of information that would be available to an insurer or pharmacy provider, it is completely synthetic and doesn’t reflect any real people, conditions, or clients.
The important features (columns) for this prediction is
MEMBERID
THERAPEUTIC_CLASS
DAYS_SUPPLY
Here we generate some synthetic data according to realistic parameters surrounding pharmacy claims:
~25% of all members have least one claim
members have between 0 and 10 therapeutic classes prescribed in one year
members days supply for a year is ~200 days +/- 100 days for each class n.b.: the purpose here is illustrate methodology, not recreate the nature of pharmacy claims. Each class would in truth have a distinct mean days supply. These parameters are sufficiently realistic to perform the analysis
#generate random numbers
id_numbers <- round(runif(n = 1000,min = 1000,max = 2000),0)
#paste numbers to an identifier
memberIDs <- paste0('memberID_',id_numbers)
#importtherapeutic classes
therapeutic_classes <- read_excel("Untitled spreadsheet.xlsx") %>% filter(!is.na(`Therapeutic Category` ))
This will be carried out using a series of random draws over a sample space of variables mentioned above, as follows:
a) In the course of two years, each MEMBERID will have been either prescribed 0 or non-zero number of medications. If zero, the number of THERAPEUTIC_CLASS will be determined according to a random Poisson distribution.
b) For each THERAPEUTIC_CLASS that a member is prescribed, the DAYSSUPPLY will be assigned according to a Normal distribution.
c) The data will then be combined randomly as it would be acquired by the data scientist from a data warehouse
#create a series of loops to iterate through each variable.
#combine the results together at the end of each loop.
##assume 25% of all member had at least 1 Rx (this is normal according to my experience)
set.seed(1)
member_rx <- head(sample(x = memberIDs),length(memberIDs)/4)
#ensure we have ~25%
length(unique(member_rx))/length(memberIDs)
## [1] 0.222
The algorithm is in-effect a time-series. Thus we first will calculate synthetic data for each of the years 2020:2023.
years <- c(2021,2022,2023)
pharm_data <- NULL
for (y in seq(from =1, to =length(years), by= 1)) {
current_year <- years[y]
set.seed(y)
#iterate through all member_rx and find which therapeutic classes they have been prescribed
therapeutic_classes_all <- sample(therapeutic_classes$`Therapeutic Category`)
#R doesn't like spaces and special characters, so they need to be cleaned up here
therapeutic.classes.clean <- gsub(pattern = " ",replacement = "",x = therapeutic_classes_all)
therapeutic.classes.clean <- gsub(pattern = "\\(",replacement = "",x = therapeutic.classes.clean)
therapeutic.classes.clean <- gsub(pattern = "\\)",replacement = "",x = therapeutic.classes.clean)
therapeutic.classes.clean <- gsub(pattern = "\n",replacement = "",x = therapeutic.classes.clean)
therapeutic.classes.clean <- gsub(pattern = "/",replacement = ".",x = therapeutic.classes.clean)
therapeutic.classes.clean <- gsub(pattern = ",",replacement = ".",x = therapeutic.classes.clean)
therapeutic.classes.clean <- gsub(pattern = "-",replacement = ".",x = therapeutic.classes.clean)
for (i in seq(from =1, to = length(member_rx))) {
set.seed(i)
current_member <- member_rx[i]
#how many therapeutic classes have they been prescribed
therap_classes_assigned <- round(runif(n = 1,min = 1,max = 10),0)
#randomize classes
random_classes <- sample(therapeutic.classes.clean)[i:therap_classes_assigned]
#for each random_class assigned to this member, assign some days supply
for (k in seq(from=1, to = length(random_classes),by = 1)) {
spp_random_class <- random_classes[k]
tmp_days_supply <- round(rnorm(n = 1,mean = 300,sd = 300),0)
tmp_days_supply <- ifelse(tmp_days_supply < 0,0,tmp_days_supply)
tmp_pharm_data <- data.frame(cbind(current_member,spp_random_class,tmp_days_supply,current_year))
pharm_data <- rbind(pharm_data,tmp_pharm_data)
}
}
}
names(pharm_data) <- c("MEMBERID","THERAPEUTIC_CLASS","DAYS_SUPPLY","CALENDAR_YEAR")
saveRDS(pharm_data, "pharm_data.RDS")
The goal here predict diabetics in cy2024. To do this we need to be able to look at the prescription utilization of all members in 2023 and determine how that impacts the probability of them developing diabetes in 2024. To generate an unbiased training set, we need to find actual diabetics, and then track backward in time to before they had developed diabetes. We can use them as our positive label in the training set. The negative label will come from patients that do not have diabetes in 2023, and have been patients since at least 2021.
Step1. Find all diabetics, and then find the year of their initial diagnosis
DiabDataOnly <- pharm_data %>% filter(THERAPEUTIC_CLASS == "BloodGlucoseRegulators")
DiabSummary <- DiabDataOnly %>% group_by(MEMBERID) %>% summarise(FirstYear = min(CALENDAR_YEAR))
table(DiabSummary$FirstYear)
##
## 2021 2022 2023
## 184 19 10
This table shows us that there were 187 diabetics diagnosed in 2021, 20 diagnosed in 2022, and 10 diagnosed in 2023.
Step2. Create datasets that comprise of each diabetics medications in the year(s) prior to their diagnosis
tmp_diab_22 <- DiabSummary %>% filter(FirstYear == 2022)
#find memberIDs that were diagnosed in 22
diab_22_members <- unique(tmp_diab_22$MEMBERID)
#find those members claims in 21
pre.diab.claim21 <- pharm_data %>% filter(CALENDAR_YEAR == 2021) %>% filter(MEMBERID %in% diab_22_members)
pre.diab.claim21$DAYS_SUPPLY <- as.numeric(pre.diab.claim21$DAYS_SUPPLY)
pre.diab.claim21Summary <- pre.diab.claim21 %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert to wide format
library(tidyr)
pre_diab_wide <- pre.diab.claim21Summary %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
pre_diab_wide$label <- 1
#remove the NA column
pre_diab_wide <- pre_diab_wide %>% select(., -c("<NA>"))
pre_diab_wide[1:10,1:10]
## # A tibble: 10 × 10
## # Groups: MEMBERID [10]
## MEMBERID Analgesics Anesthetics Anti.inflammatoryAgents Antibacterials
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 memberID_1013 NA NA 427 NA
## 2 memberID_1047 556 416 401 153
## 3 memberID_1069 505 NA 647 392
## 4 memberID_1114 504 494 NA 333
## 5 memberID_1124 NA NA 767 NA
## 6 memberID_1141 594 262 763 133
## 7 memberID_1245 NA NA NA 706
## 8 memberID_1283 113 746 262 303
## 9 memberID_1313 223 275 813 348
## 10 memberID_1322 0 65 273 466
## # … with 5 more variables: Anticonvulsants <dbl>, AntidementiaAgents <dbl>,
## # Antidepressants <dbl>, Antidotes.Deterrents.andToxicologicAgents <dbl>,
## # Antiemetics <dbl>
We now have a matrix with a row for each member, a column for each therapeutic class, and the total Days Supply for each.
#convert to numeric
pharm_data$DAYS_SUPPLY <- as.numeric(pharm_data$DAYS_SUPPLY)
#group by members and therapeutic classes then sum days supply (to remove dups)
#create summary for all members and all years
TherapClassSummary <- pharm_data %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS =sum(DAYS_SUPPLY,na.rm = T))
#remove missing data
TherapClassSummary$TotalDS <- ifelse(is.na(TherapClassSummary$TotalDS),0,TherapClassSummary$TotalDS)
#spread wide
TherapWide <- TherapClassSummary %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS)
#find never-diabetics and store their uniqueIDs
NonDiab <- TherapWide %>% filter(BloodGlucoseRegulators == 0)
NoDiabIDs <- unique(NonDiab$MEMBERID)
#filter all the claims from the never-diabetics from 2021 only
AllClaimsNoDiabs <- pharm_data %>% filter(CALENDAR_YEAR == 2021) %>% filter(MEMBERID %in% NoDiabIDs) %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>% summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert the Rx from never-diabetics in 2021 to the wide format
NoDiabWide <- AllClaimsNoDiabs %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
#replace N/A with 0 Days Supply
NoDiabWide[2:ncol(NoDiabWide)] <- lapply(NoDiabWide[2:ncol(NoDiabWide)], function(x) ifelse(is.na(x),0,x))
#add label
NoDiabWide$label <- 0
NoDiabWide <- NoDiabWide %>% select(., -"<NA>")
NoDiabWide[1:7,1:7]
FALSE # A tibble: 7 × 7
FALSE # Groups: MEMBERID [7]
FALSE MEMBERID Analgesics Anesthetics Anti.inflammatoryAgents Antibacterials
FALSE <chr> <dbl> <dbl> <dbl> <dbl>
FALSE 1 memberID_1019 445 0 613 357
FALSE 2 memberID_1088 0 0 1036 0
FALSE 3 memberID_1103 0 0 649 0
FALSE 4 memberID_1486 474 460 0 0
FALSE 5 memberID_1830 0 346 518 913
FALSE 6 memberID_1954 690 0 17 385
FALSE 7 <NA> NA NA NA NA
FALSE # … with 2 more variables: Anticonvulsants <dbl>, AntidementiaAgents <dbl>
This training set is specifically from claims in 2021. We can use 2022 as a dev set to hyper-parameter tune and 2023 as the final test set.
train <- rbind(pre_diab_wide,NoDiabWide)
table(train$label)
##
## 0 1
## 6 19
We have 7 members without diabetes and 20 that are pre-diabetic. There are two issues here, one obvious and one less so. This is obviously an extremely small data set. To generate a more authentic data set we would be starting with ~10,000x more data. Less obvious is the ratio of pre-diabetics to non-diabetics. The number of members that would pass these filters as being pass-able as actual training data for becoming diabetic should mimic the fraction of the population that develops diabetes in a given year, roughly 5-10% of the total population. However, even with these low numbers we can still carry out the procedure to illustrate how it can be done with even drastically larger datasets.
Finally the fun part! We have accessed the data (in this example created synthetic data), performed feature engineering, created a reasonable training set, and now we can let the algorithm run. I will use Ranger, which is a CART model available in R that works very well.
library(ranger)
#fix missing data
train[2:ncol(train)] <- lapply(train[2:ncol(train)], function(x) ifelse(is.na(x),0,x))
#build model
cart.model.1 <- ranger(formula = label ~., data = train, num.trees = 10,mtry = ncol(train)/3,importance = 'impurity',probability = T)
#Analyze important features
importanceDF <- data.frame(cart.model.1$variable.importance)
importanceDF <- importanceDF %>% arrange(-cart.model.1.variable.importance)
par(mai=c(2,1,1,1))
barplot(importanceDF$cart.model.1.variable.importance,names.arg = row.names(importanceDF),las=2,cex.names=.75)
Feature Selection
We built a CART model using the training set and then plotted the features, ranked and sorted according to their Gini Impurity Index. Its clear by this visualization that most of the therpeutic classes have 0 predictive power, but a small number have alot. In production, I would re-build the model using only those features that have an importance greater than a specific cutoff.
The next step is to use this predictive model to make predictions. The fun part. We need to be careful here, because there is a distinct possibility of information from the training set bleeding over into the test. To prevent this we analyze only data from 2022
#filter out 2022
AllClaimsNoDiabs <- pharm_data %>% filter(CALENDAR_YEAR == 2022) %>% group_by(MEMBERID,THERAPEUTIC_CLASS) %>%
summarise(TotalDS = sum(DAYS_SUPPLY,na.rm=T))
#convert the Rx from never-diabetics in 2021 to the wide format
test <- AllClaimsNoDiabs %>% tidyr::spread(key = THERAPEUTIC_CLASS,value = TotalDS,fill = NA)
#replace N/A with 0 Days Supply
test[2:ncol(test)] <- lapply(test[2:ncol(test)], function(x) ifelse(is.na(x),0,x))
test$pDiabetes <- predict(object = cart.model.1,data = test,type = 'response')$predictions[,1]
hist(test$pDiabetes,main = "Histogram of Model Output", xlab = "pDiabetes", col = 'dark blue')
Its always a good idea to look at the probability output from the Ranger model to see if the predictions make sense. In this case they do not, but that is because I’m using synthetic data. If this were based on actual insurance claims, we would expect to see a much longer right-tailed skew and a peak somewhere around 20%, something akin to what is shown below:
random_output <- rnbinom(n = 10000,size = 10,.7)
hist(random_output,xlab = "",axes = F, col = 'dark blue',main = "Example Output from Real-World Data")
#label the test data so we can compare
test$label <- ifelse(test$BloodGlucoseRegulators > 0 ,1, 0)
test$prediction <- ifelse(test$pDiabetes >= .5, 1, 0)
Actual <- test$label
Predicted <- test$prediction
table(Actual,Predicted)
## Predicted
## Actual 0 1
## 0 13 59
## 1 5 148
This confusion matrix gives a high level view of how the model performed. Some simple calculations:
Accuracy: \((151+8)/(151+8+64+2) = 70.7%\)
TruePositiveRate: \(151/157 = 98.6\%\)
FalsePositiveRate: \((64/151+64)=29.7%\)
These static numbers are useful in understanding the overall performance of the model. We can also plot the Receiver Operating Characterstic (ROC) curve. This output takes into consideration different decision boundary thresholds to tell a holistic picture of how well the model works. The model was very good at predicting diabetics, but not nearly as good as predicting non-diabetics.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
model.eval.roc <- pROC::roc(test$label,test$pDiabetes)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(model.eval.roc)
An AUC of 0.593 is not great. Futher work would be required to understand why the model missed so many predictions when the label was ‘0’. This could be improved through hyperparameter tuning and of course adding more data. Once those attempts have been exhausted we would next try to add a case weight to penalize missing the predictions when the label is ‘0’.