This is a story of D2Hawkeye, a medical data mining company located in Waltham, Massachusetts. The company was founded by Chris Kryder, a medical doctor who was an MIT physician in the 1990s. The company was founded in 2001. The company combines expert knowledge and databases with analytics to improve quality and course management in health care. The overall process that D2Hawkeye uses is as follows. It starts with medical claims that consist of diagnoses, procedures, and drugs. These medical claims are then processed via process of aggregation, cleaning, and normalization. This data then enters secure databases on which predictive models are applied. The output of predictive models are specific reports that give insight to the various questions that D2Hawkeye aspires to answer.
The company tries to improve health care case management. Specifically, it tries to identify high-risk patients, work with patients to manage treatment and associated costs, and arrange specialist care.
Medical costs, of course, is a serious matter both for the patient as well as the provider. Being able to predict this cost is an important problem that interests both the patients as well as the providers. The overall goal of D2Hawkeye is to improve the quality of cost predictions. D2Hawkeye had many different types of clients. The most important were third party administrators of medical claims. Third party administrators are companies hired by the employer who manage the claims of the employees. Now the type of clients were case management companies, benefits consultants, and health plans.
To analyze the data, the company used what we call a pre-analytics approach. This was based on the human judgment of physicians who manually analyze patient histories and developed medical rules. Of course, this involved human judgment, utilized a limited set of data, it was often costly, and somewhat inefficient. The key question we analyze in this lecture is “Can we use analytics instead?”
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_4_Trees")
claims <- read.csv("ClaimsData.csv")
str(claims)
## 'data.frame': 458005 obs. of 16 variables:
## $ age : int 85 59 67 52 67 68 75 70 67 67 ...
## $ alzheimers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ arthritis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cancer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ copd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ depression : int 0 0 0 0 0 0 0 0 0 0 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ heart.failure : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ihd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ kidney : int 0 0 0 0 0 0 0 0 0 0 ...
## $ osteoporosis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reimbursement2008: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2008 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ reimbursement2009: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2009 : int 1 1 1 1 1 1 1 1 1 1 ...
The observations represent a 1% random sample of Medicare beneficiaries, limited to those still alive at the end of 2008. Our independent variables are from 2008, and we will be predicting cost in 2009.
table(claims$bucket2009)/nrow(claims)
##
## 1 2 3 4 5
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679
As we can see, vast majority of the patients have low cost.
Our goal will be to predict the cost bucket the patient fell into in 2009 using a CART model.
library(caTools)
set.seed(88)
spl <- sample.split(claims$bucket2009, SplitRatio = 0.6)
ClaimsTrain <- subset(claims, spl == TRUE)
ClaimsTest <- subset(claims, spl == FALSE)
summary(ClaimsTrain$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 67.00 73.00 72.64 81.00 100.00
table(ClaimsTrain$diabetes)
##
## 0 1
## 170131 104672
create a classification matrix to see the accuracy of baseline method – predicting 2009 bucket to be the same as 2008 bucket
table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
##
## 1 2 3 4 5
## 1 110138 7787 3427 1452 174
## 2 16000 10721 4629 2931 559
## 3 7006 4629 2774 1621 360
## 4 2688 1943 1415 1539 352
## 5 293 191 160 309 104
Model accuracy is
(110138+10721+2774+1537+104)/nrow(ClaimsTest)
## [1] 0.6838026
PenaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)
PenaltyMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 2 3 4
## [2,] 2 0 1 2 3
## [3,] 4 2 0 1 2
## [4,] 6 4 2 0 1
## [5,] 8 6 4 2 0
Actual outcome is on the left, prediction on the top
So now to compute the penalty error of the baseline method, we can multiply our classification matrix by the penalty matrix.
as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##
## 1 2 3 4 5
## 1 0 7787 6854 4356 696
## 2 32000 0 4629 5862 1677
## 3 28024 9258 0 1621 720
## 4 16128 7772 2830 0 352
## 5 2344 1146 640 618 0
sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7386055
penalty error for the baseline method is 0.74
In the next section, our goal will be to create a CART model that has an accuracy higher than 68% and a penalty error lower than 0.74.
if we used the baseline method of predicting the most frequent outcome for all observations, the new method would predict cost bucket 1 for everone.
table(ClaimsTest$bucket2009)
##
## 1 2 3 4 5
## 122978 34840 16390 7937 1057
122978/nrow(ClaimsTest)
## [1] 0.67127
compute the penalty error
(122978*0+34840*2+16390*4+7937*6+1057*8)/nrow(ClaimsTest)
## [1] 1.044301
library(rpart)
library(rpart.plot)
Note that even though we have a multi-class classification problem here, we build our tree in the same way as a binary classification problem.
ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data = ClaimsTrain, method = "class", cp = 0.00005)
The cp value we’re using here was selected through cross-validation on the training set. We won’t perform the cross-validation here, because it takes a significant amount of time on a data set of this size.
prp(ClaimsTree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Make predictions
predictTest <- predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, predictTest)
## predictTest
## 1 2 3 4 5
## 1 114141 8610 124 103 0
## 2 18409 16102 187 142 0
## 3 8027 8146 118 99 0
## 4 3099 4584 53 201 0
## 5 351 657 4 45 0
accuracy is 0.71
(114141+16102+118+201+0)/nrow(ClaimsTest)
## [1] 0.7126669
as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix
## predictTest
## 1 2 3 4 5
## 1 0 8610 248 309 0
## 2 36818 0 187 284 0
## 3 32108 16292 0 99 0
## 4 18594 18336 106 0 0
## 5 2808 3942 16 90 0
sum(as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7578902
We saw that our baseline method had an accuracy of 68% and a penalty error of 0.74. So while we increased the accuracy, the penalty error also went up.Why? By default, rpart will try to maximize the overall accuracy, and every type of error is seen as having a penalty of one. Our CART model predicts 3, 4, and 5 so rarely because there are very few observations in these classes. So we don’t really expect this model to do better on the penalty error than the baseline method.
So how can we fix this? The rpart function allows us to specify a parameter called loss. This is the penalty matrix we want to use when building our model.
ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data = ClaimsTrain, method = "class", cp = 0.00005, parms = list(loss = PenaltyMatrix))
If the rpart function knows that we’ll be giving a higher penalty to some types of errors over others, it might choose different splits when building the model to minimize the worst types of errors. We’ll probably get a lower overall accuracy with this new model. But hopefully, the penalty error will be much lower too.
predictTest <- predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, predictTest)
## predictTest
## 1 2 3 4 5
## 1 94310 25295 3087 286 0
## 2 7176 18942 8079 643 0
## 3 3590 7706 4692 401 1
## 4 1304 3193 2803 636 1
## 5 135 356 408 156 2
Model accuracy is 0.647
(94310+18942+4692+636+2)/nrow(ClaimsTest)
## [1] 0.6472746
Penalty error is:
sum(as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.6418161
The modified model has lower accuracy but lower penalty error.