Source: Analytics Edge Unit 4 Lecture

Techniques involved: CART, penalty matrix, loss parameter

This is a story of D2Hawkeye, a medical data mining company located in Waltham, Massachusetts. The company was founded by Chris Kryder, a medical doctor who was an MIT physician in the 1990s. The company was founded in 2001. The company combines expert knowledge and databases with analytics to improve quality and course management in health care. The overall process that D2Hawkeye uses is as follows. It starts with medical claims that consist of diagnoses, procedures, and drugs. These medical claims are then processed via process of aggregation, cleaning, and normalization. This data then enters secure databases on which predictive models are applied. The output of predictive models are specific reports that give insight to the various questions that D2Hawkeye aspires to answer.

The company tries to improve health care case management. Specifically, it tries to identify high-risk patients, work with patients to manage treatment and associated costs, and arrange specialist care.

Medical costs, of course, is a serious matter both for the patient as well as the provider. Being able to predict this cost is an important problem that interests both the patients as well as the providers. The overall goal of D2Hawkeye is to improve the quality of cost predictions. D2Hawkeye had many different types of clients. The most important were third party administrators of medical claims. Third party administrators are companies hired by the employer who manage the claims of the employees. Now the type of clients were case management companies, benefits consultants, and health plans.

To analyze the data, the company used what we call a pre-analytics approach. This was based on the human judgment of physicians who manually analyze patient histories and developed medical rules. Of course, this involved human judgment, utilized a limited set of data, it was often costly, and somewhat inefficient. The key question we analyze in this lecture is “Can we use analytics instead?”

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_4_Trees")
claims <- read.csv("ClaimsData.csv")
str(claims)
## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

The observations represent a 1% random sample of Medicare beneficiaries, limited to those still alive at the end of 2008. Our independent variables are from 2008, and we will be predicting cost in 2009.

table(claims$bucket2009)/nrow(claims)
## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679

As we can see, vast majority of the patients have low cost.

Our goal will be to predict the cost bucket the patient fell into in 2009 using a CART model.

create training and test set

library(caTools)
set.seed(88)
spl <- sample.split(claims$bucket2009, SplitRatio = 0.6)
ClaimsTrain <- subset(claims, spl == TRUE)
ClaimsTest <- subset(claims, spl == FALSE)

Baseline model

summary(ClaimsTrain$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   67.00   73.00   72.64   81.00  100.00
table(ClaimsTrain$diabetes)
## 
##      0      1 
## 170131 104672

create a classification matrix to see the accuracy of baseline method – predicting 2009 bucket to be the same as 2008 bucket

table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104

Model accuracy is

(110138+10721+2774+1537+104)/nrow(ClaimsTest)
## [1] 0.6838026

create a penalty matrix

PenaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)
PenaltyMatrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0

Actual outcome is on the left, prediction on the top

So now to compute the penalty error of the baseline method, we can multiply our classification matrix by the penalty matrix.

as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##    
##         1     2     3     4     5
##   1     0  7787  6854  4356   696
##   2 32000     0  4629  5862  1677
##   3 28024  9258     0  1621   720
##   4 16128  7772  2830     0   352
##   5  2344  1146   640   618     0

compute penalty error

sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7386055

penalty error for the baseline method is 0.74

In the next section, our goal will be to create a CART model that has an accuracy higher than 68% and a penalty error lower than 0.74.

baseline model

if we used the baseline method of predicting the most frequent outcome for all observations, the new method would predict cost bucket 1 for everone.

table(ClaimsTest$bucket2009)
## 
##      1      2      3      4      5 
## 122978  34840  16390   7937   1057
122978/nrow(ClaimsTest)
## [1] 0.67127

compute the penalty error

(122978*0+34840*2+16390*4+7937*6+1057*8)/nrow(ClaimsTest)
## [1] 1.044301

Build a CART model for multi-class classification

library(rpart)
library(rpart.plot)

Note that even though we have a multi-class classification problem here, we build our tree in the same way as a binary classification problem.

ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data = ClaimsTrain, method = "class", cp = 0.00005)

The cp value we’re using here was selected through cross-validation on the training set. We won’t perform the cross-validation here, because it takes a significant amount of time on a data set of this size.

Plot the CART tree

prp(ClaimsTree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Evaluate the model

Make predictions

predictTest <- predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, predictTest)
##    predictTest
##          1      2      3      4      5
##   1 114141   8610    124    103      0
##   2  18409  16102    187    142      0
##   3   8027   8146    118     99      0
##   4   3099   4584     53    201      0
##   5    351    657      4     45      0

accuracy is 0.71

(114141+16102+118+201+0)/nrow(ClaimsTest)
## [1] 0.7126669

Penalty error

as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix
##    predictTest
##         1     2     3     4     5
##   1     0  8610   248   309     0
##   2 36818     0   187   284     0
##   3 32108 16292     0    99     0
##   4 18594 18336   106     0     0
##   5  2808  3942    16    90     0
sum(as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7578902

We saw that our baseline method had an accuracy of 68% and a penalty error of 0.74. So while we increased the accuracy, the penalty error also went up.Why? By default, rpart will try to maximize the overall accuracy, and every type of error is seen as having a penalty of one. Our CART model predicts 3, 4, and 5 so rarely because there are very few observations in these classes. So we don’t really expect this model to do better on the penalty error than the baseline method.

So how can we fix this? The rpart function allows us to specify a parameter called loss. This is the penalty matrix we want to use when building our model.

rebuild the model

ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data = ClaimsTrain, method = "class", cp = 0.00005, parms = list(loss = PenaltyMatrix))

If the rpart function knows that we’ll be giving a higher penalty to some types of errors over others, it might choose different splits when building the model to minimize the worst types of errors. We’ll probably get a lower overall accuracy with this new model. But hopefully, the penalty error will be much lower too.

Evaluate the new model

predictTest <- predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, predictTest)
##    predictTest
##         1     2     3     4     5
##   1 94310 25295  3087   286     0
##   2  7176 18942  8079   643     0
##   3  3590  7706  4692   401     1
##   4  1304  3193  2803   636     1
##   5   135   356   408   156     2

Model accuracy is 0.647

(94310+18942+4692+636+2)/nrow(ClaimsTest)
## [1] 0.6472746

Penalty error is:

sum(as.matrix(table(ClaimsTest$bucket2009, predictTest)) * PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.6418161

The modified model has lower accuracy but lower penalty error.