[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("dplyr")
library("magrittr")
library("ggplot2")
library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
library("randomForest")
library("caret")
library("e1071")
The specific exercise we are going to see in this lecture is an analytics approach to building models starting with 2.4 million people over a three year span, the period between 2001 and 2003.
We make out-of-sample predictions for the period of 2003 and 2004.
This was in the early years of D2Hawkeye. Out of the 2.4 million people, we included only people with data for at least 10 months in both periods, both in the observation period and the results period. This decreased the data to 400,000 people.
To build an analytics model, let us discuss the variables we used.
To work with this massive amount of variables, we aggregated the variables as follows.
To illustrate an example of how we infer further information from the data, the graph here shows on the horizontal axis, time, and on the vertical axis, costs in thousands of dollars.
Patient one is a patient who, on a monthly basis, has costs on the order of $10,000 to $15,000, a fairly significant cost but fairly constant in time.
Patient two has also an annual cost of a similar size to patient one. But in all but the third month, the costs are almost $0. Whereas in the third month, it cost about $70,000.
In fact, this is additional data we defined indicating whether the patient has a chronic or an acute condition.
In addition to the initial variables we also defined in collaboration with medical doctors, 269 medically-defined rules.
For example,
There is another set of variables involving demographic information, such as gender and age.
An important aspect of the variables are the variables related to cost. Rather than using costs directly, we bucketed costs and considered everyone in the group equally. We defined five buckets partitioned in such a way that each accounted for 20% of the total cost.
The partitions were:
The number of patients that were below 3,000 was 78% of the patients.
This diagram shows the various levels of risk.
From a medical perspective,
Let us introduce the error measures we used in building the analytics models.
Of course \(R^2\) was used, but it was accompanied by other measures.
The so-called “penalty error” is motivated by the fact that if you classify a very high-risk patient as a low-risk patient, this is more costly than the reverse, namely classifying a low-risk patient as a very high-risk patient.
Motivated by this, we developed a penalty error with the idea of using asymmetric penalties. The following figure show the penalty error as a matrix between outcome and (model) forecast.
For example, whenever we classify a low-risk patient as high-risk, we pay a penalty of 2, which is a difference of 3 minus 1, the difference in the error. However, for the reverse case, when you classify a bucket 3 patient as bucket 1 patient, the penalty is twice as much.
The off diagonal penalties are double the corresponding penalties in the lower diagonal.
To judge the quality of the analytics models we developed, we compare it with a baseline, which is to simply predict that the cost in the next period will be the cost in the current period.
We have observed that as far as identification of buckets is concerned, the accuracy was 75%.
The average penalty error of the baseline was 0.56.
In this case, we use multi-class classification because the outcome to predict is the bucket classification, for which there are five classes.
Classification trees have the major advantage of being interpretable by the physicians who observe them and judge them. In other words, people were able to identify these cases as reasonable.
The human intuition agreed with the output of the analytics model.
To give you an example, let us consider patients that have two types of diagnosis: coronary artery disease (CAD) and diabetes.
In the application of Hawkeye, the most important factors in the beginning of the tree were related to cost.
As the tree grows the secondary factor involve various chronic illnesses and some of the medical rules we discussed earlier.
For example, whether or not the patient has asthma and depression or not. If it has asthma and depression, then it’s bucket five.
Under 35 years old, between $3300 and $3900 in claims, C.A.D., but no office visits in last year
Claims between $3900 and $43000 with at least $8000 paid in last 12 months, $4300 in pharmacy claims, acute cost profile and cancer diagnosis
More than $58000 in claims, at least $55000 paid in last 12 months, and not an acute profile.
This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).
The observations represent a 1% random sample of Medicare beneficiaries, limited to those still alive at the end of 2008.
Claims <- read.csv("data/ClaimsData.csv.gz")
str(Claims)
## 'data.frame': 458005 obs. of 16 variables:
## $ age : int 85 59 67 52 67 68 75 70 67 67 ...
## $ alzheimers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ arthritis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cancer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ copd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ depression : int 0 0 0 0 0 0 0 0 0 0 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ heart.failure : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ihd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ kidney : int 0 0 0 0 0 0 0 0 0 0 ...
## $ osteoporosis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reimbursement2008: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2008 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ reimbursement2009: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2009 : int 1 1 1 1 1 1 1 1 1 1 ...
# summary(Claims)
Our independent variables are from 2008, and we will be predicting cost in 2009. They are:
and several binary variables indicating whether (1) or not (0) the patient had diagnosis codes for a particular disease or related disorder in 2008:
Finally there are non-medical-diagnostic variables:
These cost buckets are defined using the thresholds determined by D2Hawkeye.
We can verify that the number of patients in each cost bucket has the same structure as what we saw for D2Hawkeye by computing the percentage of patients in each cost bucket.
table(Claims$bucket2009)/nrow(Claims)
##
## 1 2 3 4 5
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679
round(100.0*table(Claims$bucket2009)/nrow(Claims), 2)
##
## 1 2 3 4 5
## 67.13 19.02 8.95 4.33 0.58
The vast majority of patients in this data set have low cost.
hist(log10(Claims$reimbursement2008), xlab = "log10(2008 reimbursements)", ylab = "Frequency",
main = "Distribution of 2008 Reimbursements", col = "orange")
We can check the total cost by bucket:
tot2008 <- sum(Claims$reimbursement2008)
group_by(Claims, bucket2008) %>% summarise(cost2008 = sum(reimbursement2008), frac2008 = cost2008/tot2008)
## Source: local data frame [5 x 3]
##
## bucket2008 cost2008 frac2008
## 1 1 245519070 0.1338494
## 2 2 302633750 0.1649865
## 3 3 382937080 0.2087654
## 4 4 611020730 0.3331096
## 5 5 292182840 0.1592890
The thresholds of the D2Hawkeye study were designed to divide patients in 5 buckets each accounting for a similar portion of the total cost, i.e. 20%.
The data we are going to use here do not break down as neatly, with fractions between 13% and 33%.
Our goal will be to predict the cost bucket the patient fell into in 2009 using a CART model.
First we split our entire data set into training and test sets, with 60/40 split:
set.seed(88)
spl <- sample.split(Claims$bucket2009, SplitRatio = 0.6)
ClaimsTrain <- subset(Claims, spl == TRUE)
ClaimsTest <- subset(Claims, spl == FALSE)
The baseline method would predict that the cost bucket for a patient in 2009 will be the same as it was in 2008.
So let’s create a classification matrix to compute the accuracy for the baseline method on the test set.
The accuracy is the sum of the diagonal, the observations that were classified correctly, divided by the total number of observations in our test set.
cmat_base <- table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
cmat_base
##
## 1 2 3 4 5
## 1 110138 7787 3427 1452 174
## 2 16000 10721 4629 2931 559
## 3 7006 4629 2774 1621 360
## 4 2688 1943 1415 1539 352
## 5 293 191 160 309 104
base_accu <- sum(diag(cmat_base))/nrow(ClaimsTest)
The accuracy of the baseline method is the 68.38%.
How about the penalty error?
We need to first create a penalty matrix, where we will put the actual outcomes on the left (rows), and the predicted outcomes on the top (columns) (note that this is the transpose of the matrix shown in the figure above).
PenaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)
PenaltyMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 2 3 4
## [2,] 2 0 1 2 3
## [3,] 4 2 0 1 2
## [4,] 6 4 2 0 1
## [5,] 8 6 4 2 0
Therefore, the penalty error of baseline method is:
as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##
## 1 2 3 4 5
## 1 0 7787 6854 4356 696
## 2 32000 0 4629 5862 1677
## 3 28024 9258 0 1621 720
## 4 16128 7772 2830 0 352
## 5 2344 1146 640 618 0
base_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)
The penalty error of the baseline method is 0.739.
Our goal will be to create a CART model that has an accuracy higher than 68.38% and a penalty error lower than 0.739.
Suppose that instead of the baseline method just discussed, we used the baseline method of predicting the most frequent outcome for all observations, i.e. predict cost bucket 1 for everyone.
What are the accuracy and penalty error for this model?
The confusion matrix would be the following:
cmat_all1 <- matrix(c(rowSums(cmat_base), rep(0, 20)), byrow = FALSE, nrow = 5)
cmat_all1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 122978 0 0 0 0
## [2,] 34840 0 0 0 0
## [3,] 16390 0 0 0 0
## [4,] 7937 0 0 0 0
## [5,] 1057 0 0 0 0
Accuracy:
sum(cmat_all1[,1])
## [1] 183202
accu_base_all_1 <- cmat_all1[1,1]/nrow(ClaimsTest)
… would be 67.1%.
Penalty error:
# as.matrix(cmat_all1*PenaltyMatrix)
# sum(as.matrix(cmat_all1*PenaltyMatrix))
altbase_penalty_error <- sum(as.matrix(cmat_all1*PenaltyMatrix))/nrow(ClaimsTest)
… would be 1.044.
model_CART <- rpart(bucket2009 ~
age + alzheimers + arthritis + cancer +
copd + depression + diabetes + heart.failure +
ihd + kidney + osteoporosis + stroke +
bucket2008 + reimbursement2008,
data = ClaimsTrain,
method = "class",
cp = 0.00005)
Note that even though we have a multi-class classification problem here, we build our tree in the same way as a binary classification problem.
Note on cp: the cp value we are using was selected through cross-validation on the training set. We are not performing the cross-validation here, because it takes a significant amount of time on a data set of this size. Remember that we have almost 275,000 observations in our training set.
prp(model_CART)
We have a huge tree here. This makes sense for a few reasons.
One is the large number of observations in our training set. Another is that we have a five-class classification problem, so the classification is more complex than a binary classification case, like the one we saw in the previous lecture.
The trees used by D2Hawkeye were also very large CART trees. While this hurts the interpretability of the model, it is still possible to describe each of the buckets of the tree according to the splits.
predict_CART_Test <- predict(model_CART, newdata = ClaimsTest, type = "class")
type = "class" if we want the majority class predictions.cmat_CART <- table(ClaimsTest$bucket2009, predict_CART_Test)
cmat_CART
## predict_CART_Test
## 1 2 3 4 5
## 1 114141 8610 124 103 0
## 2 18409 16102 187 142 0
## 3 8027 8146 118 99 0
## 4 3099 4584 53 201 0
## 5 351 657 4 45 0
CART_accu <- sum(diag(cmat_CART))/nrow(ClaimsTest)
The accuracy of this CART model is the 71.27%.
For the penalty error, we can use the penalty matrix we prepared previously.
as.matrix(table(ClaimsTest$bucket2009, predict_CART_Test))*PenaltyMatrix
## predict_CART_Test
## 1 2 3 4 5
## 1 0 8610 248 309 0
## 2 36818 0 187 284 0
## 3 32108 16292 0 99 0
## 4 18594 18336 106 0 0
## 5 2808 3942 16 90 0
CART_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, predict_CART_Test))*PenaltyMatrix)/nrow(ClaimsTest)
The penalty error of the baseline method is 0.758.
While we increased the accuracy, the penalty error also went up. Why?
By default, rpart will try to maximize the overall accuracy, and every type of error is seen as having a penalty of one.
Our CART model predicts 3, 4, and 5 so rarely because there are very few observations in these classes.
Therefore we do not really expect this model to do better on the penalty error than the baseline method.
How can we fix this?
The rpart function allows us to specify a parameter called loss.
This is the penalty matrix we want to use when building our model.
If the rpart function knows that we will be giving a higher penalty to some types of errors over others, it might choose different splits when building the model to minimize the worst types of errors. We will probably get a lower overall accuracy with this new model, but hopefully, the penalty error will be much lower too.
model_CART_loss <- rpart(bucket2009 ~
age + alzheimers + arthritis + cancer +
copd + depression + diabetes + heart.failure +
ihd + kidney + osteoporosis + stroke +
bucket2008 + reimbursement2008,
data = ClaimsTrain,
method = "class",
cp = 0.00005,
parms = list(loss = PenaltyMatrix))
CART_loss_predict_Test <- predict(model_CART_loss, newdata = ClaimsTest, type = "class")
CART_loss_cmat <- table(ClaimsTest$bucket2009, CART_loss_predict_Test)
CART_loss_cmat
## CART_loss_predict_Test
## 1 2 3 4 5
## 1 94310 25295 3087 286 0
## 2 7176 18942 8079 643 0
## 3 3590 7706 4692 401 1
## 4 1304 3193 2803 636 1
## 5 135 356 408 156 2
CART_loss_accu <- sum(diag(CART_loss_cmat))/nrow(ClaimsTest)
The accuracy of this CART model is the 64.73%.
as.matrix(table(ClaimsTest$bucket2009, CART_loss_predict_Test))*PenaltyMatrix
## CART_loss_predict_Test
## 1 2 3 4 5
## 1 0 25295 6174 858 0
## 2 14352 0 8079 1286 0
## 3 14360 15412 0 401 2
## 4 7824 12772 5606 0 1
## 5 1080 2136 1632 312 0
CART_loss_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, CART_loss_predict_Test))*PenaltyMatrix)/nrow(ClaimsTest)
The penalty error of the CART model with loss is 0.642.
Our accuracy is now lower than the baseline method, but our penalty error is also much lower.
First, we observe that the overall accuracy of the method regarding the percentage that it accurately predicts is 80%, compared to 75% of the baseline.
However, notice that this is done in an interesting way.
The overall penalty improves, going from from 0.56 to 0.52. The improvement for bucket one, is small, but there are significant improvements as move to higher buckets.