[ source files available on GitHub ]

PRELIMINARIES

Libraries needed for data processing and plotting:

library("dplyr")
library("magrittr")
library("ggplot2")

library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
library("randomForest")
library("caret")
library("e1071")

INTRODUCTION

The specific exercise we are going to see in this lecture is an analytics approach to building models starting with 2.4 million people over a three year span, the period between 2001 and 2003.

We make out-of-sample predictions for the period of 2003 and 2004.

This was in the early years of D2Hawkeye. Out of the 2.4 million people, we included only people with data for at least 10 months in both periods, both in the observation period and the results period. This decreased the data to 400,000 people.

The D2 Hawkeye data

To build an analytics model, let us discuss the variables we used.

  • 13,000 diagnoses. It’s for the codes for diagnosis that claims data utilize.
  • 22,000 different codes for procedures
  • 45,000 codes for prescription drugs.

To work with this massive amount of variables, we aggregated the variables as follows.

  • Out of the 13,000 diagnoses, we defined 217 diagnosis groups.
  • Out of the 20,000 procedures, we aggregated the data to develop 213 procedure groups.
  • Finally, from 45,000 prescription drugs, we developed 189 therapeutic groups.

To illustrate an example of how we infer further information from the data, the graph here shows on the horizontal axis, time, and on the vertical axis, costs in thousands of dollars.

example

  • Patient one is a patient who, on a monthly basis, has costs on the order of $10,000 to $15,000, a fairly significant cost but fairly constant in time.

  • Patient two has also an annual cost of a similar size to patient one. But in all but the third month, the costs are almost $0. Whereas in the third month, it cost about $70,000.

In fact, this is additional data we defined indicating whether the patient has a chronic or an acute condition.

In addition to the initial variables we also defined in collaboration with medical doctors, 269 medically-defined rules.

For example,

  • Interaction between various illnesses (e.g., obesity and depression).
  • Interaction between diagnosis and age (e.g., more than 65 years old and coronary artery disease).
  • Noncompliance with treatment (e.g., non-fulfillment of a particular drug order).
  • Illness severity (e.g., severe depression as opposed to regular depression).

There is another set of variables involving demographic information, such as gender and age.

An important aspect of the variables are the variables related to cost. Rather than using costs directly, we bucketed costs and considered everyone in the group equally. We defined five buckets partitioned in such a way that each accounted for 20% of the total cost.

The partitions were:

  • from 0 to 3,000,
  • from 3,000 to 8,000,
  • from 8,000 to 19,000,
  • from 19,000 to 55,000, and
  • above 55,000.

The number of patients that were below 3,000 was 78% of the patients.

Interpretation of the cost buckets

This diagram shows the various levels of risk.

cost buckets

  • Bucket one consists of patients that have rather low risk.
  • Bucket two has what is called emerging risk.
  • In bucket three, moderate level of risk.
  • Bucket four, high risk.
  • And bucket five, very high risk.

From a medical perspective,

  • buckets two and three, the medical and the moderate risk patients, are candidates for wellness programs.
  • Whereas bucket four, the high risk patients, are candidates for disease management programs.
  • bucket five, the very high risk patients, are candidates for case management.

Error Measures

Let us introduce the error measures we used in building the analytics models.

Of course \(R^2\) was used, but it was accompanied by other measures.

The so-called “penalty error” is motivated by the fact that if you classify a very high-risk patient as a low-risk patient, this is more costly than the reverse, namely classifying a low-risk patient as a very high-risk patient.

Motivated by this, we developed a penalty error with the idea of using asymmetric penalties. The following figure show the penalty error as a matrix between outcome and (model) forecast.

penalty matrix

For example, whenever we classify a low-risk patient as high-risk, we pay a penalty of 2, which is a difference of 3 minus 1, the difference in the error. However, for the reverse case, when you classify a bucket 3 patient as bucket 1 patient, the penalty is twice as much.

The off diagonal penalties are double the corresponding penalties in the lower diagonal.

Baseline reference

To judge the quality of the analytics models we developed, we compare it with a baseline, which is to simply predict that the cost in the next period will be the cost in the current period.

We have observed that as far as identification of buckets is concerned, the accuracy was 75%.
The average penalty error of the baseline was 0.56.

The Method: CART with multi-class classification

In this case, we use multi-class classification because the outcome to predict is the bucket classification, for which there are five classes.

Classification trees have the major advantage of being interpretable by the physicians who observe them and judge them. In other words, people were able to identify these cases as reasonable.
The human intuition agreed with the output of the analytics model.

To give you an example, let us consider patients that have two types of diagnosis: coronary artery disease (CAD) and diabetes.

example_tree

  • If a patient does not have a coronary artery disease, we would classify the patient as bucket one.
  • If it has coronary artery disease, then we check whether the person has diabetes or does not have diabetes.
    • If it has diabetes, then it’s bucket five, very high risk.
    • If it does not have diabetes, but given it has coronary artery disease, it is classified as bucket three.

In the application of Hawkeye, the most important factors in the beginning of the tree were related to cost.
As the tree grows the secondary factor involve various chronic illnesses and some of the medical rules we discussed earlier.

For example, whether or not the patient has asthma and depression or not. If it has asthma and depression, then it’s bucket five.

Examples of bucket 5 patients

  • Under 35 years old, between $3300 and $3900 in claims, C.A.D., but no office visits in last year

  • Claims between $3900 and $43000 with at least $8000 paid in last 12 months, $4300 in pharmacy claims, acute cost profile and cancer diagnosis

  • More than $58000 in claims, at least $55000 paid in last 12 months, and not an acute profile.

LOADING THE DATA

This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).
The observations represent a 1% random sample of Medicare beneficiaries, limited to those still alive at the end of 2008.

Claims <- read.csv("data/ClaimsData.csv.gz")
str(Claims)
## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

# summary(Claims)

The Variables

Our independent variables are from 2008, and we will be predicting cost in 2009. They are:

  • age: patient’s age in years at the end of 2008,

and several binary variables indicating whether (1) or not (0) the patient had diagnosis codes for a particular disease or related disorder in 2008:

  • alzheimers,
  • arthritis,
  • cancer,
  • copd, chronic obstructive pulmonary disease,
  • depression,
  • diabetes,
  • heart.failure,
  • ihd, ischemic heart disease,
  • kidney disease,
  • osteoporosis,
  • stroke.

Finally there are non-medical-diagnostic variables:

  • reimbursement2008: the total amount of Medicare reimbursements for this patient in 2008.
  • reimbursement2009: the total value of all Medicare reimbursements for the patient in 2009.
  • bucket2008: the cost bucket the patient fell into in 2008,
  • bucket2009: the cost bucket the patient fell into in 2009.

These cost buckets are defined using the thresholds determined by D2Hawkeye.

We can verify that the number of patients in each cost bucket has the same structure as what we saw for D2Hawkeye by computing the percentage of patients in each cost bucket.

table(Claims$bucket2009)/nrow(Claims)
## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679

round(100.0*table(Claims$bucket2009)/nrow(Claims), 2)
## 
##     1     2     3     4     5 
## 67.13 19.02  8.95  4.33  0.58

The vast majority of patients in this data set have low cost.

hist(log10(Claims$reimbursement2008), xlab = "log10(2008 reimbursements)", ylab = "Frequency", 
     main = "Distribution of 2008 Reimbursements", col = "orange")

We can check the total cost by bucket:

tot2008 <- sum(Claims$reimbursement2008)

group_by(Claims, bucket2008) %>% summarise(cost2008 = sum(reimbursement2008), frac2008 = cost2008/tot2008)
## Source: local data frame [5 x 3]
## 
##   bucket2008  cost2008  frac2008
## 1          1 245519070 0.1338494
## 2          2 302633750 0.1649865
## 3          3 382937080 0.2087654
## 4          4 611020730 0.3331096
## 5          5 292182840 0.1592890

The thresholds of the D2Hawkeye study were designed to divide patients in 5 buckets each accounting for a similar portion of the total cost, i.e. 20%.
The data we are going to use here do not break down as neatly, with fractions between 13% and 33%.

A MODEL FOR HEALTHCARE COST PREDICTION

Our goal will be to predict the cost bucket the patient fell into in 2009 using a CART model.

Split the data into training and testing sets

First we split our entire data set into training and test sets, with 60/40 split:

set.seed(88)

spl <- sample.split(Claims$bucket2009, SplitRatio = 0.6)
ClaimsTrain <- subset(Claims, spl == TRUE)
ClaimsTest <- subset(Claims, spl == FALSE)

Baseline Method and Penalty Matrix

The baseline method would predict that the cost bucket for a patient in 2009 will be the same as it was in 2008.
So let’s create a classification matrix to compute the accuracy for the baseline method on the test set.

The accuracy is the sum of the diagonal, the observations that were classified correctly, divided by the total number of observations in our test set.

cmat_base <- table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)

cmat_base
##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104

base_accu <- sum(diag(cmat_base))/nrow(ClaimsTest)

The accuracy of the baseline method is the 68.38%.

How about the penalty error?

We need to first create a penalty matrix, where we will put the actual outcomes on the left (rows), and the predicted outcomes on the top (columns) (note that this is the transpose of the matrix shown in the figure above).

PenaltyMatrix <- matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)

PenaltyMatrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0

Therefore, the penalty error of baseline method is:

as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##    
##         1     2     3     4     5
##   1     0  7787  6854  4356   696
##   2 32000     0  4629  5862  1677
##   3 28024  9258     0  1621   720
##   4 16128  7772  2830     0   352
##   5  2344  1146   640   618     0

base_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)

The penalty error of the baseline method is 0.739.

Our goal will be to create a CART model that has an accuracy higher than 68.38% and a penalty error lower than 0.739.

A different baseline method

Suppose that instead of the baseline method just discussed, we used the baseline method of predicting the most frequent outcome for all observations, i.e. predict cost bucket 1 for everyone.

What are the accuracy and penalty error for this model?

The confusion matrix would be the following:

cmat_all1 <- matrix(c(rowSums(cmat_base), rep(0, 20)), byrow = FALSE, nrow = 5)

cmat_all1
##        [,1] [,2] [,3] [,4] [,5]
## [1,] 122978    0    0    0    0
## [2,]  34840    0    0    0    0
## [3,]  16390    0    0    0    0
## [4,]   7937    0    0    0    0
## [5,]   1057    0    0    0    0

Accuracy:

sum(cmat_all1[,1])
## [1] 183202
accu_base_all_1 <- cmat_all1[1,1]/nrow(ClaimsTest)

… would be 67.1%.

Penalty error:

# as.matrix(cmat_all1*PenaltyMatrix)
# sum(as.matrix(cmat_all1*PenaltyMatrix))
altbase_penalty_error <- sum(as.matrix(cmat_all1*PenaltyMatrix))/nrow(ClaimsTest)

… would be 1.044.

The CART model

model_CART <- rpart(bucket2009 ~ 
                        age + alzheimers + arthritis + cancer + 
                        copd + depression + diabetes + heart.failure + 
                        ihd + kidney + osteoporosis + stroke + 
                        bucket2008 + reimbursement2008, 
                    data = ClaimsTrain, 
                    method = "class", 
                    cp = 0.00005)

Note that even though we have a multi-class classification problem here, we build our tree in the same way as a binary classification problem.

Note on cp: the cp value we are using was selected through cross-validation on the training set. We are not performing the cross-validation here, because it takes a significant amount of time on a data set of this size. Remember that we have almost 275,000 observations in our training set.

prp(model_CART)

We have a huge tree here. This makes sense for a few reasons.
One is the large number of observations in our training set. Another is that we have a five-class classification problem, so the classification is more complex than a binary classification case, like the one we saw in the previous lecture.

The trees used by D2Hawkeye were also very large CART trees. While this hurts the interpretability of the model, it is still possible to describe each of the buckets of the tree according to the splits.

Out-of-Sample predictions of the CART model

predict_CART_Test <- predict(model_CART, newdata = ClaimsTest, type = "class")
  • We need to give type = "class" if we want the majority class predictions.
    This is like using a threshold of 0.5.
cmat_CART <- table(ClaimsTest$bucket2009, predict_CART_Test)
cmat_CART
##    predict_CART_Test
##          1      2      3      4      5
##   1 114141   8610    124    103      0
##   2  18409  16102    187    142      0
##   3   8027   8146    118     99      0
##   4   3099   4584     53    201      0
##   5    351    657      4     45      0

CART_accu <- sum(diag(cmat_CART))/nrow(ClaimsTest)

The accuracy of this CART model is the 71.27%.

For the penalty error, we can use the penalty matrix we prepared previously.

as.matrix(table(ClaimsTest$bucket2009, predict_CART_Test))*PenaltyMatrix
##    predict_CART_Test
##         1     2     3     4     5
##   1     0  8610   248   309     0
##   2 36818     0   187   284     0
##   3 32108 16292     0    99     0
##   4 18594 18336   106     0     0
##   5  2808  3942    16    90     0

CART_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, predict_CART_Test))*PenaltyMatrix)/nrow(ClaimsTest)

The penalty error of the baseline method is 0.758.

While we increased the accuracy, the penalty error also went up. Why?
By default, rpart will try to maximize the overall accuracy, and every type of error is seen as having a penalty of one.
Our CART model predicts 3, 4, and 5 so rarely because there are very few observations in these classes.
Therefore we do not really expect this model to do better on the penalty error than the baseline method.

How can we fix this?

New CART model with loss matrix

The rpart function allows us to specify a parameter called loss.
This is the penalty matrix we want to use when building our model.

If the rpart function knows that we will be giving a higher penalty to some types of errors over others, it might choose different splits when building the model to minimize the worst types of errors. We will probably get a lower overall accuracy with this new model, but hopefully, the penalty error will be much lower too.

model_CART_loss <- rpart(bucket2009 ~ 
                             age + alzheimers + arthritis + cancer + 
                             copd + depression + diabetes + heart.failure + 
                             ihd + kidney + osteoporosis + stroke + 
                             bucket2008 + reimbursement2008, 
                         data = ClaimsTrain, 
                         method = "class", 
                         cp = 0.00005, 
                         parms = list(loss = PenaltyMatrix))

Out-of-Sample predictions of the CART model with loss

CART_loss_predict_Test <- predict(model_CART_loss, newdata = ClaimsTest, type = "class")
CART_loss_cmat <- table(ClaimsTest$bucket2009, CART_loss_predict_Test)
CART_loss_cmat
##    CART_loss_predict_Test
##         1     2     3     4     5
##   1 94310 25295  3087   286     0
##   2  7176 18942  8079   643     0
##   3  3590  7706  4692   401     1
##   4  1304  3193  2803   636     1
##   5   135   356   408   156     2

CART_loss_accu <- sum(diag(CART_loss_cmat))/nrow(ClaimsTest)

The accuracy of this CART model is the 64.73%.

as.matrix(table(ClaimsTest$bucket2009, CART_loss_predict_Test))*PenaltyMatrix
##    CART_loss_predict_Test
##         1     2     3     4     5
##   1     0 25295  6174   858     0
##   2 14352     0  8079  1286     0
##   3 14360 15412     0   401     2
##   4  7824 12772  5606     0     1
##   5  1080  2136  1632   312     0

CART_loss_penalty_error <- sum(as.matrix(table(ClaimsTest$bucket2009, CART_loss_predict_Test))*PenaltyMatrix)/nrow(ClaimsTest)

The penalty error of the CART model with loss is 0.642.

Our accuracy is now lower than the baseline method, but our penalty error is also much lower.

DISCUSSION

models performance

First, we observe that the overall accuracy of the method regarding the percentage that it accurately predicts is 80%, compared to 75% of the baseline.
However, notice that this is done in an interesting way.

The overall penalty improves, going from from 0.56 to 0.52. The improvement for bucket one, is small, but there are significant improvements as move to higher buckets.

What is the Analytics Edge?

  • Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention.
  • Because the model was interpretable, physicians were able to improve the model by identifying new variables and refining existing variables.
  • Analytics gave D2Hawkeye an edge over competition using “last-century” methods.