MITX Analytics Edge exercise

Claims Data

Sample of patients in the Medicare program (USA), which provides health insurance to Americans aged 65 years and older, as well as some younger people with certain medical conditions.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0       ✔ purrr   0.3.0  
## ✔ tibble  2.0.0       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.2       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(caTools)
library(rpart)
library(rpart.plot)

Data

This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).

## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

reimbursement2008 is the total amount of Medicare reimbursements for this patient in 2008.

reimbursement2009 is the total value of all Medicare reimbursements for the patient in 2009.

bucket2008 is the cost bucket the patient fell into in 2008, and bucket2009 is the cost bucket the patient fell into in 2009. These cost buckets are defined using the thresholds determined by D2Hawkeye.

table(claims$bucket2009)/nrow(claims)
## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679
barplot(table(claims$bucket2009))

Create training and test set

set.seed(88)
spl = sample.split(claims$bucket2009, SplitRatio = 0.6) # use library caTools
claimstrain <- subset(claims, spl == TRUE)
claimstest <- subset(claims, spl == FALSE)
summary(claimstrain)
##       age           alzheimers       arthritis          cancer       
##  Min.   : 26.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 67.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 73.00   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   : 72.64   Mean   :0.1928   Mean   :0.1546   Mean   :0.06402  
##  3rd Qu.: 81.00   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :100.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##       copd          depression        diabetes      heart.failure   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1369   Mean   :0.2129   Mean   :0.3809   Mean   :0.2852  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       ihd             kidney        osteoporosis        stroke       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.4205   Mean   :0.1616   Mean   :0.1738   Mean   :0.04497  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  reimbursement2008   bucket2008    reimbursement2009   bucket2009   
##  Min.   :     0    Min.   :1.000   Min.   :     0    Min.   :1.000  
##  1st Qu.:     0    1st Qu.:1.000   1st Qu.:   130    1st Qu.:1.000  
##  Median :   960    Median :1.000   Median :  1540    Median :1.000  
##  Mean   :  4016    Mean   :1.438   Mean   :  4280    Mean   :1.522  
##  3rd Qu.:  3110    3rd Qu.:2.000   3rd Qu.:  4220    3rd Qu.:2.000  
##  Max.   :194910    Max.   :5.000   Max.   :158800    Max.   :5.000

Proportion of people in training set with diagnosis code for diabetes

table(claimstrain$diabetes)/nrow(claimstrain)
## 
##         0         1 
## 0.6191017 0.3808983

Baseline model

Baseline model is last year’s claim bucket (2008) predicts this year’s claim bucket (2009).

baseline_table <- table(claimstest$bucket2009, claimstest$bucket2008)

baseline_table
##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104
# Accuracy is the number of correct predictions.
# The correct predictions are where 'row' matches 'column'
# i.e. the entries in the diagonal
paste('Accuracy:', (sum(diag(baseline_table)))/nrow(claimstest))
## [1] "Accuracy: 0.683813495485857"

Penalty matrix (for Hawkeye model)

PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)

PenaltyMatrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0

Actual outcomes on left, predicted outcomes are on top. Biggest penalty when a low cost bucket is predicted, but the actual outcome is a high cost bucket. There is a penalty for predicting a high cost bucket, when the actually is a low cost bucket, but penalty is not so bad.

Penalty error for baseline model

as.matrix(baseline_table)*PenaltyMatrix
##    
##         1     2     3     4     5
##   1     0  7787  6854  4356   696
##   2 32000     0  4629  5862  1677
##   3 28024  9258     0  1621   720
##   4 16128  7772  2830     0   352
##   5  2344  1146   640   618     0
paste('Penalty error:', sum(as.matrix(baseline_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.73860547373937"

Second baseline model

Predict the most frequent outcome for all observations, which is cost bucket ‘1’.

baseline2_table <- table(claimstest$bucket2009, seq(from = 1, to = 1, length.out = nrow(claimstest)))
baseline2_table
##    
##          1
##   1 122978
##   2  34840
##   3  16390
##   4   7937
##   5   1057
baseline2_penalty <- as.matrix(baseline2_table) * PenaltyMatrix[,1]
baseline2_penalty
##    
##         1
##   1     0
##   2 69680
##   3 65560
##   4 47622
##   5  8456
paste('Accuracy :', baseline2_table[1]/nrow(claimstest))
## [1] "Accuracy : 0.6712699643017"
paste('Penalty error:', sum(baseline2_penalty)/nrow(claimstest))
## [1] "Penalty error: 1.04430082641019"

Classification and Regression Tree (CART) prediction

ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, 
                    data = claimstrain, method = "class", cp = 0.00005)

cp value was selected through cross-validation on the training set.

prp(ClaimsTree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

PredictTest <- predict(ClaimsTree, newdata = claimstest, type = "class")
predict_table <- table(claimstest$bucket2009, PredictTest)

predict_table
##    PredictTest
##          1      2      3      4      5
##   1 114141   8610    124    103      0
##   2  18409  16102    187    142      0
##   3   8027   8146    118     99      0
##   4   3099   4584     53    201      0
##   5    351    657      4     45      0
paste('Accuracy:', (sum(diag(predict_table)))/nrow(claimstest))
## [1] "Accuracy: 0.712666892282835"
as.matrix(predict_table)*PenaltyMatrix
##    PredictTest
##         1     2     3     4     5
##   1     0  8610   248   309     0
##   2 36818     0   187   284     0
##   3 32108 16292     0    99     0
##   4 18594 18336   106     0     0
##   5  2808  3942    16    90     0
paste('Penalty error:', sum(as.matrix(predict_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.757890197705265"

CART model has better accuracy, but does worse with penalty error, compared to first baseline model. This is because CART assigns a value of ‘1’ to all errors, and did not weight the errors in the same way as the PenaltyMatrix.

CART model with custom loss matrix

ClaimsPenaltyTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008,
                           data = claimstrain, method = "class", cp = 0.00005,
                           parms = list(loss = PenaltyMatrix)) # define penalty matrix
prp(ClaimsPenaltyTree)

PredictPenaltyTest <- predict(ClaimsPenaltyTree, newdata = claimstest, type = "class")
predictpenalty_table <- table(claimstest$bucket2009, PredictPenaltyTest)

predictpenalty_table
##    PredictPenaltyTest
##         1     2     3     4     5
##   1 94310 25295  3087   286     0
##   2  7176 18942  8079   643     0
##   3  3590  7706  4692   401     1
##   4  1304  3193  2803   636     1
##   5   135   356   408   156     2
paste('Accuracy:', (sum(diag(predictpenalty_table)))/nrow(claimstest))
## [1] "Accuracy: 0.647274593072128"
as.matrix(predictpenalty_table)*PenaltyMatrix
##    PredictPenaltyTest
##         1     2     3     4     5
##   1     0 25295  6174   858     0
##   2 14352     0  8079  1286     0
##   3 14360 15412     0   401     2
##   4  7824 12772  5606     0     1
##   5  1080  2136  1632   312     0
paste('Penalty error:', sum(as.matrix(predictpenalty_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.641816137378413"

Lower accuracy than baseline, but better (lower) penalty score as well.