MITX Analytics Edge exercise
Sample of patients in the Medicare program (USA), which provides health insurance to Americans aged 65 years and older, as well as some younger people with certain medical conditions.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.0 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(caTools)
library(rpart)
library(rpart.plot)
This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).
## 'data.frame': 458005 obs. of 16 variables:
## $ age : int 85 59 67 52 67 68 75 70 67 67 ...
## $ alzheimers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ arthritis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cancer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ copd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ depression : int 0 0 0 0 0 0 0 0 0 0 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ heart.failure : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ihd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ kidney : int 0 0 0 0 0 0 0 0 0 0 ...
## $ osteoporosis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reimbursement2008: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2008 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ reimbursement2009: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2009 : int 1 1 1 1 1 1 1 1 1 1 ...
reimbursement2008 is the total amount of Medicare reimbursements for this patient in 2008.
reimbursement2009 is the total value of all Medicare reimbursements for the patient in 2009.
bucket2008 is the cost bucket the patient fell into in 2008, and bucket2009 is the cost bucket the patient fell into in 2009. These cost buckets are defined using the thresholds determined by D2Hawkeye.
table(claims$bucket2009)/nrow(claims)
##
## 1 2 3 4 5
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679
barplot(table(claims$bucket2009))
set.seed(88)
spl = sample.split(claims$bucket2009, SplitRatio = 0.6) # use library caTools
claimstrain <- subset(claims, spl == TRUE)
claimstest <- subset(claims, spl == FALSE)
summary(claimstrain)
## age alzheimers arthritis cancer
## Min. : 26.00 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 67.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 73.00 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 72.64 Mean :0.1928 Mean :0.1546 Mean :0.06402
## 3rd Qu.: 81.00 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :100.00 Max. :1.0000 Max. :1.0000 Max. :1.00000
## copd depression diabetes heart.failure
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1369 Mean :0.2129 Mean :0.3809 Mean :0.2852
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## ihd kidney osteoporosis stroke
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4205 Mean :0.1616 Mean :0.1738 Mean :0.04497
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## reimbursement2008 bucket2008 reimbursement2009 bucket2009
## Min. : 0 Min. :1.000 Min. : 0 Min. :1.000
## 1st Qu.: 0 1st Qu.:1.000 1st Qu.: 130 1st Qu.:1.000
## Median : 960 Median :1.000 Median : 1540 Median :1.000
## Mean : 4016 Mean :1.438 Mean : 4280 Mean :1.522
## 3rd Qu.: 3110 3rd Qu.:2.000 3rd Qu.: 4220 3rd Qu.:2.000
## Max. :194910 Max. :5.000 Max. :158800 Max. :5.000
Proportion of people in training set with diagnosis code for diabetes
table(claimstrain$diabetes)/nrow(claimstrain)
##
## 0 1
## 0.6191017 0.3808983
Baseline model is last year’s claim bucket (2008) predicts this year’s claim bucket (2009).
baseline_table <- table(claimstest$bucket2009, claimstest$bucket2008)
baseline_table
##
## 1 2 3 4 5
## 1 110138 7787 3427 1452 174
## 2 16000 10721 4629 2931 559
## 3 7006 4629 2774 1621 360
## 4 2688 1943 1415 1539 352
## 5 293 191 160 309 104
# Accuracy is the number of correct predictions.
# The correct predictions are where 'row' matches 'column'
# i.e. the entries in the diagonal
paste('Accuracy:', (sum(diag(baseline_table)))/nrow(claimstest))
## [1] "Accuracy: 0.683813495485857"
PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow = TRUE, nrow = 5)
PenaltyMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 2 3 4
## [2,] 2 0 1 2 3
## [3,] 4 2 0 1 2
## [4,] 6 4 2 0 1
## [5,] 8 6 4 2 0
Actual outcomes on left, predicted outcomes are on top. Biggest penalty when a low cost bucket is predicted, but the actual outcome is a high cost bucket. There is a penalty for predicting a high cost bucket, when the actually is a low cost bucket, but penalty is not so bad.
as.matrix(baseline_table)*PenaltyMatrix
##
## 1 2 3 4 5
## 1 0 7787 6854 4356 696
## 2 32000 0 4629 5862 1677
## 3 28024 9258 0 1621 720
## 4 16128 7772 2830 0 352
## 5 2344 1146 640 618 0
paste('Penalty error:', sum(as.matrix(baseline_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.73860547373937"
Predict the most frequent outcome for all observations, which is cost bucket ‘1’.
baseline2_table <- table(claimstest$bucket2009, seq(from = 1, to = 1, length.out = nrow(claimstest)))
baseline2_table
##
## 1
## 1 122978
## 2 34840
## 3 16390
## 4 7937
## 5 1057
baseline2_penalty <- as.matrix(baseline2_table) * PenaltyMatrix[,1]
baseline2_penalty
##
## 1
## 1 0
## 2 69680
## 3 65560
## 4 47622
## 5 8456
paste('Accuracy :', baseline2_table[1]/nrow(claimstest))
## [1] "Accuracy : 0.6712699643017"
paste('Penalty error:', sum(baseline2_penalty)/nrow(claimstest))
## [1] "Penalty error: 1.04430082641019"
ClaimsTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008,
data = claimstrain, method = "class", cp = 0.00005)
cp value was selected through cross-validation on the training set.
prp(ClaimsTree)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
PredictTest <- predict(ClaimsTree, newdata = claimstest, type = "class")
predict_table <- table(claimstest$bucket2009, PredictTest)
predict_table
## PredictTest
## 1 2 3 4 5
## 1 114141 8610 124 103 0
## 2 18409 16102 187 142 0
## 3 8027 8146 118 99 0
## 4 3099 4584 53 201 0
## 5 351 657 4 45 0
paste('Accuracy:', (sum(diag(predict_table)))/nrow(claimstest))
## [1] "Accuracy: 0.712666892282835"
as.matrix(predict_table)*PenaltyMatrix
## PredictTest
## 1 2 3 4 5
## 1 0 8610 248 309 0
## 2 36818 0 187 284 0
## 3 32108 16292 0 99 0
## 4 18594 18336 106 0 0
## 5 2808 3942 16 90 0
paste('Penalty error:', sum(as.matrix(predict_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.757890197705265"
CART model has better accuracy, but does worse with penalty error, compared to first baseline model. This is because CART assigns a value of ‘1’ to all errors, and did not weight the errors in the same way as the PenaltyMatrix.
ClaimsPenaltyTree <- rpart(bucket2009 ~ age + arthritis + alzheimers + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008,
data = claimstrain, method = "class", cp = 0.00005,
parms = list(loss = PenaltyMatrix)) # define penalty matrix
prp(ClaimsPenaltyTree)
PredictPenaltyTest <- predict(ClaimsPenaltyTree, newdata = claimstest, type = "class")
predictpenalty_table <- table(claimstest$bucket2009, PredictPenaltyTest)
predictpenalty_table
## PredictPenaltyTest
## 1 2 3 4 5
## 1 94310 25295 3087 286 0
## 2 7176 18942 8079 643 0
## 3 3590 7706 4692 401 1
## 4 1304 3193 2803 636 1
## 5 135 356 408 156 2
paste('Accuracy:', (sum(diag(predictpenalty_table)))/nrow(claimstest))
## [1] "Accuracy: 0.647274593072128"
as.matrix(predictpenalty_table)*PenaltyMatrix
## PredictPenaltyTest
## 1 2 3 4 5
## 1 0 25295 6174 858 0
## 2 14352 0 8079 1286 0
## 3 14360 15412 0 401 2
## 4 7824 12772 5606 0 1
## 5 1080 2136 1632 312 0
paste('Penalty error:', sum(as.matrix(predictpenalty_table)*PenaltyMatrix)/nrow(claimstest))
## [1] "Penalty error: 0.641816137378413"
Lower accuracy than baseline, but better (lower) penalty score as well.