Founded by Chris Kryder, MD, MBA in 2001
Combine expert knowledge and databases with analytics to improve quality and cost management in healthcare
Located in Massachusetts USA, grew very fast and was sold to Verisk Analytics in 2009
Medical costs often relate to severity of health problems, and are an issue for both patient and provider
Goal: improve the quality of cost predictions
Millions of people analyzed monthly through analytic platform in 2009
Thousands of employers processed monthly
Human judgement - MDs manually analyzed patient histories and developed
Limited data sets
Costly and inefficient
Can we used analytics instead?
What is available?
Demographic information
Available: claims data for 2.4 million people over a span of 3 years
Include only people with data for at least 10 months in both periods - 400,000 people
Typically we use R2 or accuracy, but others can be used
In case of D2Hawkeye, failing to classify a high-cost patient is worse than failing to classify a low-cost patient correctly
Use a “penalty error” to capture this asymmetry
Baseline is to simply predict that the cost in the next “period” will be the cost in the current period
Penalty Error of 0.56
Under 35 years old, between $3300 and $3900 in claims, C.A.D., but no office visits in last year.
Claims between $3900 and $43,000 with at least $8000 paid in last 12 months, $4300 in pharmacy claims, acute cost profile and cancer diagnosis
More than $58,000 in claims, at least $55,000 paid in last 12 months, and not an acute profile
Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention
Because the model was interpret able, physicians were able to improve the model by identifying new variables and refining existing variables
Analytics gave D2Hawkeye an edge over competition using “last century” methods
# Read in the data
Claims = read.csv("ClaimsData.csv")
# Output structure
str(Claims)
## 'data.frame': 458005 obs. of 16 variables:
## $ age : int 85 59 67 52 67 68 75 70 67 67 ...
## $ alzheimers : int 0 0 0 0 0 0 0 0 0 0 ...
## $ arthritis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cancer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ copd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ depression : int 0 0 0 0 0 0 0 0 0 0 ...
## $ diabetes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ heart.failure : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ihd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ kidney : int 0 0 0 0 0 0 0 0 0 0 ...
## $ osteoporosis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stroke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ reimbursement2008: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2008 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ reimbursement2009: int 0 0 0 0 0 0 0 0 0 0 ...
## $ bucket2009 : int 1 1 1 1 1 1 1 1 1 1 ...# Percentage of patients in each cost bucket
table(Claims$bucket2009)/nrow(Claims)
##
## 1 2 3 4 5
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679
# Split the data
library(caTools)
set.seed(88)
spl = sample.split(Claims$bucket2009, SplitRatio = 0.6)
ClaimsTrain = subset(Claims, spl==TRUE)
ClaimsTest = subset(Claims, spl==FALSE)# Baseline method
table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
##
## 1 2 3 4 5
## 1 110138 7787 3427 1452 174
## 2 16000 10721 4629 2931 559
## 3 7006 4629 2774 1621 360
## 4 2688 1943 1415 1539 352
## 5 293 191 160 309 104
(110138 + 10721 + 2774 + 1539 + 104)/nrow(ClaimsTest)
## [1] 0.6838135# Penalty Matrix
PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5)
PenaltyMatrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 1 2 3 4
## [2,] 2 0 1 2 3
## [3,] 4 2 0 1 2
## [4,] 6 4 2 0 1
## [5,] 8 6 4 2 0
# Penalty Error of Baseline Method
as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##
## 1 2 3 4 5
## 1 0 7787 6854 4356 696
## 2 32000 0 4629 5862 1677
## 3 28024 9258 0 1621 720
## 4 16128 7772 2830 0 352
## 5 2344 1146 640 618 0
sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7386055# Load necessary libraries
library(rpart)
library(rpart.plot)
# CART model
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005)
# Plot CART
prp(ClaimsTree)
# Make predictions
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, PredictTest)
## PredictTest
## 1 2 3 4 5
## 1 114141 8610 124 103 0
## 2 18409 16102 187 142 0
## 3 8027 8146 118 99 0
## 4 3099 4584 53 201 0
## 5 351 657 4 45 0
(114141 + 16102 + 118 + 201 + 0)/nrow(ClaimsTest)
## [1] 0.7126669
# Penalty Error
as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix
## PredictTest
## 1 2 3 4 5
## 1 0 8610 248 309 0
## 2 36818 0 187 284 0
## 3 32108 16292 0 99 0
## 4 18594 18336 106 0 0
## 5 2808 3942 16 90 0
sum(as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7578902# New CART model with loss matrix
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005, parms=list(loss=PenaltyMatrix))
# Redo predictions and penalty error
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")
table(ClaimsTest$bucket2009, PredictTest)
## PredictTest
## 1 2 3 4 5
## 1 94310 25295 3087 286 0
## 2 7176 18942 8079 643 0
## 3 3590 7706 4692 401 1
## 4 1304 3193 2803 636 1
## 5 135 356 408 156 2
(94310 + 18942 + 4692 + 636 + 2)/nrow(ClaimsTest)
## [1] 0.6472746
sum(as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.6418161