The D2Hawkeye Story

D2Hawkeye

  • Founded by Chris Kryder, MD, MBA in 2001

  • Combine expert knowledge and databases with analytics to improve quality and cost management in healthcare

  • Located in Massachusetts USA, grew very fast and was sold to Verisk Analytics in 2009

Healthcare Case Management

  • D2Hawkeye tries to improve healthcare case management
    • Identify high-risk patients
    • Work with patients to manage treatment and associated costs
    • Arrange specialist care
  • Medical costs often relate to severity of health problems, and are an issue for both patient and provider

  • Goal: improve the quality of cost predictions

Impact

  • Many different types of clients
    • Third party administrators of medical claims
    • Case management companies
    • Benefit consultants
    • Health plans
  • Millions of people analyzed monthly through analytic platform in 2009

  • Thousands of employers processed monthly

Pre-Analytics Approach

  • Human judgement - MDs manually analyzed patient histories and developed

  • Limited data sets

  • Costly and inefficient

  • Can we used analytics instead?

Data Sources

  • Healthcare industry is data-rich, but data may be hard to access
    • Unstructured - doctor’s notes
    • Unavailable - hard to get due to differences in technology
    • Inaccessible - strong privacy laws around healthcare data sharing
  • What is available?

  • Claims data
    • Requests for reimbursement submitted to insurance companies or state-provided insurance from doctors, hospitals, and pharmacies.
  • Eligibility information
  • Demographic information

Claims Data

  • Rich, structured data source
  • Very high dimension
  • Doesn’t capture all aspects of a persons treatment or health - many things must be inferred
  • Unlike electronic medical records, we do not know the results of a test, only that a test was administered

D2Hawkeye’s Claims Data

  • Available: claims data for 2.4 million people over a span of 3 years

  • Include only people with data for at least 10 months in both periods - 400,000 people

Variables / Cost Profiles

  • Variables
    • Chronic condition cost indicators
    • Gender and age

Cost Variables

Medical Intepretation of Buckets

Error Measures

  • Typically we use R2 or accuracy, but others can be used

  • In case of D2Hawkeye, failing to classify a high-cost patient is worse than failing to classify a low-cost patient correctly

  • Use a “penalty error” to capture this asymmetry

Penalty Error

  • Key idea: use asymmetric penalties
  • Define a “penalty matrix” as the cost of being wrong

Baseline

  • Baseline is to simply predict that the cost in the next “period” will be the cost in the current period

  • Accuracy of 75%
  • Penalty Error of 0.56

Multi-class Classification

  • We use predicting a bucket number

Most Important Factors

  • First splits are related to cost

Secondary Factors

  • Risk factors
  • Chronic Illness
  • “Q146”
    • Asthma + depression
  • “Q1”
    • Risk factor indicating hylan injection
    • Possible knee replacement or arthroscopy

Example Groups for Bucket 5

  • Under 35 years old, between $3300 and $3900 in claims, C.A.D., but no office visits in last year.

  • Claims between $3900 and $43,000 with at least $8000 paid in last 12 months, $4300 in pharmacy claims, acute cost profile and cancer diagnosis

  • More than $58,000 in claims, at least $55,000 paid in last 12 months, and not an acute profile

Insights

  • Substantial improvement over the baseline
  • Double accuracy over baseline in some cases
  • Smaller accuracy improvement on bucket 5, but much lower penalty

Analytics Provide an Edge

  • Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention

  • Because the model was interpret able, physicians were able to improve the model by identifying new variables and refining existing variables

  • Analytics gave D2Hawkeye an edge over competition using “last century” methods

The D2Hawkeye Story in R

Read in the data

# Read in the data
Claims = read.csv("ClaimsData.csv")
# Output structure
str(Claims)
## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

Split the data

# Percentage of patients in each cost bucket
table(Claims$bucket2009)/nrow(Claims)
## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679

# Split the data
library(caTools)

set.seed(88)

spl = sample.split(Claims$bucket2009, SplitRatio = 0.6)

ClaimsTrain = subset(Claims, spl==TRUE)

ClaimsTest = subset(Claims, spl==FALSE)

Baseline Method

# Baseline method
table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104

(110138 + 10721 + 2774 + 1539 + 104)/nrow(ClaimsTest)
## [1] 0.6838135

Create Penalty Matrix

# Penalty Matrix
PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5)

PenaltyMatrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0

# Penalty Error of Baseline Method
as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix
##    
##         1     2     3     4     5
##   1     0  7787  6854  4356   696
##   2 32000     0  4629  5862  1677
##   3 28024  9258     0  1621   720
##   4 16128  7772  2830     0   352
##   5  2344  1146   640   618     0

sum(as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7386055

CART Model

# Load necessary libraries
library(rpart)
library(rpart.plot)

# CART model
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005)
# Plot CART
prp(ClaimsTree)



# Make predictions
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")

table(ClaimsTest$bucket2009, PredictTest)
##    PredictTest
##          1      2      3      4      5
##   1 114141   8610    124    103      0
##   2  18409  16102    187    142      0
##   3   8027   8146    118     99      0
##   4   3099   4584     53    201      0
##   5    351    657      4     45      0

(114141 + 16102 + 118 + 201 + 0)/nrow(ClaimsTest)
## [1] 0.7126669

# Penalty Error
as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix
##    PredictTest
##         1     2     3     4     5
##   1     0  8610   248   309     0
##   2 36818     0   187   284     0
##   3 32108 16292     0    99     0
##   4 18594 18336   106     0     0
##   5  2808  3942    16    90     0

sum(as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.7578902

CART model with loss matrix

# New CART model with loss matrix
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005, parms=list(loss=PenaltyMatrix))

# Redo predictions and penalty error
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")

table(ClaimsTest$bucket2009, PredictTest)
##    PredictTest
##         1     2     3     4     5
##   1 94310 25295  3087   286     0
##   2  7176 18942  8079   643     0
##   3  3590  7706  4692   401     1
##   4  1304  3193  2803   636     1
##   5   135   356   408   156     2

(94310 + 18942 + 4692 + 636 + 2)/nrow(ClaimsTest)
## [1] 0.6472746

sum(as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)
## [1] 0.6418161