# set working directory
setwd('C:/Users/agrza/Google Drive/Documents/cuny/Data698/SOALargeClaims')For my Data 698 project, I will build models that project health insurance claimants spending, based on claimant attributes and historical spend data.
I hope to predict one or more of the following items:
whether a claimant will have high cost (e.g. $>25,000) in subsequent period based on current period attributures.
projected claimant health spend, in dollars.
First, load all libraries relevant to the analysis.
# common helper functions
# function for pretty formatting number output
comma <- function(x) noquote(format(x, digits = 2, big.mark = ","))The dataset for this project is located on the Society of Actuaries’ website. The data was collected as part of health inurance claims research study. The original study was completed in the mid 2000s and focused on high cost claimants.
# urls to files
clm97 <- 'https://www.soa.org/Files/Research/1997.zip'
clm98 <- 'https://www.soa.org/Files/Research/1998.zip'
clm99 <- 'https://www.soa.org/Files/Research/1999.zip'
all.clms <- c(clm97, clm98, clm99)
# download zip files
zip.files <- NULL
for (i in 1:length(all.clms)){
zip.files <- c(zip.files,str_extract(all.clms[i], '(?<=/)\\d{4}\\.zip$'))
download.file(all.clms[i], zip.files[i])
}The script below unzips the three separate claim files, and combines the information together in a single data frame.
# read claim files into R as one large data frame
clm.all <- data.frame()
for (i in 1:length(zip.files)){
clm.all <- rbind(clm.all,read.delim(unz(zip.files[i],unzip(zip.files[i],list=T)$Name), sep=","))
}
# save df to RData file
save(clm.all, file = "clm.RData")
rm(clm.all)Going forward, we will refer to the clm.all data frame object saved in the clm.RData file.
# read in clm.all data frame from RData file
load("clm.RData")The following links provide useful information concerning data collection and wrangling, as well as a data dictionary for the claim files:
The researchers for the study collected three years of claims data from 11 separate health insurers. Due to data quality issues, the researchers excluded the submissions from four of the insurers.
The researchers also performed a significant amount of data wrangling before publishing the public claims file:
# load variable descriptions
var.desc <- read.csv('var_description.csv')
# print variables and types
class.df <- data.frame(var=names(clm.all), type=sapply(clm.all, class))
class.df <- merge(var.desc, class.df, by.x='Field',by.y='var')
class.df <- data.frame(class.df[order(class.df$type, class.df$Desc),], row.names=NULL)
kable(class.df) %>%
kable_styling(bootstrap_options = c('striped','hover'), full_width=F, font_size=9)| Field | Desc | type |
|---|---|---|
| RELATION | Claimant Relationship to Policyholder (e.g. self, spouse,dependent) | factor |
| DGCAT | Diagnosis Category Associated with Highest Paid Claims | factor |
| PATSEX | Gender of Claimant | factor |
| EXPOSMEM | Member enrollment information available | factor |
| PPO | Plan Type (PPO vs. Other (e.g. HMO, POC, Indemnity, etc.) | factor |
| DIAG1 | Primary Diagnosis | factor |
| DIAG2 | Secondary Diagnosis | factor |
| DIAG3 | Tertiary Diagnosis | factor |
| CLAIMYR | Year associated with Date of Service | integer |
| PATBRTYR | Year of Birth for Claimant | integer |
| HOSLWCHG | Hospital Allowed Amount | numeric |
| HOSCVCHG | Hospital Charge Amount for Covered Services | numeric |
| HOSPDCHG | Hospital Paid Amount | numeric |
| OTHLWCHG | Other Allowed Amount | numeric |
| OTHCVCHG | Other Covered Charges | numeric |
| OTHPDCHG | Other Paid Amount | numeric |
| DIAG1CHG | Paid Amount for Primary Diagnosis | numeric |
| DIAG2CHG | Paid Amount for Secondary Diagnosis | numeric |
| DIAG3CHG | Paid Amount for Tertiary Diagnosis | numeric |
| PHYLWCHG | Physician Allowed Amount for Covered Services | numeric |
| PHYCVCHG | Physician Charge Amount for Covered Services | numeric |
| PHYPDCHG | Physician Paid Amount for Covered Services | numeric |
| DGCATCHG | Primary Diagnosis Category | numeric |
| TOTLWCHG | Total Allowed Amount | numeric |
| TOTCVCHG | Total Charged Amount for Covered Services | numeric |
| TOTPDCHG | Total Paid Amount | numeric |
| CLAIMANT | Unique Claimant Number | numeric |
# summary of variable types
class.df %>%
group_by(type) %>%
summarise(ct=n()) %>%
data.frame() type ct
1 factor 8
2 integer 2
3 numeric 17
Multiple, traditional variables that are relevant to health claim prediction but are excluded in the public release data set. These missing variables include:
The public release dataset also does excludes a member enrollment file that details the dates that members were active on a health insurance plan. Without this information, I will be unable to identify active members that do not have any health insurance claims.
Any model predictions for this project will be limited to members who have claims in at least two successive incurred periods. In other words, the models in this project will likely not generalize well to a full insured population.
Number of Records, all Years
# dimensions
comma(c(records=nrow(clm.all), fields=ncol(clm.all))) records fields
4,294,030 27
Number of Claimants, Each Year
# record count by year
comma(table(clm.all$CLAIMYR))
1997 1998 1999
1,241,438 1,460,854 1,591,738
# unique claimants
comma(length(unique(clm.all$CLAIMANT)))[1] 2,700,164
# unique claimants by year
clmts97 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1997]))
clmts98 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1998]))
clmts99 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1999]))Unique Claimants in both 1997 and 1998
# claimants in 97 and 98
comma(nrow(inner_join(clmts97,clmts98, by="x")))[1] 736,814
Unique Claimants in both 1998 and 1999
# claimants in 98 and 99
comma(nrow(inner_join(clmts98,clmts99, by="x")))[1] 810,943
Unique Claimants in All 3 Years
# claimants in 97, 98, and 99
comma(nrow(inner_join(inner_join(clmts97,clmts98, by="x"),clmts99, by="x")))[1] 436,811
The dataset provides clamaint birth year and the year claims were incurred. Using this information, we can approximate claimant age.
# create AGE variable
clm.all$AGE <- clm.all$CLAIMYR - clm.all$PATBRTYRMany of the numeric variables are highly correlated. For instance, hospital charges are a subset of total charges; so these two variables should be highly correlated.
Also cost data is presented in three different (and highly correlated ways):
# helper function for displaying numeric summaries
num.summary <- function(x,...){
c(mean=mean(x, ...), # mean
sd=sd(x, ...), # standard deviation
median=median(x, ...), # median
min=min(x, ...), # minimum
max=max(x,...), # maximum
rg=max(x, ...) - min(x, ...), # range
n=length(x), # record count
NAs = sum(is.na(x)), # number of missing values
skew = psych::skew(x,...), # skewnesss
kurt.xs = psych::kurtosi(x,...), # excess kurtosis
out.ct = outliers(x)$numOutliers, # outlier ct (box plot definition)
out.per = 100* outliers(x)$numOutliers / # percentage of outliers
(length(x) - sum(is.na(x)))
)
}
# summarize numeric variables
num.desc <- apply(select_if(clm.all,is.numeric),2,num.summary, na.rm=T)
# remove unnecessary columns
num.desc2 <- num.desc[!rownames(num.desc) %in%
c('rg','out.ct', 'out.per','n'),
!colnames(num.desc) %in%
c('PATBRTYR','CLAIMYR', 'CLAIMANT')]
num.desc2 <- round(num.desc2,0)
num.desc2 <- cbind(t(num.desc2),out.per=round(num.desc["out.per",],1))
# print descriptive summary for numeric variables
kable(num.desc2,format.args = list(big.mark=",")) %>%
kable_styling(bootstrap_options = c('striped','hover'), full_width=F,font_size=9) %>%
pack_rows('Hospital Claims', 1,3) %>%
pack_rows('Physician Claims', 4,6) %>%
pack_rows('Other Claims',7,9) %>%
pack_rows('Total Claims',10,12) %>%
pack_rows('Claims Related to Diagnosis',13,16) %>%
pack_rows('Claimant Attributes',17,17)| mean | sd | median | min | max | NAs | skew | kurt.xs | out.per | |
|---|---|---|---|---|---|---|---|---|---|
| Hospital Claims | |||||||||
| HOSCVCHG | 2,524 | 11,072 | 491 | -27,630 | 1,781,017 | 3,160,414 | 32 | 2,432 | 0.0 |
| HOSLWCHG | 2,199 | 9,153 | 354 | -27,630 | 839,633 | 3,714,064 | 20 | 776 | 0.0 |
| HOSPDCHG | 1,944 | 9,803 | 331 | -6,200 | 1,779,990 | 2,961,766 | 38 | 3,273 | 0.1 |
| Physician Claims | |||||||||
| PHYCVCHG | 809 | 2,150 | 274 | -3,402 | 611,751 | 1,788,144 | 35 | 4,961 | 11.8 |
| PHYLWCHG | 732 | 2,013 | 244 | -3,248 | 611,751 | 3,171,662 | 46 | 9,080 | 12.8 |
| PHYPDCHG | 620 | 1,921 | 175 | -3,338 | 610,263 | 1,441,761 | 42 | 6,711 | 12.0 |
| Other Claims | |||||||||
| OTHCVCHG | 517 | 2,575 | 145 | -10,206 | 439,793 | 2,631,444 | 81 | 9,801 | 11.9 |
| OTHLWCHG | 350 | 1,160 | 79 | -6,969 | 118,741 | 3,861,376 | 25 | 1,382 | 11.9 |
| OTHPDCHG | 443 | 2,433 | 100 | -6,969 | 439,137 | 2,301,455 | 83 | 10,443 | 12.7 |
| Total Claims | |||||||||
| TOTCVCHG | 2,186 | 9,090 | 493 | -27,021 | 1,840,372 | 1,664,560 | 34 | 2,935 | 11.0 |
| TOTLWCHG | 1,826 | 8,350 | 353 | -33,226 | 2,568,512 | 1,896,167 | 39 | 5,296 | 12.1 |
| TOTPDCHG | 1,646 | 8,212 | 298 | 0 | 2,568,512 | 0 | 44 | 5,509 | 11.8 |
| Claims Related to Diagnosis | |||||||||
| DIAG1CHG | 1,101 | 6,278 | 187 | 0 | 2,568,512 | 0 | 60 | 11,400 | 12.7 |
| DIAG2CHG | 366 | 1,718 | 87 | -114,835 | 456,940 | 1,158,758 | 44 | 4,668 | 13.1 |
| DIAG3CHG | 203 | 825 | 60 | -9,686 | 228,159 | 1,912,542 | 40 | 4,466 | 13.2 |
| DGCATCHG | 1,249 | 6,823 | 206 | 0 | 2,568,512 | 0 | 54 | 9,054 | 13.3 |
| Claimant Attributes | |||||||||
| AGE | 32 | 19 | 33 | -21 | 159 | 0 | 0 | -1 | 13.0 |
clm.lng <- select_if(clm.all, is.numeric)[c(4:19)]
clm.lng <- cbind(YR=as.character(clm.all$CLAIMYR),clm.lng) %>%
gather(key="type","amt", 2:17) %>%
mutate(logamt = ifelse(amt < 1,NA, log(amt)))
# plot numeric dollar variables
ggplot(clm.lng, aes(x=logamt, fill=YR)) +
geom_density(aes(y=..density..),alpha=0.3) + theme_bw() +
facet_wrap(~ type, ncol=4)
ggsave('dollar_density.png')
rm(clm.lng)Density Plot of Claimant Amounts
# plot of age
ggplot(clm.all, aes(AGE, fill=as.character(CLAIMYR))) +
geom_density(aes(y=..density..),alpha=0.3) + theme_bw() +
scale_fill_discrete(name = "YR")Sex
#Patient Sex percentages
comma(prop.table(table(clm.all$PATSEX, useNA='ifany')))
F M
0.56 0.44
PPO Indicator
# Plan Type PPO indicator percentages
comma(prop.table(table(clm.all$PPO,useNA='ifany')))
N Y
0.52 0.48
Exposure Record Indicator
# Enrollment Records Indicator, percentages?
comma(prop.table(table(clm.all$EXPOSMEM, useNA='ifany')))
N Y
0.76 0.24
Member Relation to Subscriber
# Patient Relation
comma(round(prop.table(table(clm.all$RELATION, useNA='ifany')),3))
D E S X
0.32 0.48 0.20 0.00
# Top High level Diagnosis Category
top.dx.cat <- data.frame(rcd.ct=prop.table(sort(table(clm.all$DGCAT), decreasing=T))*100)
names(top.dx.cat) <- c('dx.cat','% rcds')
comma(top.dx.cat) dx.cat % rcds
1 Unknown 14.50
2 Health Status or Service 13.81
3 Respiratory System 10.10
4 Injury & Poisoning 9.88
5 Symptoms & Ill-Defined Conditions 9.84
6 Skeleton & Muscle System 7.72
7 Genitourinary System 5.66
8 Sense Organs 4.79
9 Circulatory System 3.40
10 Digestive System 3.33
11 Skin Disorders 3.07
12 Mental Disorders, Drug, Alcohol 2.74
13 Malignant Neoplasms 2.65
14 Endocrine & Metabolic Disorders 2.32
15 Pregnancy & Childbirth 2.28
16 Infectious & Parasitic Disease 2.22
17 Nervous System 0.90
18 Congenital & Perinatal 0.54
19 Blood Related Disorders 0.26
Diagnosis Codes
Total unique codes in dataset:
# find unique diagnosis codes
dx1 <- as.character(clm.all$DIAG1)
dx2 <- as.character(clm.all$DIAG2)
dx3 <- as.character(clm.all$DIAG3)
dx.all <- c(dx1,dx2,dx3)
dx.all.unq <- unique(dx.all)
# number of unique dx code
length(dx.all.unq)[1] 1132
# possible unknown dx codes
datatable(data.frame(sort(table(dx.all[!is_defined(dx.all)]), decreasing=T)),colnames = c('dx','ct'), options = list(
columnDefs = list(list(className = 'dt-right', targets = 0:1)) ),rownames=F) %>%
formatCurrency('Freq',currency ='', mark=",", digits=0)# helper function cleanup diagnosis codes
clean_icd9 <- function(x){
# clean up decimal point issues
if (str_detect(x,'^\\d{2}\\.$')){
x <- paste0("0",substr(x,1,2))
# remap expired hiv codes
} else if (x %in% c('044','043')) {
x <- '042'
# standardize no diagnosis coding
} else if (x %in% c('', '.', '000')){
x <- '000'
}
# redefine unmapped codes as "UNK"; standalone if statement
if (!is_defined(x) & !str_detect(x,"\\.|E\\d{2}|Z79|000|^\\s*$")){
x <- 'UNK'
}
x
}
# mapping of original to clean dx mappings; retain changes only
dx.map <- sapply(dx.all.unq, clean_icd9)
dx.map <- data.frame(orig.dx = dx.all.unq, new.dx = dx.map, stringsAsFactors = F)
dx.map <- dx.map[dx.map$orig.dx != dx.map$new.dx,]
dx1.cln <- ifelse(dx1 %in% dx.map$orig.dx, dx.map$new.dx[match(dx1,dx.map$orig.dx)], dx1)
dx2.cln <- ifelse(dx2 %in% dx.map$orig.dx, dx.map$new.dx[match(dx2,dx.map$orig.dx)], dx2)
dx3.cln <- ifelse(dx3 %in% dx.map$orig.dx, dx.map$new.dx[match(dx3,dx.map$orig.dx)], dx3)
dx.all.cln <- c(dx1.cln,dx2.cln,dx3.cln)
dx.all.cln.unq <- unique(dx.all.cln)
# fix original claim file dx
clm.all$DIAG1.CLN <- as.factor(dx1.cln)
clm.all$DIAG2.CLN <- as.factor(dx2.cln)
clm.all$DIAG3.CLN <- as.factor(dx3.cln)Scrubbed Diagnosis Codes
Number of unique, scrubbed diagnosis codes:
# length clean dx
length(dx.all.cln.unq)[1] 1024
Here is a breakdown of record counts by diagnosis type:
# all dx type
dx.type <- ifelse(!(dx.all.cln %in% c('UNK','000')),'valid_cd',dx.all.cln)
# % record counts by dx type
round(prop.table(table(dx.type)),3)dx.type
000 UNK valid_cd
0.029 0.002 0.969
The goal for this project is to make cost projections for members in a subsequent period based on attributes and cost history in the current period.
Therefore, we should look at some bivariate relationship between member attributes and subsequent costs. For the charts below, we use insure paid amounts as our cost amount in the later period.
# claimants in both 97 and 98
clmt.9798 <- inner_join(clmts97,clmts98, by="x")
# claims data for claimants in both 97 and 98
clms.9798 <- inner_join(clm.all,clmt.9798, by = c("CLAIMANT"="x"))
clms.9798 <- clms.9798[clms.9798$CLAIMYR %in% c(1997,1998),]
# claimants in both years, 98 paid totals
pd.98 <- clms.9798 %>%
filter(CLAIMYR == 1998) %>%
select(CLAIMANT,TOTPDCHG)
names(pd.98) <-c('CLMT98','TOTPDCHG98')
# 97 all attributres, 98 paid only
clms.9798 <- clms.9798 %>%
inner_join(pd.98, by=c("CLAIMANT"="CLMT98")) %>%
filter(CLAIMYR == 1997)Future Costs Vs. Categorical Variables
# relationship relationship to pd98
p1 <- ggplot(clms.9798, aes(RELATION, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()
# relationship of sex to pd98
p2 <- ggplot(clms.9798, aes(PATSEX, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()
# relationship of ppo plan to pd98
p3 <- ggplot(clms.9798, aes(PPO, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()
# relationship of ppo plan to pd98
p4 <- ggplot(clms.9798, aes(EXPOSMEM, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()
# major diagnosis category to pd98
p5 <- ggplot(clms.9798, aes(substr(DGCAT,1,2), log(TOTPDCHG98))) + geom_boxplot() + theme_bw()
grid.arrange(p1,p2,p3,p4,p5, ncol=2)Future Costs Vs. Numeric Variables
# compare dollar attributes in 97 to pd 98
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,c(4:19,21)]
clms.9798.num <- clms.9798.num %>%
gather('clmtype','amt',1:16) %>%
mutate(logamt = ifelse(amt < 1,NA, log(amt))) %>%
mutate(logpd98 = ifelse(TOTPDCHG98 < 1,NA, log(TOTPDCHG98)))
ggplot(clms.9798.num, aes(x=logamt,y=logpd98)) + geom_point(alpha=0.3) + theme_bw() + facet_wrap(~ clmtype, ncol=4)
ggsave('num97_num98.png')
rm(clms.9798.num)Scatter Plots of 97 Amts vs. 98 Pd
Future Paid vs. Age
# ages vs 98 paid
ggplot(clms.9798,aes(AGE,log(TOTPDCHG98))) + geom_point(alpha=0.3) + theme_bw()
ggsave('age_pd98.png')Scatter Plot of Age vs. 98 Pd
Avg Future Paid vs. Average Age
#avg age vs 98 paid
clms.9798 %>%
group_by(AGE) %>%
summarise(avg_cost=mean(TOTPDCHG98)) %>%
ggplot(aes(AGE,avg_cost)) + geom_point() + theme_bw()# correlation plots
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,4:21]
# correlation of charges
mycorr <- rcorr(as.matrix(clms.9798.num))
corrplot(mycorr$r,type='upper',order = "hclust",
tl.col = "black", tl.srt = 90, tl.cex=0.5)The R package, icd, allows us to examine claimant mortality risk using what is known as a Charlson Risk Score, which is calculated using claimant diagnoses. This information could provide useful way to effectively reduce the dimensionality of the data set.
# all 3 diagnoses for individuals in 97
clm.dx97 <- clm.all[clm.all$CLAIMYR == 1997,] %>%
select(CLAIMANT, DIAG1.CLN,DIAG2.CLN, DIAG3.CLN) %>%
gather("dxpos","dx",2:4)
# charlson risk scores
c_score <- charlson(x = clm.dx97, visit_name = "CLAIMANT",
icd_name = "dx")
table(c_score)c_score
0 1 2 3 4 5 6 7 8
1149600 67463 20567 1024 42 3 2620 85 33
12
1
The icd package also allows us bucket members into high-level comorbidity categories based on claimant diagnoses:
Here is an example:
# comorbidity groupings
ahrq_comborbid <- comorbid_ahrq(clm.dx97,
visit_name = "CLAIMANT",
icd_name = "dx")
# example output comorbidity groupings
head(ahrq_comborbid[,1:5],20) CHF Valvular PHTN PVD HTN
1 FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE
7 FALSE FALSE FALSE TRUE FALSE
8 FALSE FALSE FALSE TRUE FALSE
9 FALSE FALSE FALSE FALSE FALSE
10 TRUE FALSE FALSE FALSE FALSE
11 FALSE FALSE FALSE FALSE FALSE
12 FALSE FALSE FALSE FALSE FALSE
13 TRUE FALSE FALSE FALSE FALSE
14 FALSE FALSE FALSE FALSE FALSE
15 FALSE FALSE FALSE FALSE FALSE
16 FALSE FALSE FALSE FALSE FALSE
17 FALSE FALSE FALSE FALSE FALSE
18 FALSE FALSE FALSE FALSE FALSE
19 FALSE FALSE FALSE FALSE FALSE
20 FALSE FALSE FALSE FALSE FALSE