# set working directory
setwd('C:/Users/agrza/Google Drive/Documents/cuny/Data698/SOALargeClaims')

Project Overview

For my Data 698 project, I will build models that project health insurance claimants spending, based on claimant attributes and historical spend data.

I hope to predict one or more of the following items:

Initial Setup

First, load all libraries relevant to the analysis.

# common helper functions
#   function for pretty formatting number output
comma <- function(x) noquote(format(x, digits = 2, big.mark = ","))

Data Retrieval

The dataset for this project is located on the Society of Actuaries’ website. The data was collected as part of health inurance claims research study. The original study was completed in the mid 2000s and focused on high cost claimants.

Download Files

# urls to files
clm97 <- 'https://www.soa.org/Files/Research/1997.zip'
clm98 <- 'https://www.soa.org/Files/Research/1998.zip'
clm99 <- 'https://www.soa.org/Files/Research/1999.zip'
all.clms <- c(clm97, clm98, clm99)

# download zip files
zip.files <- NULL
for (i in 1:length(all.clms)){
  zip.files <- c(zip.files,str_extract(all.clms[i], '(?<=/)\\d{4}\\.zip$'))
  download.file(all.clms[i], zip.files[i])
}

Read Files

The script below unzips the three separate claim files, and combines the information together in a single data frame.

# read claim files into R as one large data frame
clm.all <- data.frame() 
for (i in 1:length(zip.files)){
  clm.all <- rbind(clm.all,read.delim(unz(zip.files[i],unzip(zip.files[i],list=T)$Name), sep=","))
}
# save df to RData file 
save(clm.all, file = "clm.RData")
rm(clm.all)

Load Data

Going forward, we will refer to the clm.all data frame object saved in the clm.RData file.

# read in clm.all data frame from RData file
load("clm.RData")

Data Documentation

The following links provide useful information concerning data collection and wrangling, as well as a data dictionary for the claim files:

The researchers for the study collected three years of claims data from 11 separate health insurers. Due to data quality issues, the researchers excluded the submissions from four of the insurers.

The researchers also performed a significant amount of data wrangling before publishing the public claims file:

Data Exploration

Variable Overview

# load variable descriptions
var.desc <- read.csv('var_description.csv')

# print variables and types
class.df <- data.frame(var=names(clm.all), type=sapply(clm.all, class))
class.df <- merge(var.desc, class.df, by.x='Field',by.y='var')
class.df <- data.frame(class.df[order(class.df$type, class.df$Desc),], row.names=NULL)
kable(class.df)  %>%
  kable_styling(bootstrap_options = c('striped','hover'), full_width=F, font_size=9)
Field Desc type
RELATION Claimant Relationship to Policyholder (e.g. self, spouse,dependent) factor
DGCAT Diagnosis Category Associated with Highest Paid Claims factor
PATSEX Gender of Claimant factor
EXPOSMEM Member enrollment information available factor
PPO Plan Type (PPO vs. Other (e.g. HMO, POC, Indemnity, etc.) factor
DIAG1 Primary Diagnosis factor
DIAG2 Secondary Diagnosis factor
DIAG3 Tertiary Diagnosis factor
CLAIMYR Year associated with Date of Service integer
PATBRTYR Year of Birth for Claimant integer
HOSLWCHG Hospital Allowed Amount numeric
HOSCVCHG Hospital Charge Amount for Covered Services numeric
HOSPDCHG Hospital Paid Amount numeric
OTHLWCHG Other Allowed Amount numeric
OTHCVCHG Other Covered Charges numeric
OTHPDCHG Other Paid Amount numeric
DIAG1CHG Paid Amount for Primary Diagnosis numeric
DIAG2CHG Paid Amount for Secondary Diagnosis numeric
DIAG3CHG Paid Amount for Tertiary Diagnosis numeric
PHYLWCHG Physician Allowed Amount for Covered Services numeric
PHYCVCHG Physician Charge Amount for Covered Services numeric
PHYPDCHG Physician Paid Amount for Covered Services numeric
DGCATCHG Primary Diagnosis Category numeric
TOTLWCHG Total Allowed Amount numeric
TOTCVCHG Total Charged Amount for Covered Services numeric
TOTPDCHG Total Paid Amount numeric
CLAIMANT Unique Claimant Number numeric
# summary of variable types
class.df %>% 
  group_by(type) %>% 
  summarise(ct=n()) %>% 
  data.frame()
     type ct
1  factor  8
2 integer  2
3 numeric 17

Multiple, traditional variables that are relevant to health claim prediction but are excluded in the public release data set. These missing variables include:

  • member geography (e.g. zipcode, county, state)
  • claimant plan benefits (e.g. deductible, copay)
  • Subscriber’s industry (eg. manufacturing, financial services)

The public release dataset also does excludes a member enrollment file that details the dates that members were active on a health insurance plan. Without this information, I will be unable to identify active members that do not have any health insurance claims.

Any model predictions for this project will be limited to members who have claims in at least two successive incurred periods. In other words, the models in this project will likely not generalize well to a full insured population.

Record Counts

Number of Records, all Years

# dimensions 
comma(c(records=nrow(clm.all), fields=ncol(clm.all)))
  records    fields 
4,294,030        27 

Number of Claimants, Each Year

# record count by year
comma(table(clm.all$CLAIMYR))

     1997      1998      1999 
1,241,438 1,460,854 1,591,738 

Unique Claimant Counts

# unique claimants
comma(length(unique(clm.all$CLAIMANT)))
[1] 2,700,164
# unique claimants by year
clmts97 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1997]))
clmts98 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1998]))
clmts99 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1999]))

Unique Claimants in both 1997 and 1998

# claimants in 97 and 98
comma(nrow(inner_join(clmts97,clmts98, by="x")))
[1] 736,814

Unique Claimants in both 1998 and 1999

# claimants in 98 and 99
comma(nrow(inner_join(clmts98,clmts99, by="x")))
[1] 810,943

Unique Claimants in All 3 Years

# claimants in 97, 98, and 99
comma(nrow(inner_join(inner_join(clmts97,clmts98, by="x"),clmts99, by="x")))
[1] 436,811

Initial Feature Engineering

The dataset provides clamaint birth year and the year claims were incurred. Using this information, we can approximate claimant age.

# create AGE variable
clm.all$AGE <- clm.all$CLAIMYR - clm.all$PATBRTYR

Numeric Variables

Many of the numeric variables are highly correlated. For instance, hospital charges are a subset of total charges; so these two variables should be highly correlated.

Also cost data is presented in three different (and highly correlated ways):

  • Provider Charge (see “xxxCVCHG” variables)
  • Allowed Charge - this represents the total amount the provider is paid (including insurer payment and member copays) (See “xxxLWCVG” variables).
  • Paid Amount - this represent the amount paid by the insurer (i.e. provider paid - member cost sharing). See the “xxxPDCHG” variables.
# helper function for displaying numeric summaries
num.summary <- function(x,...){
  c(mean=mean(x, ...),                        # mean
    sd=sd(x, ...),                            # standard deviation  
    median=median(x, ...),                    # median 
    min=min(x, ...),                          # minimum
    max=max(x,...),                           # maximum
    rg=max(x, ...) - min(x, ...),             # range
    n=length(x),                              # record count
    NAs = sum(is.na(x)),                      # number of missing values
    skew = psych::skew(x,...),                # skewnesss   
    kurt.xs = psych::kurtosi(x,...),          # excess kurtosis
    out.ct = outliers(x)$numOutliers,         # outlier ct (box plot definition)
    out.per = 100* outliers(x)$numOutliers /  # percentage of outliers
      (length(x) - sum(is.na(x)))
    )
}

# summarize numeric variables 
num.desc <- apply(select_if(clm.all,is.numeric),2,num.summary, na.rm=T)

# remove unnecessary columns
num.desc2 <- num.desc[!rownames(num.desc) %in% 
                        c('rg','out.ct', 'out.per','n'), 
                      !colnames(num.desc) %in% 
                        c('PATBRTYR','CLAIMYR', 'CLAIMANT')]

num.desc2 <- round(num.desc2,0)
num.desc2 <- cbind(t(num.desc2),out.per=round(num.desc["out.per",],1))

# print descriptive summary for numeric variables
kable(num.desc2,format.args = list(big.mark=",")) %>% 
  kable_styling(bootstrap_options = c('striped','hover'), full_width=F,font_size=9) %>% 
  pack_rows('Hospital Claims', 1,3) %>%
  pack_rows('Physician Claims', 4,6) %>%
  pack_rows('Other Claims',7,9) %>% 
  pack_rows('Total Claims',10,12) %>% 
  pack_rows('Claims Related to Diagnosis',13,16) %>% 
  pack_rows('Claimant Attributes',17,17)
mean sd median min max NAs skew kurt.xs out.per
Hospital Claims
HOSCVCHG 2,524 11,072 491 -27,630 1,781,017 3,160,414 32 2,432 0.0
HOSLWCHG 2,199 9,153 354 -27,630 839,633 3,714,064 20 776 0.0
HOSPDCHG 1,944 9,803 331 -6,200 1,779,990 2,961,766 38 3,273 0.1
Physician Claims
PHYCVCHG 809 2,150 274 -3,402 611,751 1,788,144 35 4,961 11.8
PHYLWCHG 732 2,013 244 -3,248 611,751 3,171,662 46 9,080 12.8
PHYPDCHG 620 1,921 175 -3,338 610,263 1,441,761 42 6,711 12.0
Other Claims
OTHCVCHG 517 2,575 145 -10,206 439,793 2,631,444 81 9,801 11.9
OTHLWCHG 350 1,160 79 -6,969 118,741 3,861,376 25 1,382 11.9
OTHPDCHG 443 2,433 100 -6,969 439,137 2,301,455 83 10,443 12.7
Total Claims
TOTCVCHG 2,186 9,090 493 -27,021 1,840,372 1,664,560 34 2,935 11.0
TOTLWCHG 1,826 8,350 353 -33,226 2,568,512 1,896,167 39 5,296 12.1
TOTPDCHG 1,646 8,212 298 0 2,568,512 0 44 5,509 11.8
Claims Related to Diagnosis
DIAG1CHG 1,101 6,278 187 0 2,568,512 0 60 11,400 12.7
DIAG2CHG 366 1,718 87 -114,835 456,940 1,158,758 44 4,668 13.1
DIAG3CHG 203 825 60 -9,686 228,159 1,912,542 40 4,466 13.2
DGCATCHG 1,249 6,823 206 0 2,568,512 0 54 9,054 13.3
Claimant Attributes
AGE 32 19 33 -21 159 0 0 -1 13.0
clm.lng <- select_if(clm.all, is.numeric)[c(4:19)]
clm.lng <- cbind(YR=as.character(clm.all$CLAIMYR),clm.lng) %>% 
  gather(key="type","amt", 2:17) %>% 
  mutate(logamt = ifelse(amt < 1,NA, log(amt)))

# plot numeric dollar variables 
ggplot(clm.lng, aes(x=logamt, fill=YR)) + 
  geom_density(aes(y=..density..),alpha=0.3) + theme_bw() + 
  facet_wrap(~ type, ncol=4)

ggsave('dollar_density.png')

rm(clm.lng)
Density Plot of Claimant Amounts

Density Plot of Claimant Amounts

# plot of age
ggplot(clm.all, aes(AGE, fill=as.character(CLAIMYR))) +
  geom_density(aes(y=..density..),alpha=0.3) + theme_bw() + 
  scale_fill_discrete(name = "YR")

Categorical Variables

Sex

#Patient Sex percentages
comma(prop.table(table(clm.all$PATSEX, useNA='ifany')))

   F    M 
0.56 0.44 

PPO Indicator

# Plan Type PPO indicator percentages
comma(prop.table(table(clm.all$PPO,useNA='ifany')))

   N    Y 
0.52 0.48 

Exposure Record Indicator

# Enrollment Records Indicator, percentages?
comma(prop.table(table(clm.all$EXPOSMEM, useNA='ifany')))

   N    Y 
0.76 0.24 

Member Relation to Subscriber

# Patient Relation
comma(round(prop.table(table(clm.all$RELATION, useNA='ifany')),3))

   D    E    S    X 
0.32 0.48 0.20 0.00 
# Top High level Diagnosis Category
top.dx.cat <- data.frame(rcd.ct=prop.table(sort(table(clm.all$DGCAT), decreasing=T))*100)
names(top.dx.cat) <- c('dx.cat','% rcds')
comma(top.dx.cat)
                              dx.cat % rcds
1                            Unknown  14.50
2           Health Status or Service  13.81
3                 Respiratory System  10.10
4                 Injury & Poisoning   9.88
5  Symptoms & Ill-Defined Conditions   9.84
6           Skeleton & Muscle System   7.72
7               Genitourinary System   5.66
8                       Sense Organs   4.79
9                 Circulatory System   3.40
10                  Digestive System   3.33
11                    Skin Disorders   3.07
12   Mental Disorders, Drug, Alcohol   2.74
13               Malignant Neoplasms   2.65
14   Endocrine & Metabolic Disorders   2.32
15            Pregnancy & Childbirth   2.28
16    Infectious & Parasitic Disease   2.22
17                    Nervous System   0.90
18            Congenital & Perinatal   0.54
19           Blood Related Disorders   0.26

Diagnosis Codes

Total unique codes in dataset:

# find unique diagnosis codes
dx1 <- as.character(clm.all$DIAG1)
dx2 <- as.character(clm.all$DIAG2)
dx3 <- as.character(clm.all$DIAG3)
dx.all <- c(dx1,dx2,dx3)
dx.all.unq <- unique(dx.all)

# number of unique dx code
length(dx.all.unq)
[1] 1132

Diagnosis Code Cleanup

# possible unknown dx codes
datatable(data.frame(sort(table(dx.all[!is_defined(dx.all)]), decreasing=T)),colnames = c('dx','ct'), options = list(
 columnDefs = list(list(className = 'dt-right', targets = 0:1)) ),rownames=F) %>% 
  formatCurrency('Freq',currency ='', mark=",", digits=0)
# helper function cleanup diagnosis codes
clean_icd9 <- function(x){
  # clean up decimal point issues
  if (str_detect(x,'^\\d{2}\\.$')){
    x <- paste0("0",substr(x,1,2))
  # remap expired hiv codes
  } else if (x %in% c('044','043')) {
    x <- '042'
  # standardize no diagnosis coding
  } else if (x %in% c('', '.', '000')){
    x <- '000'
  }
  # redefine unmapped codes as "UNK"; standalone if statement
  if (!is_defined(x) & !str_detect(x,"\\.|E\\d{2}|Z79|000|^\\s*$")){
     x <- 'UNK'
  }
  x
}
# mapping of original to clean dx mappings; retain changes only
dx.map <- sapply(dx.all.unq, clean_icd9)
dx.map <- data.frame(orig.dx = dx.all.unq, new.dx = dx.map, stringsAsFactors = F)
dx.map <- dx.map[dx.map$orig.dx != dx.map$new.dx,]

dx1.cln <- ifelse(dx1 %in% dx.map$orig.dx, dx.map$new.dx[match(dx1,dx.map$orig.dx)], dx1)
dx2.cln <- ifelse(dx2 %in% dx.map$orig.dx, dx.map$new.dx[match(dx2,dx.map$orig.dx)], dx2)
dx3.cln <- ifelse(dx3 %in% dx.map$orig.dx, dx.map$new.dx[match(dx3,dx.map$orig.dx)], dx3)
dx.all.cln <- c(dx1.cln,dx2.cln,dx3.cln)
dx.all.cln.unq <- unique(dx.all.cln)

# fix original claim file dx
clm.all$DIAG1.CLN <- as.factor(dx1.cln)
clm.all$DIAG2.CLN <- as.factor(dx2.cln)
clm.all$DIAG3.CLN <- as.factor(dx3.cln)

Scrubbed Diagnosis Codes

Number of unique, scrubbed diagnosis codes:

# length clean dx
length(dx.all.cln.unq)
[1] 1024

Here is a breakdown of record counts by diagnosis type:

# all dx type
dx.type <- ifelse(!(dx.all.cln %in% c('UNK','000')),'valid_cd',dx.all.cln)

# % record counts by dx type
round(prop.table(table(dx.type)),3)
dx.type
     000      UNK valid_cd 
   0.029    0.002    0.969 

Bivarate Exploration

Prior Year Attributes vs. Claims in Current Year

The goal for this project is to make cost projections for members in a subsequent period based on attributes and cost history in the current period.

Therefore, we should look at some bivariate relationship between member attributes and subsequent costs. For the charts below, we use insure paid amounts as our cost amount in the later period.

# claimants in both 97 and 98
clmt.9798 <- inner_join(clmts97,clmts98, by="x")

# claims data for claimants in both 97 and 98 
clms.9798 <- inner_join(clm.all,clmt.9798, by = c("CLAIMANT"="x"))
clms.9798 <- clms.9798[clms.9798$CLAIMYR %in% c(1997,1998),]

# claimants in both years, 98 paid totals
pd.98 <- clms.9798 %>% 
  filter(CLAIMYR == 1998) %>% 
  select(CLAIMANT,TOTPDCHG)
names(pd.98) <-c('CLMT98','TOTPDCHG98')

# 97 all attributres, 98 paid only
clms.9798 <- clms.9798 %>% 
  inner_join(pd.98, by=c("CLAIMANT"="CLMT98")) %>% 
  filter(CLAIMYR == 1997)

Future Costs Vs. Categorical Variables

# relationship relationship to pd98
p1 <- ggplot(clms.9798, aes(RELATION, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of sex to pd98
p2 <- ggplot(clms.9798, aes(PATSEX, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of ppo plan to pd98
p3 <- ggplot(clms.9798, aes(PPO, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of ppo plan to pd98
p4 <- ggplot(clms.9798, aes(EXPOSMEM, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# major diagnosis category to pd98
p5 <- ggplot(clms.9798, aes(substr(DGCAT,1,2), log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

grid.arrange(p1,p2,p3,p4,p5, ncol=2)

Future Costs Vs. Numeric Variables

# compare dollar attributes in 97 to pd 98
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,c(4:19,21)]

clms.9798.num <- clms.9798.num %>% 
  gather('clmtype','amt',1:16) %>%
  mutate(logamt = ifelse(amt < 1,NA, log(amt))) %>% 
  mutate(logpd98 = ifelse(TOTPDCHG98 < 1,NA, log(TOTPDCHG98)))

ggplot(clms.9798.num, aes(x=logamt,y=logpd98)) + geom_point(alpha=0.3) + theme_bw() + facet_wrap(~ clmtype, ncol=4)

ggsave('num97_num98.png')
rm(clms.9798.num)
Scatter Plots of 97 Amts vs. 98 Pd

Scatter Plots of 97 Amts vs. 98 Pd

Future Paid vs. Age

# ages vs 98 paid
ggplot(clms.9798,aes(AGE,log(TOTPDCHG98))) + geom_point(alpha=0.3) + theme_bw() 

ggsave('age_pd98.png')
Scatter Plot of Age vs. 98 Pd

Scatter Plot of Age vs. 98 Pd

Avg Future Paid vs. Average Age

#avg age vs 98 paid
clms.9798 %>% 
  group_by(AGE) %>% 
  summarise(avg_cost=mean(TOTPDCHG98)) %>% 
  ggplot(aes(AGE,avg_cost)) + geom_point() + theme_bw()

Correlation Plots of Numeric Variables

# correlation plots
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,4:21]

# correlation of charges
mycorr <- rcorr(as.matrix(clms.9798.num))
corrplot(mycorr$r,type='upper',order = "hclust", 
         tl.col = "black", tl.srt = 90, tl.cex=0.5)

Feature Engineering of Diagnoses

The R package, icd, allows us to examine claimant mortality risk using what is known as a Charlson Risk Score, which is calculated using claimant diagnoses. This information could provide useful way to effectively reduce the dimensionality of the data set.

# all 3 diagnoses for individuals in 97
clm.dx97 <- clm.all[clm.all$CLAIMYR == 1997,] %>% 
  select(CLAIMANT, DIAG1.CLN,DIAG2.CLN, DIAG3.CLN) %>% 
  gather("dxpos","dx",2:4)

# charlson risk scores
c_score <- charlson(x = clm.dx97, visit_name = "CLAIMANT",
                  icd_name = "dx")
table(c_score)
c_score
      0       1       2       3       4       5       6       7       8 
1149600   67463   20567    1024      42       3    2620      85      33 
     12 
      1 

The icd package also allows us bucket members into high-level comorbidity categories based on claimant diagnoses:

Here is an example:

# comorbidity groupings
ahrq_comborbid <- comorbid_ahrq(clm.dx97,
                                visit_name = "CLAIMANT",
                                icd_name = "dx")
# example output comorbidity groupings
head(ahrq_comborbid[,1:5],20)
     CHF Valvular  PHTN   PVD   HTN
1  FALSE    FALSE FALSE FALSE FALSE
2  FALSE    FALSE FALSE FALSE FALSE
3  FALSE    FALSE FALSE FALSE FALSE
4  FALSE    FALSE FALSE FALSE FALSE
5  FALSE    FALSE FALSE FALSE FALSE
6  FALSE    FALSE FALSE FALSE FALSE
7  FALSE    FALSE FALSE  TRUE FALSE
8  FALSE    FALSE FALSE  TRUE FALSE
9  FALSE    FALSE FALSE FALSE FALSE
10  TRUE    FALSE FALSE FALSE FALSE
11 FALSE    FALSE FALSE FALSE FALSE
12 FALSE    FALSE FALSE FALSE FALSE
13  TRUE    FALSE FALSE FALSE FALSE
14 FALSE    FALSE FALSE FALSE FALSE
15 FALSE    FALSE FALSE FALSE FALSE
16 FALSE    FALSE FALSE FALSE FALSE
17 FALSE    FALSE FALSE FALSE FALSE
18 FALSE    FALSE FALSE FALSE FALSE
19 FALSE    FALSE FALSE FALSE FALSE
20 FALSE    FALSE FALSE FALSE FALSE