SOA Large Claims Study

# set working directory
setwd('C:/Users/agrza/Google Drive/Documents/cuny/Data698/SOALargeClaims')

Project Overview

For my Data 698 project, I will build models that project health insurance claimants spending, based on claimant attributes and historical spend data.

I hope to predict one or more of the following items:

whether a claimant will have high cost (e.g. $>25,000) in subsequent period based on current period attributures.
a quantile of health insurance spend for a claimant relative to his/her peer group.
a prospective risk score that projects health spend relative to average spend.
projected claimant health spend, in dollars.

Initial Setup

First, load all libraries relevant to the analysis.

# common helper functions
#   function for pretty formatting number output
comma <- function(x) noquote(format(x, digits = 2, big.mark = ","))

Data Retrieval

The dataset for this project is located on the Society of Actuaries’ website. The data was collected as part of health inurance claims research study. The original study was completed in the mid 2000s and focused on high cost claimants.

Download Files

# urls to files
clm97 <- 'https://www.soa.org/Files/Research/1997.zip'
clm98 <- 'https://www.soa.org/Files/Research/1998.zip'
clm99 <- 'https://www.soa.org/Files/Research/1999.zip'
all.clms <- c(clm97, clm98, clm99)

# download zip files
zip.files <- NULL
for (i in 1:length(all.clms)){
  zip.files <- c(zip.files,str_extract(all.clms[i], '(?<=/)\\d{4}\\.zip$'))
  download.file(all.clms[i], zip.files[i])
}

Read Files

The script below unzips the three separate claim files, and combines the information together in a single data frame.

# read claim files into R as one large data frame
clm.all <- data.frame() 
for (i in 1:length(zip.files)){
  clm.all <- rbind(clm.all,read.delim(unz(zip.files[i],unzip(zip.files[i],list=T)$Name), sep=","))
}
# save df to RData file 
save(clm.all, file = "clm.RData")
rm(clm.all)

Load Data

Going forward, we will refer to the clm.all data frame object saved in the clm.RData file.

# read in clm.all data frame from RData file
load("clm.RData")

Data Documentation

The following links provide useful information concerning data collection and wrangling, as well as a data dictionary for the claim files:

The researchers for the study collected three years of claims data from 11 separate health insurers. Due to data quality issues, the researchers excluded the submissions from four of the insurers.

The researchers also performed a significant amount of data wrangling before publishing the public claims file:

aggregating wildly varying data sources into one common format
limiting published fields to those deemed to be of sufficient quality
creating a unique member identifer using a variety of different feilds from the insurers’ raw data submissions
imputing missing values where feasible

Data Exploration

Variable Overview

# load variable descriptions
var.desc <- read.csv('var_description.csv')

# print variables and types
class.df <- data.frame(var=names(clm.all), type=sapply(clm.all, class))
class.df <- merge(var.desc, class.df, by.x='Field',by.y='var')
class.df <- data.frame(class.df[order(class.df$type, class.df$Desc),], row.names=NULL)
kable(class.df)  %>%
  kable_styling(bootstrap_options = c('striped','hover'), full_width=F, font_size=9)

Field	Desc	type
RELATION	Claimant Relationship to Policyholder (e.g. self, spouse,dependent)	factor
DGCAT	Diagnosis Category Associated with Highest Paid Claims	factor
PATSEX	Gender of Claimant	factor
EXPOSMEM	Member enrollment information available	factor
PPO	Plan Type (PPO vs. Other (e.g. HMO, POC, Indemnity, etc.)	factor
DIAG1	Primary Diagnosis	factor
DIAG2	Secondary Diagnosis	factor
DIAG3	Tertiary Diagnosis	factor
CLAIMYR	Year associated with Date of Service	integer
PATBRTYR	Year of Birth for Claimant	integer
HOSLWCHG	Hospital Allowed Amount	numeric
HOSCVCHG	Hospital Charge Amount for Covered Services	numeric
HOSPDCHG	Hospital Paid Amount	numeric
OTHLWCHG	Other Allowed Amount	numeric
OTHCVCHG	Other Covered Charges	numeric
OTHPDCHG	Other Paid Amount	numeric
DIAG1CHG	Paid Amount for Primary Diagnosis	numeric
DIAG2CHG	Paid Amount for Secondary Diagnosis	numeric
DIAG3CHG	Paid Amount for Tertiary Diagnosis	numeric
PHYLWCHG	Physician Allowed Amount for Covered Services	numeric
PHYCVCHG	Physician Charge Amount for Covered Services	numeric
PHYPDCHG	Physician Paid Amount for Covered Services	numeric
DGCATCHG	Primary Diagnosis Category	numeric
TOTLWCHG	Total Allowed Amount	numeric
TOTCVCHG	Total Charged Amount for Covered Services	numeric
TOTPDCHG	Total Paid Amount	numeric
CLAIMANT	Unique Claimant Number	numeric

# summary of variable types
class.df %>% 
  group_by(type) %>% 
  summarise(ct=n()) %>% 
  data.frame()

     type ct
1  factor  8
2 integer  2
3 numeric 17

Multiple, traditional variables that are relevant to health claim prediction but are excluded in the public release data set. These missing variables include:

member geography (e.g. zipcode, county, state)
claimant plan benefits (e.g. deductible, copay)
Subscriber’s industry (eg. manufacturing, financial services)

The public release dataset also does excludes a member enrollment file that details the dates that members were active on a health insurance plan. Without this information, I will be unable to identify active members that do not have any health insurance claims.

Any model predictions for this project will be limited to members who have claims in at least two successive incurred periods. In other words, the models in this project will likely not generalize well to a full insured population.

Record Counts

Number of Records, all Years

# dimensions 
comma(c(records=nrow(clm.all), fields=ncol(clm.all)))

  records    fields 
4,294,030        27

Number of Claimants, Each Year

# record count by year
comma(table(clm.all$CLAIMYR))


     1997      1998      1999 
1,241,438 1,460,854 1,591,738

Unique Claimant Counts

# unique claimants
comma(length(unique(clm.all$CLAIMANT)))

[1] 2,700,164

# unique claimants by year
clmts97 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1997]))
clmts98 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1998]))
clmts99 <- data.frame(x=unique(clm.all$CLAIMANT[clm.all$CLAIMYR == 1999]))

Unique Claimants in both 1997 and 1998

# claimants in 97 and 98
comma(nrow(inner_join(clmts97,clmts98, by="x")))

[1] 736,814

Unique Claimants in both 1998 and 1999

# claimants in 98 and 99
comma(nrow(inner_join(clmts98,clmts99, by="x")))

[1] 810,943

Unique Claimants in All 3 Years

# claimants in 97, 98, and 99
comma(nrow(inner_join(inner_join(clmts97,clmts98, by="x"),clmts99, by="x")))

[1] 436,811

Initial Feature Engineering

The dataset provides clamaint birth year and the year claims were incurred. Using this information, we can approximate claimant age.

# create AGE variable
clm.all$AGE <- clm.all$CLAIMYR - clm.all$PATBRTYR

Numeric Variables

Many of the numeric variables are highly correlated. For instance, hospital charges are a subset of total charges; so these two variables should be highly correlated.

Also cost data is presented in three different (and highly correlated ways):

Provider Charge (see “xxxCVCHG” variables)
Allowed Charge - this represents the total amount the provider is paid (including insurer payment and member copays) (See “xxxLWCVG” variables).
Paid Amount - this represent the amount paid by the insurer (i.e. provider paid - member cost sharing). See the “xxxPDCHG” variables.

# helper function for displaying numeric summaries
num.summary <- function(x,...){
  c(mean=mean(x, ...),                        # mean
    sd=sd(x, ...),                            # standard deviation  
    median=median(x, ...),                    # median 
    min=min(x, ...),                          # minimum
    max=max(x,...),                           # maximum
    rg=max(x, ...) - min(x, ...),             # range
    n=length(x),                              # record count
    NAs = sum(is.na(x)),                      # number of missing values
    skew = psych::skew(x,...),                # skewnesss   
    kurt.xs = psych::kurtosi(x,...),          # excess kurtosis
    out.ct = outliers(x)$numOutliers,         # outlier ct (box plot definition)
    out.per = 100* outliers(x)$numOutliers /  # percentage of outliers
      (length(x) - sum(is.na(x)))
    )
}

# summarize numeric variables 
num.desc <- apply(select_if(clm.all,is.numeric),2,num.summary, na.rm=T)

# remove unnecessary columns
num.desc2 <- num.desc[!rownames(num.desc) %in% 
                        c('rg','out.ct', 'out.per','n'), 
                      !colnames(num.desc) %in% 
                        c('PATBRTYR','CLAIMYR', 'CLAIMANT')]

num.desc2 <- round(num.desc2,0)
num.desc2 <- cbind(t(num.desc2),out.per=round(num.desc["out.per",],1))

# print descriptive summary for numeric variables
kable(num.desc2,format.args = list(big.mark=",")) %>% 
  kable_styling(bootstrap_options = c('striped','hover'), full_width=F,font_size=9) %>% 
  pack_rows('Hospital Claims', 1,3) %>%
  pack_rows('Physician Claims', 4,6) %>%
  pack_rows('Other Claims',7,9) %>% 
  pack_rows('Total Claims',10,12) %>% 
  pack_rows('Claims Related to Diagnosis',13,16) %>% 
  pack_rows('Claimant Attributes',17,17)

	mean	sd	median	min	max	NAs	skew	kurt.xs	out.per
Hospital Claims
HOSCVCHG	2,524	11,072	491	-27,630	1,781,017	3,160,414	32	2,432	0.0
HOSLWCHG	2,199	9,153	354	-27,630	839,633	3,714,064	20	776	0.0
HOSPDCHG	1,944	9,803	331	-6,200	1,779,990	2,961,766	38	3,273	0.1
Physician Claims
PHYCVCHG	809	2,150	274	-3,402	611,751	1,788,144	35	4,961	11.8
PHYLWCHG	732	2,013	244	-3,248	611,751	3,171,662	46	9,080	12.8
PHYPDCHG	620	1,921	175	-3,338	610,263	1,441,761	42	6,711	12.0
Other Claims
OTHCVCHG	517	2,575	145	-10,206	439,793	2,631,444	81	9,801	11.9
OTHLWCHG	350	1,160	79	-6,969	118,741	3,861,376	25	1,382	11.9
OTHPDCHG	443	2,433	100	-6,969	439,137	2,301,455	83	10,443	12.7
Total Claims
TOTCVCHG	2,186	9,090	493	-27,021	1,840,372	1,664,560	34	2,935	11.0
TOTLWCHG	1,826	8,350	353	-33,226	2,568,512	1,896,167	39	5,296	12.1
TOTPDCHG	1,646	8,212	298	0	2,568,512	0	44	5,509	11.8
Claims Related to Diagnosis
DIAG1CHG	1,101	6,278	187	0	2,568,512	0	60	11,400	12.7
DIAG2CHG	366	1,718	87	-114,835	456,940	1,158,758	44	4,668	13.1
DIAG3CHG	203	825	60	-9,686	228,159	1,912,542	40	4,466	13.2
DGCATCHG	1,249	6,823	206	0	2,568,512	0	54	9,054	13.3
Claimant Attributes
AGE	32	19	33	-21	159	0	0	-1	13.0

clm.lng <- select_if(clm.all, is.numeric)[c(4:19)]
clm.lng <- cbind(YR=as.character(clm.all$CLAIMYR),clm.lng) %>% 
  gather(key="type","amt", 2:17) %>% 
  mutate(logamt = ifelse(amt < 1,NA, log(amt)))

# plot numeric dollar variables 
ggplot(clm.lng, aes(x=logamt, fill=YR)) + 
  geom_density(aes(y=..density..),alpha=0.3) + theme_bw() + 
  facet_wrap(~ type, ncol=4)

ggsave('dollar_density.png')

rm(clm.lng)

Density Plot of Claimant Amounts

# plot of age
ggplot(clm.all, aes(AGE, fill=as.character(CLAIMYR))) +
  geom_density(aes(y=..density..),alpha=0.3) + theme_bw() + 
  scale_fill_discrete(name = "YR")

Categorical Variables

Sex

#Patient Sex percentages
comma(prop.table(table(clm.all$PATSEX, useNA='ifany')))


   F    M 
0.56 0.44

PPO Indicator

# Plan Type PPO indicator percentages
comma(prop.table(table(clm.all$PPO,useNA='ifany')))


   N    Y 
0.52 0.48

Exposure Record Indicator

# Enrollment Records Indicator, percentages?
comma(prop.table(table(clm.all$EXPOSMEM, useNA='ifany')))


   N    Y 
0.76 0.24

Member Relation to Subscriber

# Patient Relation
comma(round(prop.table(table(clm.all$RELATION, useNA='ifany')),3))


   D    E    S    X 
0.32 0.48 0.20 0.00

# Top High level Diagnosis Category
top.dx.cat <- data.frame(rcd.ct=prop.table(sort(table(clm.all$DGCAT), decreasing=T))*100)
names(top.dx.cat) <- c('dx.cat','% rcds')
comma(top.dx.cat)

                              dx.cat % rcds
1                            Unknown  14.50
2           Health Status or Service  13.81
3                 Respiratory System  10.10
4                 Injury & Poisoning   9.88
5  Symptoms & Ill-Defined Conditions   9.84
6           Skeleton & Muscle System   7.72
7               Genitourinary System   5.66
8                       Sense Organs   4.79
9                 Circulatory System   3.40
10                  Digestive System   3.33
11                    Skin Disorders   3.07
12   Mental Disorders, Drug, Alcohol   2.74
13               Malignant Neoplasms   2.65
14   Endocrine & Metabolic Disorders   2.32
15            Pregnancy & Childbirth   2.28
16    Infectious & Parasitic Disease   2.22
17                    Nervous System   0.90
18            Congenital & Perinatal   0.54
19           Blood Related Disorders   0.26

Diagnosis Codes

Total unique codes in dataset:

# find unique diagnosis codes
dx1 <- as.character(clm.all$DIAG1)
dx2 <- as.character(clm.all$DIAG2)
dx3 <- as.character(clm.all$DIAG3)
dx.all <- c(dx1,dx2,dx3)
dx.all.unq <- unique(dx.all)

# number of unique dx code
length(dx.all.unq)

[1] 1132

Diagnosis Code Cleanup

# possible unknown dx codes
datatable(data.frame(sort(table(dx.all[!is_defined(dx.all)]), decreasing=T)),colnames = c('dx','ct'), options = list(
 columnDefs = list(list(className = 'dt-right', targets = 0:1)) ),rownames=F) %>% 
  formatCurrency('Freq',currency ='', mark=",", digits=0)

# helper function cleanup diagnosis codes
clean_icd9 <- function(x){
  # clean up decimal point issues
  if (str_detect(x,'^\\d{2}\\.$')){
    x <- paste0("0",substr(x,1,2))
  # remap expired hiv codes
  } else if (x %in% c('044','043')) {
    x <- '042'
  # standardize no diagnosis coding
  } else if (x %in% c('', '.', '000')){
    x <- '000'
  }
  # redefine unmapped codes as "UNK"; standalone if statement
  if (!is_defined(x) & !str_detect(x,"\\.|E\\d{2}|Z79|000|^\\s*$")){
     x <- 'UNK'
  }
  x
}
# mapping of original to clean dx mappings; retain changes only
dx.map <- sapply(dx.all.unq, clean_icd9)
dx.map <- data.frame(orig.dx = dx.all.unq, new.dx = dx.map, stringsAsFactors = F)
dx.map <- dx.map[dx.map$orig.dx != dx.map$new.dx,]

dx1.cln <- ifelse(dx1 %in% dx.map$orig.dx, dx.map$new.dx[match(dx1,dx.map$orig.dx)], dx1)
dx2.cln <- ifelse(dx2 %in% dx.map$orig.dx, dx.map$new.dx[match(dx2,dx.map$orig.dx)], dx2)
dx3.cln <- ifelse(dx3 %in% dx.map$orig.dx, dx.map$new.dx[match(dx3,dx.map$orig.dx)], dx3)
dx.all.cln <- c(dx1.cln,dx2.cln,dx3.cln)
dx.all.cln.unq <- unique(dx.all.cln)

# fix original claim file dx
clm.all$DIAG1.CLN <- as.factor(dx1.cln)
clm.all$DIAG2.CLN <- as.factor(dx2.cln)
clm.all$DIAG3.CLN <- as.factor(dx3.cln)

Scrubbed Diagnosis Codes

Number of unique, scrubbed diagnosis codes:

# length clean dx
length(dx.all.cln.unq)

[1] 1024

Here is a breakdown of record counts by diagnosis type:

No diagnosis listed (i.e. ‘000’)
Code listed but unknown (‘UNK’)
Valid code

# all dx type
dx.type <- ifelse(!(dx.all.cln %in% c('UNK','000')),'valid_cd',dx.all.cln)

# % record counts by dx type
round(prop.table(table(dx.type)),3)

dx.type
     000      UNK valid_cd 
   0.029    0.002    0.969

Bivarate Exploration

Prior Year Attributes vs. Claims in Current Year

The goal for this project is to make cost projections for members in a subsequent period based on attributes and cost history in the current period.

Therefore, we should look at some bivariate relationship between member attributes and subsequent costs. For the charts below, we use insure paid amounts as our cost amount in the later period.

# claimants in both 97 and 98
clmt.9798 <- inner_join(clmts97,clmts98, by="x")

# claims data for claimants in both 97 and 98 
clms.9798 <- inner_join(clm.all,clmt.9798, by = c("CLAIMANT"="x"))
clms.9798 <- clms.9798[clms.9798$CLAIMYR %in% c(1997,1998),]

# claimants in both years, 98 paid totals
pd.98 <- clms.9798 %>% 
  filter(CLAIMYR == 1998) %>% 
  select(CLAIMANT,TOTPDCHG)
names(pd.98) <-c('CLMT98','TOTPDCHG98')

# 97 all attributres, 98 paid only
clms.9798 <- clms.9798 %>% 
  inner_join(pd.98, by=c("CLAIMANT"="CLMT98")) %>% 
  filter(CLAIMYR == 1997)

Future Costs Vs. Categorical Variables

# relationship relationship to pd98
p1 <- ggplot(clms.9798, aes(RELATION, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of sex to pd98
p2 <- ggplot(clms.9798, aes(PATSEX, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of ppo plan to pd98
p3 <- ggplot(clms.9798, aes(PPO, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# relationship of ppo plan to pd98
p4 <- ggplot(clms.9798, aes(EXPOSMEM, log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

# major diagnosis category to pd98
p5 <- ggplot(clms.9798, aes(substr(DGCAT,1,2), log(TOTPDCHG98))) + geom_boxplot() + theme_bw()

grid.arrange(p1,p2,p3,p4,p5, ncol=2)

Future Costs Vs. Numeric Variables

# compare dollar attributes in 97 to pd 98
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,c(4:19,21)]

clms.9798.num <- clms.9798.num %>% 
  gather('clmtype','amt',1:16) %>%
  mutate(logamt = ifelse(amt < 1,NA, log(amt))) %>% 
  mutate(logpd98 = ifelse(TOTPDCHG98 < 1,NA, log(TOTPDCHG98)))

ggplot(clms.9798.num, aes(x=logamt,y=logpd98)) + geom_point(alpha=0.3) + theme_bw() + facet_wrap(~ clmtype, ncol=4)

ggsave('num97_num98.png')
rm(clms.9798.num)

Scatter Plots of 97 Amts vs. 98 Pd

Future Paid vs. Age

# ages vs 98 paid
ggplot(clms.9798,aes(AGE,log(TOTPDCHG98))) + geom_point(alpha=0.3) + theme_bw() 

ggsave('age_pd98.png')

Scatter Plot of Age vs. 98 Pd

Avg Future Paid vs. Average Age

#avg age vs 98 paid
clms.9798 %>% 
  group_by(AGE) %>% 
  summarise(avg_cost=mean(TOTPDCHG98)) %>% 
  ggplot(aes(AGE,avg_cost)) + geom_point() + theme_bw()

Correlation Plots of Numeric Variables

# correlation plots
clms.9798.num <- select_if(clms.9798, is.numeric)
clms.9798.num <- clms.9798.num[,4:21]

# correlation of charges
mycorr <- rcorr(as.matrix(clms.9798.num))
corrplot(mycorr$r,type='upper',order = "hclust", 
         tl.col = "black", tl.srt = 90, tl.cex=0.5)

Feature Engineering of Diagnoses

The R package, icd, allows us to examine claimant mortality risk using what is known as a Charlson Risk Score, which is calculated using claimant diagnoses. This information could provide useful way to effectively reduce the dimensionality of the data set.

# all 3 diagnoses for individuals in 97
clm.dx97 <- clm.all[clm.all$CLAIMYR == 1997,] %>% 
  select(CLAIMANT, DIAG1.CLN,DIAG2.CLN, DIAG3.CLN) %>% 
  gather("dxpos","dx",2:4)

# charlson risk scores
c_score <- charlson(x = clm.dx97, visit_name = "CLAIMANT",
                  icd_name = "dx")
table(c_score)

c_score
      0       1       2       3       4       5       6       7       8 
1149600   67463   20567    1024      42       3    2620      85      33 
     12 
      1

The icd package also allows us bucket members into high-level comorbidity categories based on claimant diagnoses:

Here is an example:

# comorbidity groupings
ahrq_comborbid <- comorbid_ahrq(clm.dx97,
                                visit_name = "CLAIMANT",
                                icd_name = "dx")
# example output comorbidity groupings
head(ahrq_comborbid[,1:5],20)

     CHF Valvular  PHTN   PVD   HTN
1  FALSE    FALSE FALSE FALSE FALSE
2  FALSE    FALSE FALSE FALSE FALSE
3  FALSE    FALSE FALSE FALSE FALSE
4  FALSE    FALSE FALSE FALSE FALSE
5  FALSE    FALSE FALSE FALSE FALSE
6  FALSE    FALSE FALSE FALSE FALSE
7  FALSE    FALSE FALSE  TRUE FALSE
8  FALSE    FALSE FALSE  TRUE FALSE
9  FALSE    FALSE FALSE FALSE FALSE
10  TRUE    FALSE FALSE FALSE FALSE
11 FALSE    FALSE FALSE FALSE FALSE
12 FALSE    FALSE FALSE FALSE FALSE
13  TRUE    FALSE FALSE FALSE FALSE
14 FALSE    FALSE FALSE FALSE FALSE
15 FALSE    FALSE FALSE FALSE FALSE
16 FALSE    FALSE FALSE FALSE FALSE
17 FALSE    FALSE FALSE FALSE FALSE
18 FALSE    FALSE FALSE FALSE FALSE
19 FALSE    FALSE FALSE FALSE FALSE
20 FALSE    FALSE FALSE FALSE FALSE

SOA Large Claims Study

Data Exploration

Aaron Grzasko

March 24, 2019

Project Overview

Initial Setup

Data Retrieval

Download Files

Read Files

Load Data

Data Documentation

Data Exploration

Variable Overview

Record Counts

Unique Claimant Counts

Initial Feature Engineering

Numeric Variables

Categorical Variables

Diagnosis Code Cleanup

Bivarate Exploration

Prior Year Attributes vs. Claims in Current Year

Correlation Plots of Numeric Variables

Feature Engineering of Diagnoses