How (a very very small percentage of) Americans Die

Disclaimer: The title of this post was shamelessly “borrowed” from a recent visualization done by Matt Klein of Bloomberg, How Americans Die. I strongly recommend everyone to click through the site.

Goal of the Project

I'm an actuary for a life insurance company, so I thought I would stick close to my root and look at some insurance/mortality data. More specifically, these are the three questions that I want to answer:

Who buys life insurance?
How much life insurance do people with different backgrounds purchase?
How do people die?

Data

One of the reasons that I want to look at insurance data is because I'm surrounded by them. For this project, I obtained a clean text file from a coworker. The dataset contains information about 3.6 million life insurance policies, such as the face amount, the type of insurance purchased, the date of birth, date of death, etc. The dataset was stripped of all personal information, such as name and address (although zip codes were available). The only identifier is a 20 characters string of number. For privacy and practical purposes, I also used the sample function in R to randomly select only 1% of the available data. Eventually, I end up with a 36137 by 38 table.

Exploring the Data

`Who Buys Life Insurance`

The first question that I want to get an answer from the dataset is: what type of people purchases life insurance? To do that, I looked at the age of the policyholders when they purchased the insurance (issue age) grouped by male and female. And this is what I found:

require(ggplot2)

## Loading required package: ggplot2

mimsa.1pct <- read.table("MIMSA_ONEPCT.txt", header = TRUE)
attach(mimsa.1pct)
issue.year <- as.numeric(format(as.Date(REG_DATE, "%Y%m%d"), format = "%Y"))
birth.year <- as.numeric(format(as.Date(as.character(DOB), "%Y%m%d"), format = "%Y"))
issue.age <- issue.year - birth.year
curr.age <- 2014 - birth.year
ggplot(mimsa.1pct, aes(x = issue.age, fill = GENDER)) + geom_histogram(binwidth = 2) + 
    ggtitle("Distribution of Age When Purchased Insurance Policy")

plot of chunk ggplot2ex

The mean and median issue ages are both 35, the purchase rate increases steadily from age 18 to around 40. And the largest age group is 30-40 year old. This is not very surprising, because as people began to settle down and start a family, the need to financially protect their family increased.

Another pattern that jumps off the graph is men purchases as least twice as many life insurance policies as women across most age groups. This is a more interesting finding, I didn't expect the different to be so large. I would think with all the women's right movement in the past decades, the gap would be narrower. I'm interested in correlating the insurance ownership with income, but that's like another project.

The last interesting observation is that many family purchases life insurance for their new born babies, and the rate is about even for boys and girls. I suspect this is due to tax benefits and the ability to put pre-tax dollars into an college fund.

`How Much Life Insurance Do People Buy`

The next question I want to investigate is once people decided to purchase life insurance, how much do they buy? I answered this by plotting the kernel density of all the face amounts:

qplot(log10(FACE_AMT), data = mimsa.1pct, geom = "density", fill = GENDER, alpha = I(0.5), 
    main = "Face Amount by Gender", xlab = "Face Amount (10^x dollars)", ylab = "Density")

plot of chunk ggplot2ex2

Looks like many people purchase $100,000 of life insurance, the left tail is very long due to many older policies that have very low face amount. And within the right tail, where people purchase $1 million or more, we see that not only do men buy more life insurance, they buy higher amounts as well.

I then looked at the median face amount purchased through the years, and unsurprisingly, the face amount of the policies are trending higher from 1980 to 2011. There might be a few explanations for this, such as inflation, introduction of new products, and people generally have more assets to protect against.

fa.issyr <- as.data.frame(cbind(issue.year, mimsa.1pct[, c("FACE_AMT", "GENDER")]))
agg <- aggregate(fa.issyr$FACE_AMT, by = list(fa.issyr$issue.year, fa.issyr$GENDER), 
    FUN = median)
colnames(agg) <- c("issue.year", "gender", "median.face")
ggplot(agg, aes(x = issue.year, y = median.face, color = gender)) + geom_point() + 
    geom_smooth() + xlim(c(1980, 2011))

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 57 rows containing missing values (stat_smooth).
## Warning: Removed 56 rows containing missing values (stat_smooth).
## Warning: Removed 113 rows containing missing values (geom_point).

plot of chunk ggplot2ex3

`How Do People Die`

The part that interest me the most about the dataset is the mortality experience of the policy holders (ie how did they die). Among the 36137 policies, about 32000 are still alive, I looked at the remaining 4000 policies. After some data manipulation, I created the stack area chart below:

library(reshape2)
term.year <- as.numeric(format(as.Date(TRMN_EFF_DATE, "%Y%m%d"), format = "%Y"))
cy <- as.data.frame(cbind(term.year, as.character(COD)))  ###COD and termination Year 
cy <- as.data.frame(table(cy$V2, cy$term.year))
colnames(cy) <- c("COD", "Year", "Freq")
cy$group <- rep(c("nonmed", "med", "med", "med", "med", "med", "med", "med", 
    "nonmed", "med", "other", "med", "nonmed", "other", "other", "med", "med", 
    "nonmed"), 12)
cy <- subset(cy, COD != "NNN" & COD != "LVG")
b <- with(cy, tapply(Freq, list(COD, Year), sum), na.rm = TRUE)
b <- na.omit(b)
pct <- prop.table(b, 2)
pct <- melt(pct, id = c("Year"))
colnames(pct) <- c("COD", "Year", "pct")
cfp <- merge(cy, pct)  # COD with frequency and pencentages
cfp$pct <- round(cfp$pct, 3)
ggplot(cy, aes(x = Year, y = Freq, group = COD, fill = COD)) + geom_area(position = "fill") + 
    ggtitle("Stacked Area Chart for All Causes of Death") + xlab("Year of Death") + 
    ylab("Percentage")

plot of chunk ggplot2ex4

A stacked area chart, for those who's not familiar, shows the change in proportion over time. A famous example is the Baby Name Voyager. The dataset I used grouped the causes of death (COD) into 16 different buckets with varies combination of ICD 10 codes. Sum of these COD will be constant at 1, but the probably of each COD will change from year to year. The Chart above showed that the three largest COD are Other, Cancer and Cardiovascular diseases. It's hard to define “Other”, since it's composed of many COD beside the major ones. It could also come from data errors, when some doctors might forget to put down a cause of death. To drill a little deeper into each COD, I grouped them into medical related death and non medical related death. And they are shown below:

# medical related death
cfpm <- cfp[cfp$group == "med", ]
ggplot(cfpm, aes(x = Year, y = pct, group = COD)) + geom_area(aes(color = COD, 
    fill = COD), position = "stack") + ylim(c(0, 1)) + ggtitle("Stacked Area Chart for medically related Deaths") + 
    ylab("Percentage")

plot of chunk ggplot2ex5

# non medical related death
cfpnm <- cfp[cfp$group != "med", ]
ggplot(cfpnm, aes(x = Year, y = pct, group = COD)) + geom_area(aes(color = COD, 
    fill = COD), position = "stack") + ylim(c(0, 1)) + ggtitle("Stacked Area Chart for non-medically related Deaths") + 
    ylab("Percentage")

plot of chunk ggplot2ex6

By separating out medical and non medical, we can see that medical related deaths, especially heart diseases and cancer,have been decreasing over the year, while the “Other” category keeps expanding. This is mostly consistent with the Bloomberg visualization. The Bloomberg visualization also provides some hints about what's included in “Other”, as it pointed out that drug related death and suicide have been on the rise.

And finally, I graph the top 5 causes of death during the period (2000 to 2011) below:

library(plyr)
cfp.order <- cfp[with(cfp, order(Year, -pct)), ]
cfp.top5 <- ddply(cfp.order, .(Year), function(x) print(head(x, 5)))

##   COD Year Freq group   pct
## 1 CAR 2000   58   med 0.309
## 2 CAN 2000   34   med 0.181
## 3 OOO 2000   33 other 0.176
## 4 RES 2000   24   med 0.128
## 5 STR 2000   14   med 0.074
##   COD Year Freq group   pct
## 1 CAR 2001   76   med 0.317
## 2 CAN 2001   61   med 0.254
## 3 OOO 2001   41 other 0.171
## 4 RES 2001   24   med 0.100
## 5 STR 2001    9   med 0.038
##   COD Year Freq group   pct
## 1 CAR 2002  121   med 0.328
## 2 CAN 2002   66   med 0.179
## 3 OOO 2002   66 other 0.179
## 4 RES 2002   38   med 0.103
## 5 STR 2002   18   med 0.049
##   COD Year Freq group   pct
## 1 CAR 2003  111   med 0.304
## 2 CAN 2003   67   med 0.184
## 3 OOO 2003   64 other 0.175
## 4 RES 2003   39   med 0.107
## 5 STR 2003   20   med 0.055
##   COD Year Freq group   pct
## 1 CAR 2004   90   med 0.288
## 2 OOO 2004   79 other 0.253
## 3 CAN 2004   62   med 0.199
## 4 RES 2004   28   med 0.090
## 5 STR 2004   15   med 0.048
##   COD Year Freq group   pct
## 1 CAR 2005  107   med 0.327
## 2 OOO 2005   63 other 0.193
## 3 CAN 2005   60   med 0.183
## 4 RES 2005   33   med 0.101
## 5 MEN 2005   22   med 0.067
##   COD Year Freq group   pct
## 1 CAR 2006  117   med 0.328
## 2 CAN 2006   73   med 0.204
## 3 OOO 2006   60 other 0.168
## 4 RES 2006   35   med 0.098
## 5 MEN 2006   28   med 0.078
##   COD Year Freq group   pct
## 1 CAR 2007  114   med 0.322
## 2 OOO 2007   69 other 0.195
## 3 CAN 2007   68   med 0.192
## 4 RES 2007   26   med 0.073
## 5 MEN 2007   18   med 0.051
##   COD Year Freq group   pct
## 1 CAR 2008   97   med 0.282
## 2 OOO 2008   85 other 0.247
## 3 CAN 2008   76   med 0.221
## 4 RES 2008   23   med 0.067
## 5 MEN 2008   14   med 0.041
##   COD Year Freq group   pct
## 1 OOO 2009   96 other 0.309
## 2 CAR 2009   83   med 0.267
## 3 CAN 2009   54   med 0.174
## 4 RES 2009   26   med 0.084
## 5 MEN 2009   14   med 0.045
##   COD Year Freq group   pct
## 1 OOO 2010  119 other 0.395
## 2 CAR 2010   75   med 0.249
## 3 CAN 2010   43   med 0.143
## 4 RES 2010   14   med 0.047
## 5 MEN 2010   12   med 0.040
##   COD Year Freq group   pct
## 1 OOO 2011  118 other 0.392
## 2 CAR 2011   66   med 0.219
## 3 CAN 2011   43   med 0.143
## 4 RES 2011   26   med 0.086
## 5 MEN 2011   14   med 0.047

ggplot(cfp.top5, aes(x = Year, y = pct, fill = COD)) + geom_point(size = 3, 
    aes(shape = COD, color = COD)) + geom_line(size = 1, aes(group = COD, color = COD)) + 
    ylab("Percentage of each COD") + xlab("Year") + ggtitle("Top Five Causes of Death Over The Last Ten Years")

plot of chunk ggplot2ex7

Summary of Analysis

So we did I learn? I learned that a typical life insurance buyer is a 40-year-old man, who will buy $500,000 of life insurance, and will likely die due to “other” cause of death.

More Works to be Done

There are a few items I hope to come back to in the future:

I hope to perform further analysis on the location data(zip code)
I hope to combine these analysis with some open demographic data