Disclaimer: The title of this post was shamelessly “borrowed” from a recent visualization done by Matt Klein of Bloomberg, How Americans Die. I strongly recommend everyone to click through the site.
I'm an actuary for a life insurance company, so I thought I would stick close to my root and look at some insurance/mortality data. More specifically, these are the three questions that I want to answer:
One of the reasons that I want to look at insurance data is because I'm surrounded by them. For this project, I obtained a clean text file from a coworker. The dataset contains information about 3.6 million life insurance policies, such as the face amount, the type of insurance purchased, the date of birth, date of death, etc. The dataset was stripped of all personal information, such as name and address (although zip codes were available). The only identifier is a 20 characters string of number. For privacy and practical purposes, I also used the sample function in R to randomly select only 1% of the available data. Eventually, I end up with a 36137 by 38 table.
Who Buys Life InsuranceThe first question that I want to get an answer from the dataset is: what type of people purchases life insurance? To do that, I looked at the age of the policyholders when they purchased the insurance (issue age) grouped by male and female. And this is what I found:
require(ggplot2)
## Loading required package: ggplot2
mimsa.1pct <- read.table("MIMSA_ONEPCT.txt", header = TRUE)
attach(mimsa.1pct)
issue.year <- as.numeric(format(as.Date(REG_DATE, "%Y%m%d"), format = "%Y"))
birth.year <- as.numeric(format(as.Date(as.character(DOB), "%Y%m%d"), format = "%Y"))
issue.age <- issue.year - birth.year
curr.age <- 2014 - birth.year
ggplot(mimsa.1pct, aes(x = issue.age, fill = GENDER)) + geom_histogram(binwidth = 2) +
ggtitle("Distribution of Age When Purchased Insurance Policy")
The mean and median issue ages are both 35, the purchase rate increases steadily from age 18 to around 40. And the largest age group is 30-40 year old. This is not very surprising, because as people began to settle down and start a family, the need to financially protect their family increased.
Another pattern that jumps off the graph is men purchases as least twice as many life insurance policies as women across most age groups. This is a more interesting finding, I didn't expect the different to be so large. I would think with all the women's right movement in the past decades, the gap would be narrower. I'm interested in correlating the insurance ownership with income, but that's like another project.
The last interesting observation is that many family purchases life insurance for their new born babies, and the rate is about even for boys and girls. I suspect this is due to tax benefits and the ability to put pre-tax dollars into an college fund.
How Much Life Insurance Do People BuyThe next question I want to investigate is once people decided to purchase life insurance, how much do they buy? I answered this by plotting the kernel density of all the face amounts:
qplot(log10(FACE_AMT), data = mimsa.1pct, geom = "density", fill = GENDER, alpha = I(0.5),
main = "Face Amount by Gender", xlab = "Face Amount (10^x dollars)", ylab = "Density")
Looks like many people purchase $100,000 of life insurance, the left tail is very long due to many older policies that have very low face amount. And within the right tail, where people purchase $1 million or more, we see that not only do men buy more life insurance, they buy higher amounts as well.
I then looked at the median face amount purchased through the years, and unsurprisingly, the face amount of the policies are trending higher from 1980 to 2011. There might be a few explanations for this, such as inflation, introduction of new products, and people generally have more assets to protect against.
fa.issyr <- as.data.frame(cbind(issue.year, mimsa.1pct[, c("FACE_AMT", "GENDER")]))
agg <- aggregate(fa.issyr$FACE_AMT, by = list(fa.issyr$issue.year, fa.issyr$GENDER),
FUN = median)
colnames(agg) <- c("issue.year", "gender", "median.face")
ggplot(agg, aes(x = issue.year, y = median.face, color = gender)) + geom_point() +
geom_smooth() + xlim(c(1980, 2011))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 57 rows containing missing values (stat_smooth).
## Warning: Removed 56 rows containing missing values (stat_smooth).
## Warning: Removed 113 rows containing missing values (geom_point).
How Do People DieThe part that interest me the most about the dataset is the mortality experience of the policy holders (ie how did they die). Among the 36137 policies, about 32000 are still alive, I looked at the remaining 4000 policies. After some data manipulation, I created the stack area chart below:
library(reshape2)
term.year <- as.numeric(format(as.Date(TRMN_EFF_DATE, "%Y%m%d"), format = "%Y"))
cy <- as.data.frame(cbind(term.year, as.character(COD))) ###COD and termination Year
cy <- as.data.frame(table(cy$V2, cy$term.year))
colnames(cy) <- c("COD", "Year", "Freq")
cy$group <- rep(c("nonmed", "med", "med", "med", "med", "med", "med", "med",
"nonmed", "med", "other", "med", "nonmed", "other", "other", "med", "med",
"nonmed"), 12)
cy <- subset(cy, COD != "NNN" & COD != "LVG")
b <- with(cy, tapply(Freq, list(COD, Year), sum), na.rm = TRUE)
b <- na.omit(b)
pct <- prop.table(b, 2)
pct <- melt(pct, id = c("Year"))
colnames(pct) <- c("COD", "Year", "pct")
cfp <- merge(cy, pct) # COD with frequency and pencentages
cfp$pct <- round(cfp$pct, 3)
ggplot(cy, aes(x = Year, y = Freq, group = COD, fill = COD)) + geom_area(position = "fill") +
ggtitle("Stacked Area Chart for All Causes of Death") + xlab("Year of Death") +
ylab("Percentage")
A stacked area chart, for those who's not familiar, shows the change in proportion over time. A famous example is the Baby Name Voyager. The dataset I used grouped the causes of death (COD) into 16 different buckets with varies combination of ICD 10 codes. Sum of these COD will be constant at 1, but the probably of each COD will change from year to year. The Chart above showed that the three largest COD are Other, Cancer and Cardiovascular diseases. It's hard to define “Other”, since it's composed of many COD beside the major ones. It could also come from data errors, when some doctors might forget to put down a cause of death. To drill a little deeper into each COD, I grouped them into medical related death and non medical related death. And they are shown below:
# medical related death
cfpm <- cfp[cfp$group == "med", ]
ggplot(cfpm, aes(x = Year, y = pct, group = COD)) + geom_area(aes(color = COD,
fill = COD), position = "stack") + ylim(c(0, 1)) + ggtitle("Stacked Area Chart for medically related Deaths") +
ylab("Percentage")
# non medical related death
cfpnm <- cfp[cfp$group != "med", ]
ggplot(cfpnm, aes(x = Year, y = pct, group = COD)) + geom_area(aes(color = COD,
fill = COD), position = "stack") + ylim(c(0, 1)) + ggtitle("Stacked Area Chart for non-medically related Deaths") +
ylab("Percentage")
By separating out medical and non medical, we can see that medical related deaths, especially heart diseases and cancer,have been decreasing over the year, while the “Other” category keeps expanding. This is mostly consistent with the Bloomberg visualization. The Bloomberg visualization also provides some hints about what's included in “Other”, as it pointed out that drug related death and suicide have been on the rise.
And finally, I graph the top 5 causes of death during the period (2000 to 2011) below:
library(plyr)
cfp.order <- cfp[with(cfp, order(Year, -pct)), ]
cfp.top5 <- ddply(cfp.order, .(Year), function(x) print(head(x, 5)))
## COD Year Freq group pct
## 1 CAR 2000 58 med 0.309
## 2 CAN 2000 34 med 0.181
## 3 OOO 2000 33 other 0.176
## 4 RES 2000 24 med 0.128
## 5 STR 2000 14 med 0.074
## COD Year Freq group pct
## 1 CAR 2001 76 med 0.317
## 2 CAN 2001 61 med 0.254
## 3 OOO 2001 41 other 0.171
## 4 RES 2001 24 med 0.100
## 5 STR 2001 9 med 0.038
## COD Year Freq group pct
## 1 CAR 2002 121 med 0.328
## 2 CAN 2002 66 med 0.179
## 3 OOO 2002 66 other 0.179
## 4 RES 2002 38 med 0.103
## 5 STR 2002 18 med 0.049
## COD Year Freq group pct
## 1 CAR 2003 111 med 0.304
## 2 CAN 2003 67 med 0.184
## 3 OOO 2003 64 other 0.175
## 4 RES 2003 39 med 0.107
## 5 STR 2003 20 med 0.055
## COD Year Freq group pct
## 1 CAR 2004 90 med 0.288
## 2 OOO 2004 79 other 0.253
## 3 CAN 2004 62 med 0.199
## 4 RES 2004 28 med 0.090
## 5 STR 2004 15 med 0.048
## COD Year Freq group pct
## 1 CAR 2005 107 med 0.327
## 2 OOO 2005 63 other 0.193
## 3 CAN 2005 60 med 0.183
## 4 RES 2005 33 med 0.101
## 5 MEN 2005 22 med 0.067
## COD Year Freq group pct
## 1 CAR 2006 117 med 0.328
## 2 CAN 2006 73 med 0.204
## 3 OOO 2006 60 other 0.168
## 4 RES 2006 35 med 0.098
## 5 MEN 2006 28 med 0.078
## COD Year Freq group pct
## 1 CAR 2007 114 med 0.322
## 2 OOO 2007 69 other 0.195
## 3 CAN 2007 68 med 0.192
## 4 RES 2007 26 med 0.073
## 5 MEN 2007 18 med 0.051
## COD Year Freq group pct
## 1 CAR 2008 97 med 0.282
## 2 OOO 2008 85 other 0.247
## 3 CAN 2008 76 med 0.221
## 4 RES 2008 23 med 0.067
## 5 MEN 2008 14 med 0.041
## COD Year Freq group pct
## 1 OOO 2009 96 other 0.309
## 2 CAR 2009 83 med 0.267
## 3 CAN 2009 54 med 0.174
## 4 RES 2009 26 med 0.084
## 5 MEN 2009 14 med 0.045
## COD Year Freq group pct
## 1 OOO 2010 119 other 0.395
## 2 CAR 2010 75 med 0.249
## 3 CAN 2010 43 med 0.143
## 4 RES 2010 14 med 0.047
## 5 MEN 2010 12 med 0.040
## COD Year Freq group pct
## 1 OOO 2011 118 other 0.392
## 2 CAR 2011 66 med 0.219
## 3 CAN 2011 43 med 0.143
## 4 RES 2011 26 med 0.086
## 5 MEN 2011 14 med 0.047
ggplot(cfp.top5, aes(x = Year, y = pct, fill = COD)) + geom_point(size = 3,
aes(shape = COD, color = COD)) + geom_line(size = 1, aes(group = COD, color = COD)) +
ylab("Percentage of each COD") + xlab("Year") + ggtitle("Top Five Causes of Death Over The Last Ten Years")
So we did I learn? I learned that a typical life insurance buyer is a 40-year-old man, who will buy $500,000 of life insurance, and will likely die due to “other” cause of death.
There are a few items I hope to come back to in the future: