Insurance cost data analysis

Data

The data used for this analysis is an open-source patient level data set. It is simulated data sourced from the textbook Machine Learning with R by Brett Lantz. Given that the data is simulated, we cannot actually apply these findings to real world decisions. However, the data was simulated based on demographic statistics from the U.S. Census Bureau which roughly reflect real-world conditions.

Below is a peek at what the data looks like.

library(data.table)
library(tidyverse)
library(knitr)

dt = fread("C:/Users/Alexei/Desktop/Stuff/emma/data/insurance_cost/insurance.csv")
head(dt) %>% kable()

age	sex	bmi	children	smoker	region	charges
19	female	27.900	0	yes	southwest	16884.924
18	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855
31	female	25.740	0	no	southeast	3756.622

This data contains yearly medical insurance costs for 1338 patients, along with 6 variables describing patients.

Goal

In this analysis, I am going to use R to determine what factors drive high insurance costs.

Data analysis

Missing data

for (col in names(dt)) {
  print(col)
  print(dt[is.null(get(col)) | is.na(get(col)) | get(col) == "", .N])
}

## [1] "age"
## [1] 0
## [1] "sex"
## [1] 0
## [1] "bmi"
## [1] 0
## [1] "children"
## [1] 0
## [1] "smoker"
## [1] 0
## [1] "region"
## [1] 0
## [1] "charges"
## [1] 0

Great news! We do not have any missing data.

Understanding `charges`

ggplot(dt, aes(x = charges)) +
  geom_density(fill = "#69b3a2", alpha = 0.7)

Charges is bimodal with a peak around $7K, and smaller one around $40K with a significant tail of higher charges that seem to be outliers. To understand what drives higher insurance charges, I will dive deeper into the second peak in the distribution later.

dt[order(-charges)][1:10] %>% kable()

age	sex	bmi	children	smoker	region	charges
54	female	47.410	0	yes	southeast	63770.43
45	male	30.360	0	yes	southeast	62592.87
52	male	34.485	3	yes	northwest	60021.40
31	female	38.095	1	yes	northeast	58571.07
33	female	35.530	0	yes	northwest	55135.40
60	male	32.800	0	yes	southwest	52590.83
28	male	36.400	1	yes	southwest	51194.56
64	male	36.960	2	yes	southeast	49577.66
59	male	41.140	1	yes	southeast	48970.25
44	female	38.060	0	yes	southeast	48885.14

Every single one of the top 10 outliers is obese and a smoker with BMI > 30. Age and children do not seem to be drivers here. 70% are located in the southern region of the USA.

Understanding `age`

ggplot(dt, aes(x = age)) +
  geom_density(fill = "#69b3a2", alpha = 0.7)

summary(dt$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   39.00   39.21   51.00   64.00

The average age is 39. There is a broad age range that is well represented within the data - only adults aged 18 - 64 are included. There is a swift drop at about age 60 which could indicate retirement and a change of insurance to medicare.

ggplot(dt, aes(x = age, y = charges)) +
  geom_point()

The data suggests that cost increases with age. There appear to be 3 bands in the data. Later I will focus on factors that drive the upper two bands as they clearly drive up healthcare costs.

Understanding `bmi`

ggplot(dt, aes(x = bmi)) +
  geom_density(fill = "#69b3a2", alpha = 0.7)

dt[order(-bmi)][1:10] %>% kable()

age	sex	bmi	children	smoker	region	charges
18	male	53.13	0	no	southeast	1163.463
22	male	52.58	1	yes	southeast	44501.398
23	male	50.38	1	no	southeast	2438.055
58	male	49.06	0	no	southeast	11381.325
46	female	48.07	2	no	northeast	9432.925
52	male	47.74	1	no	southeast	9748.911
37	female	47.60	2	yes	southwest	46113.511
47	male	47.52	1	no	southeast	8083.920
54	female	47.41	0	yes	southeast	63770.428
52	female	46.75	5	no	southeast	12592.534

BMI is fairly normally distributed with the peak around 30, and a tail of high BMI values. A BMI over 30 is considered obese. The top 10 highest BMI patients have high charges, the highest of which come from smokers in the southern regions. The highest BMI patient seems to be an outlier due to being one of the youngest ages in the data and having one of the lowest charges. Low charges could indicate that this individual may not be physically able to interact with the healthcare system or could be due to young age.

ggplot(dt, aes(x = bmi, y = charges)) +
  geom_point()

As BMI increases there is a trend with cost increasing. After a BMI > 30 there is a cluster of high charges. Later, I will focus on what other factors could be contributing to a higher cost for some obese patients.

Understanding `smoker`

dt[ , .(.N, avg_cost = mean(charges)), by = smoker] %>% kable()

smoker	N	avg_cost
yes	274	32050.232
no	1064	8434.268

Majority of the patients are not smokers, but the average cost for smokers is significantly more compared to nonsmokers.

ggplot(dt, aes(x = smoker, y = charges, fill = smoker)) +
  geom_violin()

The plot shows that there is an increase in cost for smokers. Smoking is bimodal with two peaks around $20K and $40K. Clearly smoking is a factor in determining increased insurance cost. The peak around $40K indicates that there may be another factor that could be contributing to smokers having a higher cost.

Understanding `children`

dt[ , children := factor(children)]
dt[ , .(.N, avg_cost = mean(charges)), by = children][order(children)] %>% kable()

children	N	avg_cost
0	574	12365.976
1	324	12731.172
2	240	15073.564
3	157	15355.318
4	25	13850.656
5	18	8786.035

The mode of the patient population do not have children, but the majority have 1-3 children.

dt[sex == "male", .N, by = children][order(children)] %>% kable()

children	N
0	285
1	166
2	121
3	80
4	14
5	10

By filtering for males we can determine that children are counted as dependents.

ggplot(dt, aes(x = children, y = charges, fill = children)) +
  geom_violin()

There is a slight upward trend in cost with an increase in dependents.

ggplot(dt, aes(x = children, y = charges, fill = sex)) +
  geom_violin()

Is childbirth a driver in higher cost for women? Overall there seems to be a negligible additional cost for women for subsequent children. This indicates that childbirth may not be a huge driver to increasing insurance cost for females.

Understanding `sex`

dt[ , .(.N, avg_cost = mean(charges)), by = sex] %>% kable()

sex	N	avg_cost
female	662	12569.58
male	676	13956.75

ggplot(dt, aes(x = sex, y = charges, fill = sex)) +
  geom_violin()

Males have a slight increase in cost but overall sex does not seem to be a driving factor for increase cost.

Understanding `region`

dt[ , .(totalcount =.N, avg_cost = mean(charges)), by = region] %>% kable()

region	totalcount	avg_cost
southwest	325	12346.94
southeast	364	14735.41
northwest	325	12417.58
northeast	324	13406.38

All regions are fairly represented in the data set. On average the Southeast has a slight increase in cost.

ggplot(dt, aes(x = region, y = charges, fill = region)) +
  geom_violin()

There is a slight increase in cost for Southern regions.

Diving deeper

Age, BMI, and smoking seem to be the main factors driving up cost based on the above analysis. Here we will take a deeper look into how these variables together impact cost.

Deeper dive: `age and smoking`

ggplot(dt, aes(x= age, y = charges, color = smoker)) +
  geom_point()

All of the higher costs in the upper band are associated with smokers. The middle band seems to be a mix and the lower band includes only nonsmokers. As age increases cost follows. There may be another variable that is contributing to the higher cost band. Let’s now see how obesity and age impact charges.

Deeper dive: `age and obesity`

dt[ , obese := bmi >= 30]
ggplot(dt, aes(x = age, y = charges, color = obese)) +
  geom_point()

For obesity, as age increases so does cost. The upper band is almost entirely comprised of obese patients. Although obesity is shown to increase cost it is not as definitive of an indicator as smoking because the lowest band is a mix of obese and not obese patients. Smoking seems to be the primary driver to increase cost.

Deeper dive: `smoking and obesity`

ggplot(dt, aes(x = smoker, y = charges, fill = obese)) +
  geom_violin()

Non-smokers have similar costs whether they are obese or not. Smokers have much higher costs than non-smokers in general, with obese smokers having by far the highest costs. Thus, obesity seems to be an important factor only for smokers.

As a sensitivity analysis, let’s be less strict on the BMI group and see how the data changes when we include overweight patients (BMI >= 25.0).

dt[ , overweight := bmi > 25.0]
ggplot(dt, aes(x = smoker, y = charges, fill = overweight)) +
  geom_violin()

The story does not change here - patients who smoke and are overweight drive up cost. However, there is more overlap in cost between overweight and non-overweight patients. Obesity partitions the data with less overlap.

Conclusion

dt[smoker == "no", category := "Not smoker"]
dt[smoker == "yes" & obese == TRUE, category := "Smoker and obese"]
dt[smoker == "yes" & obese == FALSE, category := "Smoker and not obese"]

ggplot(dt, aes(x = age, y = charges, color = category)) +
  geom_point()

The analysis illuminates that the first order effect on cost is smoking and the second order effect is obesity. Smoking and obesity combined correlates to the highest insurance costs. As age increases so does cost across all categories at a fairly similar rate.

Insurance cost data analysis

Emma King

Data

Goal

Data analysis

Missing data

Understanding charges

Understanding age

Understanding bmi

Understanding smoker

Understanding children

Understanding sex

Understanding region