Data

The data used for this analysis is an open-source patient level data set. It is simulated data sourced from the textbook Machine Learning with R by Brett Lantz. Given that the data is simulated, we cannot actually apply these findings to real world decisions. However, the data was simulated based on demographic statistics from the U.S. Census Bureau which roughly reflect real-world conditions.

Below is a peek at what the data looks like.

library(data.table)
library(tidyverse)
library(knitr)
dt = fread("C:/Users/Alexei/Desktop/Stuff/emma/data/insurance_cost/insurance.csv")
head(dt) %>% kable()
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.924
18 male 33.770 1 no southeast 1725.552
28 male 33.000 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.471
32 male 28.880 0 no northwest 3866.855
31 female 25.740 0 no southeast 3756.622

This data contains yearly medical insurance costs for 1338 patients, along with 6 variables describing patients.

Goal

In this analysis, I am going to use R to determine what factors drive high insurance costs.

Data analysis

Missing data

for (col in names(dt)) {
  print(col)
  print(dt[is.null(get(col)) | is.na(get(col)) | get(col) == "", .N])
}
## [1] "age"
## [1] 0
## [1] "sex"
## [1] 0
## [1] "bmi"
## [1] 0
## [1] "children"
## [1] 0
## [1] "smoker"
## [1] 0
## [1] "region"
## [1] 0
## [1] "charges"
## [1] 0

Great news! We do not have any missing data.

Understanding charges

ggplot(dt, aes(x = charges)) +
  geom_density(fill = "#69b3a2", alpha = 0.7)

Charges is bimodal with a peak around $7K, and smaller one around $40K with a significant tail of higher charges that seem to be outliers. To understand what drives higher insurance charges, I will dive deeper into the second peak in the distribution later.

dt[order(-charges)][1:10] %>% kable()
age sex bmi children smoker region charges
54 female 47.410 0 yes southeast 63770.43
45 male 30.360 0 yes southeast 62592.87
52 male 34.485 3 yes northwest 60021.40
31 female 38.095 1 yes northeast 58571.07
33 female 35.530 0 yes northwest 55135.40
60 male 32.800 0 yes southwest 52590.83
28 male 36.400 1 yes southwest 51194.56
64 male 36.960 2 yes southeast 49577.66
59 male 41.140 1 yes southeast 48970.25
44 female 38.060 0 yes southeast 48885.14

Every single one of the top 10 outliers is obese and a smoker with BMI > 30. Age and children do not seem to be drivers here. 70% are located in the southern region of the USA.

Understanding age

ggplot(dt, aes(x = age)) +
  geom_density(fill = "#69b3a2", alpha = 0.7)

summary(dt$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   39.00   39.21   51.00   64.00

The average age is 39. There is a broad age range that is well represented within the data - only adults aged 18 - 64 are included. There is a swift drop at about age 60 which could indicate retirement and a change of insurance to medicare.

ggplot(dt, aes(x = age, y = charges)) +
  geom_point()

The data suggests that cost increases with age. There appear to be 3 bands in the data. Later I will focus on factors that drive the upper two bands as they clearly drive up healthcare costs.

Understanding bmi

ggplot(dt, aes(x = bmi)) +
  geom_density(fill = "#69b3a2", alpha = 0.7) 

dt[order(-bmi)][1:10] %>% kable()
age sex bmi children smoker region charges
18 male 53.13 0 no southeast 1163.463
22 male 52.58 1 yes southeast 44501.398
23 male 50.38 1 no southeast 2438.055
58 male 49.06 0 no southeast 11381.325
46 female 48.07 2 no northeast 9432.925
52 male 47.74 1 no southeast 9748.911
37 female 47.60 2 yes southwest 46113.511
47 male 47.52 1 no southeast 8083.920
54 female 47.41 0 yes southeast 63770.428
52 female 46.75 5 no southeast 12592.534

BMI is fairly normally distributed with the peak around 30, and a tail of high BMI values. A BMI over 30 is considered obese. The top 10 highest BMI patients have high charges, the highest of which come from smokers in the southern regions. The highest BMI patient seems to be an outlier due to being one of the youngest ages in the data and having one of the lowest charges. Low charges could indicate that this individual may not be physically able to interact with the healthcare system or could be due to young age.

ggplot(dt, aes(x = bmi, y = charges)) +
  geom_point() 

As BMI increases there is a trend with cost increasing. After a BMI > 30 there is a cluster of high charges. Later, I will focus on what other factors could be contributing to a higher cost for some obese patients.

Understanding smoker

dt[ , .(.N, avg_cost = mean(charges)), by = smoker] %>% kable()
smoker N avg_cost
yes 274 32050.232
no 1064 8434.268

Majority of the patients are not smokers, but the average cost for smokers is significantly more compared to nonsmokers.

ggplot(dt, aes(x = smoker, y = charges, fill = smoker)) +
  geom_violin()

The plot shows that there is an increase in cost for smokers. Smoking is bimodal with two peaks around $20K and $40K. Clearly smoking is a factor in determining increased insurance cost. The peak around $40K indicates that there may be another factor that could be contributing to smokers having a higher cost.

Understanding children

dt[ , children := factor(children)]
dt[ , .(.N, avg_cost = mean(charges)), by = children][order(children)] %>% kable()
children N avg_cost
0 574 12365.976
1 324 12731.172
2 240 15073.564
3 157 15355.318
4 25 13850.656
5 18 8786.035

The mode of the patient population do not have children, but the majority have 1-3 children.

dt[sex == "male", .N, by = children][order(children)] %>% kable()
children N
0 285
1 166
2 121
3 80
4 14
5 10

By filtering for males we can determine that children are counted as dependents.

ggplot(dt, aes(x = children, y = charges, fill = children)) +
  geom_violin()

There is a slight upward trend in cost with an increase in dependents.

ggplot(dt, aes(x = children, y = charges, fill = sex)) +
  geom_violin()

Is childbirth a driver in higher cost for women? Overall there seems to be a negligible additional cost for women for subsequent children. This indicates that childbirth may not be a huge driver to increasing insurance cost for females.

Understanding sex

dt[ , .(.N, avg_cost = mean(charges)), by = sex] %>% kable()
sex N avg_cost
female 662 12569.58
male 676 13956.75
ggplot(dt, aes(x = sex, y = charges, fill = sex)) +
  geom_violin() 

Males have a slight increase in cost but overall sex does not seem to be a driving factor for increase cost.

Understanding region

dt[ , .(totalcount =.N, avg_cost = mean(charges)), by = region] %>% kable()
region totalcount avg_cost
southwest 325 12346.94
southeast 364 14735.41
northwest 325 12417.58
northeast 324 13406.38

All regions are fairly represented in the data set. On average the Southeast has a slight increase in cost.

ggplot(dt, aes(x = region, y = charges, fill = region)) +
  geom_violin()

There is a slight increase in cost for Southern regions.

Diving deeper

Age, BMI, and smoking seem to be the main factors driving up cost based on the above analysis. Here we will take a deeper look into how these variables together impact cost.

Deeper dive: age and smoking

ggplot(dt, aes(x= age, y = charges, color = smoker)) +
  geom_point()

All of the higher costs in the upper band are associated with smokers. The middle band seems to be a mix and the lower band includes only nonsmokers. As age increases cost follows. There may be another variable that is contributing to the higher cost band. Let’s now see how obesity and age impact charges.

Deeper dive: age and obesity

dt[ , obese := bmi >= 30]
ggplot(dt, aes(x = age, y = charges, color = obese)) +
  geom_point()

For obesity, as age increases so does cost. The upper band is almost entirely comprised of obese patients. Although obesity is shown to increase cost it is not as definitive of an indicator as smoking because the lowest band is a mix of obese and not obese patients. Smoking seems to be the primary driver to increase cost.

Deeper dive: smoking and obesity

ggplot(dt, aes(x = smoker, y = charges, fill = obese)) +
  geom_violin()

Non-smokers have similar costs whether they are obese or not. Smokers have much higher costs than non-smokers in general, with obese smokers having by far the highest costs. Thus, obesity seems to be an important factor only for smokers.

As a sensitivity analysis, let’s be less strict on the BMI group and see how the data changes when we include overweight patients (BMI >= 25.0).

dt[ , overweight := bmi > 25.0]
ggplot(dt, aes(x = smoker, y = charges, fill = overweight)) +
  geom_violin()

The story does not change here - patients who smoke and are overweight drive up cost. However, there is more overlap in cost between overweight and non-overweight patients. Obesity partitions the data with less overlap.

Conclusion

dt[smoker == "no", category := "Not smoker"]
dt[smoker == "yes" & obese == TRUE, category := "Smoker and obese"]
dt[smoker == "yes" & obese == FALSE, category := "Smoker and not obese"]

ggplot(dt, aes(x = age, y = charges, color = category)) +
  geom_point()

The analysis illuminates that the first order effect on cost is smoking and the second order effect is obesity. Smoking and obesity combined correlates to the highest insurance costs. As age increases so does cost across all categories at a fairly similar rate.