The data used for this analysis is an open-source patient level data set. It is simulated data sourced from the textbook Machine Learning with R by Brett Lantz. Given that the data is simulated, we cannot actually apply these findings to real world decisions. However, the data was simulated based on demographic statistics from the U.S. Census Bureau which roughly reflect real-world conditions.
Below is a peek at what the data looks like.
library(data.table)
library(tidyverse)
library(knitr)
dt = fread("C:/Users/Alexei/Desktop/Stuff/emma/data/insurance_cost/insurance.csv")
head(dt) %>% kable()
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 |
| 28 | male | 33.000 | 3 | no | southeast | 4449.462 |
| 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
| 32 | male | 28.880 | 0 | no | northwest | 3866.855 |
| 31 | female | 25.740 | 0 | no | southeast | 3756.622 |
This data contains yearly medical insurance costs for 1338 patients, along with 6 variables describing patients.
In this analysis, I am going to use R to determine what factors drive high insurance costs.
for (col in names(dt)) {
print(col)
print(dt[is.null(get(col)) | is.na(get(col)) | get(col) == "", .N])
}
## [1] "age"
## [1] 0
## [1] "sex"
## [1] 0
## [1] "bmi"
## [1] 0
## [1] "children"
## [1] 0
## [1] "smoker"
## [1] 0
## [1] "region"
## [1] 0
## [1] "charges"
## [1] 0
Great news! We do not have any missing data.
chargesggplot(dt, aes(x = charges)) +
geom_density(fill = "#69b3a2", alpha = 0.7)
Charges is bimodal with a peak around $7K, and smaller one around $40K with a significant tail of higher charges that seem to be outliers. To understand what drives higher insurance charges, I will dive deeper into the second peak in the distribution later.
dt[order(-charges)][1:10] %>% kable()
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 54 | female | 47.410 | 0 | yes | southeast | 63770.43 |
| 45 | male | 30.360 | 0 | yes | southeast | 62592.87 |
| 52 | male | 34.485 | 3 | yes | northwest | 60021.40 |
| 31 | female | 38.095 | 1 | yes | northeast | 58571.07 |
| 33 | female | 35.530 | 0 | yes | northwest | 55135.40 |
| 60 | male | 32.800 | 0 | yes | southwest | 52590.83 |
| 28 | male | 36.400 | 1 | yes | southwest | 51194.56 |
| 64 | male | 36.960 | 2 | yes | southeast | 49577.66 |
| 59 | male | 41.140 | 1 | yes | southeast | 48970.25 |
| 44 | female | 38.060 | 0 | yes | southeast | 48885.14 |
Every single one of the top 10 outliers is obese and a smoker with BMI > 30. Age and children do not seem to be drivers here. 70% are located in the southern region of the USA.
ageggplot(dt, aes(x = age)) +
geom_density(fill = "#69b3a2", alpha = 0.7)
summary(dt$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 27.00 39.00 39.21 51.00 64.00
The average age is 39. There is a broad age range that is well represented within the data - only adults aged 18 - 64 are included. There is a swift drop at about age 60 which could indicate retirement and a change of insurance to medicare.
ggplot(dt, aes(x = age, y = charges)) +
geom_point()
The data suggests that cost increases with age. There appear to be 3 bands in the data. Later I will focus on factors that drive the upper two bands as they clearly drive up healthcare costs.
bmiggplot(dt, aes(x = bmi)) +
geom_density(fill = "#69b3a2", alpha = 0.7)
dt[order(-bmi)][1:10] %>% kable()
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 18 | male | 53.13 | 0 | no | southeast | 1163.463 |
| 22 | male | 52.58 | 1 | yes | southeast | 44501.398 |
| 23 | male | 50.38 | 1 | no | southeast | 2438.055 |
| 58 | male | 49.06 | 0 | no | southeast | 11381.325 |
| 46 | female | 48.07 | 2 | no | northeast | 9432.925 |
| 52 | male | 47.74 | 1 | no | southeast | 9748.911 |
| 37 | female | 47.60 | 2 | yes | southwest | 46113.511 |
| 47 | male | 47.52 | 1 | no | southeast | 8083.920 |
| 54 | female | 47.41 | 0 | yes | southeast | 63770.428 |
| 52 | female | 46.75 | 5 | no | southeast | 12592.534 |
BMI is fairly normally distributed with the peak around 30, and a tail of high BMI values. A BMI over 30 is considered obese. The top 10 highest BMI patients have high charges, the highest of which come from smokers in the southern regions. The highest BMI patient seems to be an outlier due to being one of the youngest ages in the data and having one of the lowest charges. Low charges could indicate that this individual may not be physically able to interact with the healthcare system or could be due to young age.
ggplot(dt, aes(x = bmi, y = charges)) +
geom_point()
As BMI increases there is a trend with cost increasing. After a BMI > 30 there is a cluster of high charges. Later, I will focus on what other factors could be contributing to a higher cost for some obese patients.
smokerdt[ , .(.N, avg_cost = mean(charges)), by = smoker] %>% kable()
| smoker | N | avg_cost |
|---|---|---|
| yes | 274 | 32050.232 |
| no | 1064 | 8434.268 |
Majority of the patients are not smokers, but the average cost for smokers is significantly more compared to nonsmokers.
ggplot(dt, aes(x = smoker, y = charges, fill = smoker)) +
geom_violin()
The plot shows that there is an increase in cost for smokers. Smoking is bimodal with two peaks around $20K and $40K. Clearly smoking is a factor in determining increased insurance cost. The peak around $40K indicates that there may be another factor that could be contributing to smokers having a higher cost.
childrendt[ , children := factor(children)]
dt[ , .(.N, avg_cost = mean(charges)), by = children][order(children)] %>% kable()
| children | N | avg_cost |
|---|---|---|
| 0 | 574 | 12365.976 |
| 1 | 324 | 12731.172 |
| 2 | 240 | 15073.564 |
| 3 | 157 | 15355.318 |
| 4 | 25 | 13850.656 |
| 5 | 18 | 8786.035 |
The mode of the patient population do not have children, but the majority have 1-3 children.
dt[sex == "male", .N, by = children][order(children)] %>% kable()
| children | N |
|---|---|
| 0 | 285 |
| 1 | 166 |
| 2 | 121 |
| 3 | 80 |
| 4 | 14 |
| 5 | 10 |
By filtering for males we can determine that children are counted as dependents.
ggplot(dt, aes(x = children, y = charges, fill = children)) +
geom_violin()
There is a slight upward trend in cost with an increase in dependents.
ggplot(dt, aes(x = children, y = charges, fill = sex)) +
geom_violin()
Is childbirth a driver in higher cost for women? Overall there seems to be a negligible additional cost for women for subsequent children. This indicates that childbirth may not be a huge driver to increasing insurance cost for females.
sexdt[ , .(.N, avg_cost = mean(charges)), by = sex] %>% kable()
| sex | N | avg_cost |
|---|---|---|
| female | 662 | 12569.58 |
| male | 676 | 13956.75 |
ggplot(dt, aes(x = sex, y = charges, fill = sex)) +
geom_violin()
Males have a slight increase in cost but overall sex does not seem to be a driving factor for increase cost.
regiondt[ , .(totalcount =.N, avg_cost = mean(charges)), by = region] %>% kable()
| region | totalcount | avg_cost |
|---|---|---|
| southwest | 325 | 12346.94 |
| southeast | 364 | 14735.41 |
| northwest | 325 | 12417.58 |
| northeast | 324 | 13406.38 |
All regions are fairly represented in the data set. On average the Southeast has a slight increase in cost.
ggplot(dt, aes(x = region, y = charges, fill = region)) +
geom_violin()
There is a slight increase in cost for Southern regions.
Age, BMI, and smoking seem to be the main factors driving up cost based on the above analysis. Here we will take a deeper look into how these variables together impact cost.
age and smokingggplot(dt, aes(x= age, y = charges, color = smoker)) +
geom_point()
All of the higher costs in the upper band are associated with smokers. The middle band seems to be a mix and the lower band includes only nonsmokers. As age increases cost follows. There may be another variable that is contributing to the higher cost band. Let’s now see how obesity and age impact charges.
age and obesitydt[ , obese := bmi >= 30]
ggplot(dt, aes(x = age, y = charges, color = obese)) +
geom_point()
For obesity, as age increases so does cost. The upper band is almost entirely comprised of obese patients. Although obesity is shown to increase cost it is not as definitive of an indicator as smoking because the lowest band is a mix of obese and not obese patients. Smoking seems to be the primary driver to increase cost.
smoking and obesityggplot(dt, aes(x = smoker, y = charges, fill = obese)) +
geom_violin()
Non-smokers have similar costs whether they are obese or not. Smokers have much higher costs than non-smokers in general, with obese smokers having by far the highest costs. Thus, obesity seems to be an important factor only for smokers.
As a sensitivity analysis, let’s be less strict on the BMI group and see how the data changes when we include overweight patients (BMI >= 25.0).
dt[ , overweight := bmi > 25.0]
ggplot(dt, aes(x = smoker, y = charges, fill = overweight)) +
geom_violin()
The story does not change here - patients who smoke and are overweight drive up cost. However, there is more overlap in cost between overweight and non-overweight patients. Obesity partitions the data with less overlap.
dt[smoker == "no", category := "Not smoker"]
dt[smoker == "yes" & obese == TRUE, category := "Smoker and obese"]
dt[smoker == "yes" & obese == FALSE, category := "Smoker and not obese"]
ggplot(dt, aes(x = age, y = charges, color = category)) +
geom_point()
The analysis illuminates that the first order effect on cost is smoking and the second order effect is obesity. Smoking and obesity combined correlates to the highest insurance costs. As age increases so does cost across all categories at a fairly similar rate.