knitr::opts_chunk$set(
message = FALSE,
warnings = FALSE
)
library(readr)
library(dplyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.2.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
library(viridis)
## Warning: package 'viridis' was built under R version 4.2.2
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
library(car)
## Warning: package 'car' was built under R version 4.2.2
## Warning: package 'carData' was built under R version 4.2.2
library(MVN)
## Warning: package 'MVN' was built under R version 4.2.2
library(pROC)
## Warning: package 'pROC' was built under R version 4.2.2
set.seed(3999)
In this project, I use a dataset on medical costs that contains the total medical costs of individuals with a variety of different factors. Here is a breakdown of our variables:
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
Now, we’ll proceed with taking a look at and cleaning our data.
For our data investigation, we’ll take a look at our data, the structure of our data and look to see if there are any interesting relationships that exist. We’ll also create some visualizations to further investigate these relationships, with a particlar interest on how variables relate to medical costs.
First, let’s just take a look at our data to see the variables we have, whether they are numeric or categorical, and the levels of any categorical variables we have.
insurance_raw <- read_csv("insurance.csv")
head(insurance_raw)
## # A tibble: 6 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
unique(insurance_raw$sex); unique(insurance_raw$smoker); unique(insurance_raw$region)
## [1] "female" "male"
## [1] "yes" "no"
## [1] "southwest" "southeast" "northwest" "northeast"
Here, we can see that we have 3 categorical variables - a few binary (sex and smoker status), and one with multiple levels (region). Something I’m curious about, is looking at BMI a little deeper to see if we have any BMI values that are greater than 30. I’m curious about this, because a BMI larger than 30 indicates an individual is obese, which could play some importance in their medical costs.
hist(insurance_raw$bmi)
Wow.. It appears that almost half of the invidiuals in this dataset could be categorized as having obesity. Seems like it would be wise to create another variable to make this categorization since so many are affected by this in our data.
insurance_raw <- insurance_raw %>% mutate(obesity = ifelse(bmi >= 30, "yes", "no")) %>% relocate(obesity, .before = charges)
Lastly, let’s go ahead and check for any NA values in our data to see if we need to remove any rows, or take a deeper look at missing values.
sum(is.na(insurance_raw))
## [1] 0
Ahh, a dream - nothing is missing! There won’t be much to do in terms of data cleaning (woohoo!), but we’ll probably want to recode some of our categorical varibles to help with a few calculations down the road.
insurance_clean <- insurance_raw %>%
mutate(sex = ifelse(sex == "female", 1, 0)) %>%
mutate(smoker = ifelse(smoker == "yes", 1, 0)) %>%
mutate(region = case_when(region == "southwest" ~ "1",
region == "southeast" ~ "2",
region == "northwest" ~ "3",
region == "northeast" ~ "4")) %>%
mutate(region = as.numeric(region)) %>%
mutate(obesity = ifelse(obesity == "yes", 1, 0)) %>%
relocate(obesity, .before = charges)
head(insurance_clean)
## # A tibble: 6 × 8
## age sex bmi children smoker region obesity charges
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 1 27.9 0 1 1 0 16885.
## 2 18 0 33.8 1 0 2 1 1726.
## 3 28 0 33 3 0 2 1 4449.
## 4 33 0 22.7 0 0 3 0 21984.
## 5 32 0 28.9 0 0 3 0 3867.
## 6 31 1 25.7 0 0 2 0 3757.
Alright - recoding completed. We may run into an issue with recoding of our region given they don’t have a logical order to them, but we’ll go ahead and proceed and deal with that at a later part if it seems to be an issue.
Next, I’m curious about how our variables correlate with each other. This will help determine what variables might be most important in our eventual predictions of charges, but also will help us in the general data investigation aspect of this project.
corr_mat <- round(cor(insurance_clean), 4)
get_lower_tri<-function(corr_mat) {
corr_mat[upper.tri(corr_mat)] <- NA
return(corr_mat)
}
lower_tri <- get_lower_tri(corr_mat)
melted_corrmat <- melt(lower_tri, na.rm = TRUE)
corr_plot <- ggplot(data = melted_corrmat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
corr_plot +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
So, it doesn’t seem like every variable in this dataset is going to be super important to predicting charges, but there are some key takeaways here:
Some interesting takeaways here. This gives me a few ideas of visualizations to investigate this a bit further.
To start, I simply want to see how our potential predictors are distributed. Given our main goal of predicting, we want normality in these variables, but it will also simply be interesting to see how many individuals fall within each category, or how numerical variables are spread out.
We’ll start out taking a look at our numeric variables to see how they distribute.
We see that medical charges tend to be on the lower end (most below ~$15,000), but there are a decent amount of medical costs on the higher end, ranging up to around $50,000. Ages seem to be fairly, evenly distributed, ranging from 18 years old up to around 65 years old. As we saw before, BMI is pretty much a normal distribution, and we’re seeing around half of those in our dataset having obesity. And finally, we see that the majority of individuals in this dataset don’t have any children, where each additional child has a lower and lower count of individuals.
Interesting. Not much to go on from here, but let’s look at our categorical variables next.
ggplot(data = insurance_raw, aes(x = smoker, fill = smoker)) +
geom_bar() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = sex, fill = sex)) +
geom_bar() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = region, fill = region)) +
geom_bar() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = obesity, fill = obesity)) +
geom_bar() +
theme_minimal()
So, it sppears most of those in this dataset are not smokers. We also have a fairly even distribution of males and females, as well as across the four regions in our data. Lastly, it actually appears that more than HALF of individuals could be categorized as having obesity. Now, let’s take a deeper look at some of the relationships we say with medical charges with visualizations.
First, let’s simply see how much smokers and non-smokers are paying in medical charges.
ggplot(data = insurance_raw, aes(x = smoker, y = charges, fill = smoker)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal()
Wow… Even though smokers make up much less of our data, they actually are paying MORE than non-smokers. Just from this alone, it seems like being a smoker may lead you to pay much more for medical expenses. Now, let’s see if other categorical variables tend to play a role here.
ggplot(data = insurance_raw, aes(x = smoker, y = charges, fill = sex)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = smoker, y = charges, fill = region)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = smoker, y = charges, fill = obesity)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal()
Very interesting. So smokers of different genders pay about the same and those in different regions pay about the same. This can also be said for non-smokers.
The most interesting take-away here is that smokers and non-smokers who are not classified as obese seem to pay roughly the same (although, there are many more smokers, so they most likely spend more), but those who are obese AND smokers pay waaay more than those who are obese and non-smokers.
So, it appears there is some interaction between smokers and obesity that may lead to to these much larger costs - something we’ll want to take into account when it comes to our prediction model.
Next, let’s look at various ages and their medical charges. We say age was positively correlated with medical charges, so we’d expect to see those who are older tending to pay more for medical expenses, on average.
ggplot(data = insurance_raw, aes(x = age, y = charges)) +
geom_point(color = "blue") +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE, color = "black") +
theme_minimal()
Yep. We see a clear, positive correlation here. This graph does look interesting, however. There seems to be some groupings that are linearlly correlated with each other. We’ll have to add some additional variables here to see if any variable is associated with this grouping.
ggplot(data = insurance_raw, aes(x = age, y = charges, color = smoker)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = age, y = charges, color = sex)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = age, y = charges, color = region)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = age, y = charges, color = obesity)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
Looks like there isn’t much separation between regions or sex. But, we do see some solid separation between obesity and a LARGE separation for smokers and non-smokers. This seems to match our barplot that showed smokers tend to spend much more than non-smokers, and that those with obesity tend to spend more than those who are not obese. However, we say a very interesting plot previously, that showed smokers who are obese tend to spend the most on medical. Let’s create a graph that breaks down smokers and obesity to see if we can see any additional separations.
ggplot(data = insurance_raw, aes(x = age, y = charges, color = smoker, shape = obesity)) +
geom_point() +
scale_shape_manual(values=c(3, 17))+
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
Alright, let’s digest this. It seems like, regardless of your obesity
status, non-smokers simply do not tend to spend much on medical expenses
(with some exceptions). We see that those who are not obese, but are
smokers, cause a significant jump in medical charges, but tend to spend
more as they age at roughly the same rate as those who are not
smokers.
The more interesting take away here - those who are obese and smokers tend to spend the most, at around the same rate as the other groupings.
What does this mean? Well, smoking causes one to spend much more on medical charges, but if you are obese and a smoker, your expenditures are VERY high, much higher than those who are non-smokers and those who are smokers but not obese. It appears that smoking leads to higher medical expenditures, obesity can (but not always) lead to higher medical expenditures, but if you are a smoker and obese, you can expect to have some pretty significant costs compared to others.
Now, let’s look at BMI. We say BMI positively correlated with medical charges from our correlation matrix, but let’s visualize that.
ggplot(data = insurance_raw, aes(x = bmi, y = charges)) +
geom_point(color = "blue") +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE, color = "black") +
theme_minimal()
Looks to be expected, however, we see another separation in our data here. We’ll approach this similarly as we did with age and add some other variable into the mix to see if we can find this separation.
ggplot(data = insurance_raw, aes(x = bmi, y = charges, color = smoker)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = bmi, y = charges, color = sex)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
ggplot(data = insurance_raw, aes(x = bmi, y = charges, color = region)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
Alright, another interesting set of visuals. Again, we’re seeing that age and region don’t seem to play a row in BMIs relationship with medical charges. However, we again see that smoking status plays an additional role, now with BMI and charges. It appear that, if you are a non-smoker, an increase in BMI leads to only a slight increase in medical charges, but no by much. However, if you are a smoker, having an increase in BMI is leading to a LARGE increase in charges.
We actually see that those with very low BMIs, regardless of smoking or not, have pretty similar medical expenses. However, as your BMI increases when you’re also a smoker, your medical expenses increase significantly.
Just for curiousity sake, let’s see if there is some relationship between age and BMI.
ggplot(data = insurance_raw, aes(x = age, y = bmi)) +
geom_point(color = "blue") +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE, color = "black") +
theme_minimal()
Doesn’t appear like there is much here. Maybe a slight increase in BMI as you age, but BMI tends to stay pretty constant through our age groups.
Let’s take one final look at how medical expenses are distributed across various levels of our data.
ggplot(data = insurance_raw, aes(x = smoker, y = charges, fill = smoker)) +
geom_violin() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = sex, y = charges, fill = sex)) +
geom_violin() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = region, y = charges, fill = region)) +
geom_violin() +
theme_minimal()
ggplot(data = insurance_raw, aes(x = obesity, y = charges, fill = obesity)) +
geom_violin() +
theme_minimal()
This seems to reiterate what we’ve seen, but again shows solid separation between smokers and non-smokers and their medical expenses, as well as a grouping of additional expenses amongst those who are classified as obese.
Alright, time to move onto our model creation and predictions!
For this part, we’ll work on building the best model we can, and then use that model to predict medical charges. First step is working on our model!
Before creating our model, we’ll want to address some needed assumptions: no multicollinearity and multivariate normality.
In essence, we want our predictor variables to be highly correlated with our variable we want to predict (medical charges), but we don’t want those variables to be correlated with each other (known as multicollinearity).
There’s a quick way to check this in R. First, we’ll force a reduced linear model using all of our predictors. We can then find the Variance Inflation Factor (VIF) of each variable to determine if there is a severe level of correlation between them. We want a VIF less than 5 for each variable, and if that’s the case, we’re rockin’ and rollin’.
reduced_model <- lm(charges ~ age + bmi + sex + children + smoker + region + obesity, data = insurance_clean)
car::vif(reduced_model)
## age bmi sex children smoker region obesity
## 1.015402 2.820392 1.009190 1.002482 1.006492 1.026674 2.770875
Woohoo! All VIF values are looking good, so we’ve met the assumption that there is no multicollinearity present in our model. Next, we’ll look into multivariate normality.
We need to check if our residuals are normally distributed, otherwise known as Multivariate Normality. There’s a simple package in R that allows us to make a quick check.
mvn <- mvn(data = insurance_clean, mvnTest = "hz")
print(mvn$multivariateNormality)
## Test HZ p value MVN
## 1 Henze-Zirkler 6.999874 0 NO
Well, looks like (from the last column), that our data does not meet this assumption. We previously saw that the age variable wasn’t really normally distributed, so this may be our issue. Let’s take a look at the qqplots for our numeric data to see if we’re running into this issue for a specific variable.
qqPlot(insurance_clean$age)
## [1] 63 95
qqPlot(insurance_clean$bmi)
## [1] 1318 1048
qqPlot(insurance_clean$children)
## [1] 33 72
As expected, it seems like age is the culprit here. To deal with this, I’ll simply add a squared version of the variable to our data account for the lack of linearity present here. This isn’ the perfect solution, but should give us some additional help when it comes to predicting.
insurance_clean$age2 <- (insurance_clean$age)^2
Alright, assumptions are dealt with, now time to move onto creating our model!
We already created a reduced moodel to check for multicollinearity, and we don’t need to remove any variables to continue with this model based on our assumptions being met. Now, let’s take a look at our reduced model and how well it’s doing.
summary(reduced_model)
##
## Call:
## lm(formula = charges ~ age + bmi + sex + children + smoker +
## region + obesity, data = insurance_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11929.3 -3432.8 -137.3 1545.2 28332.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9063.73 1351.71 -6.705 2.96e-11 ***
## age 257.46 11.77 21.877 < 2e-16 ***
## bmi 143.73 45.19 3.181 0.001503 **
## sex 161.07 329.57 0.489 0.625122
## children 479.65 136.29 3.519 0.000447 ***
## smoker 23831.35 407.78 58.441 < 2e-16 ***
## region 332.86 150.48 2.212 0.027136 *
## obesity 2877.29 546.96 5.261 1.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6000 on 1330 degrees of freedom
## Multiple R-squared: 0.7558, Adjusted R-squared: 0.7545
## F-statistic: 588.1 on 7 and 1330 DF, p-value: < 2.2e-16
This is pretty good! Even though we haven’t made any adjustments, we’re already getting an Adjusted R-Squared value of 0.7545, which indicated we’re capturing a pretty large portion of medical charges with this information. We see that sex is not really a super strong variable for modeling medical charges, and region is only slightly significant. However, age, BMI, number of children smoker status and obesity are very significant, and we’ll definitely want to consider them in a final model. Now, let’s add some interactions and our squared age term to see how the model performs.
Given age was an issue in it’s residual normality, we’ll want to add a squared version of the variable to help with our assumptions. We also saw an interesting interaction between smoking and obesity in our visualizations, where being a smoker and obese led to much higher medical charges. To account for this interaction, we’ll add an additional variable to our model where we multiple smoking status and obesity status together so we can have a slope measure for when an individual is a smoker and obese simultaneously.
full_model <- glm(charges ~ age + age2 + bmi + sex + children + smoker + region + obesity + obesity*smoker, data = insurance_clean)
summary(full_model)
##
## Call:
## glm(formula = charges ~ age + age2 + bmi + sex + children + smoker +
## region + obesity + obesity * smoker, data = insurance_clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17300.1 -1656.4 -1262.0 -731.6 24128.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1952.5862 1398.1047 -1.397 0.162769
## age -32.8368 59.7802 -0.549 0.582897
## age2 3.7341 0.7458 5.007 6.27e-07 ***
## bmi 118.3823 33.4815 3.536 0.000421 ***
## sex 497.1827 244.1990 2.036 0.041951 *
## children 680.1827 105.7319 6.433 1.74e-10 ***
## smoker 13400.4757 438.7671 30.541 < 2e-16 ***
## region 421.1681 111.4352 3.779 0.000164 ***
## obesity -991.5070 421.5131 -2.352 0.018805 *
## smoker:obesity 19807.3646 604.0016 32.794 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 19729354)
##
## Null deviance: 1.9607e+11 on 1337 degrees of freedom
## Residual deviance: 2.6201e+10 on 1328 degrees of freedom
## AIC: 26284
##
## Number of Fisher Scoring iterations: 2
Even though it seems like the full model is performing better, let’s be statistically sound and properlly compare the two models with an Analysis of Covariance (ANOVA) Chi-Square test. This test will help us determine if the added variables in the full model are significantly improve our reduced model.
anova(reduced_model, full_model, test = "Chisq")
## Analysis of Variance Table
##
## Model 1: charges ~ age + bmi + sex + children + smoker + region + obesity
## Model 2: charges ~ age + age2 + bmi + sex + children + smoker + region +
## obesity + obesity * smoker
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 1330 4.7878e+10
## 2 1328 2.6201e+10 2 2.1677e+10 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see a veeeery small p-value, which indicates that our full model is performing significantly better than our reduced model! Perfect, we can now move on to train our model to make some predictions.
To predict medical expenses, we’ll want to first split our data into “train” and “test” data. By doing this, we’ll be able to employ our model onto a train dataset, make predictions with that data, then test the accuracy of our predictions of our test dataset. This helps us ensure that the data we’re predicting isn’t coming straight from the data we build our model upon.
We’ll approach this training with a Cross Validation, where we’ll split up our training dataset into four parts, 3 to train with and 1 to test with. This will help make sure there isn’t a specific subset of the data we train with that varies significantly from the test data.
First we’ll split our data, then we’ll use our full model previously obtained to train our model.
train <- trainControl(method = "cv", number = 4)
mod2 <- train(charges ~ age + age^2 + bmi + sex + children + smoker + region + obesity + obesity*smoker, data = insurance_clean, method = "lm", trControl = train)
print(mod2)
## Linear Regression
##
## 1338 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 1006, 1002, 1002, 1004
## Resampling results:
##
## RMSE Rsquared MAE
## 4475.718 0.8631125 2456.319
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Alright, we obtained a pretty large R-Squared here! Looking like we have some pretty solid prediction potential, although not perfect. Now it’s time for predicting!
First, we can use our full model to make predictions of medical charges, then visualize those predictions against our true medical charges. Then, we can investigate the accuracy of our predictions more precisely outside of simple visualizations.
insurance_clean$index <- 1:nrow(insurance_clean)
insurance_train <- insurance_clean %>% dplyr::sample_frac(0.7) #split 70% of our data into a training data to train predictions
insurance_test <- dplyr::anti_join(insurance_clean, insurance_train, by="index") #join the remainder of the dataset into a test data to see how predicitons fair
insurance_train <- insurance_train %>% dplyr::select(-index)
insurance_test <- insurance_test %>% dplyr::select(-index)
insurance_predicted <- insurance_test %>%
mutate(predicted_charges = predict(full_model, newdata = insurance_test)) %>%
rename("true_charges" = "charges")
ggplot(data = insurance_predicted, aes(x = true_charges, y = predicted_charges)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE) +
theme_minimal()
Well, this is looking pretty good! We see that, particularly for lower expenses, we can predict charges fairly well. For middle-of-the-road expenses, it seems like we either very slightly over or largely under predict medical charges, and typically over-predict larger medical expenses. Overall, this seems to be fairly good at predicting specially lower charges.
So what have we learned? Well, we discovered that there are many important factors when it comes to predicting medical charges for an individual. An individual’s sex, age, bmi, smoking status, region, number of children and obesity status all seem to be important predictors of an individuals medical charges. We further found that, those who smoke simply spend much more on medical, on average. Coupling smoking with obesity creates one of the largest risk factors for large medical expenses than any other factor. Even though age and bmi were found to be correlated and important predictors of medical expenses, if you are a smoker, you will be spending much more on medical, especially if you are additionally obese.
Overall, we’ve build a model that help predict expenses fairly well, especially determining expenses of those who have lower medical charges. Potentially some additional information could help with the middle-range charges!