In this project, I use a dataset on medical costs that contains the total medical costs of individuals with a variety of different factors. Here is a breakdown of our variables:

Now, we’ll proceed with taking a look at and cleaning our data.

Data Investigation

For our data investigation, we’ll take a look at our data, the structure of our data and look to see if there are any interesting relationships that exist. We’ll also create some visualizations to further investigate these relationships, with a particlar interest on how variables relate to medical costs.

Data Cleaning

First, let’s just take a look at our data to see the variables we have, whether they are numeric or categorical, and the levels of any categorical variables we have.

## # A tibble: 6 × 7
##     age sex      bmi children smoker region    charges
##   <dbl> <chr>  <dbl>    <dbl> <chr>  <chr>       <dbl>
## 1    19 female  27.9        0 yes    southwest  16885.
## 2    18 male    33.8        1 no     southeast   1726.
## 3    28 male    33          3 no     southeast   4449.
## 4    33 male    22.7        0 no     northwest  21984.
## 5    32 male    28.9        0 no     northwest   3867.
## 6    31 female  25.7        0 no     southeast   3757.
## [1] "female" "male"
## [1] "yes" "no"
## [1] "southwest" "southeast" "northwest" "northeast"

Here, we can see that we have 3 categorical variables - a few binary (sex and smoker status), and one with multiple levels (region). Something I’m curious about, is looking at BMI a little deeper to see if we have any BMI values that are greater than 30. I’m curious about this, because a BMI larger than 30 indicates an individual is obese, which could play some importance in their medical costs.

Wow.. It appears that almost half of the invidiuals in this dataset could be categorized as having obesity. Seems like it would be wise to create another variable to make this categorization since so many are affected by this in our data.

Lastly, let’s go ahead and check for any NA values in our data to see if we need to remove any rows, or take a deeper look at missing values.

## [1] 0

Ahh, a dream - nothing is missing! There won’t be much to do in terms of data cleaning (woohoo!), but we’ll probably want to recode some of our categorical varibles to help with a few calculations down the road.

## # A tibble: 6 × 8
##     age   sex   bmi children smoker region obesity charges
##   <dbl> <dbl> <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
## 1    19     1  27.9        0      1      1       0  16885.
## 2    18     0  33.8        1      0      2       1   1726.
## 3    28     0  33          3      0      2       1   4449.
## 4    33     0  22.7        0      0      3       0  21984.
## 5    32     0  28.9        0      0      3       0   3867.
## 6    31     1  25.7        0      0      2       0   3757.

Alright - recoding completed. We may run into an issue with recoding of our region given they don’t have a logical order to them, but we’ll go ahead and proceed and deal with that at a later part if it seems to be an issue.

Taking a Look at Correlations

Next, I’m curious about how our variables correlate with each other. This will help determine what variables might be most important in our eventual predictions of charges, but also will help us in the general data investigation aspect of this project.

So, it doesn’t seem like every variable in this dataset is going to be super important to predicting charges, but there are some key takeaways here:

  • Smoker status seems to be the most correlated with medical costs, where if an individual is a smoker, they are more likely to spend more on medical costs, on average.
  • Bmi, obesity and age are slightly, positively correlated with charges. So, if an individual has a higher bmi, or is older, we expect them to be spending more on medical costs, on average.
  • Sex is slightly, negatively correlated with medical costs. Given I assigned females to 1 and males to 0, this shows that females tend to spend less on medical costs, on average.
  • The region someone is in doesn’t seem to be correlated much at all with medical costs.

Some interesting takeaways here. This gives me a few ideas of visualizations to investigate this a bit further.

Data Visualizations

To start, I simply want to see how our potential predictors are distributed. Given our main goal of predicting, we want normality in these variables, but it will also simply be interesting to see how many individuals fall within each category, or how numerical variables are spread out.

Histograms & Barcharts of Categorical Data

We’ll start out taking a look at our numeric variables to see how they distribute.

We see that medical charges tend to be on the lower end (most below ~$15,000), but there are a decent amount of medical costs on the higher end, ranging up to around $50,000. Ages seem to be fairly, evenly distributed, ranging from 18 years old up to around 65 years old. As we saw before, BMI is pretty much a normal distribution, and we’re seeing around half of those in our dataset having obesity. And finally, we see that the majority of individuals in this dataset don’t have any children, where each additional child has a lower and lower count of individuals.

Interesting. Not much to go on from here, but let’s look at our categorical variables next.

So, it sppears most of those in this dataset are not smokers. We also have a fairly even distribution of males and females, as well as across the four regions in our data. Lastly, it actually appears that more than HALF of individuals could be categorized as having obesity. Now, let’s take a deeper look at some of the relationships we say with medical charges with visualizations.

Investigating Relationships with Visualizations

Smoking Status and Medical Charges

First, let’s simply see how much smokers and non-smokers are paying in medical charges.

Wow… Even though smokers make up much less of our data, they actually are paying MORE than non-smokers. Just from this alone, it seems like being a smoker may lead you to pay much more for medical expenses. Now, let’s see if other categorical variables tend to play a role here.

Very interesting. So smokers of different genders pay about the same and those in different regions pay about the same. This can also be said for non-smokers.

The most interesting take-away here is that smokers and non-smokers who are not classified as obese seem to pay roughly the same (although, there are many more smokers, so they most likely spend more), but those who are obese AND smokers pay waaay more than those who are obese and non-smokers.

So, it appears there is some interaction between smokers and obesity that may lead to to these much larger costs - something we’ll want to take into account when it comes to our prediction model.

Age and Medical Charges

Next, let’s look at various ages and their medical charges. We say age was positively correlated with medical charges, so we’d expect to see those who are older tending to pay more for medical expenses, on average.

Yep. We see a clear, positive correlation here. This graph does look interesting, however. There seems to be some groupings that are linearlly correlated with each other. We’ll have to add some additional variables here to see if any variable is associated with this grouping.

Looks like there isn’t much separation between regions or sex. But, we do see some solid separation between obesity and a LARGE separation for smokers and non-smokers. This seems to match our barplot that showed smokers tend to spend much more than non-smokers, and that those with obesity tend to spend more than those who are not obese. However, we say a very interesting plot previously, that showed smokers who are obese tend to spend the most on medical. Let’s create a graph that breaks down smokers and obesity to see if we can see any additional separations.

Alright, let’s digest this. It seems like, regardless of your obesity status, non-smokers simply do not tend to spend much on medical expenses (with some exceptions). We see that those who are not obese, but are smokers, cause a significant jump in medical charges, but tend to spend more as they age at roughly the same rate as those who are not smokers.

The more interesting take away here - those who are obese and smokers tend to spend the most, at around the same rate as the other groupings.

What does this mean? Well, smoking causes one to spend much more on medical charges, but if you are obese and a smoker, your expenditures are VERY high, much higher than those who are non-smokers and those who are smokers but not obese. It appears that smoking leads to higher medical expenditures, obesity can (but not always) lead to higher medical expenditures, but if you are a smoker and obese, you can expect to have some pretty significant costs compared to others.

Age and Medical Charges

Now, let’s look at BMI. We say BMI positively correlated with medical charges from our correlation matrix, but let’s visualize that.

Looks to be expected, however, we see another separation in our data here. We’ll approach this similarly as we did with age and add some other variable into the mix to see if we can find this separation.

Alright, another interesting set of visuals. Again, we’re seeing that age and region don’t seem to play a row in BMIs relationship with medical charges. However, we again see that smoking status plays an additional role, now with BMI and charges. It appear that, if you are a non-smoker, an increase in BMI leads to only a slight increase in medical charges, but no by much. However, if you are a smoker, having an increase in BMI is leading to a LARGE increase in charges.

We actually see that those with very low BMIs, regardless of smoking or not, have pretty similar medical expenses. However, as your BMI increases when you’re also a smoker, your medical expenses increase significantly.

Just for curiousity sake, let’s see if there is some relationship between age and BMI.

Age and BMI

Doesn’t appear like there is much here. Maybe a slight increase in BMI as you age, but BMI tends to stay pretty constant through our age groups.

Let’s take one final look at how medical expenses are distributed across various levels of our data.

Violin Plots of Medical Charges Across Smoking Status, Sex, Region and Obesity

This seems to reiterate what we’ve seen, but again shows solid separation between smokers and non-smokers and their medical expenses, as well as a grouping of additional expenses amongst those who are classified as obese.

Alright, time to move onto our model creation and predictions!

Model Creation and Predictions

For this part, we’ll work on building the best model we can, and then use that model to predict medical charges. First step is working on our model!

Model Creation

Before creating our model, we’ll want to address some needed assumptions: no multicollinearity and multivariate normality.

Multicollinearity Check

In essence, we want our predictor variables to be highly correlated with our variable we want to predict (medical charges), but we don’t want those variables to be correlated with each other (known as multicollinearity).

There’s a quick way to check this in R. First, we’ll force a reduced linear model using all of our predictors. We can then find the Variance Inflation Factor (VIF) of each variable to determine if there is a severe level of correlation between them. We want a VIF less than 5 for each variable, and if that’s the case, we’re rockin’ and rollin’.

##      age      bmi      sex children   smoker   region  obesity 
## 1.015402 2.820392 1.009190 1.002482 1.006492 1.026674 2.770875

Woohoo! All VIF values are looking good, so we’ve met the assumption that there is no multicollinearity present in our model. Next, we’ll look into multivariate normality.

Multivariate Normality Check

We need to check if our residuals are normally distributed, otherwise known as Multivariate Normality. There’s a simple package in R that allows us to make a quick check.

##            Test       HZ p value MVN
## 1 Henze-Zirkler 6.999874       0  NO

Well, looks like (from the last column), that our data does not meet this assumption. We previously saw that the age variable wasn’t really normally distributed, so this may be our issue. Let’s take a look at the qqplots for our numeric data to see if we’re running into this issue for a specific variable.

## [1] 63 95

## [1] 1318 1048

## [1] 33 72

As expected, it seems like age is the culprit here. To deal with this, I’ll simply add a squared version of the variable to our data account for the lack of linearity present here. This isn’ the perfect solution, but should give us some additional help when it comes to predicting.

Alright, assumptions are dealt with, now time to move onto creating our model!

Reduced model

We already created a reduced moodel to check for multicollinearity, and we don’t need to remove any variables to continue with this model based on our assumptions being met. Now, let’s take a look at our reduced model and how well it’s doing.

## 
## Call:
## lm(formula = charges ~ age + bmi + sex + children + smoker + 
##     region + obesity, data = insurance_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11929.3  -3432.8   -137.3   1545.2  28332.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9063.73    1351.71  -6.705 2.96e-11 ***
## age           257.46      11.77  21.877  < 2e-16 ***
## bmi           143.73      45.19   3.181 0.001503 ** 
## sex           161.07     329.57   0.489 0.625122    
## children      479.65     136.29   3.519 0.000447 ***
## smoker      23831.35     407.78  58.441  < 2e-16 ***
## region        332.86     150.48   2.212 0.027136 *  
## obesity      2877.29     546.96   5.261 1.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6000 on 1330 degrees of freedom
## Multiple R-squared:  0.7558, Adjusted R-squared:  0.7545 
## F-statistic: 588.1 on 7 and 1330 DF,  p-value: < 2.2e-16

This is pretty good! Even though we haven’t made any adjustments, we’re already getting an Adjusted R-Squared value of 0.7545, which indicated we’re capturing a pretty large portion of medical charges with this information. We see that sex is not really a super strong variable for modeling medical charges, and region is only slightly significant. However, age, BMI, number of children smoker status and obesity are very significant, and we’ll definitely want to consider them in a final model. Now, let’s add some interactions and our squared age term to see how the model performs.

Full Model

Given age was an issue in it’s residual normality, we’ll want to add a squared version of the variable to help with our assumptions. We also saw an interesting interaction between smoking and obesity in our visualizations, where being a smoker and obese led to much higher medical charges. To account for this interaction, we’ll add an additional variable to our model where we multiple smoking status and obesity status together so we can have a slope measure for when an individual is a smoker and obese simultaneously.

## 
## Call:
## glm(formula = charges ~ age + age2 + bmi + sex + children + smoker + 
##     region + obesity + obesity * smoker, data = insurance_clean)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -17300.1   -1656.4   -1262.0    -731.6   24128.9  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1952.5862  1398.1047  -1.397 0.162769    
## age              -32.8368    59.7802  -0.549 0.582897    
## age2               3.7341     0.7458   5.007 6.27e-07 ***
## bmi              118.3823    33.4815   3.536 0.000421 ***
## sex              497.1827   244.1990   2.036 0.041951 *  
## children         680.1827   105.7319   6.433 1.74e-10 ***
## smoker         13400.4757   438.7671  30.541  < 2e-16 ***
## region           421.1681   111.4352   3.779 0.000164 ***
## obesity         -991.5070   421.5131  -2.352 0.018805 *  
## smoker:obesity 19807.3646   604.0016  32.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 19729354)
## 
##     Null deviance: 1.9607e+11  on 1337  degrees of freedom
## Residual deviance: 2.6201e+10  on 1328  degrees of freedom
## AIC: 26284
## 
## Number of Fisher Scoring iterations: 2

Comparing Models with ANOVA

Even though it seems like the full model is performing better, let’s be statistically sound and properlly compare the two models with an Analysis of Covariance (ANOVA) Chi-Square test. This test will help us determine if the added variables in the full model are significantly improve our reduced model.

## Analysis of Variance Table
## 
## Model 1: charges ~ age + bmi + sex + children + smoker + region + obesity
## Model 2: charges ~ age + age2 + bmi + sex + children + smoker + region + 
##     obesity + obesity * smoker
##   Res.Df        RSS Df  Sum of Sq  Pr(>Chi)    
## 1   1330 4.7878e+10                            
## 2   1328 2.6201e+10  2 2.1677e+10 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see a veeeery small p-value, which indicates that our full model is performing significantly better than our reduced model! Perfect, we can now move on to train our model to make some predictions.

Predicting Medical Expenses

To predict medical expenses, we’ll want to first split our data into “train” and “test” data. By doing this, we’ll be able to employ our model onto a train dataset, make predictions with that data, then test the accuracy of our predictions of our test dataset. This helps us ensure that the data we’re predicting isn’t coming straight from the data we build our model upon.

We’ll approach this training with a Cross Validation, where we’ll split up our training dataset into four parts, 3 to train with and 1 to test with. This will help make sure there isn’t a specific subset of the data we train with that varies significantly from the test data.

Training model

First we’ll split our data, then we’ll use our full model previously obtained to train our model.

## Linear Regression 
## 
## 1338 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 1006, 1002, 1002, 1004 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4475.718  0.8631125  2456.319
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Alright, we obtained a pretty large R-Squared here! Looking like we have some pretty solid prediction potential, although not perfect. Now it’s time for predicting!

Predicting Medical Charges

First, we can use our full model to make predictions of medical charges, then visualize those predictions against our true medical charges. Then, we can investigate the accuracy of our predictions more precisely outside of simple visualizations.

Visualizing Predictions

Well, this is looking pretty good! We see that, particularly for lower expenses, we can predict charges fairly well. For middle-of-the-road expenses, it seems like we either very slightly over or largely under predict medical charges, and typically over-predict larger medical expenses. Overall, this seems to be fairly good at predicting specially lower charges.

Summary

So what have we learned? Well, we discovered that there are many important factors when it comes to predicting medical charges for an individual. An individual’s sex, age, bmi, smoking status, region, number of children and obesity status all seem to be important predictors of an individuals medical charges. We further found that, those who smoke simply spend much more on medical, on average. Coupling smoking with obesity creates one of the largest risk factors for large medical expenses than any other factor. Even though age and bmi were found to be correlated and important predictors of medical expenses, if you are a smoker, you will be spending much more on medical, especially if you are additionally obese.

Overall, we’ve build a model that help predict expenses fairly well, especially determining expenses of those who have lower medical charges. Potentially some additional information could help with the middle-range charges!