This report seeks to answer the following question:
Does the provided health insurance data set provide evidence that men are charged significantly more for health insurance than women?
We will be using a data set called insurance obtained
from https://drive.google.com/file/d/13l0g-9CisnolO8f7tNZuLj1-VJqrEpZw/view?usp=sharing{target=“_blank}.
It contains per-member data for 1,338 members of the insurance plan, who
are over the age if 18. There are 7 variables for each player; all of
the data points are relevant in this report and they are as follows,
sex (the biological gender of the member,
charges (the dollar amount charged from the insurance
company to each individual member), bmi(stands for “body
mass index”, which measures the body fat of the member),
smoker(whether the member smoke cigarettes or not),
children (how many children each member has),
region (the region that the member lives in), and
age (how old the member is in years). The full data set can
be viewed below:
Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations. We will also need the modelr package, for linear regression purposes.
library(tidyverse)
library(modelr)
Throughout the entirety of this project, I will be looking at the effect that sex has on the average health insurance charges. To begin I want to look at the surface level of this topic by solely comparing the averages of male and female.
average_charges_gender <-insurance %>%
group_by(sex) %>%
summarize(mean_charges = mean(charges, na.rm = TRUE),
count = n())
average_charges_gender
## # A tibble: 2 × 3
## sex mean_charges count
## <chr> <dbl> <int>
## 1 female 12570. 662
## 2 male 13957. 676
From this comparison we see that the average mean charges for male is nearly one and a half thousand more than female. It is fair to conclude that men are charged more than women in this instance. But it is not fair to conclude that men are charged more than women because they are men. Instead, I must look at other confounding variables that may explain why the men’s mean charges are higher.
In order to see if the difference between men and women’s insurance charges is significant, I must make a linear model with sex as my x and insurance charges as my y, then I can check the p-value to see if the difference is significant.
insurance_model <- lm(`charges` ~ `sex`, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
After running the model, we see that the p-value is .0361 which is less than .05. Therefore, we can reject the null, and say that we do have significant evidence that there is a difference in average insurance charges between male and female.
As stated earlier though, there is a reason to distrust my answer to the previous model. That reason of course being the absence of potential confounding variables from my model. For example, age could be a confounding variable that has a significant effect on the model, and it may explain more in the deviation in charges than sex does. It is possible that in our study, the men are older than the women, which leads them to have more health complications, and therefore have more insurance charges. If this were the case, the high insurance charges would have more to do with age, than the sex of the individual.
Now I’m going to run my model while controlling for age. If we see a higher R^2, this will likely indicate our model has improved. Also, we want to look at the p-values to see which variable is stronger between sex and age.
insurance_model <- lm(`charges` ~ `sex` + `age`, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
The R^2 value went from .003 to .09, which is an improvement, and indicates that our model got stronger after controlling for age. Essentially this tells us that .09% of the variation can be explained by the variables we are using. As for the p-values, we see that sex and age are both less than .05 so we should still include both in our model. But the p-value for age is much smaller, which indicates that age is a much more important variable to include than sex. Answering our initial question, even when controlling for age, there is still a significant difference between male and female because the p-value of .0149 is still less than .05.
There is still reason to distrust the comparison from my previous
model because there a still variables that I could potentially be
leaving out. Specifically, there are two variables in the data set that
could be used to describe a person’s health. Those two variables are
bmi and smoker. I am now going to add those
two variables to the model and see what my new results are.
insurance_model <- lm(`charges` ~ `sex` + `age` + `bmi`+`smoker`, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
Adding those two variables to the model were very necessary. The R^2 shot up to .7475. The p-values for every variable except for sex is 2e-16 which is way less than .05. And when including these variables we finally see that the observed difference in charges between men and women is not significant. The p-value is .745 which is way larger than .05. With this finding, we can conclude that the sex variable could be removed from our model.
I would argue that the comparison made in the previous model should be trusted, because it included all of the significant health variables that would have the biggest impact on the average health charges. However, there are still two more variables in the data set, so I am going to make a model including these two variables to see if they should be used or not.
insurance_model <- lm(`charges` ~ `sex` + `age` + `bmi`+`smoker`+`children` + `region`, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
After including Children and Region, it is clear that children should be included, because its p-value is much less than .05, and I would include region as well because two of the three regions have a p-value of less than .05. Region could certainly be left off the model too though, because it has a very small impact compared to the other variables.
Now that I have made all of these different models, and have investigated what variables have the most significant impact on average insurance charges, I can now answer the main question. Which is, do I have evidence that men are charged significantly more for health insurance than women? At first glance it appeared that they were charged significantly more, but once I started to include potential confounding variables, the p-value on sex began to increase until it became clear that we could not prove that men are charged significantly more than women are for health insurance.