In this report, I will seek to find an answer to the question: “do men get charged significantly more for health insurance?” This report will utilize various methods of statistical analysis such as averages, simple and multiple linear regression, and looking at p-values and R2 values to answer this question in a way that is statistically meaningful.
This data set contains 7 variables and 1,338 entries and comes from
kaggle.com. The variables that will be of import to this report are:
charges, the individual medical costs billed by health
insurance; age, the age of the individual;
smoker, whether or not the individual smokes;
sex, the sex of the individual; bmi, the body
mass index of the individual; region, what geographical
region of the United States the individual lives in; and
children, the number of children covered by health
insurance.
datatable(insurance, options = list(scrollX = TRUE))
To get a general comparison, first I want to break down average insurance cost by gender.
insurance %>%
filter(sex == "male") %>%
summarize(mean(charges))
insurance %>%
filter(sex == "female") %>%
summarize(mean(charges))
From this, we can see that males are charged on average $13,956.75, while females are charged only $12,569.58 on average. However, this alone can not tell us whether men are charged more than women for health insurance as there is no statistical significance to an average and other factors could be at play.
We can determine the significance of the average charges to aid us in answering our question.
insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
By definition, because the p-value is below the 0.05 cutoff we can
say that the correlation between sex and
charges is significant. However, I would be hesitant to
trust this model because the R2 value is 0.002536 meaning
that only 0.2536% of the variation can be explained by this model.
Further testing will need to be done to answer the question.
Because men tend to be more at risk for developing a number of health
problems, age may be a confounding variable in this data
set. Let’s factor age into the model.
insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
Based off this new model, age is a very significant factor in price,
with a p-value well below the cutoff of 0.05. This new model does also
have sex as a significant factor with a p-value of 0.0149
meaning that there is a 98.51% chance that sex is a good
predictor. So, age and sex can both be said to
be significant predictors of charges. However, I still am
not fully satisfied with this model as the R2 value is only
0.09209 meaning that only 9.209% of the variation can be explained by
both age and sex. This model can still not be
fully trusted and something else is causing a lot of the variation.
Next, I will add bmi and smoker to the
model to see what roles they play in determining charges. I
predict that they have a very high impact as both a person’s BMI and
whether they smoke has large implications for overall health.
insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
In this model, we can see that being male isn’t actually a strong
predictor of charges as it’s new p-value is 0.7467.
age, bmi, and smoker are all
significant predictors of charges. The new R2
value of 0.7467 indicates that the model is much stronger as it can now
explain 74.67% of variation in charges based off of the
predictors. This new model, based off of the statistics, can be trusted.
However, before I make my decision on the question, I would like to look
at several other variables to see if they play a role.
I will now add the remaining variables to the model to see if they play a role.
insurance_model4 <- lm(charges ~ sex + age + bmi + smoker + region + children,
data = insurance)
summary(insurance_model4)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region + children,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## children 475.5 137.8 3.451 0.000577 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Based on this new model, children is absolutely a
significant factor with a p-value of 0.000577, but region
cannot be said to be significant as only some of the regions have a
significant p-value while others seem to have no effect whatsoever.
sex still isn’t significant based off of p-value in this
model.
With these models, I have been able to show that yes, men are charged
significantly more for health insurance that women. However, the reason
that they are charged more is not their sex as that is insignificant
when the confounding variables are taken into account. It can be said
that men may just be more predisposed to certain health conditions as
they age, smoking, having a higher bmi, or any number of other factors
at play in overall health. For this reason, men are charged
significantly more due to factors outside of their sex.