This report will answer the following question:
Are men charged significantly more for health insurance than women?
We will be using a data set titled insurance, obtained
from an imported data table. Within this data set, there are a total of
186 observations across 7 variables; the relevant variables to this
report are sex (gender of insurance recipient),
charges (individual medical costs billed by health
insurance), age (age of primary beneficiary),
bmi(Body Mass Index of primary beneficiary), and
smoker(does the primary beneficiary smoke?). The full data
set can be viewed below:
Throughout, the tidyverse and modelr
packages will be used to manipulate the data set.
library(tidyverse)
library(modelr)
To determine if men are charged more than women, we can look at the mean insurance charges for men versus the insurance charges for women. The code chunk below will determine the average insurance charges based on the sex of the primary beneficiary.
mean_charge <- insurance %>%
group_by(sex) %>%
summarise(mean(charges))
mean_charge
Based on only comparing the average insurance charges, it can be concluded that men are charged more than women. According to the average, men are charged $1387.17 more than women.
In order to determine if the difference in charges can be considered significant, we will do a F test from a model that relates the insurance charges and the sex of the primary beneficiary.
charge_model <- lm(charges ~ sex, data = insurance)
summary(charge_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
Since the p-value of the model is less than 0.05, we can conclude that the difference between the insurance charges and the sex of the primary beneficiary is significant. From this, we can state that men are charged more than women.
However, the analysis above is only based on the sex and the insurance charges. We cannot fully trust this answer, as we did not account for the possible confounding variables in the data set.
A possible confounding variable is age, since as people get older, they often need more medical care, which causes the insurances charges to increase. This could help explain why men are charged more than women as it would provide more context to the charges.
charge_mult_model <- lm(charges ~ sex + age, data = insurance)
summary(charge_mult_model)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
The comparison demonstrates that controlling for age creates a reliable model. According to the model, men are still charged more than women, and as a beneficiary gets older the charges also increase. Since the overall p-value is less than 0.05 we can conclude that the difference is significant.
We cannot fully trust this model either since the R2 value is quite low and there are still other variables that could be considered confounding to the model.
Two other possible confounding variables, are the BMI of the beneficiary and if they are a smoker. We will add these variables to the model to determine if they will also affect the insurance charges.
charge_mult_model_2 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(charge_mult_model_2)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
This comparison is telling us that if the beneficiary’s BMI is higher and if they smoke, the charges will increase and men are still charged more than women. The difference in charges can be considered significant based on the overall p-value.
This comparison can be trusted, but there are some other variables that still have not been controlled for. We must first test all possible confounding variables before coming to a set conclusion.
The last two possible confounding variables are if the primary beneficiary has children and the region in which they live. The codes below will control for each of these variables individually.
charge_children_model <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)
summary(charge_children_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11837.2 -2916.7 -994.2 1375.3 29565.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
## sexmale -128.64 333.36 -0.386 0.699641
## age 257.73 11.90 21.651 < 2e-16 ***
## bmi 322.36 27.42 11.757 < 2e-16 ***
## smokeryes 23823.39 412.52 57.750 < 2e-16 ***
## children 474.41 137.86 3.441 0.000597 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
## F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
charge_region_model <- lm(charges ~ sex + age + bmi + smoker + region, data = insurance)
summary(charge_region_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11852.6 -3010.9 -987.8 1515.8 29467.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11556.96 985.63 -11.725 <2e-16 ***
## sexmale -111.57 334.26 -0.334 0.7386
## age 258.54 11.94 21.658 <2e-16 ***
## bmi 340.46 28.71 11.857 <2e-16 ***
## smokeryes 23862.91 414.82 57.526 <2e-16 ***
## regionnorthwest -304.10 478.01 -0.636 0.5248
## regionsoutheast -1039.20 480.65 -2.162 0.0308 *
## regionsouthwest -916.44 479.72 -1.910 0.0563 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6087 on 1330 degrees of freedom
## Multiple R-squared: 0.7487, Adjusted R-squared: 0.7474
## F-statistic: 566 on 7 and 1330 DF, p-value: < 2.2e-16
Both variables can be considered to have a significant impact on the charges. However, the region in which the primary beneficiary lives can be discarded as the R2 value is less than that of the model involving children.
The data set does not provide enough evidence to suggest that men are charged significantly more for health insurance than women. When looking at the individual p-value for men, we can see that it is above 0.05 which concludes that the difference is insignificant.
From the multiple models created and analysed, we can conclude that men are not charged significantly more for health insurance compared to women. While they are charged more, when other variables are included and taken into the equation for end charge, we can determine that they are not charged more simply because they are male rather than female. The data is able to support this claim and we were able to manipulate the given variables to show that sex is not the only reason for the higher charges.