This report seeks to answer the following question:
Does the health insurance data provide statistically significant evidence that men are charged significantly more for health insurance than women?
We will be using a data set called insurance obtained
from https://www.kaggle.com/datasets/mirichoi0218/insurance.
This data includes 7 variables and 1,338 entries. Of these variables,
the relevant ones include are age(age of primary
beneficiary), sex(insurance contractor gender),
bmi(body mass index), children(number of
children covered by the health insurance), smoker(do they
smoke?), region(the beneficiary’s residential area in the
U.S.), and charges(individual medical costs billed by
health insurance).The full data set can be viewed below:
Throughout, we will need the functionality of the tidyverse package,modelr, mainly to create the model’s themselves.
library(tidyverse)
library(modelr)
To identify if men are charged significantly more then women for health insurance, we will first look at the average insurance charges for both male and female.
insurance %>%
group_by(sex) %>%
summarize(mean(charges))
## # A tibble: 2 × 2
## sex `mean(charges)`
## <chr> <dbl>
## 1 female 12570.
## 2 male 13957.
Simply based on these average we have some data that may prove men are charged more, however we should create a linear regression model to prove statistical significance. This model is created as follows:
insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
The coefficients of the regression model also conclude that men are charged more. Based on the coefficients it shows on average that men are charged $1387.2 more than women.
This model does also show statistical significance as the residual standard error (RSE) is 12090 on 1336 degrees of freedom, this is not a great number but not terrible either it is contingent on the context. The p-value is 0.03613 which falls below the 5% cutoff. However the R-squared value is 0.002536, when converted to a percent is 0.2536% meaning that only 0.2536% of variance in the variable is explained by the model.
Although this model technically shows statistical significance, there are other variables within the model that may have an effect on the model. The adjusted R-squared is extremely low and these other variables should be tested to find a more trustworthy significant model.
Now that we have identified a possible statistically significant correlation between gender and insurance charges we can consider confounding variables that may impact these charges. While dealing with insurance a persons age may impact the charges as they may be more prone to health conditions or have frequent insurance charges than a younger person.
To begin this process, we will first check include age
in the regression model as a predictor variable:
insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
This new model showcases some change in the intercepts, highlighting that men are charged $1538.83 higher then women of the same age. In addition to that this model highlights that for every year the age increases the person is charged 258.87 dollars regardless of gender.
Both of these differences are significant as the p-value for the male
and age coefficients are 0.0149 and <2e-16 accordingly, which is
smaller than the 0.05 cutoff.However the adjusted R-squared is still a
low number,0.09209, that leads us to believe there are more confounding
variables in addition to age. ## Testing for Other Confounding Variables
Considering age has shown a significant impact on the
charges that men and women receive for health insurance, it is fair to
look into the other variables contained within the model.
Two variables contained in the data are directly related to a person
health, which would most likely have an impact on how much they are
charged for health insurance. These variables are bmi and
smoker which are added to the following model:
insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
These additions have drastically change the amount in which men are charged, in fact this model shows that men are charged $109.04 dollars less then women with the same BMI, age and smoking status. Yet, this number is not statistically significant as the p-value is 0.745, far over the 0.05 cutoff.
This model also shows that as age increase by one year the charge increases by 259.45 dollars. As the BMI increases the person will be charged 323.05 dollars more, and lastly if they are a smoker they will be charged $23,833.87 more than a non-smoker. All of these values are statistically significant as the p-value is <2e-16.
Overall this new comparison is more trustworthy than previous models as it has a <2e-16 p-value. The RSE is 6094 on 1333 degrees of freedom which is lower than the original model. Lastly, the adjusted R-squared is 0.7475 meaning that 74.75% of the varience is explained by the model, significantly stronger than the previous models.
Before final conclusions can be drawn it is important to test the last variables within the data set to confirm the strengths of our previous model and identify any other variables that need to be included.
By including children and region we will
get a glimpse at how a persons local environment may impact their health
insurance charges.
insurance_model4 <- lm(charges ~ sex + age + smoker + bmi+ children + region, data = insurance)
summary(insurance_model4)
##
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
When these variables are controlled for it can be seen quickly that
they are not impacting the model in a positive way. The
children variable increases the charge by $257.58 and is
statistically significant. Yet, all of the three regions have p-values
over the cutoff of 0.05. West would decrease the charge by 353.0, East
would increase the charge by 1035.0, and Southwest would decrease the
charge by 960.0 if statistically significant.
By testing this last two variables it can be highlighted that excluding both of these models will have minimal impact to the dataset.
Now that all variables have been tested and analyzed, the final regression model can be created and the final conclusion can be drawn.
final_model <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(final_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
In summary, there is not enough evidence to support the claim that men are being charged more then women for health insurance. While interpreting the final model it can be seen that the coefficient for males is -109.04, meaning they would be charged $109.04 less then women of the same health and age, however the p-value of this coefficient is 0.745 which is much larger than the 0.05 cutoff. This means that it is not statistically significant. We can identify the strength of this model based upon the RSE at 6094 on 1333 degrees of freedom, relatively low. The adjusted R-squared is 0.7467, meaning that 74.67% of the variance in the variable is explained by the model. Finally, the p-value of the model is < 2.2e-16, making the model as a whole statistically significant.