knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
This report seeks to answer the following question:
Does the insurance data set provide evidence that men are charged significantly more for health insurance than women?
I will be using a data set called insurance
obtained
from https://www.kaggle.com/datasets/mirichoi0218/insurance.
It includes health insurance information from various people. This data
set contains 7 variables and 1,338 entries. Of these variables, the
relevant ones in this report are age
(age of primary
beneficiary), sex
(insurance contractor gender),
bmi
(body mass index), children
(number of
children covered by health insurance), and smoker
(smoking), region
(the beneficiary’s residential area in
the U.S.), and charges
(individual medical costs billed by
health insurance).
Throughout this report, I will need the functionality of the tidyverse package and readr package.
library(tidyverse)
library(readr)
Here is the full data table:
insurance <- read_csv("insurance.csv")
insurance
In order to find out if men are charged significantly more than women for health insurance, I will first just simply look at the average insurance charges for both men and women and also make a regression model.
insurance %>%
group_by(sex) %>%
summarize(mean(charges))
insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
coef(insurance_model)
## (Intercept) sexmale
## 12569.579 1387.172
Based on the average alone, it would be fair to say that men are charged more than women. When creating a regression model and analyzing the coefficients, it shows that on average, men are charged 1387 more than women.
The difference in averages is significant. When looking at the regression model, the residual standard error (RSE) is 12090 on 1336 degrees of freedom, which is an okay value depending on the context. The R-squared value is 0.002536, which is not very strong, but the p-value is 0.03613, which is below the 5% cutoff.
Although proven slightly significant, there could be reason to distrust my result because the R-squared value shows that this is a weaker regression model. Although it might prove significant, this would be more trustworthy if the value was closer to 1.
Now that I have an idea of how to answer this question, I want to
make sure that I consider confounding variables that may also have an
affect on the rates that men and women are charged for health insurance.
To do so, I will first make a regression model with age
as
a possible confounding variable.
age
might be a confounding variable because it is
closely correlated to sex when describing someone in the data, and it
has an effect on the charges for insurance. Depending on how old someone
is, they might be charged more simply because they are prone to more
health issues than someone else much younger. Here is my regression
model that includes age
.
insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
coef(insurance_model2)
## (Intercept) sexmale age
## 2343.6249 1538.8314 258.8651
This new model is saying that the charged value increases by 1538.8314 when the person is male, but also that the charged value increases by 258.8651 when the age increases. These differences in charges are seen as significant because the p-value for the male coefficient is 0.0149 which is below the 5% cutoff. The p-value for age is 2e-16 which is definitely below the 5% cutoff as well.
We could have reason to doubt this model because the R-squared value is 0.09209. We would be more confident if this value was closer to 1.
Since age
has proven to have a significant impact on the
charges that men and women experience for health insurance, I would like
to see if any more of the variables in the data could be confounding as
well. This could create a stronger model to base my results on.
The variables that I will be checking for are the bmi
and the smoker
variables.
insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
coef(insurance_model3)
## (Intercept) sexmale age bmi smokeryes
## -11633.4946 -109.0411 259.4532 323.0511 23833.8700
This new model is telling me that when someone is male, the charged
value decreases by 109.0411. It shows that when age
increases, the charges also increase by 259.4532. As the
bmi
increases, the charges increase by 323.0511. When
people smoke, the charges increase by 23833.87. The results from the
male variable causing a decrease in charges is not significant. The
coefficient p-value is 0.745, which is far beyond the 5% cutoff. The
age
, bmi
, and smoker
variables
are all significant though, with p-values of 2e-16.
I can trust my new comparison. The RSE for this model is 6094 on 1333 degrees of freedom, which is lower than the first regression model. The R-squared value is 0.7467, which is a lot closer to 1 than the other models, making this model a strong one. The overall model p-value is 2.2e-16, making this model trustworthy.
Before I make a final conclusion as to if men are charged more for health insurance than women, I want to test the final two remaining variables to see if they are confounding as well.
These last two variables are children
and
region
.
insurance_model4 <- lm(charges ~ sex + age + children + region, data = insurance)
summary(insurance_model4)
##
## Call:
## lm(formula = charges ~ sex + age + children + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9701 -6859 -5088 4733 48065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1945.88 1159.19 1.679 0.0935 .
## sexmale 1479.25 628.64 2.353 0.0188 *
## age 257.58 22.39 11.502 <2e-16 ***
## children 574.91 261.17 2.201 0.0279 *
## regionnorthwest -1017.27 902.49 -1.127 0.2599
## regionsoutheast 1388.07 877.72 1.581 0.1140
## regionsouthwest -1160.05 902.43 -1.285 0.1989
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11490 on 1331 degrees of freedom
## Multiple R-squared: 0.1037, Adjusted R-squared: 0.0997
## F-statistic: 25.68 on 6 and 1331 DF, p-value: < 2.2e-16
coef(insurance_model4)
## (Intercept) sexmale age children regionnorthwest
## 1945.8758 1479.2521 257.5801 574.9104 -1017.2683
## regionsoutheast regionsouthwest
## 1388.0650 -1160.0460
When I control for these variables, my regression model shows me that
the amount of children someone has can have a significant impact on
charges, but the region that they are in does not have a significant
impact. This is shown by the p-values for the coefficients. The
children
coefficient p-value is 0.0279 which is below the
5% cutoff. On the other hand, the region
variable has
multiple coefficient p-values of 0.2599, 0.1140, and 0.1989 which are
all above the 5% cutoff.
Now that I have decided which variables are confounding and have an affect on my analysis, I can create a final regression model to answer my question of if men are charged more than women for health care.
insurance_model5 <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)
summary(insurance_model5)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11837.2 -2916.7 -994.2 1375.3 29565.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
## sexmale -128.64 333.36 -0.386 0.699641
## age 257.73 11.90 21.651 < 2e-16 ***
## bmi 322.36 27.42 11.757 < 2e-16 ***
## smokeryes 23823.39 412.52 57.750 < 2e-16 ***
## children 474.41 137.86 3.441 0.000597 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
## F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
coef(insurance_model5)
## (Intercept) sexmale age bmi smokeryes children
## -12052.4620 -128.6399 257.7350 322.3642 23823.3925 474.4111
My final answer to this question is yes, the data set provides evidence that men are charged significantly more for health insurance than women are. When I interpret my most trustworthy regression model that accounts for all confounding variables, I see that the coefficient p-value for males is 0.699641, which is far above the 5% cutoff. This means that the charged value decreasing by 128.64 when the person is male is not a significant difference. This model is strong because the RSE is 6070 on 1332 degrees of freedom, which is the lowest I have had out of all of my models. The R-squared value is 0.7488 which is the closest to 1 that I have gotten with all of my models. Finally, the overall p-value for this model is 2.2e-16, making this the most trustworthy of my models.