Throughout this analysis, we will be referring to a data set called
insurance. This data set is simulated based on demographics
from the US census [ref].
We will primarily be demonstrating linear regression skills using this
synthetic dataset. Each data entry has 7 variables: age is
the age of the primary beneficiary of the insurance policy,
sex is either male or female
depending on the gender of the holder, bmi is the holder’s
body mass index, children is the number of dependent
children, smoker is either yes or
no depending on the beneficiary’s smoking status,
region is the beneficiary’s residential region in the US,
and charges are the individual medical costs billed by
health insurance. The data can be viewed below:
This report aims to answer the following question:
Does the given data set provide evidence that men are charged significantly more for health insurance than women?
We will need the functionality of the tidyverse package to utilize its grouped summary feature and data visualizations. The rest of the analysis will be done using regression commands in base R.
library(tidyverse)
We can directly compare the mean insurance rate (labeled
charges in the data set) between males and females using a
grouped summary.
insurance %>%
group_by(sex) %>%
summarise(mean_charges = mean(charges), count = n())
There are a comparable number of males and females in the data set. The mean insurance charges for men is around \(\$14,000\), and the mean insurance charges for women is about \(\$12,600\). The difference seems pretty significant in favor of men getting charged more, and we can confirm this by performing a simple linear regression and inspecting the summary statistics.
simple_insurance_model <- lm(charges ~ sex, data = insurance)
summary(simple_insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
The p-value on the variable sexmale is 0.036, which is
less than our 0.05 cutoff. This tells us that the insurance charges for
males is indeed significantly different than that of females. However,
just because males have higher rates doesn’t mean their gender is the
cause. Let’s introduce another variable: age. Age may be
related to both gender and insurance risk. One possible explanation is
that men are more likely to need medical attention than women at higher
ages. We can test this theory by adding the age variable to
our regression model.
insurance_model_w_age <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model_w_age)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
The p-value on sexmale is now 0.015, still lower than
our 0.05 cutoff. This tells us that for a man and a woman the same age,
the man will on average be charged about \(\$1,500\) more for insurance.
That conclusion still doesn’t help us much. Even though we have
controlled for age, we can hardly say that age is the only factor that
can affect insurance rates. Another factor affecting insurance charges
is health. Our data set has two variables indicative of health:
bmi and smoker. It is possible that men are
more likely to be smokers than women and/or have a higher BMI,
indicating worse health and therefore higher insurance charges. Let’s
add these two variables to our regression model.
insurance_model_w_age_health <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model_w_age_health)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
Now there is no longer a significant difference in charges between men and women when controlling for age and health. There are still two more variables available for us to test: children and region. It is good practice to use all the data available to us before we make a definite conclusion.
One variable we have in the data set is region. Perhaps
due to economic factors, people in a certain region get charged more for
health insurance. Since this is a categorical data type, it is better to
test for significance using boxplots. This will help us compare the
distribution of charges for each region.
ggplot(insurance) +
geom_boxplot(aes(x = region, y = charges))
The southeast region seems slightly stronger skewed to the right than the other regions, but overall, none of the regions appear distinctly different from the rest. For example, the median of all 4 regions is about \(\$1,000\) in charges. We will proceed without including this variable.
The final variable we should test is children. This is
another numerical data type, so we will add it straight to the model
like usual and see if its coefficient is significant.
final_insurance_model <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)
summary(final_insurance_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11837.2 -2916.7 -994.2 1375.3 29565.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
## sexmale -128.64 333.36 -0.386 0.699641
## age 257.73 11.90 21.651 < 2e-16 ***
## bmi 322.36 27.42 11.757 < 2e-16 ***
## smokeryes 23823.39 412.52 57.750 < 2e-16 ***
## children 474.41 137.86 3.441 0.000597 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
## F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
The number of dependent children on health insurance was a significant factor in determining health insurance charges as well. We have checked all 6 possible explanatory variables, and are now ready to give a final assessment.
Although the data set does indeed suggest that men get charged
significantly more for health insurance than women, we have determined
that there are other confounding variables that better explain the
variation in insurance charges. In fact, when controlling for age,
health, and the number of children, men actually got charged slightly
less for health insurance. The p-value on sexmale
was 0.70, far above our 0.05 cutoff. Therefore, for a male and female of
comparable age, health, and number of children, there is no significant
difference in insurance charges.