Introduction

This report will answer the following question:

Are men charged significantly more for health insurance than women?

We will be using a data set titled insurance, obtained from an imported data table. Within this data set, there are a total of 186 observations across 7 variables; the relevant variables to this report are sex (gender of insurance recipient), charges (individual medical costs billed by health insurance), age (age of primary beneficiary), bmi(Body Mass Index of primary beneficiary), and smoker(does the primary beneficiary smoke?). The full data set can be viewed below:

Throughout, the tidyverse and modelr packages will be used to manipulate the data set.

library(tidyverse)
library(modelr)

Insurance Charges Based on Sex

To determine if men are charged more than women, we can look at the mean insurance charges for men versus the insurance charges for women. The code chunk below will determine the average insurance charges based on the sex of the primary beneficiary.

mean_charge <- insurance %>% 
  group_by(sex) %>% 
  summarise(mean(charges))

mean_charge

Based on only comparing the average insurance charges, it can be concluded that men are charged more than women. According to the average, men are charged $1387.17 more than women.

In order to determine if the difference in charges can be considered significant, we will do a F test from a model that relates the insurance charges and the sex of the primary beneficiary.

charge_model <- lm(charges ~ sex, data = insurance)

summary(charge_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

Since the p-value of the model is less than 0.05, we can conclude that the difference between the insurance charges and the sex of the primary beneficiary is significant. From this, we can state that men are charged more than women.

However, the analysis above is only based on the sex and the insurance charges. We cannot fully trust this answer, as we did not account for the possible confounding variables in the data set.

Age as a Confounding Variable

A possible confounding variable is age, since as people get older, they often need more medical care, which causes the insurances charges to increase. This could help explain why men are charged more than women as it would provide more context to the charges.

charge_mult_model <- lm(charges ~ sex + age, data = insurance)

summary(charge_mult_model)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

The comparison demonstrates that controlling for age creates a reliable model. According to the model, men are still charged more than women, and as a beneficiary gets older the charges also increase. Since the overall p-value is less than 0.05 we can conclude that the difference is significant.

We cannot fully trust this model either since the R2 value is quite low and there are still other variables that could be considered confounding to the model.

Other Confounding Variables

Two other possible confounding variables, are the BMI of the beneficiary and if they are a smoker. We will add these variables to the model to determine if they will also affect the insurance charges.

charge_mult_model_2 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)

summary(charge_mult_model_2)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

This comparison is telling us that if the beneficiary’s BMI is higher and if they smoke, the charges will increase and men are still charged more than women. The difference in charges can be considered significant based on the overall p-value.

This comparison can be trusted, but there are some other variables that still have not been controlled for. We must first test all possible confounding variables before coming to a set conclusion.

The last two possible confounding variables are if the primary beneficiary has children and the region in which they live. The codes below will control for each of these variables individually.

charge_children_model <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)

summary(charge_children_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11837.2  -2916.7   -994.2   1375.3  29565.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
## sexmale       -128.64     333.36  -0.386 0.699641    
## age            257.73      11.90  21.651  < 2e-16 ***
## bmi            322.36      27.42  11.757  < 2e-16 ***
## smokeryes    23823.39     412.52  57.750  < 2e-16 ***
## children       474.41     137.86   3.441 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7488 
## F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16
charge_region_model <- lm(charges ~ sex + age + bmi + smoker + region, data = insurance)

summary(charge_region_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11852.6  -3010.9   -987.8   1515.8  29467.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11556.96     985.63 -11.725   <2e-16 ***
## sexmale           -111.57     334.26  -0.334   0.7386    
## age                258.54      11.94  21.658   <2e-16 ***
## bmi                340.46      28.71  11.857   <2e-16 ***
## smokeryes        23862.91     414.82  57.526   <2e-16 ***
## regionnorthwest   -304.10     478.01  -0.636   0.5248    
## regionsoutheast  -1039.20     480.65  -2.162   0.0308 *  
## regionsouthwest   -916.44     479.72  -1.910   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6087 on 1330 degrees of freedom
## Multiple R-squared:  0.7487, Adjusted R-squared:  0.7474 
## F-statistic:   566 on 7 and 1330 DF,  p-value: < 2.2e-16

Both variables can be considered to have a significant impact on the charges. However, the region in which the primary beneficiary lives can be discarded as the R2 value is less than that of the model involving children.

The data set does not provide enough evidence to suggest that men are charged significantly more for health insurance than women. When looking at the individual p-value for men, we can see that it is above 0.05 which concludes that the difference is insignificant.

Conclusion

From the multiple models created and analysed, we can conclude that men are not charged significantly more for health insurance compared to women. While they are charged more, when other variables are included and taken into the equation for end charge, we can determine that they are not charged more simply because they are male rather than female. The data is able to support this claim and we were able to manipulate the given variables to show that sex is not the only reason for the higher charges.