Introduction

In this report, I will seek to find an answer to the question: “do men get charged significantly more for health insurance?” This report will utilize various methods of statistical analysis such as averages, simple and multiple linear regression, and looking at p-values and R2 values to answer this question in a way that is statistically meaningful.

This data set contains 7 variables and 1,338 entries and comes from kaggle.com. The variables that will be of import to this report are: charges, the individual medical costs billed by health insurance; age, the age of the individual; smoker, whether or not the individual smokes; sex, the sex of the individual; bmi, the body mass index of the individual; region, what geographical region of the United States the individual lives in; and children, the number of children covered by health insurance.

datatable(insurance, options = list(scrollX = TRUE))

Average Insurance Cost by Gender

To get a general comparison, first I want to break down average insurance cost by gender.

insurance %>%
  filter(sex == "male") %>%
  summarize(mean(charges))
insurance %>%
  filter(sex == "female") %>%
  summarize(mean(charges))

From this, we can see that males are charged on average $13,956.75, while females are charged only $12,569.58 on average. However, this alone can not tell us whether men are charged more than women for health insurance as there is no statistical significance to an average and other factors could be at play.

We can determine the significance of the average charges to aid us in answering our question.

insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

By definition, because the p-value is below the 0.05 cutoff we can say that the correlation between sex and charges is significant. However, I would be hesitant to trust this model because the R2 value is 0.002536 meaning that only 0.2536% of the variation can be explained by this model. Further testing will need to be done to answer the question.

How Much of a Factor is Age?

Because men tend to be more at risk for developing a number of health problems, age may be a confounding variable in this data set. Let’s factor age into the model.

insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

Based off this new model, age is a very significant factor in price, with a p-value well below the cutoff of 0.05. This new model does also have sex as a significant factor with a p-value of 0.0149 meaning that there is a 98.51% chance that sex is a good predictor. So, age and sex can both be said to be significant predictors of charges. However, I still am not fully satisfied with this model as the R2 value is only 0.09209 meaning that only 9.209% of the variation can be explained by both age and sex. This model can still not be fully trusted and something else is causing a lot of the variation.

Taking BMI and Smoking Into Consideration

Next, I will add bmi and smoker to the model to see what roles they play in determining charges. I predict that they have a very high impact as both a person’s BMI and whether they smoke has large implications for overall health.

insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

In this model, we can see that being male isn’t actually a strong predictor of charges as it’s new p-value is 0.7467. age, bmi, and smoker are all significant predictors of charges. The new R2 value of 0.7467 indicates that the model is much stronger as it can now explain 74.67% of variation in charges based off of the predictors. This new model, based off of the statistics, can be trusted. However, before I make my decision on the question, I would like to look at several other variables to see if they play a role.

Examining the Other Variables

I will now add the remaining variables to the model to see if they play a role.

insurance_model4 <- lm(charges ~ sex + age + bmi + smoker + region + children, 
                       data = insurance)
summary(insurance_model4)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region + children, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## children           475.5      137.8   3.451 0.000577 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Based on this new model, children is absolutely a significant factor with a p-value of 0.000577, but region cannot be said to be significant as only some of the regions have a significant p-value while others seem to have no effect whatsoever. sex still isn’t significant based off of p-value in this model.

Conclusion

With these models, I have been able to show that yes, men are charged significantly more for health insurance that women. However, the reason that they are charged more is not their sex as that is insignificant when the confounding variables are taken into account. It can be said that men may just be more predisposed to certain health conditions as they age, smoking, having a higher bmi, or any number of other factors at play in overall health. For this reason, men are charged significantly more due to factors outside of their sex.