knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)

Introduction

This report seeks to answer the following question:

Does the insurance data set provide evidence that men are charged significantly more for health insurance than women?

I will be using a data set called insurance obtained from https://www.kaggle.com/datasets/mirichoi0218/insurance. It includes health insurance information from various people. This data set contains 7 variables and 1,338 entries. Of these variables, the relevant ones in this report are age (age of primary beneficiary), sex (insurance contractor gender), bmi (body mass index), children (number of children covered by health insurance), and smoker (smoking), region (the beneficiary’s residential area in the U.S.), and charges (individual medical costs billed by health insurance).

Throughout this report, I will need the functionality of the tidyverse package and readr package.

library(tidyverse)
library(readr)

Here is the full data table:

insurance <- read_csv("insurance.csv")

insurance

Average Insurance Charges By Gender

In order to find out if men are charged significantly more than women for health insurance, I will first just simply look at the average insurance charges for both men and women and also make a regression model.

insurance %>%
  group_by(sex) %>%
  summarize(mean(charges))
insurance_model <- lm(charges ~ sex, data = insurance)

summary(insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613
coef(insurance_model)
## (Intercept)     sexmale 
##   12569.579    1387.172

Based on the average alone, it would be fair to say that men are charged more than women. When creating a regression model and analyzing the coefficients, it shows that on average, men are charged 1387 more than women.

The difference in averages is significant. When looking at the regression model, the residual standard error (RSE) is 12090 on 1336 degrees of freedom, which is an okay value depending on the context. The R-squared value is 0.002536, which is not very strong, but the p-value is 0.03613, which is below the 5% cutoff.

Although proven slightly significant, there could be reason to distrust my result because the R-squared value shows that this is a weaker regression model. Although it might prove significant, this would be more trustworthy if the value was closer to 1.

Considering Age’s Affect On Charges

Now that I have an idea of how to answer this question, I want to make sure that I consider confounding variables that may also have an affect on the rates that men and women are charged for health insurance. To do so, I will first make a regression model with age as a possible confounding variable.

age might be a confounding variable because it is closely correlated to sex when describing someone in the data, and it has an effect on the charges for insurance. Depending on how old someone is, they might be charged more simply because they are prone to more health issues than someone else much younger. Here is my regression model that includes age.

insurance_model2 <- lm(charges ~ sex + age, data = insurance)

summary(insurance_model2)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16
coef(insurance_model2)
## (Intercept)     sexmale         age 
##   2343.6249   1538.8314    258.8651

This new model is saying that the charged value increases by 1538.8314 when the person is male, but also that the charged value increases by 258.8651 when the age increases. These differences in charges are seen as significant because the p-value for the male coefficient is 0.0149 which is below the 5% cutoff. The p-value for age is 2e-16 which is definitely below the 5% cutoff as well.

We could have reason to doubt this model because the R-squared value is 0.09209. We would be more confident if this value was closer to 1.

Finding More Confounding Variables

Since age has proven to have a significant impact on the charges that men and women experience for health insurance, I would like to see if any more of the variables in the data could be confounding as well. This could create a stronger model to base my results on.

The variables that I will be checking for are the bmi and the smoker variables.

insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)

summary(insurance_model3)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16
coef(insurance_model3)
## (Intercept)     sexmale         age         bmi   smokeryes 
## -11633.4946   -109.0411    259.4532    323.0511  23833.8700

This new model is telling me that when someone is male, the charged value decreases by 109.0411. It shows that when age increases, the charges also increase by 259.4532. As the bmi increases, the charges increase by 323.0511. When people smoke, the charges increase by 23833.87. The results from the male variable causing a decrease in charges is not significant. The coefficient p-value is 0.745, which is far beyond the 5% cutoff. The age, bmi, and smoker variables are all significant though, with p-values of 2e-16.

I can trust my new comparison. The RSE for this model is 6094 on 1333 degrees of freedom, which is lower than the first regression model. The R-squared value is 0.7467, which is a lot closer to 1 than the other models, making this model a strong one. The overall model p-value is 2.2e-16, making this model trustworthy.

Testing The Last Variables

Before I make a final conclusion as to if men are charged more for health insurance than women, I want to test the final two remaining variables to see if they are confounding as well.

These last two variables are children and region.

insurance_model4 <- lm(charges ~ sex + age + children + region, data = insurance)

summary(insurance_model4)
## 
## Call:
## lm(formula = charges ~ sex + age + children + region, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9701  -6859  -5088   4733  48065 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1945.88    1159.19   1.679   0.0935 .  
## sexmale          1479.25     628.64   2.353   0.0188 *  
## age               257.58      22.39  11.502   <2e-16 ***
## children          574.91     261.17   2.201   0.0279 *  
## regionnorthwest -1017.27     902.49  -1.127   0.2599    
## regionsoutheast  1388.07     877.72   1.581   0.1140    
## regionsouthwest -1160.05     902.43  -1.285   0.1989    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11490 on 1331 degrees of freedom
## Multiple R-squared:  0.1037, Adjusted R-squared:  0.0997 
## F-statistic: 25.68 on 6 and 1331 DF,  p-value: < 2.2e-16
coef(insurance_model4)
##     (Intercept)         sexmale             age        children regionnorthwest 
##       1945.8758       1479.2521        257.5801        574.9104      -1017.2683 
## regionsoutheast regionsouthwest 
##       1388.0650      -1160.0460

When I control for these variables, my regression model shows me that the amount of children someone has can have a significant impact on charges, but the region that they are in does not have a significant impact. This is shown by the p-values for the coefficients. The children coefficient p-value is 0.0279 which is below the 5% cutoff. On the other hand, the region variable has multiple coefficient p-values of 0.2599, 0.1140, and 0.1989 which are all above the 5% cutoff.

Final Regression Model & Conclusion

Now that I have decided which variables are confounding and have an affect on my analysis, I can create a final regression model to answer my question of if men are charged more than women for health care.

insurance_model5 <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)

summary(insurance_model5)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11837.2  -2916.7   -994.2   1375.3  29565.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
## sexmale       -128.64     333.36  -0.386 0.699641    
## age            257.73      11.90  21.651  < 2e-16 ***
## bmi            322.36      27.42  11.757  < 2e-16 ***
## smokeryes    23823.39     412.52  57.750  < 2e-16 ***
## children       474.41     137.86   3.441 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7488 
## F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16
coef(insurance_model5)
## (Intercept)     sexmale         age         bmi   smokeryes    children 
## -12052.4620   -128.6399    257.7350    322.3642  23823.3925    474.4111

My final answer to this question is yes, the data set provides evidence that men are charged significantly more for health insurance than women are. When I interpret my most trustworthy regression model that accounts for all confounding variables, I see that the coefficient p-value for males is 0.699641, which is far above the 5% cutoff. This means that the charged value decreasing by 128.64 when the person is male is not a significant difference. This model is strong because the RSE is 6070 on 1332 degrees of freedom, which is the lowest I have had out of all of my models. The R-squared value is 0.7488 which is the closest to 1 that I have gotten with all of my models. Finally, the overall p-value for this model is 2.2e-16, making this the most trustworthy of my models.