Introduction

This report seeks to answer the following question:

Does the health insurance data provide statistically significant evidence that men are charged significantly more for health insurance than women?

We will be using a data set called insurance obtained from https://www.kaggle.com/datasets/mirichoi0218/insurance. This data includes 7 variables and 1,338 entries. Of these variables, the relevant ones include are age(age of primary beneficiary), sex(insurance contractor gender), bmi(body mass index), children(number of children covered by the health insurance), smoker(do they smoke?), region(the beneficiary’s residential area in the U.S.), and charges(individual medical costs billed by health insurance).The full data set can be viewed below:

Throughout, we will need the functionality of the tidyverse package,modelr, mainly to create the model’s themselves.

library(tidyverse)
library(modelr)

Average Insurance Charges by Gender

To identify if men are charged significantly more then women for health insurance, we will first look at the average insurance charges for both male and female.

insurance %>% 
  group_by(sex) %>% 
  summarize(mean(charges))
## # A tibble: 2 × 2
##   sex    `mean(charges)`
##   <chr>            <dbl>
## 1 female          12570.
## 2 male            13957.

Simply based on these average we have some data that may prove men are charged more, however we should create a linear regression model to prove statistical significance. This model is created as follows:

insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

The coefficients of the regression model also conclude that men are charged more. Based on the coefficients it shows on average that men are charged $1387.2 more than women.

This model does also show statistical significance as the residual standard error (RSE) is 12090 on 1336 degrees of freedom, this is not a great number but not terrible either it is contingent on the context. The p-value is 0.03613 which falls below the 5% cutoff. However the R-squared value is 0.002536, when converted to a percent is 0.2536% meaning that only 0.2536% of variance in the variable is explained by the model.

Although this model technically shows statistical significance, there are other variables within the model that may have an effect on the model. The adjusted R-squared is extremely low and these other variables should be tested to find a more trustworthy significant model.

Changes when Age is Included

Now that we have identified a possible statistically significant correlation between gender and insurance charges we can consider confounding variables that may impact these charges. While dealing with insurance a persons age may impact the charges as they may be more prone to health conditions or have frequent insurance charges than a younger person.

To begin this process, we will first check include age in the regression model as a predictor variable:

insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

This new model showcases some change in the intercepts, highlighting that men are charged $1538.83 higher then women of the same age. In addition to that this model highlights that for every year the age increases the person is charged 258.87 dollars regardless of gender.

Both of these differences are significant as the p-value for the male and age coefficients are 0.0149 and <2e-16 accordingly, which is smaller than the 0.05 cutoff.However the adjusted R-squared is still a low number,0.09209, that leads us to believe there are more confounding variables in addition to age. ## Testing for Other Confounding Variables Considering age has shown a significant impact on the charges that men and women receive for health insurance, it is fair to look into the other variables contained within the model.

Two variables contained in the data are directly related to a person health, which would most likely have an impact on how much they are charged for health insurance. These variables are bmi and smoker which are added to the following model:

insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

These additions have drastically change the amount in which men are charged, in fact this model shows that men are charged $109.04 dollars less then women with the same BMI, age and smoking status. Yet, this number is not statistically significant as the p-value is 0.745, far over the 0.05 cutoff.

This model also shows that as age increase by one year the charge increases by 259.45 dollars. As the BMI increases the person will be charged 323.05 dollars more, and lastly if they are a smoker they will be charged $23,833.87 more than a non-smoker. All of these values are statistically significant as the p-value is <2e-16.

Overall this new comparison is more trustworthy than previous models as it has a <2e-16 p-value. The RSE is 6094 on 1333 degrees of freedom which is lower than the original model. Lastly, the adjusted R-squared is 0.7475 meaning that 74.75% of the varience is explained by the model, significantly stronger than the previous models.

Testing the Last Variables

Before final conclusions can be drawn it is important to test the last variables within the data set to confirm the strengths of our previous model and identify any other variables that need to be included.

By including children and region we will get a glimpse at how a persons local environment may impact their health insurance charges.

insurance_model4 <- lm(charges ~ sex + age + smoker + bmi+ children + region, data = insurance)
summary(insurance_model4)
## 
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi + children + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

When these variables are controlled for it can be seen quickly that they are not impacting the model in a positive way. The children variable increases the charge by $257.58 and is statistically significant. Yet, all of the three regions have p-values over the cutoff of 0.05. West would decrease the charge by 353.0, East would increase the charge by 1035.0, and Southwest would decrease the charge by 960.0 if statistically significant.

By testing this last two variables it can be highlighted that excluding both of these models will have minimal impact to the dataset.

Final Model & Conclusion

Now that all variables have been tested and analyzed, the final regression model can be created and the final conclusion can be drawn.

final_model <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(final_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

In summary, there is not enough evidence to support the claim that men are being charged more then women for health insurance. While interpreting the final model it can be seen that the coefficient for males is -109.04, meaning they would be charged $109.04 less then women of the same health and age, however the p-value of this coefficient is 0.745 which is much larger than the 0.05 cutoff. This means that it is not statistically significant. We can identify the strength of this model based upon the RSE at 6094 on 1333 degrees of freedom, relatively low. The adjusted R-squared is 0.7467, meaning that 74.67% of the variance in the variable is explained by the model. Finally, the p-value of the model is < 2.2e-16, making the model as a whole statistically significant.