Introduction

Throughout this analysis, we will be referring to a data set called insurance. This data set is simulated based on demographics from the US census [ref]. We will primarily be demonstrating linear regression skills using this synthetic dataset. Each data entry has 7 variables: age is the age of the primary beneficiary of the insurance policy, sex is either male or female depending on the gender of the holder, bmi is the holder’s body mass index, children is the number of dependent children, smoker is either yes or no depending on the beneficiary’s smoking status, region is the beneficiary’s residential region in the US, and charges are the individual medical costs billed by health insurance. The data can be viewed below:

This report aims to answer the following question:

Does the given data set provide evidence that men are charged significantly more for health insurance than women?

We will need the functionality of the tidyverse package to utilize its grouped summary feature and data visualizations. The rest of the analysis will be done using regression commands in base R.

library(tidyverse)

Direct Comparison

We can directly compare the mean insurance rate (labeled charges in the data set) between males and females using a grouped summary.

insurance %>% 
  group_by(sex) %>% 
  summarise(mean_charges = mean(charges), count = n())

There are a comparable number of males and females in the data set. The mean insurance charges for men is around \(\$14,000\), and the mean insurance charges for women is about \(\$12,600\). The difference seems pretty significant in favor of men getting charged more, and we can confirm this by performing a simple linear regression and inspecting the summary statistics.

simple_insurance_model <- lm(charges ~ sex, data = insurance)
summary(simple_insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

The p-value on the variable sexmale is 0.036, which is less than our 0.05 cutoff. This tells us that the insurance charges for males is indeed significantly different than that of females. However, just because males have higher rates doesn’t mean their gender is the cause. Let’s introduce another variable: age. Age may be related to both gender and insurance risk. One possible explanation is that men are more likely to need medical attention than women at higher ages. We can test this theory by adding the age variable to our regression model.

Controlling for Age

insurance_model_w_age <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model_w_age)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

The p-value on sexmale is now 0.015, still lower than our 0.05 cutoff. This tells us that for a man and a woman the same age, the man will on average be charged about \(\$1,500\) more for insurance.

Controlling for Health

That conclusion still doesn’t help us much. Even though we have controlled for age, we can hardly say that age is the only factor that can affect insurance rates. Another factor affecting insurance charges is health. Our data set has two variables indicative of health: bmi and smoker. It is possible that men are more likely to be smokers than women and/or have a higher BMI, indicating worse health and therefore higher insurance charges. Let’s add these two variables to our regression model.

insurance_model_w_age_health <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model_w_age_health)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

Now there is no longer a significant difference in charges between men and women when controlling for age and health. There are still two more variables available for us to test: children and region. It is good practice to use all the data available to us before we make a definite conclusion.

Controlling for Remaining Variables

One variable we have in the data set is region. Perhaps due to economic factors, people in a certain region get charged more for health insurance. Since this is a categorical data type, it is better to test for significance using boxplots. This will help us compare the distribution of charges for each region.

ggplot(insurance) +
  geom_boxplot(aes(x = region, y = charges))

The southeast region seems slightly stronger skewed to the right than the other regions, but overall, none of the regions appear distinctly different from the rest. For example, the median of all 4 regions is about \(\$1,000\) in charges. We will proceed without including this variable.

The final variable we should test is children. This is another numerical data type, so we will add it straight to the model like usual and see if its coefficient is significant.

final_insurance_model <- lm(charges ~ sex + age + bmi + smoker + children, data = insurance)
summary(final_insurance_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11837.2  -2916.7   -994.2   1375.3  29565.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12052.46     951.26 -12.670  < 2e-16 ***
## sexmale       -128.64     333.36  -0.386 0.699641    
## age            257.73      11.90  21.651  < 2e-16 ***
## bmi            322.36      27.42  11.757  < 2e-16 ***
## smokeryes    23823.39     412.52  57.750  < 2e-16 ***
## children       474.41     137.86   3.441 0.000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared:  0.7497, Adjusted R-squared:  0.7488 
## F-statistic:   798 on 5 and 1332 DF,  p-value: < 2.2e-16

The number of dependent children on health insurance was a significant factor in determining health insurance charges as well. We have checked all 6 possible explanatory variables, and are now ready to give a final assessment.

Conclusion

Although the data set does indeed suggest that men get charged significantly more for health insurance than women, we have determined that there are other confounding variables that better explain the variation in insurance charges. In fact, when controlling for age, health, and the number of children, men actually got charged slightly less for health insurance. The p-value on sexmale was 0.70, far above our 0.05 cutoff. Therefore, for a male and female of comparable age, health, and number of children, there is no significant difference in insurance charges.