Introduction

This report will analyze a data set that provides various details related to people’s health insurance with the goal of determining whether there is evidence that men are charged a significantly higher amount for health insurance than women are.

The name of the data set is “insurance” and it contains data on people’s health insurance charges as well as other relevant information about those people. In total, there are 7 variables and 1338 observations. The information for this data set is gathered here: https://www.kaggle.com/datasets/mirichoi0218/insurance.

The variables are called age, sex, bmi, children, smoker, region, and charges. The information provided by each variable is described below.

age = age of primary beneficiary

sex = insurance contractor gender (male or female)

bmi = body mass index, evaluates weight relative to height

children = number of dependents

smoker = whether the person smokes (yes or no)

region = the beneficiary’s residential area in the US (northeast, southeast, southwest, or northwest)

charges = individual medical costs billed by health insurance

The whole data set is displayed here:

library(readr)
library(DT)
insurance <- read_csv("insurance.csv")
datatable(insurance, options=list(scrollX=TRUE))

Three packages are needed for this report: readr, DT, and tidyverse. readr is needed to import the spreadsheet that contains the data set we are working with, DT allows us to display the data table, and tidyverse allows us to build models.

library(tidyverse)

The Impact of Gender on Insurance Charges

First, we will look at a simple comparison of the average insurance charges for men and women.

insurance %>% 
  group_by(sex) %>% 
  summarize(`average charges` = mean(charges))
## # A tibble: 2 × 2
##   sex    `average charges`
##   <chr>              <dbl>
## 1 female            12570.
## 2 male              13957.

On average, men appear to be charged more for insurance than women are. However, the difference in mean charges is not guaranteed to be statistically significant. On top of that, there could be confounding variables other than gender that play a role in this difference.

To evaluate the significance of the difference in average charges, we can build a simple regression model using sex as the predictor and charges as the response. The summary of the model contains information which reveals the significance.

insurance_model <- lm(charges~sex, data=insurance)
summary(insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

According to this model, men are charged significantly more for health insurance than women are. This is because the p-value of sexmale is under the 0.05 cutoff. However, the model is very weak. The RSE is large and the R-squared value is very small. While the model tells us that the difference in average charges is significant, it may not be trustworthy. Additionally, we still have not accounted for confounding variables.

Controlling for Age

One possible confounding variable is age. Health insurance costs increase with age, and it it possible that the men in the data set are older than the women on average.

We can see how age affects insurance costs by controlling for it in our model. This multiple regression model will have age as a predictor in addition to sex.

insurance_mult_model <- lm(charges ~ sex + age, data=insurance)
summary(insurance_mult_model)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

While the model still has a small R-squared value, meaning it is still weak, it is better than the simple model. This model’s p-value is much smaller. The p-values of the two variables show us that age has a much stronger effect on charges than sex does. Although the coefficient of age is smaller than that of sexmale, meaning it causes less of a change, its p-value is a lot smaller, meaning that change is more significant.

While this model is better, we still have reason to distrust it. The R-squared value is too small to say the model is accurate. To deal with this problem, we can control for more confounding variables.

Testing Every Variable

There are two variables left in the data set that we have not controlled for. These are children and region. By adding these variables to our model, we can see if they have a significant impact on insurance charges.

insurance_mult_model_3 <- lm(charges ~ sex + age + bmi + smoker + children + region, data=insurance)
summary(insurance_mult_model_3)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

The sex variables still has a large p-value, meaning it does not have a significant impact on charges. While two of the regions have p-values below the 0.05 cutoff, they are not very small and there is one region that has a p-value above the cutoff. It could be argued that region has a significant impact on charges, but the difference is likely due to randomness. The children variable has a pretty small p-value, whcih could mean it has a significant impact on charges. However, this p-value is not nearly as small as the p-values for age, bmi, and smoker. The change in charges associated with the children variable could also be explained by randomness. Also, the number of children the people in the data set have may, on average, increase with age. Because of this, our second multiple model (which controls for age, bmi, and smoker), is our most reliable model.

The Final Model

Our final model, which only uses variables with a significant impact on charges as predictors, looks like this:

insurance_mult_model_4 <- lm(charges ~ age + bmi + smoker, data=insurance)
summary(insurance_mult_model_4)
## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.4  -2970.9   -980.5   1480.0  28971.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11676.83     937.57  -12.45   <2e-16 ***
## age            259.55      11.93   21.75   <2e-16 ***
## bmi            322.62      27.49   11.74   <2e-16 ***
## smokeryes    23823.68     412.87   57.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7469 
## F-statistic:  1316 on 3 and 1334 DF,  p-value: < 2.2e-16

The sex variable is not used as a predictor because we have seen that it does not have a significant impact on charges. This means that the data set does not provide evidence that men are charged a significantly higher amount for health insurance than women are.

Conclusion

Through our analysis of various models, we found that three variables have a significant impact on health insurance charges: age, bmi, and smoker. While the other variables in the data set may appear to have some impact on charges, this impact is not significant and can be explained by confounding variables or randomness. So, we can conclude that there is not evidence that men are charged significantly more for health insurance than women.