This report will analyze a data set that provides various details related to people’s health insurance with the goal of determining whether there is evidence that men are charged a significantly higher amount for health insurance than women are.
The name of the data set is “insurance” and it contains data on people’s health insurance charges as well as other relevant information about those people. In total, there are 7 variables and 1338 observations. The information for this data set is gathered here: https://www.kaggle.com/datasets/mirichoi0218/insurance.
The variables are called age, sex, bmi, children, smoker, region, and charges. The information provided by each variable is described below.
age = age of primary beneficiary
sex = insurance contractor gender (male or female)
bmi = body mass index, evaluates weight relative to height
children = number of dependents
smoker = whether the person smokes (yes or no)
region = the beneficiary’s residential area in the US (northeast, southeast, southwest, or northwest)
charges = individual medical costs billed by health insurance
The whole data set is displayed here:
library(readr)
library(DT)
insurance <- read_csv("insurance.csv")
datatable(insurance, options=list(scrollX=TRUE))
Three packages are needed for this report: readr, DT, and tidyverse. readr is needed to import the spreadsheet that contains the data set we are working with, DT allows us to display the data table, and tidyverse allows us to build models.
library(tidyverse)
First, we will look at a simple comparison of the average insurance charges for men and women.
insurance %>%
group_by(sex) %>%
summarize(`average charges` = mean(charges))
## # A tibble: 2 × 2
## sex `average charges`
## <chr> <dbl>
## 1 female 12570.
## 2 male 13957.
On average, men appear to be charged more for insurance than women are. However, the difference in mean charges is not guaranteed to be statistically significant. On top of that, there could be confounding variables other than gender that play a role in this difference.
To evaluate the significance of the difference in average charges, we can build a simple regression model using sex as the predictor and charges as the response. The summary of the model contains information which reveals the significance.
insurance_model <- lm(charges~sex, data=insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
According to this model, men are charged significantly more for health insurance than women are. This is because the p-value of sexmale is under the 0.05 cutoff. However, the model is very weak. The RSE is large and the R-squared value is very small. While the model tells us that the difference in average charges is significant, it may not be trustworthy. Additionally, we still have not accounted for confounding variables.
One possible confounding variable is age. Health insurance costs increase with age, and it it possible that the men in the data set are older than the women on average.
We can see how age affects insurance costs by controlling for it in our model. This multiple regression model will have age as a predictor in addition to sex.
insurance_mult_model <- lm(charges ~ sex + age, data=insurance)
summary(insurance_mult_model)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
While the model still has a small R-squared value, meaning it is still weak, it is better than the simple model. This model’s p-value is much smaller. The p-values of the two variables show us that age has a much stronger effect on charges than sex does. Although the coefficient of age is smaller than that of sexmale, meaning it causes less of a change, its p-value is a lot smaller, meaning that change is more significant.
While this model is better, we still have reason to distrust it. The R-squared value is too small to say the model is accurate. To deal with this problem, we can control for more confounding variables.
There are two variables left in the data set that we have not controlled for. These are children and region. By adding these variables to our model, we can see if they have a significant impact on insurance charges.
insurance_mult_model_3 <- lm(charges ~ sex + age + bmi + smoker + children + region, data=insurance)
summary(insurance_mult_model_3)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
The sex variables still has a large p-value, meaning it does not have a significant impact on charges. While two of the regions have p-values below the 0.05 cutoff, they are not very small and there is one region that has a p-value above the cutoff. It could be argued that region has a significant impact on charges, but the difference is likely due to randomness. The children variable has a pretty small p-value, whcih could mean it has a significant impact on charges. However, this p-value is not nearly as small as the p-values for age, bmi, and smoker. The change in charges associated with the children variable could also be explained by randomness. Also, the number of children the people in the data set have may, on average, increase with age. Because of this, our second multiple model (which controls for age, bmi, and smoker), is our most reliable model.
Our final model, which only uses variables with a significant impact on charges as predictors, looks like this:
insurance_mult_model_4 <- lm(charges ~ age + bmi + smoker, data=insurance)
summary(insurance_mult_model_4)
##
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12415.4 -2970.9 -980.5 1480.0 28971.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11676.83 937.57 -12.45 <2e-16 ***
## age 259.55 11.93 21.75 <2e-16 ***
## bmi 322.62 27.49 11.74 <2e-16 ***
## smokeryes 23823.68 412.87 57.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7469
## F-statistic: 1316 on 3 and 1334 DF, p-value: < 2.2e-16
The sex variable is not used as a predictor because we have seen that it does not have a significant impact on charges. This means that the data set does not provide evidence that men are charged a significantly higher amount for health insurance than women are.
Through our analysis of various models, we found that three variables have a significant impact on health insurance charges: age, bmi, and smoker. While the other variables in the data set may appear to have some impact on charges, this impact is not significant and can be explained by confounding variables or randomness. So, we can conclude that there is not evidence that men are charged significantly more for health insurance than women.