This report seeks to answer the following question:
Does the data set provide evidence that men are charged significantly more for health insurance than women?
We will be using a data set called insurance obtained
from [https://www.kaggle.com/datasets/mirichoi0218/insurance].
This data set includes some of the basic health and demographic
information as well as the charges they face for health insurance for a
total of 1,338 individuals. There are 7 variables for each individual;
these include age, sex, and bmi
(some of the basic information provided to your doctor),
children (if the individual has children and how many, if
any), smoker (does the individual smoke),
region (where the individual lives in the country), and
charges (how much the individual pays for their health
insurance) The full data set can be viewed below:
datatable(insurance, options = list(scrollx = TRUE))
Throughout, we will need the functionality of the tidyverse package, mainly to create visualizations. As well as the DT package to help display our data table.
library(tidyverse)
library(DT)
As we begin to gather information to answer our question we can start with the basics. The most basic piece of data we can look at is the average insurance charges both male and female face. This will allow us to get a baseline for which sex generally has a higher insurance cost. To do this we must look at the data set in two separate groups: one that focuses on the females and another that focuses on the males. Thankfully, instead of having to filter the data into two different data tables we can look at the summary of a model that focuses on charges and how they relate to the sex of the individual. The summary would look something like this:
insurance_model <- lm(charges ~ sex, data = insurance)
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
While this summary contains a lot of information we will begin by focusing on the “Estimate Std.” which will provide us with the information needed to find the average charges for each sex. To start with the female we will look at the “(Intercept)” which tells us the average insurance charges females face is around 12,569.60 dollars. Next we will look at the “sexmale” which gives us a total of 1,387.20 dollars. To find the average insurance charges for the males we must add their total to the females since the model tells us men pay, on average, 1378 dollars more than women. Therefore, we will get a result of 13,956.80 dollars in insurance charges for the males.
Next, we must consider if the difference between the two averages is significant or not. We can do this by looking at the p-value. Here we can see that our p-value is 0.036 which is below the 0.05 cutoff point we will use as our threshold. This allows us to conclude that the difference is statistically significant. Therefore, we can say that we have significant evidence that men are charged more for health insurance than women.
From here it would be easy to say that men have higher insurance charges than women when we look at the average, but there is more that can impact the resulting charges. These results don’t take into account the other influencing variables that are included in the data set also known as our confounding variables. Additionally, when we look at our adjusted R-squared value we can see that it would give us an estimated percentage of 0.25% which doesn’t give us enough confidence to say that there is a strong correlation between gender and charges.
One of the important variables to consider is age. Age could be one of our confounding variables because generally as you get older the cost of your health insurance goes up. This is because as you age your body begins to deteriorate and you are more likely to need to use more of your health insurance package. Additionally, age could be a confounding variable because it is possible that in this data set the charges are impacted by the age of the individuals. It is possible that the average age of the males is higher than that of the women, or vice versa, which could result in the difference in insurance costs.
To determine if age is a confounding variable we can compare the insurance charges for males and females while controlling for the age variable:
insurance_model2 <- lm(charges ~ sex + age, data = insurance)
summary(insurance_model2)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
This new comparison tells us that age has a positive impact on health insurance costs. This can be seen through the estimate std. where the value is 258.87. This tells us that the older you are the greater cost you have to pay for your health insurance. When we look at the p-values we can see that for the females it is 0.0186, for the males it is 0.0149, and for the age our p-value is <2e-16. All of these p-values are below the 0.05 threshold which allows us to conclude that all of our variables have a significant impact on charges. When we look at the coefficients (the estimate std.) we can see that they are all positive. This allows us to conclude that the variables have positive impacts on the charges.
While this comparison looks better than the previous one there are still some reasons to distrust this comparison. One of the reasons is when we look at the adjusted R-squared value it gives us an estimated percentage of 9.21. This is still very small and doesn’t give me a lot of confidence to say it is a good comparison. Additionally, there could be additional confounding variables in the data set that can impact the health insurance charges.
When deciding upon these additional confounding variable we need to look at which ones would be the best to use to describe an individual’s health. Looking at the data set the two variables that I hypothesize best fits this description are BMI and smoker. To determine whether or not this hypothesis is correct we can redo our charges comparison controlling for BMI and smoke in addition to age. That comparison would look something like this:
insurance_model3 <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_model3)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
This comparison is telling us that both the BMI and smoker variables have significant impacts on health insurance charges. When looking at the coefficients we can see that both male and female they are both negative now. The biggest impact in relation to the coefficient is if the individual is a smoker. Another factor that we can determine from the coefficients is that age, BMI, and smoker all have a positive impact on charges while both genders have a negative impact on charges. This shows us that the impact on charges isn’t necessarily dependent upon the gender but is more reliant on the factors of the individual.
On the other hand, if we look at the p-values we can see that the female sex, age, BMI, and smoker variables all have a p-value of <2e-16. Since these p-values are all below the 0.05 cutoff it means that we have significant evidence that these variables have a significant impact on charges. This isn’t true for all of variables though. The male sex variable has a p-value of 0.745 this is way above the 0.05 cutoff which means that we have insufficient evidence that the male sex has an impact on charges. Therefore, we can determine that the male sex has an insignificant impact on charges.
When looking at the adjusted R-squared value, 0.7467 which rounds up to 74.68%, I can say that I am fairly confident that I can trust this comparison.
Even though we have used many of the variables in the data set there are still a couple we can use as controls to see if they have any significant impacts on charges. These variables account for some of the home life factors rather than health related ones. The two remaining variables we can use as controls to see if they have a significant impact on charges are region and children.
Our model when we use these two variables as controls would look something like this:
insurance_model4 <- lm(charges ~ children + region, data = insurance)
summary(insurance_model4)
##
## Call:
## lm(formula = charges ~ children + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13109 -8591 -4058 3107 49783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12660.8 728.5 17.379 < 2e-16 ***
## children 712.5 273.8 2.603 0.00935 **
## regionnorthwest -1061.1 947.0 -1.120 0.26272
## regionsoutheast 1326.8 920.9 1.441 0.14990
## regionsouthwest -1127.3 946.9 -1.190 0.23407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12060 on 1333 degrees of freedom
## Multiple R-squared: 0.01166, Adjusted R-squared: 0.008691
## F-statistic: 3.931 on 4 and 1333 DF, p-value: 0.003539
When looking at our p-values we see that individuals that have children have a p-value of 0.00935. This is below the 0.05 cutoff which allows us to determine that when an individual has children there is significant evidence that is has a significant impact on their charges for health insurance. This is something that is to be expected especially if the children are on the parents health care plan.
The same however cannot be said about the region variable. When we look at the p-values for the regions (northwest, southeast, southwest) we can see they are all above the 0.05 cutoff with 0.263, 0.15, and 0.234. Although one of the regions does have a p-value below the 0.05 cutoff which is northeast with <2e-16. Therefore, we have insufficient evidence that the northwest, southeast, and southwest regions have an impact on charges. But, we have sufficient evidence that the northeast region has an impact on charges. If we had to sum region up to one side of the argument I would say that we have insufficient evidence that the region impacts charges. Only because 3 out of the 4 regions fell in this category.
In summary we can conclude that sex is not statistically significant in relation to the cost of health insurance. We can conclude that in this data set men generally pay more than women but that is the result of other factors that were taken into account not simply because of the gender of the individual. Our data shows that the variables that have the greatest significance to the charges the individual faces is if they are a smoker, their BMI, and their age. Additionally, whether or not the individual has children can also have an impact on the cost of health insurance. Our data shows these results through the p-values as well as the estimate std. However, further research can be done to help improve the confidence of these impacts from 74.68% to something higher.