Health insurance costs are through the roof. With the help of our dataset “insurance”, we wish to answer the question of if men are charged significantly more for health insurance than women are. For this, we will use the libraries “tidyverse” for the ability to analyze the data effectively using R. We will also use the library “readr” to be able to import the dataset “insurance” into R. Below will also include the datatable of the dataset “insurance.”
library(tidyverse)
library(readr)
library(DT)
insurance <- read_csv("insurance.csv")
datatable(insurance)
In this dataset, insurance, we have 7 variables:
age - how old the person is, in years sex - the gender of the individual bmi - the body mass index of the individual children - the number of children the individual has smoker - Is this person a smoker? (yes or no) Region - the area on the compass rose the person lives in (Southeast, Northwest, Northeast) Charges - How much is this person charged for health insurance?
Before we can do any tests regarding if men are charged significantly more for health insurance, we should first see if men are charged more on average than women are for health insurance. For this, we will group by sex and calculate the means.
gender_model <- lm(charges ~ sex, data = insurance)
summary(gender_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
When we calculate the means, we see that on average, men are charged $1,387 more for health insurance. But, this is simply not enough to say that there is a significant difference, as this is simply the means, based on this. There could be a dude out there named Mark who is charged a lot more than the others for reasons we don’t know. We can find that it IS significant by looking at the p-value under “sexmale” which is 0.0361, which is below our threshold of 0.05 and we have convincing evidence that men are charged more than women, HOWEVER, there is other variables that we can check to truly analyze the impacts of the dataset’s variables on charges.
It is not safe to say solely based on that analysis that men are charged more than women, given the fact there are other variables in the dataset “insurance” that could be confounding, such as age, smoker status, and bmi. Age could be confounding because as a person gets older, they have a higher chance of having an event that costs the insurance company more money, such as an accident, life-threatening event, or death. On average, older people are charged more for insurance simply for being older.
First we will analyze the impact of age on charges.
model_with_age <- lm(charges ~ sex + age, data = insurance)
summary(model_with_age)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
Judging by our newest comparison, we see that for every 1 year increase in age, a person is charged $258.87 more on average all else held constant, for health insurance. Upon this comparison, we also see the p-value for age’s impact on charges is significant, with a p-value of less than 2*10^-16. However, there are still confounding variables, but we can see that age and gender have a significant impact on charges.
On average, a person with better health records are typically not charged as much for insurance as they have a lower risk of a major health event. So, we should include our last two variables in the dataset, BMI and Smoker Status to our model to see what impact they have on “charges.”
model_with_bmi_smoker <- lm(charges ~ sex + bmi + age + smoker, data = insurance)
summary(model_with_bmi_smoker)
##
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## bmi 323.05 27.53 11.735 <2e-16 ***
## age 259.45 11.94 21.727 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
Suddenly, when we include BMI and smoker status, the P-value of sexmale jumps to 0.745, which is not significant, and the estimate is now -109.04. The p-value for BMI’s impact on charges is significant, with a p-value of less than 2*10^-16, and for each 1 unit increase in BMI, insurance charges $323.05 more on average all else held constant. The p-value of age’s impact on charges is significant as well, at less than 2 x 10^-16, and for every 1 year increase in age, insurance charges 259.45 dollars more on average all else held constant. We also see that the smoker status being yes increases insurance charges by $23833.87 all else held constant, with a p-value of less than 2x10^-16, which is significant. We also see an adjusted R-squared value of 0.7467, which means that our model accounts for 74.67% of the variability. These variables seem to be helping our model and based on this model, we see that gender does not have a significant impact on charges after all.
By now, there are two variables left in the dataset that could be confounding that we haven’t accounted for. These variables are “region” and “children.” We should include these in a model to see if they have a significant impact on charges.
full_model <- lm(charges ~ sex + bmi + age + smoker + children + region, data = insurance)
summary(full_model)
##
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Judging by the summary above, we see that all of the regions except for Northwest have significant impacts on insurance charges, judging by their p-values being below our 0.05 threshold. If a person lives in the Southeast region, their insurance charges drop $1,035 on average all else held constant. Meanwhile, the Southwest region their charges drop $960 on average all else held constant. Meanwhile, having children has a significant impact on insurance charges as well, with a p-value of 0.000577, well below our threshold. This is likely due to the fact that many parents have their children on their insurance plan, and they add more “risk” associated with their care (more risk associated with an event). The summary also states that for every 1 additional child a parent has, their insurance charges are expected to go up $475.50 per child on average all else held constant. Compared to the first model, we see our r-squared went up very slightly, meaning more of our variability is explained by our model, so these should be kept in our final model. In this model, we also see an RSE of just above 6000, compared to our past models which were much higher at 11000, etc. We can also check to see if keeping region out, based on the high p-value associated with Northwest, is better for our model, by making one that does not have region.
model_w_children <- lm(charges ~ sex + bmi + age + smoker + children, data = insurance)
summary(model_w_children)
##
## Call:
## lm(formula = charges ~ sex + bmi + age + smoker + children, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11837.2 -2916.7 -994.2 1375.3 29565.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12052.46 951.26 -12.670 < 2e-16 ***
## sexmale -128.64 333.36 -0.386 0.699641
## bmi 322.36 27.42 11.757 < 2e-16 ***
## age 257.73 11.90 21.651 < 2e-16 ***
## smokeryes 23823.39 412.52 57.750 < 2e-16 ***
## children 474.41 137.86 3.441 0.000597 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6070 on 1332 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7488
## F-statistic: 798 on 5 and 1332 DF, p-value: < 2.2e-16
Judging by this summary, we see a small increase in R-squared by including region, so we will include it in our final model.
After including all key variables, we see that sex does not have a significant impact on insurance charges, with it having a p-value of 0.6933, which is well above our threshold of 0.05 for significance. In a final model, we would include sex, bmi, age, smoker status, children, and region, and we can conclude that we do not have significant evidence that sex has an impact on insurance charges. It seems like bmi, age, smoker status, and children have a significant impact on it, by their p-values all being below 0.05. If a person smokes, they are expected to pay $23,848.50 more on average for health insurance all else held constant. For each increase one unit increase in BMI, we expect a person to pay $339.20 more on average for health insurance all else held constant. We also expect for each 1 year increase in age, a person would expect to pay $256.90 more on average all else held constant. All of these have siginficant impacts judging by their p-values, but we can conclude that sex does not have a significant impact on insurance charges due to its high p-value of 0.6933.