insurance <- read_excel("insurance.xlsx")
View(insurance)
The question being investigated today: Does this data set provide evidence that men are charged significantly more for health insurance than women? The hypothesis to be searched for in this data set is such: the cost of health insurance for men will be higher, and the cost of health insurance for women will be lower. There are a few confounding variables which need to be described before heading into the analysis.
These variables come from the insurance data set.
datatable(insurance, options = list(scrollX = TRUE))
The model below shows the sex_insurance_model. This model is displaying how fit the data is for the insurance charges per each sex: male or female.
sex_insurance_model <- lm(charges ~ sex, data = insurance)
summary(sex_insurance_model)
Call:
lm(formula = charges ~ sex, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-12835 -8435 -3980 3476 51201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12569.6 470.1 26.740 <2e-16 ***
sexmale 1387.2 661.3 2.098 0.0361 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12090 on 1336 degrees of freedom
Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
When comparing the insurance charges to only the sex variable, men are charged much less than women. This is shown in the p-values < 0.05 meaning these differences are significant.
female p-value: <2e-16 male p-value: 0.0361
Since there is only one variable being compared in this analysis, this is not a fair conclusion to make overall. The R-squared is very low at 0.002536, which means this model is not very fit and there is likely other variables which will be confounding in these data.
One example of a confounding variable in this data set is age. Not only do older people tend to need to use health insurance more, bringing the charges up. In addition to this, many older people are also taken advantage of when charged for insurance which would also affect the overall charges. Since this variable has not been taken into account yet, it looks as if men are charged much more than they likely are compared to women.
The model below shows the age_model. This model is displaying how fit the data is for the insurance charges to each sex: male or female, and age.
age_model <- lm(charges ~ sex + age, data = insurance)
summary(age_model)
Call:
lm(formula = charges ~ sex + age, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-8821 -6947 -5511 5443 48203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2343.62 994.35 2.357 0.0186 *
sexmale 1538.83 631.08 2.438 0.0149 *
age 258.87 22.47 11.523 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11540 on 1335 degrees of freedom
Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
The changes in charges with the addition of the age variable create a much more reasonable number for the sexfemale but do not have much of an affect on the sexmale variable. Given the R-Squared value still being very low, this model is still not the best. Although age seems like a confounding factor in these data, it is not showing in the model due to the addition of the sex variable. Given the p-value being <2e-16, this variable is significant, but it has to be displayed with another significant variable to properly build a fit model.
The model below shows the health_model. This model is displaying how fit the data is for the insurance charges to each sex: male or female, the smoking status, and the body mass index(bmi). Since these two additional variables are used to describe a persons health, one might hypothesize higher charges for smokers as well as those with a higher bmi.
health_model <- lm(charges ~ sex + smoker + bmi, data = insurance)
summary(health_model)
Call:
lm(formula = charges ~ sex + smoker + bmi, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-15870.3 -4558.6 -800.6 3671.4 30798.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3353.53 1008.78 -3.324 0.00091 ***
sexmale -285.28 389.18 -0.733 0.46366
smokeryes 23620.85 481.66 49.040 < 2e-16 ***
bmi 389.09 31.83 12.225 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7089 on 1334 degrees of freedom
Multiple R-squared: 0.6581, Adjusted R-squared: 0.6573
F-statistic: 855.8 on 3 and 1334 DF, p-value: < 2.2e-16
The results of this model do show much higher charges for those who are smokers. This being said, with the addition of these two variables, the sexfemale and sexmale variables shown have gotten lower in price which is not expected. Specifically, the p-value from sexmale shows the number 0.46366 meaning with this comparison you can fair to reject the null hypothesis as it lines up more with the null hypothesis. Since the other 3 variables have low p-values, there is still hope in part of this model being fit.
The model below shows the children_region_model. This model is displaying how fit the data is for the insurance charges to each sex: male or female, the children covered by insurance, and the region of the US which the client comes from. These last two variables have yet to be compared in another model.
children_region_model <- lm(charges ~ sex + children + region, data = insurance)
summary(children_region_model)
Call:
lm(formula = charges ~ sex + children + region, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-13736 -8446 -3973 3301 50460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12004.9 797.5 15.053 <2e-16 ***
sexmale 1324.1 658.7 2.010 0.0446 *
children 702.8 273.5 2.570 0.0103 *
regionnorthwest -1049.9 945.9 -1.110 0.2672
regionsoutheast 1305.4 919.9 1.419 0.1561
regionsouthwest -1124.3 945.9 -1.189 0.2348
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12040 on 1332 degrees of freedom
Multiple R-squared: 0.01465, Adjusted R-squared: 0.01095
F-statistic: 3.96 on 5 and 1332 DF, p-value: 0.001438
In general, this model is not fit to these data. This model shows that the data for region is closer fit to the null hypothesis given the > 0.05 p-values. The R-Squared value is 0.01095 which also shows how unfit these data are to the standard model.
These models do not prove that men get charged more for health insurance than women. Most of these models have low R-Squared values, low model fitness, and high p-values, closer to the null hypothesis. The null hypothesis says that men and women do not get charged differently because of sex. Although the health_model has a higher R-Squared value while still including the sex variable, this is only due to the addition of the smoker variable. This hypothesis has been proven wrong given these data analyses.
The model below shows the better_model. This model shows the combination of variables which provides the best fit model. better_model is displaying how fit the data is for the insurance charges to the smoking status, and the age of the client.
better_model <- lm(charges ~ age + smoker, data = insurance)
summary(better_model)
Call:
lm(formula = charges ~ age + smoker, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-16088.1 -2046.8 -1336.4 -212.7 28760.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2391.63 528.30 -4.527 6.52e-06 ***
age 274.87 12.46 22.069 < 2e-16 ***
smokeryes 23855.30 433.49 55.031 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6397 on 1335 degrees of freedom
Multiple R-squared: 0.7214, Adjusted R-squared: 0.721
F-statistic: 1728 on 2 and 1335 DF, p-value: < 2.2e-16
One may notice that the “sex” variable is not used in this comparison. This variable was taken out because there is a very small R-Squared value when that data is included. This model furthers the proof of men not being charged significantly higher than women and it is simply a matter of age and health.