I am going to be going over health insurance charges and seeing if males and females pay the same price. We will use linear regression to show whether or not there is a difference in charges between males and females with other variables to assist us. The data set that we will use is (insurance, from “insurance.csv”). The variables that will be included are Males, Females, Charges, Age, Region of insurance holder, Amount of children they have, If they are a smoker, and their BMI index (which tells us the amount the person weighs compared to their height).
We will make adjustments to the variables. First we change the column (sex) to females. Then change all the values in the original column of sex into number values of males to 0 and females to 1. The next thing to change is splitting up the regions column into four different columns (variables). The columns will each be listed but only will have a 1 in the column if the insurance holder lives there and a 0 if they don’t.
We will first start off with comparing the charges for just males and females. From the data set below we are shown that, in the comparison of male to female, females are charged $1,405.4 less than males. The average charges between each other are significantly different. Just off of this comparison alone we could say that men pay more than women, but we don’t see other variables or details that could affect the charges between males and females. So it’s hard to trust this current model. Some of the reasons why we cannot trust this model is the small r^2 value. A small r^2 means that there is a lot of noise and the explanatory variables can explain the dependent variable. We will need more information by adding other variables to the model to determine whether or not males and females are charged differently.
MvF <- lm(`charges` ~ `female`, data = insurance_w)
summary(MvF)
##
## Call:
## lm(formula = charges ~ female, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12853 -8432 -3973 3500 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13975.0 465.5 30.020 <2e-16 ***
## female -1405.4 661.6 -2.124 0.0338 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1335 degrees of freedom
## Multiple R-squared: 0.003369, Adjusted R-squared: 0.002623
## F-statistic: 4.513 on 1 and 1335 DF, p-value: 0.03382
For our next data set we will add age to our model. The reason why we would add age to the model is because it can be an important factor on a person’s health. When adding age to the model, the overall model improves. The reason why it improves is because the r^2 increases a lot, but it is still super low. But, the p-value for females does get worse by increasing, this shows that females get charged even more than males, but the model is still not good yet. Some of the reasons are that there are a couple more variables that we could add that might play an effect on the model and the r^2 value.
MvF_cA <- lm(charges ~ female + age, data = insurance_w)
summary(MvF_cA)
##
## Call:
## lm(formula = charges ~ female + age, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8831 -6949 -5511 5446 48198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3914.19 982.05 3.986 7.09e-05 ***
## female -1549.14 631.45 -2.453 0.0143 *
## age 258.32 22.49 11.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1334 degrees of freedom
## Multiple R-squared: 0.09308, Adjusted R-squared: 0.09172
## F-statistic: 68.46 on 2 and 1334 DF, p-value: < 2.2e-16
Out of the four remaining variables there are two that can have an effect on a person’s health. Their BMI and if they are a smoker. When we add these variables to the model they both help out the model a ton. The r^2 values go way up and the RSE goes down. But the p-value for females goes above our 0.05 threshold, which tells us that females have some outside noise that can be caused by a random effect rather than a true effect.
MvF_cA_BMI_S <- lm(charges ~ female + age + bmi + smoker, data = insurance_w)
summary(MvF_cA_BMI_S)
##
## Call:
## lm(formula = charges ~ female + age + bmi + smoker, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12365.5 -2974.8 -982.6 1476.6 29018.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11736.29 960.03 -12.225 <2e-16 ***
## female 106.89 334.91 0.319 0.75
## age 259.34 11.96 21.693 <2e-16 ***
## bmi 323.07 27.54 11.731 <2e-16 ***
## smoker 23832.22 414.39 57.511 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared: 0.7473, Adjusted R-squared: 0.7466
## F-statistic: 984.9 on 4 and 1332 DF, p-value: < 2.2e-16
In the next model we will add the region and the amount of kids the insurance holder has. In the new model the new variables make the model worse. The area where the insurance holder lives does not have a significant effect on the charges, because the p-value is greater than 0.05. On the other hand, children do have some impact on insurance charges. The reason why is that the p-value is less than 0.05 but it’s higher than the rest of the variables. It also barely increases the r^2 value. Females still have a large p-value, so they do not become significant at all in the charges between males and females.
MvF_All <- lm(charges ~ female + age + bmi + smoker + children + southwest + southeast + northwest + northeast, data = insurance_w)
summary(MvF_All)
##
## Call:
## lm(formula = charges ~ female + age + bmi + smoker + children +
## southwest + southeast + northwest + northeast, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11305.3 -2849.0 -980.2 1390.5 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13616.63 6129.74 -2.221 0.026492 *
## female 127.63 333.45 0.383 0.701970
## age 256.67 11.93 21.523 < 2e-16 ***
## bmi 339.31 28.62 11.854 < 2e-16 ***
## smoker 23846.11 413.54 57.663 < 2e-16 ***
## children 474.13 137.99 3.436 0.000609 ***
## southwest 594.34 6085.35 0.098 0.922211
## southeast 518.94 6084.41 0.085 0.932043
## northwest 1209.01 6085.76 0.199 0.842557
## northeast 1554.45 6085.43 0.255 0.798424
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6066 on 1327 degrees of freedom
## Multiple R-squared: 0.7508, Adjusted R-squared: 0.7491
## F-statistic: 444.1 on 9 and 1327 DF, p-value: < 2.2e-16
In our final model we will compare our model with the variables (female(sex), age, smoker, BMI), and a model without female in it. The reason why we would take females out of our model is that it has a high p-value compared to the rest of the variables. The high p-value means that the variable does not have a significant impact on the response variable. It also could mean that there is outside noise on the variable that could be a random effect rather than a true effect. The model that has female(sex) taken out, becomes a more consistent model. It has a great r^2 value, the p-value is great for all the variables in the model, and the RSE is relatively low for the model.
MvF_corAll <- lm(charges ~ female + age + bmi + smoker , data = insurance_w)
summary(MvF_corAll)
##
## Call:
## lm(formula = charges ~ female + age + bmi + smoker, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12365.5 -2974.8 -982.6 1476.6 29018.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11736.29 960.03 -12.225 <2e-16 ***
## female 106.89 334.91 0.319 0.75
## age 259.34 11.96 21.693 <2e-16 ***
## bmi 323.07 27.54 11.731 <2e-16 ***
## smoker 23832.22 414.39 57.511 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared: 0.7473, Adjusted R-squared: 0.7466
## F-statistic: 984.9 on 4 and 1332 DF, p-value: < 2.2e-16
LM_Final <- lm(charges ~ age + bmi + smoker, data = insurance_w)
summary(LM_Final)
##
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12415.2 -2974.4 -981.3 1490.3 28972.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11671.69 938.14 -12.44 <2e-16 ***
## age 259.43 11.95 21.71 <2e-16 ***
## bmi 322.64 27.50 11.73 <2e-16 ***
## smoker 23822.18 413.06 57.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7473, Adjusted R-squared: 0.7467
## F-statistic: 1314 on 3 and 1333 DF, p-value: < 2.2e-16
In conclusion we cannot tell if males and females get charged differently based on their sex. There could be outside noise or other variables that we would need in order to make a confident assumption.