Title: Health Insurance

Author: Jack Lustig


I am going to be going over health insurance charges and seeing if males and females pay the same price. We will use linear regression to show whether or not there is a difference in charges between males and females with other variables to assist us. The data set that we will use is (insurance, from “insurance.csv”). The variables that will be included are Males, Females, Charges, Age, Region of insurance holder, Amount of children they have, If they are a smoker, and their BMI index (which tells us the amount the person weighs compared to their height).

We will make adjustments to the variables. First we change the column (sex) to females. Then change all the values in the original column of sex into number values of males to 0 and females to 1. The next thing to change is splitting up the regions column into four different columns (variables). The columns will each be listed but only will have a 1 in the column if the insurance holder lives there and a 0 if they don’t.


We will first start off with comparing the charges for just males and females. From the data set below we are shown that, in the comparison of male to female, females are charged $1,405.4 less than males. The average charges between each other are significantly different. Just off of this comparison alone we could say that men pay more than women, but we don’t see other variables or details that could affect the charges between males and females. So it’s hard to trust this current model. Some of the reasons why we cannot trust this model is the small r^2 value. A small r^2 means that there is a lot of noise and the explanatory variables can explain the dependent variable. We will need more information by adding other variables to the model to determine whether or not males and females are charged differently.

MvF <- lm(`charges` ~ `female`, data = insurance_w)
summary(MvF)
## 
## Call:
## lm(formula = charges ~ female, data = insurance_w)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12853  -8432  -3973   3500  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13975.0      465.5  30.020   <2e-16 ***
## female       -1405.4      661.6  -2.124   0.0338 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1335 degrees of freedom
## Multiple R-squared:  0.003369,   Adjusted R-squared:  0.002623 
## F-statistic: 4.513 on 1 and 1335 DF,  p-value: 0.03382

For our next data set we will add age to our model. The reason why we would add age to the model is because it can be an important factor on a person’s health. When adding age to the model, the overall model improves. The reason why it improves is because the r^2 increases a lot, but it is still super low. But, the p-value for females does get worse by increasing, this shows that females get charged even more than males, but the model is still not good yet. Some of the reasons are that there are a couple more variables that we could add that might play an effect on the model and the r^2 value.

MvF_cA <- lm(charges ~ female + age, data = insurance_w)
summary(MvF_cA)
## 
## Call:
## lm(formula = charges ~ female + age, data = insurance_w)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8831  -6949  -5511   5446  48198 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3914.19     982.05   3.986 7.09e-05 ***
## female      -1549.14     631.45  -2.453   0.0143 *  
## age           258.32      22.49  11.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1334 degrees of freedom
## Multiple R-squared:  0.09308,    Adjusted R-squared:  0.09172 
## F-statistic: 68.46 on 2 and 1334 DF,  p-value: < 2.2e-16

Out of the four remaining variables there are two that can have an effect on a person’s health. Their BMI and if they are a smoker. When we add these variables to the model they both help out the model a ton. The r^2 values go way up and the RSE goes down. But the p-value for females goes above our 0.05 threshold, which tells us that females have some outside noise that can be caused by a random effect rather than a true effect.

MvF_cA_BMI_S <- lm(charges ~ female + age + bmi + smoker, data = insurance_w)
summary(MvF_cA_BMI_S)
## 
## Call:
## lm(formula = charges ~ female + age + bmi + smoker, data = insurance_w)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12365.5  -2974.8   -982.6   1476.6  29018.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11736.29     960.03 -12.225   <2e-16 ***
## female         106.89     334.91   0.319     0.75    
## age            259.34      11.96  21.693   <2e-16 ***
## bmi            323.07      27.54  11.731   <2e-16 ***
## smoker       23832.22     414.39  57.511   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared:  0.7473, Adjusted R-squared:  0.7466 
## F-statistic: 984.9 on 4 and 1332 DF,  p-value: < 2.2e-16

In the next model we will add the region and the amount of kids the insurance holder has. In the new model the new variables make the model worse. The area where the insurance holder lives does not have a significant effect on the charges, because the p-value is greater than 0.05. On the other hand, children do have some impact on insurance charges. The reason why is that the p-value is less than 0.05 but it’s higher than the rest of the variables. It also barely increases the r^2 value. Females still have a large p-value, so they do not become significant at all in the charges between males and females.

MvF_All <- lm(charges ~ female + age + bmi + smoker + children + southwest + southeast + northwest + northeast, data = insurance_w)
summary(MvF_All)
## 
## Call:
## lm(formula = charges ~ female + age + bmi + smoker + children + 
##     southwest + southeast + northwest + northeast, data = insurance_w)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11305.3  -2849.0   -980.2   1390.5  29992.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13616.63    6129.74  -2.221 0.026492 *  
## female         127.63     333.45   0.383 0.701970    
## age            256.67      11.93  21.523  < 2e-16 ***
## bmi            339.31      28.62  11.854  < 2e-16 ***
## smoker       23846.11     413.54  57.663  < 2e-16 ***
## children       474.13     137.99   3.436 0.000609 ***
## southwest      594.34    6085.35   0.098 0.922211    
## southeast      518.94    6084.41   0.085 0.932043    
## northwest     1209.01    6085.76   0.199 0.842557    
## northeast     1554.45    6085.43   0.255 0.798424    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6066 on 1327 degrees of freedom
## Multiple R-squared:  0.7508, Adjusted R-squared:  0.7491 
## F-statistic: 444.1 on 9 and 1327 DF,  p-value: < 2.2e-16

In our final model we will compare our model with the variables (female(sex), age, smoker, BMI), and a model without female in it. The reason why we would take females out of our model is that it has a high p-value compared to the rest of the variables. The high p-value means that the variable does not have a significant impact on the response variable. It also could mean that there is outside noise on the variable that could be a random effect rather than a true effect. The model that has female(sex) taken out, becomes a more consistent model. It has a great r^2 value, the p-value is great for all the variables in the model, and the RSE is relatively low for the model.

MvF_corAll <- lm(charges ~ female + age + bmi + smoker , data = insurance_w)
summary(MvF_corAll)
## 
## Call:
## lm(formula = charges ~ female + age + bmi + smoker, data = insurance_w)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12365.5  -2974.8   -982.6   1476.6  29018.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11736.29     960.03 -12.225   <2e-16 ***
## female         106.89     334.91   0.319     0.75    
## age            259.34      11.96  21.693   <2e-16 ***
## bmi            323.07      27.54  11.731   <2e-16 ***
## smoker       23832.22     414.39  57.511   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6097 on 1332 degrees of freedom
## Multiple R-squared:  0.7473, Adjusted R-squared:  0.7466 
## F-statistic: 984.9 on 4 and 1332 DF,  p-value: < 2.2e-16
LM_Final <- lm(charges ~ age + bmi + smoker, data = insurance_w)
summary(LM_Final)
## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance_w)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.2  -2974.4   -981.3   1490.3  28972.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11671.69     938.14  -12.44   <2e-16 ***
## age            259.43      11.95   21.71   <2e-16 ***
## bmi            322.64      27.50   11.73   <2e-16 ***
## smoker       23822.18     413.06   57.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7473, Adjusted R-squared:  0.7467 
## F-statistic:  1314 on 3 and 1333 DF,  p-value: < 2.2e-16

In conclusion we cannot tell if males and females get charged differently based on their sex. There could be outside noise or other variables that we would need in order to make a confident assumption.