Introduction

In this analysis, I will be trying to answer the question “Are men charged significantly more for health insurance than women?” using a data set called in from Excel. I will do this by running several linear regression models to recognize if sex does have a true statistical significant impact to the cost of health insurance.

Functions

The functions that I have called in are tidyverse, readxl, and dplyr. Tidyverse and dplry allow me to sort and clean the data to run important analysis. Readxl allows me to call in the Excel file that the data is stored in.

library(tidyverse)
library(readxl)
insurance <- read_excel("C:/Users/maize/Downloads/insurance.xlsx")
library(dplyr)

Data Set

The data set that I will be using in this report is called in from Excel. It is composed of seven variables and 1339 data entries. The variables included are; age, sex, bmi, number of children, smoker (yes/no), region (geographical), and charges for health insurance.

Average Insurance Charges by Sex

average_charges_by_sex <- insurance %>% group_by(sex) %>% summarize(average_charges = mean(charges, na.rm = TRUE))

average_charges_by_sex

Based on this model we see that on average males are charged more for insurance than females are. Although we would like to say that from this information if you are male than you will be charged more for health insurance, this is just not statistically significant enough to say that it is true.

ggplot(insurance, aes(x = sex, y = charges, fill = sex)) +
  geom_boxplot() + 
  labs(title = "Insurance Charges by Sex", x = "Sex", y = "Insurance Charges", fill = "Sex") +
  scale_fill_manual(values = c("female" = "pink", "male" = "lightblue"))

After looking through the data as well as this box plot we can tell that assuming sex has a significant impact on health insurance is not true. Looking through the data shows that age might also have a hand in deciding the cost of health insurance, not only if you are a male or female.

model_with_age <- lm(charges ~ sex + age, data = insurance) 
summary(model_with_age)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

Does Age have an Impact

To better the model I will add in age as a predicting variable. By doing this and looking over the values listed in the summary we can tell that statistically this should be a better model.

ggplot(insurance, aes(x = age, y = charges, color = sex)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Insurance Charges by Age", x = "Age", y = "Insurance Charges", color = "Sex") +
  scale_color_manual(values = c("female" = "pink", "male" = "lightblue"))

Although statistically this model looked like it could perform well we see three clear patterns to the data. This means that based only on sex and age there are still some other factors that can impact the cost of health insurance.

Health Impact on Health Insurance

One big factor that this model is not taking into consideration is if the person is healthy or not. The data set that I called in also included bmi and if the person was a smoker. These two variables have a huge impact on if the person is healthy or not and would help determine the cost of health insurance.

model_with_health <- lm(charges ~ sex + age + bmi + smoker, data = insurance) 
summary(model_with_health)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

Looking at the p-value and adjusted r-squared value we can see that this model should perform well, but as we saw with the previous model, it is important to check for any trends with the data that are not statistically proven. We can do this by looking at the coefficients.

coef(model_with_health)
## (Intercept)     sexmale         age         bmi   smokeryes 
## -11633.4946   -109.0411    259.4532    323.0511  23833.8700

We can see that being a smoker has a large significant impact on the cost of health insurance as well as male having a lower standard cost than a female. This hugely contradicts the first model we made.

Full data analysis

While including four predictors from the data set has shown us that this model is working well we can also see if adding in number of children as well as region have an impact on the cost of health insurance.

model_full <- lm(charges ~ sex + age + bmi + smoker + children + region, data = insurance)

summary(model_full)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

After conducting the same steps for this model to check and see if the new predictors are significantly important in predicting cost of health insurance we see that each new predictor has a significant impact.

Conclusion

Based off of the completed model it is clear to tell that sex is not the main predictor in the cost of health insurance. Based off of the final linear model we see that sex has little to no impact on the cost of insurance but rather your health condition, the region you are located in, and number of children have the largest impacts on the cost.