Introduction

This report looks at health insurance charges to understand if men are charged more than women. To do this, we’ll analyze data that includes information like age, gender, bmi, smoking status, number of children, and where people live. The main goal is to figure out if being male leads to higher charges for health insurance, even after considering other factors.

We will start by comparing the average charges for men and women. Then, we’ll use statistical tests to see if the differences are meaningful. After that, we’ll consider the other factors using similar methods. By the end of the report we should have a clear picture of what variables have the greatest effect on cost.The data is retrieved from the online Google excel sheet named Insurance.

Throughout the report we will use several statistics to make conclusions about the data. The first one will be the coefficient, which will show us the difference between the average cost of insurance when accounting for certain variables. The next is the p-value which will tell us if the effect of a variable is statistically significant or not. If the p-value is less than .05 then the effect should be considered statistically significant. Lastly, we will consider the R-squared value which will show us the strength of the relationship between the variables. The closer it is to 100%, the stronger the relationship is.

Below is the code that shows the data set:

datatable(insurance, options = list(scrollx = TRUE))

The tidyverse library will be used for visualizations throughout the report.

library(tidyverse)

Average Insurance Costs for Men and Women

First we will simply compare the average cost for men versus women for insurance.

avg_charges_model <- lm(charges ~ sex, data = insurance)

summary(avg_charges_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

This data shows that on average males are charged $1387.20 more than females for insurance. This is shown to be statistically significant as the p-value is .0361 which is less than our threshold of .05.

Accounting for Age as a Factor

The previous analysis does not account for age as a factor which means their could be problems with it. We also see that the R-squared value from the linear model is very low .33%, indicating that gender explains only little of the variability in charges. This means there may be confounding variables like age that we should account for. Insurance tends to increase with age and men and women might have different age distributions in this data set.

avg_charges_age_model <- lm(charges ~ sex + age, data = insurance)

summary(avg_charges_age_model)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

The model shows that men are charged an average of $1,538.83 more than women, even after accounting for age. This is an increase of $258.87. It is statistically significant as well because the p-value for age is 2e-16 which is significantly smaller than .05. The R-squared is 9.3% which means this model is a better indicator than the model that does not account for age.

Accounting for Health Factors

The R-squared is still low from the previous problem, so there are clearly other variables we should consider in our analysis. Specifically the variables in this data set that effect the health of an individual are smoking and bmi. Both of these variables could have an impact on the cost of insurance, so we will account for them. Likely these factors will have a positive relationship with the cost of insurance.

charges_health_model <- lm(charges ~ sex + age + bmi + smoker, data = insurance)

summary(charges_health_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

Accounting for these additional factors shows us a completely different outcome than before. For each unit increase in BMI, charges increase by $323.05. Smokers are charged $23,833.87 more on average than non-smokers, which is the largest effect. Both of these are shown to be statistically significant with a p-value of 2e-16. On the other side, the effect on the cost based on sex after accounting for these variables was reduce to $109.04 more for females with not statistically significant difference being shown based on the p-value of .745. We also see that the R-squared value is 0.7475, indicating that 74.75% of the variability in insurance charges is explained by the model. We also can see that our hypothesis of these factors having a positive relationship is correct. This is due to the fact that the more at risk someone is the more they will be charged for insurance. These variables both increase the risk of health complications therefore they increase the cost of insurance.

Accounting for Additional Factors

The last comparison was shown to have a high R-squared value, but it is possible that there are still additional factors that we should account for. We will testy this by adding the last two variables of the data set, children and region to the analysis.

charges_full_model <- lm(charges ~ sex + age + bmi + smoker + region + children, data = insurance)

summary(charges_full_model)
## 
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region + children, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## children           475.5      137.8   3.451 0.000577 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Both of these variables are shown to be statistically significant as well so they should be included. The Northeast region variable is the only region variable that is not statistically significant. The Southeast region has $1035 less cost on average. The Southwest region also has a decreased cost on average of $960. This is probably due to the fact that price of living is cheaper on average in these regions. Children has a positive effect on the charges, meaning each child increases the cost of insurance by $475.50. This also makes sense as children often need additional coverage that costs more.

Conclusion

After controlling for key health-related variables, the data shows no significant difference in insurance charges between men and women. This suggests that gender alone does not explain variations in charges. Instead, factors like smoking, BMI, and age are the primary determinants. Smoking has largest effect with an increase of nearly $25,000 if a person does smoke. Other factors such as children and region also have an effect.