This report looks at health insurance charges to understand if men are charged more than women. To do this, we’ll analyze data that includes information like age, gender, bmi, smoking status, number of children, and where people live. The main goal is to figure out if being male leads to higher charges for health insurance, even after considering other factors.
We will start by comparing the average charges for men and women. Then, we’ll use statistical tests to see if the differences are meaningful. After that, we’ll consider the other factors using similar methods. By the end of the report we should have a clear picture of what variables have the greatest effect on cost.The data is retrieved from the online Google excel sheet named Insurance.
Throughout the report we will use several statistics to make conclusions about the data. The first one will be the coefficient, which will show us the difference between the average cost of insurance when accounting for certain variables. The next is the p-value which will tell us if the effect of a variable is statistically significant or not. If the p-value is less than .05 then the effect should be considered statistically significant. Lastly, we will consider the R-squared value which will show us the strength of the relationship between the variables. The closer it is to 100%, the stronger the relationship is.
Below is the code that shows the data set:
datatable(insurance, options = list(scrollx = TRUE))
The tidyverse library will be used for visualizations throughout the report.
library(tidyverse)
First we will simply compare the average cost for men versus women for insurance.
avg_charges_model <- lm(charges ~ sex, data = insurance)
summary(avg_charges_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
This data shows that on average males are charged $1387.20 more than females for insurance. This is shown to be statistically significant as the p-value is .0361 which is less than our threshold of .05.
The previous analysis does not account for age as a factor which means their could be problems with it. We also see that the R-squared value from the linear model is very low .33%, indicating that gender explains only little of the variability in charges. This means there may be confounding variables like age that we should account for. Insurance tends to increase with age and men and women might have different age distributions in this data set.
avg_charges_age_model <- lm(charges ~ sex + age, data = insurance)
summary(avg_charges_age_model)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
The model shows that men are charged an average of $1,538.83 more than women, even after accounting for age. This is an increase of $258.87. It is statistically significant as well because the p-value for age is 2e-16 which is significantly smaller than .05. The R-squared is 9.3% which means this model is a better indicator than the model that does not account for age.
The R-squared is still low from the previous problem, so there are clearly other variables we should consider in our analysis. Specifically the variables in this data set that effect the health of an individual are smoking and bmi. Both of these variables could have an impact on the cost of insurance, so we will account for them. Likely these factors will have a positive relationship with the cost of insurance.
charges_health_model <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(charges_health_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
Accounting for these additional factors shows us a completely different outcome than before. For each unit increase in BMI, charges increase by $323.05. Smokers are charged $23,833.87 more on average than non-smokers, which is the largest effect. Both of these are shown to be statistically significant with a p-value of 2e-16. On the other side, the effect on the cost based on sex after accounting for these variables was reduce to $109.04 more for females with not statistically significant difference being shown based on the p-value of .745. We also see that the R-squared value is 0.7475, indicating that 74.75% of the variability in insurance charges is explained by the model. We also can see that our hypothesis of these factors having a positive relationship is correct. This is due to the fact that the more at risk someone is the more they will be charged for insurance. These variables both increase the risk of health complications therefore they increase the cost of insurance.
The last comparison was shown to have a high R-squared value, but it is possible that there are still additional factors that we should account for. We will testy this by adding the last two variables of the data set, children and region to the analysis.
charges_full_model <- lm(charges ~ sex + age + bmi + smoker + region + children, data = insurance)
summary(charges_full_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + region + children,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## children 475.5 137.8 3.451 0.000577 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Both of these variables are shown to be statistically significant as well so they should be included. The Northeast region variable is the only region variable that is not statistically significant. The Southeast region has $1035 less cost on average. The Southwest region also has a decreased cost on average of $960. This is probably due to the fact that price of living is cheaper on average in these regions. Children has a positive effect on the charges, meaning each child increases the cost of insurance by $475.50. This also makes sense as children often need additional coverage that costs more.
After controlling for key health-related variables, the data shows no significant difference in insurance charges between men and women. This suggests that gender alone does not explain variations in charges. Instead, factors like smoking, BMI, and age are the primary determinants. Smoking has largest effect with an increase of nearly $25,000 if a person does smoke. Other factors such as children and region also have an effect.