The goal of this report is to analyze a data set names
insurance
in order to come to a conclusion whether males
are charged more than females when getting medical care. Below is the
data set.
insurance
This set comes from the website www.kaggle.com/datasets/mirichoi0218/insurance
This data set contains 7 variables, including age
,
sex
, bmi
, children
,
smoker
, region
, and charges
.
Throughout this report, we will explore the relationship between the
different variables and create a claim whether sex does or does not have
a direct relationship with charges.
ggplot(insurance) +
geom_boxplot(aes(sex, charges))
From looking at just the box plot of the two variables, there doesn’t seem to be anything bizarre going on. The box for male participants is a little bigger than that of females, however it is notable that the outliers are more concentrated in the higher charge values for males than females.
insurance %>%
group_by(sex)%>%
summarize(mean(charges))
From the averages, we can see females are charged roughly 12569.58 dollars per visit and males are charged 13956.75 dollars on average. From simply the average we can conclude men are charged more, however, there is still more evidence that could be gathered to back up that claim.
To get more insight on the relationship of these two variables, the next step would be to create a linear model of the two. A liner model will show statistical predictions which will further the understanding on whether sex has a significant impact on charges.
insurance_model <- lm(charges ~ sex, data = insurance)
coef(insurance_model)
## (Intercept) sexmale
## 12569.579 1387.172
insurance_w_pred_resids <- insurance %>%
add_predictions(insurance_model) %>%
add_residuals(insurance_model)
insurance_w_pred_resids
After creating a liner model, the coefficients are calculated, letting us create an equation for these two variables. From the numbers above, the equation would be:
y = 1387.172(x) + 12569.579
If the participant is male, the x value will be one, if the participant is female, the x value will be zero. From just this equation, a claim can be made that per visit, males are charged 1387.172 dollars more than females.
When looking at the residual column in the table, we can see that they are very large numbers. A very reliable model would have values close to zero.
summary(insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
Looking at the summary of the linear model created, we are given statistical values such as p-value, residual standard error (RSE), and the R^2 value which will help us come to a conclusion. Starting with the RSE, the value given to us is 12090. This is a very large number which means that the residual points are not very close to the line of best fit on the residual plot. Since this number is so large, our conclusion that males are charges more cannot be said with 100% certainty because the data lacks precison
Next, the R^2 value is 0.002536. Since this number is closer to zero, that means our model is only a little better than the mean model of the data set. This also tells us that we cannot conclude with certainty that males are charged more than females because of lack of precision. And finally, the p-value of 0.03613. This number is less than 0.05 meaning we have sufficient evidence that our model outperforms the mean model.
After gathering all that data, we cannot conclude with certainty that males get charged more than females. The averages, p-value, and equation lean toward that idea, however, two of the statistical tests say the model lacks precision. With all of that put together, there should be distrust with the conclusion being made.
While gender may play a part in the price of a visit to the doctor, there may be other confounding variables that effect the results. Age, for example, could drastically change the price of visits. Is there a significant impact of charges based on sex when controlled by age?
insurance %>%
group_by(sex, age)%>%
summarize(mean(charges))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
When looking at the average charges of different age groups of males and females, the rough comparison from the youngest to oldest is:
Female, 18, 6522.258 per visit
Male, 18, 7603.181 per visit
Female, 64, 23493.178 per visit
Male, 60, 26262.186 per visit
From just these averages, one would be able to say that older individuals are charged more. Beyond that, the male individuals are charged more in both instances. With just the averages, we could say males are overall charged more at the doctor.
In order to make a firm claim, a liner regression model is needed.
insurance2_model <- lm(charges ~ sex + age, data = insurance)
summary(insurance2_model)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
insurance2_w_pred_resids <- insurance %>%
add_predictions(insurance2_model) %>%
add_residuals(insurance2_model)
insurance2_w_pred_resids
Using the same statistical tests as before, we can see if there is a signficant impact of sex on charges when controlled by age. The RSE of this model is 11540, which is still large, however, it is smaller than the previous RSE of 12090 meaning that this model is a little more reliable. The R^2 value is 0.09209 since this value is bigger than the previous value, we can also agree that this model is more reliable. The p-value is 2.2-16, since this number is minuscule we have significant evidence that out model outperforms the mean model.
However, with is model we are given individual p-values. The p-value for age is 2e-16, and the p-value for sex is 0.0149. Looking at these two numbers, we cans see that age has a more significant impact on charges since it is so much smaller that 0.0149.
The residuals on this model are also very far away from zero meaning this isn’t the most reliable model.
After gathering all of this, I think there should still be a little distrust on the overall claim that men get charged more than women, although not as much distrust as the previous example. Overall, we can conclude that we have significant evidence that men get charged more than women when controlled by age.
While Gender and Age could have an impact on how much an individual was charged, there could be other health factors that impact the overall bill. When looking at the data table, both BMI and whether the patient smokes or not could be an indicator of health. Is there a significant impact of gender on charges when controlled by age, BMI, and whether or not the patient is a smoker?
insurance3_model <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
insurance3_w_pred_resids <- insurance %>%
add_predictions(insurance3_model) %>%
add_residuals(insurance3_model)
insurance3_w_pred_resids
summary(insurance3_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
When looking at the statistics of the overall model, the RSE is smaller than the previous example, meaning this model is a little more trust worthy. The R^2 value is larger, and a lot closer to one meaning this also proves this model is more reliable. The p-value remained the same which is a good thing because it means we have significant evidence to say our model is better than the mean model.
This model is the most reliable this far, however, when we look at the individual p-values of the regression coefficients, age, bmi, and being a smoker all have a significant impact on charges. When we look at the value for sex the value is 0.745 and when we use the 0.05 cut off, this means sex does not have a significant impact on charges.
Almost all of the variables have been used in comparison to charges. There are two variables left that we can test for significance, those being children and region.
insurance4_model <- lm(charges ~ sex + age + bmi + smoker + children + region, data = insurance)
insurance4_w_pred_resids <- insurance %>%
add_predictions(insurance4_model) %>%
add_residuals(insurance4_model)
insurance4_w_pred_resids
summary(insurance4_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
From the individual p-values, region is slightly significant but not to the point that it is worth keeping in the model. Children, on the other hand, have a big impact because the value is way under 0.05.
After including all of the variables, it is fair to say the ones that have the greatest impact on charges are age, bmi, children, and being a smoker.
The leading question has been: Does sex have an impact on how much somebody is charged. After looking at the different models and comparing sex to the other variables, I think no, sex does not have a notable impact on how much somebody is charged. There may be a slight different, however, it is not statistically significant to outweigh the impact of the other variables on charges.
In conclusion, when going to the doctor, sex does not have an impact on how much somebody is charged.