This report will utilize linear modeling to investigate the impact of sex on insurance cost. Specifically, it will seek to answer the question: Is the cost of insurance significantly different between males and females?
library(tidyverse)
library(readr)
insurance <- read_csv("insurance.csv")
datatable(insurance)
tidyverse: provides a variety of base functions used throughout this project
readr: allows data set to be imported from CV file
This data set provides the cost of medical insurance along with several variables collected by the insurance agency.
7 variables with 1338 observations
From: https://www.kaggle.com/datasets/mirichoi0218/insurance
age: age of primary beneficiary
sex: sex of primary beneficiary
bmi: Body Mass Index, provides a relative understanding of healthy body weight dependent on an individuals height
children: number of dependent children on insurance plan
smoker: primary beneficiaries smoking status (yes or no)
region: regions in the U.S. primary beneficiary lives
charges: Medical costs billed to primary benficiary to insurance contractor
The model below compares the average insurance charges for men and women based solely upon their sex.
insurance_sex_lm <- lm(charges ~ sex, data = insurance)
summary(insurance_sex_lm)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
Average charge female = 12569.6 Average charge male = 13956.2
Based on this comparison males are charged about $1500 more than females.
This model is significantly different from the mean model (p=0.03613), however the R2 value is incredibly low (R2=0.002536). This indicates that the model is not very reliable and leave considerable room for improvement.
There are several confounding variables with sex present within the data set that may be impacting the significance of sex on charges within the model. For example, age, smoker status, and BMI are all known to have significant negative impacts on the health of individuals and may be linked to an individuals sex.
Age has negative effects on overall health. As people get older they typically run into more health complications than they have in their younger years. If the men are on average older than the women who seek health care, the age of the individual could be impacting the charge, making it appear that men are charged more than women. To investigate this concern, the model below considers both sex and age’s effect on the individuals cost of insurance.
insurance_sex_age_lm <- lm(charges ~ sex + age, data = insurance)
summary(insurance_sex_age_lm)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
This model indicates that men are charge about $1500 more for health insuranc than women and that for every year older a person is they will be charge about $250 more. The increase from both sex (p=0.0149) and age (p < 2e-16) are significant. Overall, the model proves to be a better fit for the dataset than the previous model. The RSE has been decreased from 12090 to 11540 and the p-value compared to the mean model has decreased from 0.03613 to <2.2e-16. That being said, the R2 value, while increased from the previous model (R2 = 0.09209) is still low and does not suggest a reliable model. This indicates that there are likely other variables impacting the insurance charge that are unaccounted for in this model.
An individuals BMI and smoker status are known to negatively impact an individuals health. People who smoke and are overweight are typically in poorer health, meaning they are more likely to receive higher insurance charges. For the data collected in this set, males may be more inclined to smoke, or have higher BMI’s than females which may impact the significance of the cost differences seen between the two sexes.
The following model investigates the impact of sex, age, BMI, and smoker status on insurance charges.
insurance_sex_age_health_lm <- lm(charges ~ sex + age + bmi + smoker, data = insurance)
summary(insurance_sex_age_health_lm)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
This model reveals that sex does not significantly impact insurance price charges (p = 0.745). Rather an individuals age (p < 2.2e-16), BMI (p < 2.2e-16), and smoker status (p < 2.2e-16) significantly increase the charges they receive. The model suggests that for every one unit increase on BMI a persons insurance cost increases by about $320. Additionally, for every year older a person gets their insurance cost increases by about $260. Lastly, being a smoker increases the cost of insurance by roughly $23820.
Unfortunately the value of the intercept (-11633.49) does appear to have real applications. Insurance costs money, it does not give you money as the value of the intercept suggests.
The RSE is nearly halved from the previous model, RSE = 6094 compared to RSE = 11540 indicating that the model is more accurate than the previous. Furthermore, the R2 value of this model = 0.7467 which is increased from the previous models R2 value of 0.09209. This indicates that model considering age, bmi, smoker status, and seg is more reliable than the model considering only age and sex.
This model is once again significantly different from the mean model (p < 2.2e-16). Unfortunately, R does not calculate p-values below 2.2e-16, but based on other statistics mentioned earlier for this model, I would assume that the p-value too has been reduced from the previous model indicating a more significant difference from the mean model although this cannot be said for certain.
The below model considers the impacts sex, age, BMI, smoker status, number of children, and the region a person lives on insurance charges.
insurance_total_model <- lm(charges ~ sex + age + bmi + smoker + children + region, data = insurance)
summary(insurance_total_model)
##
## Call:
## lm(formula = charges ~ sex + age + bmi + smoker + children +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
This model is slightly more reliable than the previous model. R2 = 0.7494 which is slightly increased from the previous model and an RSE = 6062 which is slightly reduced from the previous model.
This model further confirms that BMI (p < 2.2e-16), smoker status (p < 2.2e-16), and age (p < 2.2e-16), all significantly increase the insurance charges an individual receives and that sex (p = 0.6933) does not have a significant impact on insurance cost. Furthermore, model highlights number off children (p = 0.0006) as significantly increasing the cost of insurance. The model suggest that for every year older an individual gets their insurance cost will increase by roughly $260. Similarly, for every one unit increase in BMI and individual’s insurance cost increases by roughly $340. For every child a person has their insurance cost increases by $475. Lastly, being a smoker increases an individuals insurance cost by roughly $23850. Once again the value of the intercept (-11938.5) does appear to have real applications
It should be noted though that the the reference variable now contains 3 categorical variables (sexfemale, smokerno, and regionnortheast) so it is difficult to draw real conclusions from the value of the coefficient. It should also be noted that the model suggest that the region an individual lives in does not have significant impacts on their health. However, the p-value for each specific region varies making it difficult to draw definite conclusions.
Based on the considerations of all variables within the data set a person’s sex, which appears to be the best model based on RSE and R2 values, does not have a significant impact on the cost of health insurance. Males are not charged significantly more than females when considering BMI, smoker status, age, and number of children; all of which are confounding variables and significantly impact the price of insurance.