library(tidyverse)
library(modelr)
library(gridExtra)
library(readr)
library(dplyr)
library(readr)
insurance <- read_csv("insurance.csv")
The purpose of this report is to determine whether men are charged
significantly more for health insurance than women, using a real-world
dataset containing 1,338 observations and seven variables related to
demographics, lifestyle, and health. The data include age, sex, BMI,
number of dependent children, smoking status, region, and annual medical
charges. These variables allow us to compare insurance costs between men
and women and to examine whether any observed differences remain after
accounting for confounding factors such as age, BMI, and smoking. By
building and interpreting a series of linear regression models, this
analysis aims to provide a clear, evidence-based answer to the central
question of whether sex independently predicts higher insurance
charges.
insurance %>%
head(10)
## # A tibble: 10 × 7
## age sex bmi children smoker region charges
## <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
## 1 19 female 27.9 0 yes southwest 16885.
## 2 18 male 33.8 1 no southeast 1726.
## 3 28 male 33 3 no southeast 4449.
## 4 33 male 22.7 0 no northwest 21984.
## 5 32 male 28.9 0 no northwest 3867.
## 6 31 female 25.7 0 no southeast 3757.
## 7 46 female 33.4 1 no southeast 8241.
## 8 37 female 27.7 3 no northwest 7282.
## 9 37 male 29.8 2 no northeast 6406.
## 10 60 female 25.8 0 no northwest 28923.
To begin, I compared the average insurance charges for males and
females using a simple summary of the data:
gender_insurance <- insurance %>%
group_by(sex) %>%
summarize(mean_charges = mean(charges))
gender_insurance
## # A tibble: 2 × 2
## sex mean_charges
## <chr> <dbl>
## 1 female 12570.
## 2 male 13957.
Men have a higher average insurance charge than women, but this
simple comparison alone doesn’t prove that men are charged more because
of their gender. Other factors in the dataset (like age, BMI, or smoking
status) could be influencing the difference.
ggplot(insurance, aes(x = sex, y = charges)) +
geom_boxplot(fill = c("pink", "lightblue")) +
labs(title = "Insurance Charges by Sex",
x = "Sex",
y = "Charges")

The boxplot shows that males tend to have higher insurance charges
on average, but there is a lot of variation within each sex. This
suggests other factors, such as age, BMI, or smoking, may explain the
differences.
To determine whether men are charged more for health insurance than
women, I ran a simple linear regression with charges as the response
variable and sex as the predictor.
gender_insurance_model <- lm(charges ~ sex, data = insurance)
summary(gender_insurance_model)
##
## Call:
## lm(formula = charges ~ sex, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12835 -8435 -3980 3476 51201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12569.6 470.1 26.740 <2e-16 ***
## sexmale 1387.2 661.3 2.098 0.0361 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared: 0.003282, Adjusted R-squared: 0.002536
## F-statistic: 4.4 on 1 and 1336 DF, p-value: 0.03613
The regression model estimates that the predicted average insurance
charge for females is $12,569.6. The model also estimates that males are
charged $1,387.2 more than females on average, giving a predicted
average of $13,956.8 for males. This difference is statistically
significant (p = 0.036). However, the model does a poor job of
explaining variation in charges. The residual standard error is very
large, about 12,090, and the adjusted R-squared is extremely small
(0.0025), meaning sex alone explains almost none of the variation in
charges. The F-statistic (4.4) also indicates a weak model. Overall,
while the difference is statistically significant, this simple model
provides limited evidence that gender alone determines insurance
charges.
There are reasons to be cautious about this result. The model only
includes sex as a predictor, but insurance charges are influenced by
other factors such as age, BMI, smoking status, number of children, and
region. The low adjusted R-squared and large residual standard error
indicate that individual charges vary widely around the predicted
values, and the statistical significance may be partly due to the large
sample size rather than a meaningful effect. Therefore, this simple
model may give a misleading impression if other variables are not
considered.
One important variable to consider is age, which could be a
confounding factor. Age is related to both sex and insurance charges:
older individuals tend to have higher health risks, leading to higher
charges, and the age distribution may differ between men and women in
the dataset. If men are, on average, older than women, part of the
observed difference in charges could be due to age rather than gender.
Controlling for age is therefore necessary to determine whether the
difference in insurance charges is truly associated with sex.
To better understand whether the difference in insurance charges
between males and females is truly due to gender, I re-ran the
regression model while controlling for age. This allows us to account
for the fact that age affects insurance costs and may differ between men
and women, providing a clearer comparison of charges by sex.
gender_insurance_model_1 <- lm(charges ~ sex + age, data = insurance)
summary(gender_insurance_model_1)
##
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8821 -6947 -5511 5443 48203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2343.62 994.35 2.357 0.0186 *
## sexmale 1538.83 631.08 2.438 0.0149 *
## age 258.87 22.47 11.523 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared: 0.09344, Adjusted R-squared: 0.09209
## F-statistic: 68.8 on 2 and 1335 DF, p-value: < 2.2e-16
After controlling for age, the regression model estimates that men
are charged $1,538.83 more on average than women, holding age constant.
This difference is statistically significant (p = 0.0149), indicating
that males still have higher insurance charges even when accounting for
age. The model also shows that insurance charges increase by about
$258.87 for each additional year of age, confirming that age is an
important predictor of costs. The residual standard error is about
11,540, and the adjusted R-squared is 0.0921, meaning the model explains
roughly 9.2% of the variation in charges.
While this model improves on the simple sex-only comparison, the
adjusted R-squared is still relatively low, meaning it explains only a
small portion of the variation in charges. Other factors, such as BMI,
smoking status, or number of children, may still influence insurance
costs, so the comparison could be further strengthened by including
these additional variables. Despite the statistical significance, the
model provides limited evidence that gender alone determines insurance
charges.
To better account for individual health differences that influence
insurance charges, I re-ran the regression model including BMI and
smoking status along with age. These two variables are strong indicators
of a person’s health and are likely to explain a significant portion of
the variation in insurance costs.
gender_insurance_model_2 <- lm(charges ~ sex + age + smoker + bmi, data = insurance)
summary(gender_insurance_model_2)
##
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12364.7 -2972.2 -983.2 1475.8 29018.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11633.49 947.27 -12.281 <2e-16 ***
## sexmale -109.04 334.66 -0.326 0.745
## age 259.45 11.94 21.727 <2e-16 ***
## smokeryes 23833.87 414.19 57.544 <2e-16 ***
## bmi 323.05 27.53 11.735 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7467
## F-statistic: 986.5 on 4 and 1333 DF, p-value: < 2.2e-16
ggplot(insurance, aes(x = bmi, y = charges)) +
geom_point(alpha = 0.5, size = 1.5) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Insurance Charges vs BMI with Trendline",
x = "BMI",
y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'

As BMI increases, insurance charges generally increase, showing a
positive relationship. Although the scatterplot is busy with many
points, the overall trend is upward, indicating that higher BMI is
associated with higher insurance costs. This demonstrates that health
factors like BMI are important predictors of charges and helps explain
much of the variation in costs, reducing the apparent effect of
gender.
After controlling for age, BMI, and smoking, the regression model
estimates that men are charged $-109.04 on average compared to women,
holding all other factors constant. This difference is not statistically
significant (p = 0.745), indicating that once health-related factors are
accounted for, there is little evidence that males pay more than
females. The model shows that charges increase by about $259.45 for each
additional year of age, by $323.05 for each unit increase in BMI, and by
$23,833.87 for smokers compared to non-smokers, highlighting that
smoking and BMI are strong predictors of insurance costs. The residual
standard error is 6,094, and the adjusted R-squared is 0.7467, meaning
the model explains roughly 74.7% of the variation in charges.
This comparison is much more reliable than previous models because
it accounts for important confounding variables like age, BMI, and
smoking status. Since the estimated difference between males and females
is very small and not statistically significant, it suggests that gender
alone is not a strong predictor of insurance charges once health factors
are considered. While no model can capture every possible factor, this
analysis provides strong evidence that the previously observed
differences were largely due to differences in health rather than
gender.
To ensure that all relevant factors are considered, I expanded the
regression model to include the remaining variables in the dataset:
region and number of children, along with sex, age, BMI, and smoking
status. This allows us to determine whether these additional variables
have a significant impact on insurance charges.
gender_insurance_model_3 <- lm(charges ~ sex + age + smoker + bmi + region + children, data = insurance)
summary(gender_insurance_model_3)
##
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi + region + children,
## data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## age 256.9 11.9 21.587 < 2e-16 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## children 475.5 137.8 3.451 0.000577 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
After controlling for sex, age, BMI, smoking status, region, and
number of children, the model shows that region has a small effect:
charges are slightly lower in the northwest (-$353.0, p = 0.459),
southeast (-$1,035.0, p = 0.031), and southwest (-$960.0, p = 0.045)
compared to the baseline region, but only the southeast and southwest
differences are statistically significant. The number of children has a
significant positive effect on charges, with each additional child
increasing charges by $475.5 on average (p = 0.000577). The estimated
difference between males and females remains small and not statistically
significant (-$131.3, p = 0.693). The residual standard error is 6,062,
and the adjusted R-squared is 0.7494, meaning the model explains roughly
74.9% of the variation in charges.
After analyzing the full regression model controlling for age, BMI,
smoking status, number of children, and region, the estimated difference
between males and females is -$131.3 and is not statistically
significant (p = 0.693). This indicates that, once all major health and
demographic factors are accounted for, gender does not have a meaningful
impact on insurance charges.
Overall, the data show that factors such as age, BMI, smoking
status, and number of children are much stronger predictors of insurance
costs than sex. Therefore, there is no evidence in this dataset that men
are charged significantly more for health insurance than women.