library(tidyverse)
library(modelr)
library(gridExtra)
library(readr)
library(dplyr)
library(readr)

insurance <- read_csv("insurance.csv")

To begin, I compared the average insurance charges for males and females using a simple summary of the data:

gender_insurance <- insurance %>%
  group_by(sex) %>%
  summarize(mean_charges = mean(charges))

gender_insurance
## # A tibble: 2 × 2
##   sex    mean_charges
##   <chr>         <dbl>
## 1 female       12570.
## 2 male         13957.

Men have a higher average insurance charge than women, but this simple comparison alone doesn’t prove that men are charged more because of their gender. Other factors in the dataset (like age, BMI, or smoking status) could be influencing the difference.

ggplot(insurance, aes(x = sex, y = charges)) +
geom_boxplot(fill = c("pink", "lightblue")) +
labs(title = "Insurance Charges by Sex",
x = "Sex",
y = "Charges")

The boxplot shows that males tend to have higher insurance charges on average, but there is a lot of variation within each sex. This suggests other factors, such as age, BMI, or smoking, may explain the differences.

To determine whether men are charged more for health insurance than women, I ran a simple linear regression with charges as the response variable and sex as the predictor.

gender_insurance_model <- lm(charges ~ sex, data = insurance)

summary(gender_insurance_model)
## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

The regression model estimates that the predicted average insurance charge for females is $12,569.6. The model also estimates that males are charged $1,387.2 more than females on average, giving a predicted average of $13,956.8 for males. This difference is statistically significant (p = 0.036). However, the model does a poor job of explaining variation in charges. The residual standard error is very large, about 12,090, and the adjusted R-squared is extremely small (0.0025), meaning sex alone explains almost none of the variation in charges. The F-statistic (4.4) also indicates a weak model. Overall, while the difference is statistically significant, this simple model provides limited evidence that gender alone determines insurance charges.

There are reasons to be cautious about this result. The model only includes sex as a predictor, but insurance charges are influenced by other factors such as age, BMI, smoking status, number of children, and region. The low adjusted R-squared and large residual standard error indicate that individual charges vary widely around the predicted values, and the statistical significance may be partly due to the large sample size rather than a meaningful effect. Therefore, this simple model may give a misleading impression if other variables are not considered.

To better understand whether the difference in insurance charges between males and females is truly due to gender, I re-ran the regression model while controlling for age. This allows us to account for the fact that age affects insurance costs and may differ between men and women, providing a clearer comparison of charges by sex.

gender_insurance_model_1 <- lm(charges ~ sex + age, data = insurance)

summary(gender_insurance_model_1)
## 
## Call:
## lm(formula = charges ~ sex + age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## sexmale      1538.83     631.08   2.438   0.0149 *  
## age           258.87      22.47  11.523   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

After controlling for age, the regression model estimates that men are charged $1,538.83 more on average than women, holding age constant. This difference is statistically significant (p = 0.0149), indicating that males still have higher insurance charges even when accounting for age. The model also shows that insurance charges increase by about $258.87 for each additional year of age, confirming that age is an important predictor of costs. The residual standard error is about 11,540, and the adjusted R-squared is 0.0921, meaning the model explains roughly 9.2% of the variation in charges.

While this model improves on the simple sex-only comparison, the adjusted R-squared is still relatively low, meaning it explains only a small portion of the variation in charges. Other factors, such as BMI, smoking status, or number of children, may still influence insurance costs, so the comparison could be further strengthened by including these additional variables. Despite the statistical significance, the model provides limited evidence that gender alone determines insurance charges.

To better account for individual health differences that influence insurance charges, I re-ran the regression model including BMI and smoking status along with age. These two variables are strong indicators of a person’s health and are likely to explain a significant portion of the variation in insurance costs.

gender_insurance_model_2 <- lm(charges ~ sex + age + smoker + bmi, data = insurance)

summary(gender_insurance_model_2)
## 
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## age            259.45      11.94  21.727   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16
ggplot(insurance, aes(x = bmi, y = charges)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Insurance Charges vs BMI with Trendline",
       x = "BMI",
       y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'

As BMI increases, insurance charges generally increase, showing a positive relationship. Although the scatterplot is busy with many points, the overall trend is upward, indicating that higher BMI is associated with higher insurance costs. This demonstrates that health factors like BMI are important predictors of charges and helps explain much of the variation in costs, reducing the apparent effect of gender.

This comparison is much more reliable than previous models because it accounts for important confounding variables like age, BMI, and smoking status. Since the estimated difference between males and females is very small and not statistically significant, it suggests that gender alone is not a strong predictor of insurance charges once health factors are considered. While no model can capture every possible factor, this analysis provides strong evidence that the previously observed differences were largely due to differences in health rather than gender.

To ensure that all relevant factors are considered, I expanded the regression model to include the remaining variables in the dataset: region and number of children, along with sex, age, BMI, and smoking status. This allows us to determine whether these additional variables have a significant impact on insurance charges.

gender_insurance_model_3 <- lm(charges ~ sex + age + smoker + bmi + region + children, data = insurance)

summary(gender_insurance_model_3)
## 
## Call:
## lm(formula = charges ~ sex + age + smoker + bmi + region + children, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## age                256.9       11.9  21.587  < 2e-16 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## children           475.5      137.8   3.451 0.000577 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

After controlling for sex, age, BMI, smoking status, region, and number of children, the model shows that region has a small effect: charges are slightly lower in the northwest (-$353.0, p = 0.459), southeast (-$1,035.0, p = 0.031), and southwest (-$960.0, p = 0.045) compared to the baseline region, but only the southeast and southwest differences are statistically significant. The number of children has a significant positive effect on charges, with each additional child increasing charges by $475.5 on average (p = 0.000577). The estimated difference between males and females remains small and not statistically significant (-$131.3, p = 0.693). The residual standard error is 6,062, and the adjusted R-squared is 0.7494, meaning the model explains roughly 74.9% of the variation in charges.

After analyzing the full regression model controlling for age, BMI, smoking status, number of children, and region, the estimated difference between males and females is -$131.3 and is not statistically significant (p = 0.693). This indicates that, once all major health and demographic factors are accounted for, gender does not have a meaningful impact on insurance charges.

Overall, the data show that factors such as age, BMI, smoking status, and number of children are much stronger predictors of insurance costs than sex. Therefore, there is no evidence in this dataset that men are charged significantly more for health insurance than women.