March 30, 2025

Introduction

This project analyzes a health insurance dataset to explore how various personal and lifestyle factors influence medical insurance charges.

The dataset includes variables such as age, sex, BMI, smoking status, number of children, region, and charges.

Our main goals are to:

  • Understand trends through exploratory data analysis (EDA).
  • Visualize key relationships using ggplot2 and Plotly.
  • Apply statistical modeling (linear regression, t-tests, ANOVA) to identify the most impactful predictors.

This analysis highlights how behaviors and demographics—especially smoking—can significantly affect insurance costs.

Data Description

Dataset: insurance.csv
Source: Kaggle - Medical Cost Personal Dataset

Variables:

  • age: Age of primary beneficiary
  • sex: Gender (male/female)
  • bmi: Body mass index
  • children: Number of dependents covered by insurance
  • smoker: Smoking status (yes/no)
  • region: Residential region in the US
  • charges: Annual insurance cost

Load Dataset & Preview

library(tidyverse)
library(plotly)
library(broom)
library(knitr)
library(kableExtra)

insurance <- read.csv("insurance.csv")
head(insurance)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

R Code + Pie Chart: Region Distribution


# Prepare data for pie chart
region_data <- insurance %>%
  count(region) %>%
  mutate(percentage = round(100 * n / sum(n), 1),
         label = paste0(region, ": ", percentage, "%"))

# Plot pie chart
ggplot(region_data, aes(x = "", y = percentage, fill = region)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  labs(title = "Distribution of Customers by Region") +
  theme_void(base_size = 13) +
  theme(legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold")) +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5), size = 4)

Pie CHART : Region Distribution

Interpretation:

  1. This pie chart shows the percentage distribution of insurance policyholders across the four U.S. regions.
  2. The southeast has the largest share at 27.2%, while the northeast, northwest, and southwest each make up about 24% of the dataset.
  3. The relatively even distribution ensures that our later regional comparisons (e.g., ANOVA) are not heavily biased by class imbalance.

ggplot: Smoker Count by Gender

Interpretation:

  1. The bar chart displays the distribution of smokers vs non-smokers across genders.
  2. Males have a higher proportion of smokers compared to females, even though both genders have more non-smokers overall.
  3. This imbalance may contribute to the differences observed in insurance charges, as smoking status is a strong predictor of cost in our regression model.

Boxplot: Charges by Smoker Status

Interpretation: This boxplot shows a dramatic difference in insurance charges between smokers and non-smokers.

  • Non-smokers: Charges are generally clustered below $15,000.
  • Smokers: Much higher median and more extreme outliers, often exceeding $35,000.

The wider spread and higher median for smokers visually supports the results of the t-test. This emphasizes that smoking is the most impactful single variable influencing cost in this dataset.

Scatter Plot: Insurance Charges vs. BMI and Age vs. Charges

Interpretation:

  1. In the Charges vs. BMI plot, charges remain fairly constant for most individuals, but spike sharply for smokers with high BMI, especially those classified as obese (BMI > 30).
  2. In the Charges vs. Age plot, insurance charges generally increase with age. This trend is more pronounced for smokers, who incur significantly higher costs than non-smokers at nearly every age.

Plotly: 3D Plot (Age, BMI, Charges)

Interpretation:

  1. This 3D plot visualizes how Age and BMI together influence Insurance Charges, colored by smoking status.
  2. There is a clear vertical separation between smokers (red) and non-smokers (blue), showing that smokers are consistently charged higher premiums across all age and BMI levels.
  3. The combination of higher BMI and older age especially among smokers leads to the highest charges, supporting the interaction effect in the regression model.

Plotly: Charges vs. BMI Colored by Smoker

Interpretation:

  1. This interactive 3D scatter plot visualizes the relationship between Age, BMI, and Insurance Charges, with points colored by smoking status.
  2. Smokers (in red) tend to have significantly higher charges, especially when their BMI is elevated. This supports the strong effect of smoking and BMI seen in the regression model.
  3. Non-smokers (in blue) have lower charges overall and show less variation, indicating that smoking is a key driver of cost variability, particularly when combined with obesity.

Descriptive Statistics

Descriptive Statistics Summary
Variable Mean Median SD Min Q1 Q3 Max
Age 39.2 39.0 14.0 18.0 27.0 51.0 64.0
BMI 30.7 30.4 6.1 16.0 26.3 34.7 53.1
Children 1.1 1.0 1.2 0.0 0.0 2.0 5.0
Charges 13270.4 9382.0 12110.0 1121.9 4740.3 16639.9 63770.4

Age: The average age is 39.2, with most individuals between 27 and 51 years old.

BMI: Mean BMI is 30.7, which is classified as obese. The BMI ranges from 16 to 53, showing significant variation.

Children: On average, individuals have about 1 child, and 75% have 2 or fewer.

Charges: Charges vary drastically from $1,121.9 to $63,770.4, with a mean of $13,270.4. Median is lower than the mean, showing a right-skewed distribution.

Regression: Predicting Insurance Charges

Linear Regression Coefficients
term estimate std.error statistic p.value
(Intercept) -11938.54 987.82 -12.09 0.00
age 256.86 11.90 21.59 0.00
bmi 339.19 28.60 11.86 0.00
children 475.50 137.80 3.45 0.00
smokeryes 23848.53 413.15 57.72 0.00
sexmale -131.31 332.95 -0.39 0.69
regionnorthwest -352.96 476.28 -0.74 0.46
regionsoutheast -1035.02 478.69 -2.16 0.03
regionsouthwest -960.05 477.93 -2.01 0.04

Interpretation: This multiple linear regression model predicts insurance charges using age, BMI, number of children, smoking status, sex, and region.

Smoker status is the strongest predictor — being a smoker increases charges by $23,848 on average, all else held constant (p < 0.001).

Other strong predictors include age, BMI, and number of children, all statistically significant.

Gender and some regions (e.g., northwest) are not significant (p > 0.05), indicating weaker predictive value.

Regression Model Fit

Model Fit Statistics
r.squared adj.r.squared sigma
0.751 0.749 6062.102

Interpretation: The model explains about 74.9% of the variability in insurance charges (Adjusted R² = 0.749).

The Residual Standard Error is 6062.1, which reflects average prediction error in the units of charges.

This indicates a good model fit for a real-world dataset with complex variation.

T-test: Charges by Smoking Status

Welch Two-Sample T-Test Results
Mean (Non-Smokers) Mean (Smokers) Difference 95% CI Lower 95% CI Upper t-Statistic p-Value Test Method
8434.27 32050.23 -23615.96 -25034.71 -22197.21 -32.75 0 Welch Two Sample t-test

Interpretation: A Welch two-sample t-test was used to compare the average insurance charges between smokers and non-smokers.

  • Mean charge (smokers): $32,050
  • Mean charge (non-smokers): $8,434
  • Difference: $23,616
  • 95% CI: [-$25,035, -$22,197]
  • p-value: < 0.001

This result is statistically significant and confirms that smokers pay much higher insurance premiums on average. The large effect size, combined with a very narrow confidence interval, strongly supports the conclusion that smoking is a major cost-driving factor in insurance pricing.This aligns with the findings from our regression analysis and descriptive statistics, providing consistent evidence across methods.

ANOVA: Charges by Region

ANOVA Table: Charges by Region
term df sumsq meansq statistic p.value
region 3 1300759681 433586560 2.97 0.031
Residuals 1334 194773461887 146007093 NA NA

Interpretation: ANOVA shows a statistically significant difference in mean charges across regions (p = 0.031). The chart shows the southeast region has the highest average insurance charges, while northwest and southwest are lower.However, the practical difference is modest. Compared to smoking or age, region has a relatively small influence on predicting charges.

Plot: Actual vs Predicted Charges

Interpretation:

  1. The plot shows the predicted insurance charges (x-axis) vs. the actual charges (y-axis). The red dashed line represents perfect predictions (where predicted = actual).
  2. Most predictions cluster around the red line, indicating good model accuracy. However, for higher-cost individuals, the model still slightly underpredicts charges.
  3. The improved model, using an interaction term between smoker and BMI, captures non-linear effects better — especially for smokers with high BMI — leading to stronger predictive performance overall.

Conclusion

  • Smoking status is the most significant predictor of insurance charges, consistently confirmed by regression and t-tests.
  • Other influential variables include age, BMI, and number of children, all positively associated with higher charges.
  • Gender and region showed smaller or mixed effects, indicating lower practical importance.

This analysis used a combination of EDA, visualization, and statistical modeling to draw meaningful conclusions about insurance cost drivers.

Thank You!

Thank you for your time and attention!

Questions and feedback are welcome.

Presented by Kumar Satvik Chaudhary