A. Introduction

Health insurance premiums and medical costs are a major financial concern for most households, and insurers often argue that lifestyle factors — especially smoking — justify charging some customers substantially more than others. This project investigates whether that difference actually shows up in real billing data.

Research question: Do smokers have higher average medical insurance charges than non-smokers?

To answer this, I use the Medical Cost Personal Dataset (insurance.csv). The dataset contains 1,338 observations (cases) and 7 variables, where each case is an individual insurance beneficiary. The seven variables are age, sex, bmi (body mass index), children (number of dependents), smoker (yes/no), region (U.S. residential area), and charges (the individual medical costs billed by health insurance, in USD). For this analysis the two variables of interest are charges, a continuous numeric response variable, and smoker, a categorical variable with two levels (yes/no) that splits the data into the two groups I compare. The remaining variables are used for exploratory context.

The dataset was obtained from Kaggle (“Medical Cost Personal Datasets,” uploaded by Miri Choi), and it originally appears in the textbook Machine Learning with R by Brett Lantz. It can be accessed at: https://www.kaggle.com/datasets/mirichoi0218/insurance.

B. Data Analysis

I begin by importing the data and performing exploratory data analysis (EDA) to understand its structure and check for missing values, using functions such as dim(), str(), summary(), and colSums(is.na()). I then use several dplyr verbs (filter(), select(), mutate(), group_by(), and summarize()) to clean and summarize the data, and I create visualizations — a histogram of charges, a boxplot of charges by smoking status, and a scatterplot of BMI versus charges — to reveal the distribution of costs and the relationship between smoking and charges.

insurance <- read.csv("insurance.csv")

dim(insurance)          # 1338 rows, 7 columns
## [1] 1338    7
str(insurance)          # variable types
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
summary(insurance$charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770
colSums(is.na(insurance))  # check for missing values (none)
##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0
insurance <- insurance |>
  mutate(smoker = factor(smoker, levels = c("no", "yes")))

group_summary <- insurance |>
  select(smoker, charges) |>
  group_by(smoker) |>
  summarize(
    n           = n(),
    mean_charge = mean(charges),
    median_charge = median(charges),
    sd_charge   = sd(charges)
  )
group_summary
## # A tibble: 2 × 5
##   smoker     n mean_charge median_charge sd_charge
##   <fct>  <int>       <dbl>         <dbl>     <dbl>
## 1 no      1064       8434.         7345.     5994.
## 2 yes      274      32050.        34456.    11542.

The summary table already hints at a large gap: on average, smokers are billed roughly $32,050 versus about $8,434 for non-smokers.

ggplot(insurance, aes(x = charges)) +
  geom_histogram(binwidth = 2500, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Medical Charges",
       x = "Charges (USD)", y = "Count")

ggplot(insurance, aes(x = smoker, y = charges, fill = smoker)) +
  geom_boxplot() +
  labs(title = "Medical Charges by Smoking Status",
       x = "Smoker", y = "Charges (USD)") +
  theme(legend.position = "none")

ggplot(insurance, aes(x = bmi, y = charges, color = smoker)) +
  geom_point(alpha = 0.6) +
  labs(title = "BMI vs. Charges, Colored by Smoking Status",
       x = "BMI", y = "Charges (USD)", color = "Smoker")

The histogram shows that charges are strongly right-skewed. The boxplot makes the group difference visually obvious — the entire distribution of charges for smokers sits far above that of non-smokers. The scatterplot reinforces this: smokers form a distinctly higher band of charges, and among smokers, higher BMI is associated with especially high costs.

C. Statistical Analysis

Because I am comparing the mean of a numeric variable (charges) between two independent groups (smoker = yes vs. no), the appropriate test is an independent two-sample t-test. The research question is directional (I expect smokers to pay more), so I use a one-sided test at a significance level of α = 0.05. Let μ₁ be the mean charge for smokers and μ₂ be the mean charge for non-smokers.

smoker_charges    <- insurance |> filter(smoker == "yes") |> pull(charges)
nonsmoker_charges <- insurance |> filter(smoker == "no")  |> pull(charges)

t_result <- t.test(smoker_charges, nonsmoker_charges, alternative = "greater")
t_result
## 
##  Welch Two Sample t-test
## 
## data:  smoker_charges and nonsmoker_charges
## t = 32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  22426.4     Inf
## sample estimates:
## mean of x mean of y 
## 32050.232  8434.268

The test produces a t-statistic of about 32.8 with a p-value < 2.2 × 10⁻¹⁶, which is far smaller than α = 0.05. The estimated mean charges are about $32,050 for smokers versus $8,434 for non-smokers, a difference of roughly $23,600, and the confidence interval for the difference lies entirely well above zero. Because the p-value is far below 0.05, we reject the null hypothesis in favor of the alternative.

D. Conclusion and Future Directions

The analysis provides overwhelming statistical evidence that smokers are billed higher average medical charges than non-smokers — on average roughly $23,600 more per year in this dataset. Both the visualizations and the two-sample t-test (p < 0.001) point to the same conclusion, so the result directly and strongly answers the research question. This is consistent with the well-established health risks of smoking and helps explain why insurers treat smoking status as a key rating factor.

Several limitations suggest future work. This dataset is observational, so it shows association, not proof that smoking causes higher charges — smokers may differ from non-smokers in other ways (age, BMI, region) that also affect cost. A natural next step would be a multiple linear regression of charges on smoker, age, bmi, and other variables to estimate the effect of smoking while controlling for these factors, or an interaction analysis of smoker × bmi, since the scatterplot suggested smoking and high BMI together drive the most extreme costs. Analyzing more recent and larger datasets would also help confirm whether this gap has changed over time.

E. References