Health insurance premiums and medical costs are a major financial concern for most households, and insurers often argue that lifestyle factors — especially smoking — justify charging some customers substantially more than others. This project investigates whether that difference actually shows up in real billing data.
Research question: Do smokers have higher average medical insurance charges than non-smokers?
To answer this, I use the Medical Cost Personal
Dataset (insurance.csv). The dataset contains
1,338 observations (cases) and 7 variables, where each
case is an individual insurance beneficiary. The seven variables are
age, sex, bmi (body mass index),
children (number of dependents), smoker
(yes/no), region (U.S. residential area), and
charges (the individual medical costs billed by health
insurance, in USD). For this analysis the two variables of interest are
charges, a continuous numeric response
variable, and smoker, a categorical
variable with two levels (yes/no) that splits
the data into the two groups I compare. The remaining variables are used
for exploratory context.
The dataset was obtained from Kaggle (“Medical Cost Personal Datasets,” uploaded by Miri Choi), and it originally appears in the textbook Machine Learning with R by Brett Lantz. It can be accessed at: https://www.kaggle.com/datasets/mirichoi0218/insurance.
I begin by importing the data and performing exploratory data
analysis (EDA) to understand its structure and check for missing values,
using functions such as dim(), str(),
summary(), and colSums(is.na()). I then use
several dplyr verbs (filter(),
select(), mutate(), group_by(),
and summarize()) to clean and summarize the data, and I
create visualizations — a histogram of charges, a boxplot of charges by
smoking status, and a scatterplot of BMI versus charges — to reveal the
distribution of costs and the relationship between smoking and
charges.
insurance <- read.csv("insurance.csv")
dim(insurance) # 1338 rows, 7 columns
## [1] 1338 7
str(insurance) # variable types
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary(insurance$charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
colSums(is.na(insurance)) # check for missing values (none)
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
insurance <- insurance |>
mutate(smoker = factor(smoker, levels = c("no", "yes")))
group_summary <- insurance |>
select(smoker, charges) |>
group_by(smoker) |>
summarize(
n = n(),
mean_charge = mean(charges),
median_charge = median(charges),
sd_charge = sd(charges)
)
group_summary
## # A tibble: 2 × 5
## smoker n mean_charge median_charge sd_charge
## <fct> <int> <dbl> <dbl> <dbl>
## 1 no 1064 8434. 7345. 5994.
## 2 yes 274 32050. 34456. 11542.
The summary table already hints at a large gap: on average, smokers are billed roughly $32,050 versus about $8,434 for non-smokers.
ggplot(insurance, aes(x = charges)) +
geom_histogram(binwidth = 2500, fill = "steelblue", color = "white") +
labs(title = "Distribution of Medical Charges",
x = "Charges (USD)", y = "Count")
ggplot(insurance, aes(x = smoker, y = charges, fill = smoker)) +
geom_boxplot() +
labs(title = "Medical Charges by Smoking Status",
x = "Smoker", y = "Charges (USD)") +
theme(legend.position = "none")
ggplot(insurance, aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.6) +
labs(title = "BMI vs. Charges, Colored by Smoking Status",
x = "BMI", y = "Charges (USD)", color = "Smoker")
The histogram shows that charges are strongly right-skewed. The boxplot makes the group difference visually obvious — the entire distribution of charges for smokers sits far above that of non-smokers. The scatterplot reinforces this: smokers form a distinctly higher band of charges, and among smokers, higher BMI is associated with especially high costs.
Because I am comparing the mean of a numeric variable
(charges) between two independent groups
(smoker = yes vs. no), the appropriate test is an
independent two-sample t-test. The research question is
directional (I expect smokers to pay more), so I use a
one-sided test at a significance level of α = 0.05. Let
μ₁ be the mean charge for smokers and μ₂ be the mean charge for
non-smokers.
smoker_charges <- insurance |> filter(smoker == "yes") |> pull(charges)
nonsmoker_charges <- insurance |> filter(smoker == "no") |> pull(charges)
t_result <- t.test(smoker_charges, nonsmoker_charges, alternative = "greater")
t_result
##
## Welch Two Sample t-test
##
## data: smoker_charges and nonsmoker_charges
## t = 32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 22426.4 Inf
## sample estimates:
## mean of x mean of y
## 32050.232 8434.268
The test produces a t-statistic of about 32.8 with a p-value < 2.2 × 10⁻¹⁶, which is far smaller than α = 0.05. The estimated mean charges are about $32,050 for smokers versus $8,434 for non-smokers, a difference of roughly $23,600, and the confidence interval for the difference lies entirely well above zero. Because the p-value is far below 0.05, we reject the null hypothesis in favor of the alternative.
The analysis provides overwhelming statistical evidence that smokers are billed higher average medical charges than non-smokers — on average roughly $23,600 more per year in this dataset. Both the visualizations and the two-sample t-test (p < 0.001) point to the same conclusion, so the result directly and strongly answers the research question. This is consistent with the well-established health risks of smoking and helps explain why insurers treat smoking status as a key rating factor.
Several limitations suggest future work. This dataset is
observational, so it shows association, not proof that smoking
causes higher charges — smokers may differ from non-smokers in
other ways (age, BMI, region) that also affect cost. A natural next step
would be a multiple linear regression of
charges on smoker, age,
bmi, and other variables to estimate the effect of smoking
while controlling for these factors, or an interaction analysis of
smoker × bmi, since the scatterplot suggested smoking and
high BMI together drive the most extreme costs. Analyzing more recent
and larger datasets would also help confirm whether this gap has changed
over time.
insurance
dataset.)dplyr and
ggplot2 packages.)