The rising cost of healthcare in the United States continues to be a pressing issue for individuals, insurers, and policymakers. One factor that significantly contributes to increased medical expenses is tobacco use, particularly cigarette smoking. Smoking not only poses substantial health risks but also imposes heavy financial burdens on the healthcare system. In this project, we aim to explore how smoking behavior influences medical insurance charges and assess whether data-driven pricing policies.
Our motivation stems from real-world policy implications of the Affordable Care Act (ACA), which allows insurance providers to charge higher premiums to smokers. However, understanding the actual cost differential between smokers and non-smokers, and how this varies across regions, demographics, and insurance types, requires data analysis rooted in empirical evidence.
To conduct this analysis, we combine structured insurance cost data from a Kaggle-provided CSV dataset with real-time smoking prevalence statistics scraped from the CDC’s Tips From Former Smokers campaign website.
Through a combination of exploratory data analysis, statistical testing, predictive modeling, and interactive visualizations, this project offers insights for both public health advocates and insurance policy designers.
if (rstudioapi::isAvailable()) {
current_dir <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(current_dir)
cat("Working directory set to:", getwd(), "\n")
} else {
cat("rstudioapi not available. Please set the working directory manually using setwd().\n")
}## rstudioapi not available. Please set the working directory manually using setwd().
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
ggplot(insurance, aes(x = charges)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Insurance Charges", x = "Charges", y = "Count")ggplot(insurance, aes(x = smoker, y = charges, fill = smoker)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Insurance Charges by Smoker Status", x = "Smoker", y = "Charges")## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
##
## Welch Two Sample t-test
##
## data: charges by smoker
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -25034.71 -22197.21
## sample estimates:
## mean in group no mean in group yes
## 8434.268 32050.232
## --- T-test Hypothesis Interpretation ---
## Null Hypothesis (H0): Mean insurance charges for smokers and non-smokers are equal.
## Alternative Hypothesis (H1): Mean insurance charges for smokers and non-smokers differ.
if (t_test$p.value < 0.05) {
cat("Result: p-value =", format(t_test$p.value, digits = 4), "< 0.05, reject H0.\n")
cat("Interpretation: There is significant evidence that insurance charges differ between smokers and non-smokers.\n")
cat("Practical Meaning: Smokers likely incur higher charges, justifying tobacco surcharges and cessation programs.\n")
} else {
cat("Result: p-value =", format(t_test$p.value, digits = 4), ">= 0.05, fail to reject H0.\n")
cat("Interpretation: There is insufficient evidence to conclude that charges differ between smokers and non-smokers.\n")
cat("Practical Meaning: Smoking status may not significantly impact charges in this dataset.\n")
}## Result: p-value = 5.889e-103 < 0.05, reject H0.
## Interpretation: There is significant evidence that insurance charges differ between smokers and non-smokers.
## Practical Meaning: Smokers likely incur higher charges, justifying tobacco surcharges and cessation programs.
## Df Sum Sq Mean Sq F value Pr(>F)
## region 3 1.301e+09 433586560 2.97 0.0309 *
## Residuals 1334 1.948e+11 146007093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --- ANOVA Hypothesis Interpretation ---
## Null Hypothesis (H0): Mean insurance charges are equal across all regions.
## Alternative Hypothesis (H1): At least one region's mean charges differ.
p_value <- anova_summary[[1]]$`Pr(>F)`[1]
if (p_value < 0.05) {
cat("Result: p-value =", format(p_value, digits = 4), "< 0.05, reject H0.\n")
cat("Interpretation: There is significant evidence that insurance charges differ across regions.\n")
cat("Practical Meaning: Regional differences in charges may reflect variations in healthcare costs or demographics, warranting region-specific policies.\n")
} else {
cat("Result: p-value =", format(p_value, digits = 4), ">= 0.05, fail to reject H0.\n")
cat("Interpretation: There is insufficient evidence to conclude that charges differ across regions.\n")
cat("Practical Meaning: Region may not be a significant factor in insurance charges in this dataset.\n")
}## Result: p-value = 0.03089 < 0.05, reject H0.
## Interpretation: There is significant evidence that insurance charges differ across regions.
## Practical Meaning: Regional differences in charges may reflect variations in healthcare costs or demographics, warranting region-specific policies.
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
set.seed(123)
trainIndex <- createDataPartition(insurance$charges, p = 0.8, list = FALSE)
train_data <- insurance[trainIndex, ]
test_data <- insurance[-trainIndex, ]rf_model <- randomForest(charges ~ .,
data = train_data,
ntree = 500,
importance = TRUE)
print(rf_model)##
## Call:
## randomForest(formula = charges ~ ., data = train_data, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 21827112
## % Var explained: 85.34
This script scrapes 2022 smoking prevalence data from tables on the CDC Tips campaign page for Sex, Age Group, Race/Ethnicity, U.S. Census Region, Education (≥25 yrs), and Health Insurance Coverage. Data is saved as a CSV and displayed in Markdown tables showing Group and Percentage. Bar plots visualize smoking prevalence for each category. Use the CSV to merge with HealthCare.gov PUFs for tobacco surcharge or cessation coverage analysis. Debugging and a microdata fallback are included.
# URL for CDC 2022 smoking data
url <- "https://www.cdc.gov/tobacco/campaign/tips/resources/data/cigarette-smoking-in-united-states.html"
# Function to scrape data with error handling
scrape_cdc <- function(url) {
tryCatch({
# Read HTML
webpage <- read_html(url)
# Initialize data frame
data <- data.frame(
Category = character(),
Group = character(),
Percentage = character(),
Population = character(),
Prevalence = numeric(), # For plotting
stringsAsFactors = FALSE
)
# Define requested categories
categories <- c(
"By Sex",
"By Age Group",
"By Race/Ethnicity",
"By U.S. Census Region",
"By Education",
"By Health Insurance Coverage"
)
# Extract all table-responsive divs
divs <- webpage %>% html_nodes("div.table-responsive")
cat("Found ", length(divs), " table-responsive div(s)\n")
if (length(divs) == 0) {
cat("Error: No table-responsive divs found. Verify HTML structure.\n")
return(data)
}
# Process each div
for (i in seq_along(divs)) {
cat("Processing div ", i, "\n")
table <- divs[[i]] %>% html_nodes("table") %>% .[[1]]
table_data <- table %>% html_table(fill = TRUE)
cat("Table ", i, " has ", nrow(table_data), " rows and ", ncol(table_data), " columns\n")
if (nrow(table_data) == 0 || ncol(table_data) < 2) {
cat("Table ", i, " is empty or invalid. Skipping.\n")
next
}
# Clean column names
colnames(table_data) <- str_replace_all(colnames(table_data), "[^[:alnum:]]", "_")
cat("Table ", i, " column names: ", paste(colnames(table_data), collapse = ", "), "\n")
# Extract category from first column name
category_col <- colnames(table_data)[1]
category <- case_when(
grepl("Sex", category_col, ignore.case = TRUE) ~ "By Sex",
grepl("Age_Group", category_col, ignore.case = TRUE) ~ "By Age Group",
grepl("Race_Ethnicity", category_col, ignore.case = TRUE) ~ "By Race/Ethnicity",
grepl("Census_Region", category_col, ignore.case = TRUE) ~ "By U.S. Census Region",
grepl("Education", category_col, ignore.case = TRUE) ~ "By Education",
grepl("Insurance_Coverage", category_col, ignore.case = TRUE) ~ "By Health Insurance Coverage",
TRUE ~ "Other"
)
cat("Table ", i, " mapped to category: ", category, "\n")
# Process only requested categories
if (category %in% categories) {
table_data <- table_data %>%
mutate(
Percentage = .[[2]], # Keep as string with %
Prevalence = as.numeric(str_replace(.[[2]], "%", "")), # Numeric for plotting
Population = "Not reported"
) %>%
rename(Group = 1) %>%
mutate(Category = category) %>%
select(Category, Group, Percentage, Population, Prevalence)
if (nrow(table_data) > 0 && !all(is.na(table_data$Prevalence))) {
data <- bind_rows(data, table_data)
cat("Extracted ", nrow(table_data), " rows for ", category, " from table ", i, "\n")
} else {
cat("No valid data parsed for ", category, " in table ", i, "\n")
}
} else {
cat("Table ", i, " (", category, ") not in requested categories. Skipping.\n")
}
}
if (nrow(data) == 0) {
cat("Warning: No data extracted. Tables may be incorrectly structured or JavaScript-rendered.\n")
} else {
cat("Successfully extracted ", nrow(data), " rows of data.\n")
}
return(data)
}, error = function(e) {
cat("Error: Could not connect to URL or parse data.\n", e$message, "\n")
cat("Try downloading NHIS microdata from: https://www.cdc.gov/nchs/nhis/2022data.htm\n")
return(NULL)
})
}
# Scrape data
data <- scrape_cdc(url)## Found 12 table-responsive div(s)
## Processing div 1
## Table 1 has 2 rows and 2 columns
## Table 1 column names: By_Sex, Percentage
## Table 1 mapped to category: By Sex
## Extracted 2 rows for By Sex from table 1
## Processing div 2
## Table 2 has 4 rows and 2 columns
## Table 2 column names: By_Age_Group__yrs_, Percentage
## Table 2 mapped to category: By Age Group
## Extracted 4 rows for By Age Group from table 2
## Processing div 3
## Table 3 has 5 rows and 2 columns
## Table 3 column names: By_Race_Ethnicity, Percentage
## Table 3 mapped to category: By Race/Ethnicity
## Extracted 5 rows for By Race/Ethnicity from table 3
## Processing div 4
## Table 4 has 4 rows and 2 columns
## Table 4 column names: By_U_S__Census_Region, Percentage
## Table 4 mapped to category: By U.S. Census Region
## Extracted 4 rows for By U.S. Census Region from table 4
## Processing div 5
## Table 5 has 7 rows and 2 columns
## Table 5 column names: By_Education__adults_aged__25_yrs_, Percentage
## Table 5 mapped to category: By Education
## Extracted 7 rows for By Education from table 5
## Processing div 6
## Table 6 has 3 rows and 2 columns
## Table 6 column names: By_Marital_Status, Percentage
## Table 6 mapped to category: Other
## Table 6 ( Other ) not in requested categories. Skipping.
## Processing div 7
## Table 7 has 3 rows and 2 columns
## Table 7 column names: By_Annual_Household_Income, Percentage
## Table 7 mapped to category: Other
## Table 7 ( Other ) not in requested categories. Skipping.
## Processing div 8
## Table 8 has 2 rows and 2 columns
## Table 8 column names: By_Sexual_Orientation, Percentage
## Table 8 mapped to category: By Sex
## Extracted 2 rows for By Sex from table 8
## Processing div 9
## Table 9 has 5 rows and 2 columns
## Table 9 column names: By_Health_Insurance_Coverage, Percentage
## Table 9 mapped to category: By Health Insurance Coverage
## Extracted 5 rows for By Health Insurance Coverage from table 9
## Processing div 10
## Table 10 has 2 rows and 2 columns
## Table 10 column names: Has_a_Disability, Percentage
## Table 10 mapped to category: Other
## Table 10 ( Other ) not in requested categories. Skipping.
## Processing div 11
## Table 11 has 2 rows and 2 columns
## Table 11 column names: Regularly_Having_Feelings_of_Severe_Psychological_Distress____, Percentage
## Table 11 mapped to category: Other
## Table 11 ( Other ) not in requested categories. Skipping.
## Processing div 12
## Table 12 has 2 rows and 2 columns
## Table 12 column names: Were_ever_told_by_a_healthcare_provider_that_they_had_depression_____, Percentage
## Table 12 mapped to category: Other
## Table 12 ( Other ) not in requested categories. Skipping.
## Successfully extracted 29 rows of data.
# Save to CSV if successful
if (!is.null(data) && nrow(data) > 0) {
output_file <- paste0("cdc_smoking_data_", ".csv")
write.csv(data, output_file, row.names = FALSE)
cat("Data saved to", output_file, "\n")
} else {
cat("No data saved due to empty or failed scrape.\n")
}## Data saved to cdc_smoking_data_.csv
Bar plots are provided for each category to visualize smoking prevalence. The By Health Insurance Coverage plot highlights high rates (e.g., Medicaid, Uninsured) relevant for ACA surcharge and cessation program analysis.
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By Sex") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by Sex (2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| Male | 13.1% |
| Female | 10.1% |
| Heterosexual/Straight | 11.4% |
| Lesbian/Gay/Bisexual | 15.3% |
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By Age Group") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by Age Group (2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| 18–24 | 5.3% |
| 25–44 | 12.6% |
| 45–64 | 14.9% |
| ≥65 | 8.3% |
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By Race/Ethnicity") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by Race/Ethnicity (2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| White, non-Hispanic | 12.9% |
| Black, non-Hispanic | 11.7% |
| Asian, non-Hispanic | 5.4% |
| Hispanic | 7.7% |
| Other, non-Hispanic | 14.9% |
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By U.S. Census Region") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by U.S. Census Region (2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| Northeast | 10.4% |
| Midwest | 14.0% |
| South | 12.4% |
| West | 8.9% |
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By Education") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by Education (Adults ≥25 yrs, 2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| 0–12 yrs (no diploma) | 20.1% |
| GED | 30.7% |
| High school diploma | 17.1% |
| Some college, no degree | 16.1% |
| Associate degree (academic or technical/vocational) | 13.7% |
| Undergraduate degree (bachelor’s) | 5.3% |
| Graduate degree (Master’s, doctoral or professional) | 3.2% |
if (!is.null(data) && nrow(data) > 0) {
data %>%
filter(Category == "By Health Insurance Coverage") %>%
select(Group, Percentage) %>%
knitr::kable(caption = "Smoking Prevalence by Health Insurance Coverage (2022)")
} else {
cat("Data unavailable due to scraping error.\n")
}| Group | Percentage |
|---|---|
| Private insurance | 8.6% |
| Medicaid | 21.5% |
| Medicare only (aged ≥65 yrs) | 8.4% |
| Other public insurance | 13.9% |
| Uninsured | 20.0% |
This analysis confirms that smoking status is a major determinant of medical insurance charges, with smokers facing significantly higher costs than non-smokers. Through statistical testing and predictive modeling, we quantified this disparity—smokers incur nearly four times the charges on average compared to non-smokers.
The Random Forest model demonstrated strong predictive accuracy (R² ≈ 0.82), identifying smoking status, age, and BMI as the most influential variables. These insights not only validate existing tobacco surcharge policies but also highlight the potential of integrating behavioral data into pricing models for improved fairness and efficiency.
Additionally, by incorporating real-time CDC data through web scraping, this project showcases the value of public health surveillance in dynamic, evidence-driven insurance modeling.
This project examined the relationship between smoking behavior and insurance charges in the U.S., motivated by policy discussions under the Affordable Care Act (ACA).While the ACA limits tobacco premium surcharges to 50%, this analysis shows that actual medical costs for smokers can be nearly four times higher than for non-smokers. This gap reflects higher healthcare utilization, not just pricing policy—and supports the need for ongoing public health intervention. One of the key drivers behind the rising cost of healthcare in the United States is the increased utilization of medical services by certain high-risk groups—such as smokers. This disparity goes beyond pricing policy and highlights the need for continued public health interventions.
rvest
packagesmoker,
region, and sex for compatibility with
modeling techniques