Beth Gallatin
2025-05-06
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.8 ✔ rsample 1.3.0
## ✔ dials 1.4.0 ✔ tune 1.3.0
## ✔ infer 1.0.8 ✔ workflows 1.2.0
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.3.1 ✔ yardstick 1.3.2
## ✔ recipes 1.3.0
## Warning: package 'broom' was built under R version 4.4.3
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'scales' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
cdc_age_groups <- cdc3 %>%
filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany), !is.na(age)) %>%
mutate(
group = case_when(
genhlth %in% c("excellent", "very good") & smoke100 == "Yes" & exerany == "Yes" ~ "Healthy Smoker (Active)",
genhlth %in% c("excellent", "very good") & smoke100 == "Yes" & exerany == "No" ~ "Healthy Smoker (Inactive)",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(group))-Self-rated health is widly used in public health surveys.
-But Does it truly reflect physical health behavior?
The Paradox of the Healthy Smoker
A Health Mismatch Index
Age-baseed perception differences
Are there individuals who smoke yet report excellent health and high activity levels?
How often do people rate their health highly despite smoking, inactivity, or high BMI?
Are younger people more likely to overestimate their health compared to older people?
Dataset: cdc3.Rdata ( Based on CDC BRFSS)
Variables used:
-genhlth (General Health) -smoke100 (Ever smoked 100+ cigarettes) -exerany (Any exercise in past month) -height,weight (used to calculate BMI) -age, gender
## Rows: 19,997
## Columns: 16
## $ genhlth <fct> good, good, good, good, very good, very good, very good, v…
## $ exerany <fct> No, No, Yes, Yes, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes…
## $ hlthplan <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ smoke100 <fct> No, Yes, Yes, No, No, No, No, No, Yes, No, Yes, Yes, Yes, …
## $ height <int> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 70, 69…
## $ weight <int> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186, 168…
## $ wtdesire <int> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175, 148…
## $ age <int> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 69, 23…
## $ gender <fct> m, f, f, f, f, f, m, m, f, m, m, m, m, m, m, m, m, m, m, f…
## $ BMI <dbl> 25.10714, 21.45386, 20.50417, 21.30303, 28.33916, 19.56592…
## $ BMIDes <dbl> 25.10714, 19.73755, 20.50417, 20.01194, 24.56060, 19.56592…
## $ DesActRatio <dbl> 1.0000000, 0.9200000, 1.0000000, 0.9393939, 0.8666667, 1.0…
## $ BMICat <fct> Overweight, Normal, Normal, Normal, Overweight, Normal, Ov…
## $ BMIDesCat <fct> Overweight, Normal, Normal, Normal, Normal, Normal, Overwe…
## $ ageCat <fct> 58-99, 32-43, 44-57, 32-43, 44-57, 44-57, 18-31, 44-57, 18…
## $ wtplans <fct> Stay the same, Lose, Stay the same, Lose, Lose, Stay the s…
## Warning: package 'gt' was built under R version 4.4.3
# Create the summary table again, in case it's not in memory
health_levels <- table(cdc3$genhlth)
smoking_status <- table(cdc3$smoke100)
exercise_status <- table(cdc3$exerany)
var_summary <- data.frame(
Variable = c("Self-Reported Health", "", "", "", "",
"Smoked 100+ Cigarettes", "",
"Exercised in Past Month", ""),
Category = c("Excellent", "Very Good", "Good", "Fair", "Poor",
"Yes", "No",
"Yes", "No"),
Count = c(health_levels["excellent"], health_levels["very good"], health_levels["good"],
health_levels["fair"], health_levels["poor"],
smoking_status["Yes"], smoking_status["No"],
exercise_status["Yes"], exercise_status["No"])
)
# Format and display the table using gt
var_summary %>%
gt() %>%
tab_header(title = "Summary of Key Health Variables") %>%
cols_label(Count = "Number of People")| Summary of Key Health Variables | ||
| Variable | Category | Number of People |
|---|---|---|
| Self-Reported Health | Excellent | 4657 |
| Very Good | 6970 | |
| Good | 5674 | |
| Fair | 2019 | |
| Poor | 677 | |
| Smoked 100+ Cigarettes | Yes | 9440 |
| No | 10557 | |
| Exercised in Past Month | Yes | 14912 |
| No | 5085 | |
cdc3 %>%
filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany)) %>%
mutate(
good_health = genhlth %in% c("excellent", "very good"),
is_smoker = smoke100 == "Yes",
is_active = exerany == "Yes"
) %>%
count(good_health, is_smoker, is_active) %>%
arrange(desc(n))## good_health is_smoker is_active n
## 1 TRUE FALSE TRUE 5490
## 2 TRUE TRUE TRUE 4023
## 3 FALSE TRUE TRUE 2874
## 4 FALSE FALSE TRUE 2525
## 5 FALSE TRUE FALSE 1575
## 6 FALSE FALSE FALSE 1396
## 7 TRUE FALSE FALSE 1146
## 8 TRUE TRUE FALSE 968
This table now proves the “Healthy Smoker Paradox” exists:
There are 4,023 people who smoke, report good health, and are physically active.
And even more intriguing:
There are 968 people who smoke, report good health, and are not active — a deeper paradox!
library(dplyr)
library(ggplot2)
library(forcats)
# Create better-labeled groupings
cdc_combo <- cdc3 %>%
filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany)) %>%
mutate(
good_health = genhlth %in% c("excellent", "very good"),
is_smoker = smoke100 == "Yes",
is_active = exerany == "Yes",
group = case_when(
good_health & is_smoker & is_active ~ "Healthy Smoker (Active)",
good_health & is_smoker & !is_active ~ "Healthy Smoker (Inactive)",
good_health & !is_smoker & is_active ~ "Healthy Non-Smoker (Active)",
good_health & !is_smoker & !is_active ~ "Healthy Non-Smoker (Inactive)",
!good_health & is_smoker & is_active ~ "Unhealthy Smoker (Active)",
!good_health & is_smoker & !is_active ~ "Unhealthy Smoker (Inactive)",
!good_health & !is_smoker & is_active ~ "Unhealthy Non-Smoker (Active)",
!good_health & !is_smoker & !is_active ~ "Unhealthy Non-Smoker (Inactive)"
)
) %>%
count(group)
# Elegant bar plot
ggplot(cdc_combo, aes(x = fct_reorder(group, n), y = n, fill = group)) +
geom_col(width = 0.7, show.legend = FALSE) +
coord_flip() +
labs(
title = "Self-Reported Data",
x = "",
y = "Number of Individuals"
) +
theme_minimal(base_size = 14) +
scale_fill_brewer(palette = "Paired")Are younger smokers more likely to report healthy and active?
library(ggplot2)
cdc_age_groups %>%
group_by(group) %>%
summarize(
mean_age = mean(age),
sd_age = sd(age)
) %>%
ggplot(aes(x = group, y = mean_age, color = group)) +
geom_point(size = 4) +
geom_errorbar(aes(ymin = mean_age - sd_age, ymax = mean_age + sd_age), width = 0.2) +
labs(
title = "Average Age and Variability: Healthy Smokers",
x = "",
y = "Mean Age (± SD)"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")The median age of the Healthy Smoker(Active) is slightly lower than that of the Healthy Smoker(Inactive)
This aligns with the psychological and public health studies suggesting that Younger individual often overestimate their resilience despite risk behaviors like smoking.
Are the differences in age meaningful in terms of public health or patterns.
ggplot(cdc_age_groups, aes(x = group, y = age, color = group)) +
geom_jitter(width = 0.2, alpha = 0.25) +
stat_summary(fun = mean, geom = "point", shape = 23, fill = "black", size = 3) +
labs(
title = "Individual Ages of Healthy Smokers",
y = "Age", x = ""
) +
theme_minimal(base_size = 14) +
theme(legend.position = "none")While both groups span a wide range of ages, Healthy Smokers who are Active tend to be younger on average than those who are Inactive. The difference is subtle but visible — both in the mean marker and in the distribution density.
-Active group mean ≈ 43 -Inactive group mean ≈ 48
A mismatch is when someone reports excellent/good heath, but actually does not meet the criteria.
#step1: dprepare the data
cdc_mismatch <- cdc3 %>%
filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany),
!is.na(height), !is.na(weight), !is.na(age)) %>%
mutate(
bmi = (weight / (height^2)) * 703,
good_health = genhlth %in% c("excellent", "very good"),
is_smoker = smoke100 == "Yes",
is_active = exerany == "Yes",
obese = bmi >= 30,
risk = is_smoker | !is_active | obese,
mismatch = good_health & risk
)#run the model
model <- glm(mismatch ~ age + is_smoker + obese + is_active,
data = cdc_mismatch,
family = "binomial")
summary(model)##
## Call:
## glm(formula = mismatch ~ age + is_smoker + obese + is_active,
## family = "binomial", data = cdc_mismatch)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7310549 0.0573546 -12.75 <2e-16 ***
## age -0.0139920 0.0009836 -14.22 <2e-16 ***
## is_smokerTRUE 1.7366041 0.0338451 51.31 <2e-16 ***
## obeseTRUE 0.4610672 0.0403939 11.41 <2e-16 ***
## is_activeTRUE -0.4351293 0.0368582 -11.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25709 on 19996 degrees of freedom
## Residual deviance: 22427 on 19992 degrees of freedom
## AIC: 22437
##
## Number of Fisher Scoring iterations: 3
#Visualizing the Inactive Plot
library(ggplot2)
# Simplified 2-panel setup: obesity groups
new_data2 <- expand.grid(
age = seq(18, 80, by = 1),
is_smoker = c(TRUE, FALSE),
obese = c(TRUE, FALSE),
is_active = FALSE # Hold fixed for now
)
new_data2$predicted <- predict(model, newdata = new_data2, type = "response")
ggplot(new_data2, aes(x = age, y = predicted, color = is_smoker)) +
geom_line(size = 1.2) +
facet_wrap(~ obese, labeller = labeller(obese = c(`TRUE` = "Obese", `FALSE` = "Not Obese"))) +
scale_color_manual(values = c("firebrick", "steelblue"),
labels = c("Smoker", "Non-Smoker")) +
labs(
title = "Pred. Health Mismatch by Age and Smoking",
subtitle = "Separated by Obesity Status (Inactive Individuals)",
x = "Age", y = "Probability of Mismatch", color = "Smoking Status"
) +
theme_minimal(base_size = 14)## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Simplified 2-panel setup: obesity groups for Active individuals
new_data_active <- expand.grid(
age = seq(18, 80, by = 1),
is_smoker = c(TRUE, FALSE),
obese = c(TRUE, FALSE),
is_active = TRUE # Change here
)
new_data_active$predicted <- predict(model, newdata = new_data_active, type = "response")
ggplot(new_data_active, aes(x = age, y = predicted, color = is_smoker)) +
geom_line(size = 1.2) +
facet_wrap(~ obese, labeller = labeller(obese = c(`TRUE` = "Obese", `FALSE` = "Not Obese"))) +
scale_color_manual(values = c("firebrick", "steelblue"),
labels = c("Smoker", "Non-Smoker")) +
labs(
title = "Pred. Health Mismatch by Age and Smoking",
subtitle = "Separated by Obesity Status (Active Individuals)",
x = "Age", y = "Probability of Mismatch", color = "Smoking Status"
) +
theme_minimal(base_size = 14)
# Results
Mismatch is most likely among: -Younger Individuals - Physically Inactive - Obese - Non Smokers (Surprisingly!)
Smokers Consistently have lower prediccted mismatch Physical activity lowers mismatch probabilities across all groups obesity sharply increases predicted mismatch
Predicted mismatch is highest among younger, inactive,obese, non-smokers- suggesting that percieved health doent always align with risk Smokers may be more self-critical while physically active individuals benefit from more accurate perceptions.
Perceived health (self-reported as “excellent” or “very good”) does not always align with objective health risks such as smoking, obesity, or inactivity.
We identified and visualized a distinct group of “healthy smokers” — individuals who smoke yet rate their health highly and often maintain high activity levels.
Through age-based analysis, we found that younger individuals, particularly active smokers, are more likely to report good health despite risk behaviors — a phenomenon we describe as the Healthy Smoker Paradox.
Using logistic regression, we modeled the likelihood of mismatch between health perception and behavior. We found that:
Smokers are actually less likely to mismatch — potentially due to a more cautious self-assessment.
Obesity and inactivity are strong predictors of mismatch.
Younger age increases the risk of overestimating one’s health.
Perceived health is not simply a function of behavior or risk — it is shaped by age, lifestyle, and perhaps optimism bias.
Health promotion campaigns may benefit from targeted education for younger adults who underestimate their risk.
Self-reported health data, while convenient, must be interpreted with caution — especially in behavioral research.
This work highlights the importance of aligning subjective wellness with objective indicators, particularly in early intervention efforts.