Perception vs. Reality in Public Health

Beth Gallatin

2025-05-06

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.3.0
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.8     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.1     ✔ yardstick    1.3.2
## ✔ recipes      1.3.0
## Warning: package 'broom' was built under R version 4.4.3
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'scales' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
load("C:/Users/pearl/Downloads/cdc3.Rdata")
cdc_age_groups <- cdc3 %>%
  filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany), !is.na(age)) %>%
  mutate(
    group = case_when(
      genhlth %in% c("excellent", "very good") & smoke100 == "Yes" & exerany == "Yes" ~ "Healthy Smoker (Active)",
      genhlth %in% c("excellent", "very good") & smoke100 == "Yes" & exerany == "No"  ~ "Healthy Smoker (Inactive)",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(group))

Perception vs. Reality in Public Health:

-Self-rated health is widly used in public health surveys.

-But Does it truly reflect physical health behavior?

Research Questions

Data Source

Dataset: cdc3.Rdata ( Based on CDC BRFSS)

Variables used:

-genhlth (General Health) -smoke100 (Ever smoked 100+ cigarettes) -exerany (Any exercise in past month) -height,weight (used to calculate BMI) -age, gender

glimpse(cdc3)
## Rows: 19,997
## Columns: 16
## $ genhlth     <fct> good, good, good, good, very good, very good, very good, v…
## $ exerany     <fct> No, No, Yes, Yes, No, Yes, Yes, No, No, Yes, Yes, Yes, Yes…
## $ hlthplan    <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes…
## $ smoke100    <fct> No, Yes, Yes, No, No, No, No, No, Yes, No, Yes, Yes, Yes, …
## $ height      <int> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 70, 69…
## $ weight      <int> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186, 168…
## $ wtdesire    <int> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175, 148…
## $ age         <int> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 69, 23…
## $ gender      <fct> m, f, f, f, f, f, m, m, f, m, m, m, m, m, m, m, m, m, m, f…
## $ BMI         <dbl> 25.10714, 21.45386, 20.50417, 21.30303, 28.33916, 19.56592…
## $ BMIDes      <dbl> 25.10714, 19.73755, 20.50417, 20.01194, 24.56060, 19.56592…
## $ DesActRatio <dbl> 1.0000000, 0.9200000, 1.0000000, 0.9393939, 0.8666667, 1.0…
## $ BMICat      <fct> Overweight, Normal, Normal, Normal, Overweight, Normal, Ov…
## $ BMIDesCat   <fct> Overweight, Normal, Normal, Normal, Normal, Normal, Overwe…
## $ ageCat      <fct> 58-99, 32-43, 44-57, 32-43, 44-57, 44-57, 18-31, 44-57, 18…
## $ wtplans     <fct> Stay the same, Lose, Stay the same, Lose, Lose, Stay the s…
#install.packages("gt")

Variable Validations

##Summary Table

library(gt)
## Warning: package 'gt' was built under R version 4.4.3
# Create the summary table again, in case it's not in memory
health_levels <- table(cdc3$genhlth)
smoking_status <- table(cdc3$smoke100)
exercise_status <- table(cdc3$exerany)

var_summary <- data.frame(
  Variable = c("Self-Reported Health", "", "", "", "", 
               "Smoked 100+ Cigarettes", "",
               "Exercised in Past Month", ""),
  Category = c("Excellent", "Very Good", "Good", "Fair", "Poor",
               "Yes", "No",
               "Yes", "No"),
  Count = c(health_levels["excellent"], health_levels["very good"], health_levels["good"], 
            health_levels["fair"], health_levels["poor"],
            smoking_status["Yes"], smoking_status["No"],
            exercise_status["Yes"], exercise_status["No"])
)

# Format and display the table using gt
var_summary %>%
  gt() %>%
  tab_header(title = "Summary of Key Health Variables") %>%
  cols_label(Count = "Number of People")
Summary of Key Health Variables
Variable Category Number of People
Self-Reported Health Excellent 4657
Very Good 6970
Good 5674
Fair 2019
Poor 677
Smoked 100+ Cigarettes Yes 9440
No 10557
Exercised in Past Month Yes 14912
No 5085

Exploring Intersectionality: The Healthy Smoker Paradox

cdc3 %>%
  filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany)) %>%
  mutate(
    good_health = genhlth %in% c("excellent", "very good"),
    is_smoker = smoke100 == "Yes",
    is_active = exerany == "Yes"
  ) %>%
  count(good_health, is_smoker, is_active) %>%
  arrange(desc(n))
##   good_health is_smoker is_active    n
## 1        TRUE     FALSE      TRUE 5490
## 2        TRUE      TRUE      TRUE 4023
## 3       FALSE      TRUE      TRUE 2874
## 4       FALSE     FALSE      TRUE 2525
## 5       FALSE      TRUE     FALSE 1575
## 6       FALSE     FALSE     FALSE 1396
## 7        TRUE     FALSE     FALSE 1146
## 8        TRUE      TRUE     FALSE  968

This table now proves the “Healthy Smoker Paradox” exists:

There are 4,023 people who smoke, report good health, and are physically active.

And even more intriguing:

There are 968 people who smoke, report good health, and are not active — a deeper paradox!

Visualizing The Data

library(dplyr)
library(ggplot2)
library(forcats)

# Create better-labeled groupings
cdc_combo <- cdc3 %>%
  filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany)) %>%
  mutate(
    good_health = genhlth %in% c("excellent", "very good"),
    is_smoker = smoke100 == "Yes",
    is_active = exerany == "Yes",
    group = case_when(
      good_health & is_smoker & is_active ~ "Healthy Smoker (Active)",
      good_health & is_smoker & !is_active ~ "Healthy Smoker (Inactive)",
      good_health & !is_smoker & is_active ~ "Healthy Non-Smoker (Active)",
      good_health & !is_smoker & !is_active ~ "Healthy Non-Smoker (Inactive)",
      !good_health & is_smoker & is_active ~ "Unhealthy Smoker (Active)",
      !good_health & is_smoker & !is_active ~ "Unhealthy Smoker (Inactive)",
      !good_health & !is_smoker & is_active ~ "Unhealthy Non-Smoker (Active)",
      !good_health & !is_smoker & !is_active ~ "Unhealthy Non-Smoker (Inactive)"
    )
  ) %>%
  count(group)

# Elegant bar plot
ggplot(cdc_combo, aes(x = fct_reorder(group, n), y = n, fill = group)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Self-Reported Data",
    x = "",
    y = "Number of Individuals"
  ) +
  theme_minimal(base_size = 14) +
  scale_fill_brewer(palette = "Paired")

Comparison of Age Between Groups

Are younger smokers more likely to report healthy and active?

library(ggplot2)

cdc_age_groups %>%
  group_by(group) %>%
  summarize(
    mean_age = mean(age),
    sd_age = sd(age)
  ) %>%
  ggplot(aes(x = group, y = mean_age, color = group)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean_age - sd_age, ymax = mean_age + sd_age), width = 0.2) +
  labs(
    title = "Average Age and Variability: Healthy Smokers",
    x = "",
    y = "Mean Age (± SD)"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

The median age of the Healthy Smoker(Active) is slightly lower than that of the Healthy Smoker(Inactive)

This aligns with the psychological and public health studies suggesting that Younger individual often overestimate their resilience despite risk behaviors like smoking.

Demographic Patterns

Are the differences in age meaningful in terms of public health or patterns.

ggplot(cdc_age_groups, aes(x = group, y = age, color = group)) +
  geom_jitter(width = 0.2, alpha = 0.25) +
  stat_summary(fun = mean, geom = "point", shape = 23, fill = "black", size = 3) +
  labs(
    title = "Individual Ages of Healthy Smokers",
    y = "Age", x = ""
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

While both groups span a wide range of ages, Healthy Smokers who are Active tend to be younger on average than those who are Inactive. The difference is subtle but visible — both in the mean marker and in the distribution density.

-Active group mean ≈ 43 -Inactive group mean ≈ 48

Modeling the Likelihood of “Health Mismatch”

A mismatch is when someone reports excellent/good heath, but actually does not meet the criteria.

#step1: dprepare the data
cdc_mismatch <- cdc3 %>%
  filter(!is.na(genhlth), !is.na(smoke100), !is.na(exerany),
         !is.na(height), !is.na(weight), !is.na(age)) %>%
  mutate(
    bmi = (weight / (height^2)) * 703,
    good_health = genhlth %in% c("excellent", "very good"),
    is_smoker = smoke100 == "Yes",
    is_active = exerany == "Yes",
    obese = bmi >= 30,
    risk = is_smoker | !is_active | obese,
    mismatch = good_health & risk
  )
#run the model
model <- glm(mismatch ~ age + is_smoker + obese + is_active,
             data = cdc_mismatch,
             family = "binomial")
summary(model)
## 
## Call:
## glm(formula = mismatch ~ age + is_smoker + obese + is_active, 
##     family = "binomial", data = cdc_mismatch)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.7310549  0.0573546  -12.75   <2e-16 ***
## age           -0.0139920  0.0009836  -14.22   <2e-16 ***
## is_smokerTRUE  1.7366041  0.0338451   51.31   <2e-16 ***
## obeseTRUE      0.4610672  0.0403939   11.41   <2e-16 ***
## is_activeTRUE -0.4351293  0.0368582  -11.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25709  on 19996  degrees of freedom
## Residual deviance: 22427  on 19992  degrees of freedom
## AIC: 22437
## 
## Number of Fisher Scoring iterations: 3

Visualizing the Data: Inactive Individuals

#Visualizing the Inactive Plot
library(ggplot2)

# Simplified 2-panel setup: obesity groups
new_data2 <- expand.grid(
  age = seq(18, 80, by = 1),
  is_smoker = c(TRUE, FALSE),
  obese = c(TRUE, FALSE),
  is_active = FALSE   # Hold fixed for now
)

new_data2$predicted <- predict(model, newdata = new_data2, type = "response")

ggplot(new_data2, aes(x = age, y = predicted, color = is_smoker)) +
  geom_line(size = 1.2) +
  facet_wrap(~ obese, labeller = labeller(obese = c(`TRUE` = "Obese", `FALSE` = "Not Obese"))) +
  scale_color_manual(values = c("firebrick", "steelblue"),
                     labels = c("Smoker", "Non-Smoker")) +
  labs(
    title = "Pred. Health Mismatch by Age and Smoking",
    subtitle = "Separated by Obesity Status (Inactive Individuals)",
    x = "Age", y = "Probability of Mismatch", color = "Smoking Status"
  ) +
  theme_minimal(base_size = 14)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Visualizing the Data: Active Individuals

# Simplified 2-panel setup: obesity groups for Active individuals
new_data_active <- expand.grid(
  age = seq(18, 80, by = 1),
  is_smoker = c(TRUE, FALSE),
  obese = c(TRUE, FALSE),
  is_active = TRUE   # Change here
)

new_data_active$predicted <- predict(model, newdata = new_data_active, type = "response")

ggplot(new_data_active, aes(x = age, y = predicted, color = is_smoker)) +
  geom_line(size = 1.2) +
  facet_wrap(~ obese, labeller = labeller(obese = c(`TRUE` = "Obese", `FALSE` = "Not Obese"))) +
  scale_color_manual(values = c("firebrick", "steelblue"),
                     labels = c("Smoker", "Non-Smoker")) +
  labs(
    title = "Pred. Health Mismatch by Age and Smoking",
    subtitle = "Separated by Obesity Status (Active Individuals)",
    x = "Age", y = "Probability of Mismatch", color = "Smoking Status"
  ) +
  theme_minimal(base_size = 14)

# Results

Mismatch is most likely among: -Younger Individuals - Physically Inactive - Obese - Non Smokers (Surprisingly!)

Smokers Consistently have lower prediccted mismatch Physical activity lowers mismatch probabilities across all groups obesity sharply increases predicted mismatch

Predicted mismatch is highest among younger, inactive,obese, non-smokers- suggesting that percieved health doent always align with risk Smokers may be more self-critical while physically active individuals benefit from more accurate perceptions.

Summary of Findings

Perceived health (self-reported as “excellent” or “very good”) does not always align with objective health risks such as smoking, obesity, or inactivity.

We identified and visualized a distinct group of “healthy smokers” — individuals who smoke yet rate their health highly and often maintain high activity levels.

Through age-based analysis, we found that younger individuals, particularly active smokers, are more likely to report good health despite risk behaviors — a phenomenon we describe as the Healthy Smoker Paradox.

Using logistic regression, we modeled the likelihood of mismatch between health perception and behavior. We found that:

Smokers are actually less likely to mismatch — potentially due to a more cautious self-assessment.

Obesity and inactivity are strong predictors of mismatch.

Younger age increases the risk of overestimating one’s health.

Key Insights and Broader Implications

Perceived health is not simply a function of behavior or risk — it is shaped by age, lifestyle, and perhaps optimism bias.

Health promotion campaigns may benefit from targeted education for younger adults who underestimate their risk.

Self-reported health data, while convenient, must be interpreted with caution — especially in behavioral research.

This work highlights the importance of aligning subjective wellness with objective indicators, particularly in early intervention efforts.