County Wellness Indicators — Data Cleaning and Analysis

Author

WILSON

Published

June 17, 2026

Task 1: Research Question and Dataset Orientation

Research Question

Is frequent mental distress associated with higher smoking rates, and how does this pattern vary across states?

Unit of Observation

Each row in county_wellness_indicators_long.csv is one county–measure–year estimate. For example, a single row records that a county in Alaska had an estimated frequent mental distress prevalence of 16.2% in 2023, along with 95% confidence interval bounds.

Dataset Overview

The dataset covers 520 counties across all 50 U.S. states for the year 2023. Each county has exactly 7 health/wellness estimates — one per measure — yielding 3,640 rows total.

Main Variables Supporting the Research Question

Variable Type Role
measure_key character Used to filter rows to current_smoking and frequent_mental_distress
estimate numeric County-level prevalence (%) for the selected measure
state / state_name character State identifier; used to compare patterns across states
county_fips character Unique county identifier; used when pivoting to wide format
ci_lower / ci_upper numeric 95% confidence interval bounds; support uncertainty assessment

The two focal measures are:

  • Frequent mental distress (MHLTH / frequent_mental_distress): estimated percentage of adults reporting 14 or more mentally unhealthy days in the past 30 days.
  • Current smoking (CSMOKING / current_smoking): estimated percentage of adults who currently smoke cigarettes.

Both are county-level prevalence percentages for 2023.


Task 2: Data Cleaning

Load Raw Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
raw <- read_csv("data/county_wellness_indicators_long.csv", show_col_types = FALSE)

dim(raw)
[1] 3640   14
glimpse(raw)
Rows: 3,640
Columns: 14
$ year                   <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023,…
$ state                  <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",…
$ state_name             <chr> "Alaska", "Alaska", "Alaska", "Alaska", "Alaska…
$ county_fips            <chr> "02172", "02172", "02172", "02172", "02172", "0…
$ county_name            <chr> "County 172", "County 172", "County 172", "Coun…
$ tract_count            <dbl> 46, 46, 46, 46, 46, 46, 46, 30, 30, 30, 30, 30,…
$ total_population       <dbl> 34547, 34547, 34547, 34547, 34547, 34547, 34547…
$ total_adult_population <dbl> 26582, 26582, 26582, 26582, 26582, 26582, 26582…
$ measure_id             <chr> "CSMOKING", "DIABETES", "MHLTH", "BPHIGH", "ACC…
$ measure_key            <chr> "current_smoking", "diabetes", "frequent_mental…
$ measure                <chr> "Current smoking", "Diabetes", "Frequent mental…
$ estimate               <dbl> 19.5, 12.5, 16.2, 30.6, 14.7, 35.2, 25.5, 13.3,…
$ ci_lower               <dbl> 13.1, 7.5, 10.5, 27.5, 11.4, 29.5, 22.0, 10.1, …
$ ci_upper               <dbl> 25.9, 17.5, 22.0, 33.6, 18.0, 40.9, 29.0, 16.4,…

Before: Raw Data Snapshot

head(raw, 5)
# A tibble: 5 × 14
   year state state_name county_fips county_name tract_count total_population
  <dbl> <chr> <chr>      <chr>       <chr>             <dbl>            <dbl>
1  2023 AK    Alaska     02172       County 172           46            34547
2  2023 AK    Alaska     02172       County 172           46            34547
3  2023 AK    Alaska     02172       County 172           46            34547
4  2023 AK    Alaska     02172       County 172           46            34547
5  2023 AK    Alaska     02172       County 172           46            34547
# ℹ 7 more variables: total_adult_population <dbl>, measure_id <chr>,
#   measure_key <chr>, measure <chr>, estimate <dbl>, ci_lower <dbl>,
#   ci_upper <dbl>

Audit

Missing values

raw |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "column", values_to = "n_missing") |>
  filter(n_missing > 0)
# A tibble: 0 × 2
# ℹ 2 variables: column <chr>, n_missing <int>

No missing values found.

Duplicate rows

sum(duplicated(raw))
[1] 0

No duplicates.

Measure completeness

Every county should have exactly 7 measures.

raw |>
  count(county_fips, year, name = "n_measures") |>
  count(n_measures, name = "n_counties")
# A tibble: 1 × 2
  n_measures n_counties
       <int>      <int>
1          7        520

All 520 counties have all 7 measures.

Numeric ranges

Estimates and confidence intervals are percentages; all values should be in [0, 100].

raw |>
  summarise(
    across(c(estimate, ci_lower, ci_upper),
           list(min = min, max = max),
           .names = "{.col}_{.fn}")
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "value")
# A tibble: 6 × 2
  stat         value
  <chr>        <dbl>
1 estimate_min   1  
2 estimate_max  45.7
3 ci_lower_min   0  
4 ci_lower_max  41.7
5 ci_upper_min   3.1
6 ci_upper_max  52.9

Confidence interval ordering

ci_lowerestimateci_upper for all rows.

raw |>
  filter(ci_lower > estimate | ci_upper < estimate) |>
  nrow()
[1] 0

No violations.

Column types

year is read as dbl; it should be integer.

class(raw$year)
[1] "numeric"

Cleaning Step

The raw data is structurally sound. The only transformation needed is converting year to integer.

clean <- raw |>
  mutate(year = as.integer(year))

glimpse(clean)
Rows: 3,640
Columns: 14
$ year                   <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023,…
$ state                  <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK",…
$ state_name             <chr> "Alaska", "Alaska", "Alaska", "Alaska", "Alaska…
$ county_fips            <chr> "02172", "02172", "02172", "02172", "02172", "0…
$ county_name            <chr> "County 172", "County 172", "County 172", "Coun…
$ tract_count            <dbl> 46, 46, 46, 46, 46, 46, 46, 30, 30, 30, 30, 30,…
$ total_population       <dbl> 34547, 34547, 34547, 34547, 34547, 34547, 34547…
$ total_adult_population <dbl> 26582, 26582, 26582, 26582, 26582, 26582, 26582…
$ measure_id             <chr> "CSMOKING", "DIABETES", "MHLTH", "BPHIGH", "ACC…
$ measure_key            <chr> "current_smoking", "diabetes", "frequent_mental…
$ measure                <chr> "Current smoking", "Diabetes", "Frequent mental…
$ estimate               <dbl> 19.5, 12.5, 16.2, 30.6, 14.7, 35.2, 25.5, 13.3,…
$ ci_lower               <dbl> 13.1, 7.5, 10.5, 27.5, 11.4, 29.5, 22.0, 10.1, …
$ ci_upper               <dbl> 25.9, 17.5, 22.0, 33.6, 18.0, 40.9, 29.0, 16.4,…

After: Verification Checks

Year is now integer

class(clean$year)
[1] "integer"

Row count is unchanged (before vs. after)

cat("Before:", nrow(raw), "rows\n")
Before: 3640 rows
cat("After: ", nrow(clean), "rows\n")
After:  3640 rows
cat("Match: ", nrow(raw) == nrow(clean), "\n")
Match:  TRUE 

No new missing values introduced

clean |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "column", values_to = "n_missing") |>
  filter(n_missing > 0)
# A tibble: 0 × 2
# ℹ 2 variables: column <chr>, n_missing <int>

Summary of Changes

Issue Found Cleaning Action
year stored as dbl Converted to integer with as.integer()

No rows were dropped; no values were imputed or altered.

Save Clean Data

write_csv(clean, "data/clean_data.WILSON.PS1.csv")
cat("Saved: data/clean_data.WILSON.PS1.csv —", nrow(clean), "rows x", ncol(clean), "cols\n")
Saved: data/clean_data.WILSON.PS1.csv — 3640 rows x 14 cols

Task 3: Visualization Set

The three visualizations below all address the research question: Is frequent mental distress associated with higher smoking rates, and how does this pattern vary across states?

# Pivot to wide format for county-level comparisons
wide <- clean |>
  select(county_fips, county_name, state, state_name,
         total_population, measure_key, estimate) |>
  pivot_wider(names_from = measure_key, values_from = estimate)

Visualization 1 — Distribution: Smoking Rates by Mental Distress Quartile (Grouped Comparison)

Counties are split into four equal groups by their frequent mental distress rate. This shows how the distribution of smoking rates shifts across those groups.

wide |>
  mutate(
    mh_quartile = ntile(frequent_mental_distress, 4),
    mh_group = case_when(
      mh_quartile == 1 ~ "Q1 (Lowest Distress)",
      mh_quartile == 2 ~ "Q2",
      mh_quartile == 3 ~ "Q3",
      mh_quartile == 4 ~ "Q4 (Highest Distress)"
    ),
    mh_group = factor(mh_group,
      levels = c("Q1 (Lowest Distress)", "Q2", "Q3", "Q4 (Highest Distress)"))
  ) |>
  ggplot(aes(x = current_smoking, fill = mh_group, color = mh_group)) +
  geom_density(alpha = 0.25, linewidth = 0.8) +
  labs(
    title = "County Smoking Rates by Mental Distress Quartile",
    x = "Current Smoking Prevalence (%)",
    y = "Density",
    fill = "Mental Distress\nQuartile",
    color = "Mental Distress\nQuartile",
    caption = "Source: County Wellness Indicators, 2023. n = 520 counties."
  ) +
  theme_minimal()

Distribution of county-level smoking prevalence grouped by mental distress quartile. Counties with higher mental distress consistently show higher smoking rates. Each group contains 130 counties. Source: County Wellness Indicators, 2023.

Takeaway: The Q4 distribution (highest mental distress) is shifted markedly to the right compared to Q1, indicating that counties with more frequent mental distress also have higher smoking rates.


Visualization 2 — Trend: Frequent Mental Distress vs. Current Smoking (Scatter)

Each point is one county. This shows the overall association between the two measures.

ggplot(wide, aes(x = frequent_mental_distress, y = current_smoking)) +
  geom_point(alpha = 0.35, size = 1.3) +
  geom_smooth(method = "lm", se = TRUE, color = "purple4", linewidth = 0.9) +
  labs(
    title = "Frequent Mental Distress vs. Current Smoking Across U.S. Counties",
    x = "Frequent Mental Distress Prevalence (%)",
    y = "Current Smoking Prevalence (%)",
    caption = "Source: County Wellness Indicators, 2023. n = 520 counties."
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Each point represents one county. Counties with higher frequent mental distress rates tend to have higher current smoking rates. The regression line summarizes the overall positive trend. Source: County Wellness Indicators, 2023.

Takeaway: There is a consistent positive linear trend (r ≈ 0.66) across the full range of both measures. No obvious curvature or clusters break the pattern.


Visualization 3 — Ranking: States by Median Mental Distress and Smoking (Grouped Comparison)

State medians for both measures are computed and displayed together. This shows whether the geographic pattern is consistent across the two measures.

state_rq3 <- clean |>
  filter(measure_key %in% c("frequent_mental_distress", "current_smoking")) |>
  group_by(state_name, measure) |>
  summarise(median_pct = median(estimate), .groups = "drop")

# Order states by mental distress median
state_order <- state_rq3 |>
  filter(measure == "Frequent mental distress") |>
  arrange(median_pct) |>
  pull(state_name)

state_rq3 |>
  mutate(state_name = factor(state_name, levels = state_order)) |>
  ggplot(aes(x = median_pct, y = state_name, color = measure, shape = measure)) +
  geom_point(size = 2.2) +
  labs(
    title = "State Medians: Frequent Mental Distress and Current Smoking",
    subtitle = "States ranked by mental distress rate (low to high)",
    x = "Median County Prevalence (%)",
    y = NULL,
    color = "Measure",
    shape = "Measure",
    caption = "Source: County Wellness Indicators, 2023. One point per state per measure."
  ) +
  theme_minimal() +
  theme(legend.position = "top")

State median county-level prevalence for frequent mental distress and current smoking, ranked by mental distress. States near the top of the mental distress ranking tend to also appear near the top for smoking. Source: County Wellness Indicators, 2023.

Takeaway: States that rank high on mental distress (e.g., Appalachian and southern states) largely also rank high on smoking, while states with low mental distress rates tend to have lower smoking rates. The geographic pattern is consistent across both measures.