Division of Work

Member Student ID Main Assigned Section
Chen Miaojun 25072895 Dataset Description; Data Structure and Quality; Data Cleaning and Question Selection; Dashboard / App Design
Wang Peiyuan 25089856 Exploratory Data Analysis; Location and Demographic Comparison; Aligned Relationship and Robustness Checks
Huang Shengqi 24080731 Regression Model; Regression Diagnostics / Interpretation; Adjustment Ladder; statistical limitation review
Wang Hexiao 25089268 Classification Model: High-Obesity Subgroup Profile Screening; Risk Ranking Validation; Public Health Intervention Matrix; Threshold Sensitivity and Robust Core
All members N/A Discussion / Limitations / Conclusion; final review; presentation preparation

Abstract

Obesity and physical inactivity are important public health concerns because they are closely related to long-term population health, healthcare burden, and preventive health planning. This project analyses the Nutrition, Physical Activity, and Obesity public health surveillance dataset in the United States. The dataset covers 2011 to 2021 and contains 88,629 records with 33 variables, giving a total dataset dimension of more than 2.9 million cells.

The analysis focuses on two selected indicators: adult obesity and no leisure-time physical activity. Since the dataset is aggregated, each row represents a public health profile rather than an individual person. The project therefore studies obesity and physical inactivity patterns across years, locations, and demographic groups instead of predicting individual medical outcomes. A distinction is made between broad descriptive views, which may use all selected surveillance records, and modelling-related subgroup analysis, which excludes Total or Overall aggregate rows to keep the unit of analysis consistent.

The overall aim is to build an interpretable public health decision-support workflow. The report first prepares clean indicator-specific datasets, then supports exploratory analysis, aligned obesity-inactivity comparison, regression modelling, classification modelling, and public health intervention priority analysis. The workflow deliberately separates total-stratification summaries from demographic subgroup profiles so that exploratory patterns, regression explanation, classification screening, and intervention priorities are interpreted at the correct level. All findings should be interpreted as group-level public health insights rather than individual medical predictions or causal effects.

The main findings show that adult obesity increased by about six percentage points over the study period. Obesity and physical inactivity have a positive subgroup-level aligned correlation of about 0.46, but the within-profile demeaned correlation is only about 0.07, suggesting that much of the relationship is driven by stable differences between locations and demographic profiles. The regression model explains about three-quarters of test-set obesity variation, while the classification model captures about two-thirds of true high-obesity subgroup profiles within the top 20% highest-ranked profiles.

Executive Decision Summary

  • Adult obesity increased clearly from 2011 to 2021, and the rise remains visible when the analysis focuses on total-stratification records.
  • Obesity is unevenly distributed across locations and demographic subgroup profiles, so public health monitoring should not rely only on broad averages.
  • Physical inactivity is positively associated with adult obesity at the subgroup profile level, but the relationship weakens strongly after within-profile adjustment. This supports a cautious interpretation: inactivity is useful for profiling and prioritisation, but the analysis does not prove a causal effect.
  • The regression model provides strong profile-level explanation within the observed surveillance panel, mainly because year, location, and demographic group capture a large share of obesity variation.
  • The classification and intervention matrix translate the analysis into practical priority groups. The most defensible starting point is the robust core of profiles that remain high-priority under Q70, Q75, and Q80 threshold choices.
  • The Shiny dashboard extends the report by allowing users to explore trends, location and demographic rankings, and intervention priority segments interactively.

1 Introduction

1.1 Background

Obesity and physical inactivity are major public health concerns because they are closely associated with chronic disease burden, healthcare cost, and preventive health planning. However, obesity is not evenly distributed across the population. Its prevalence can vary across locations, years, age groups, education levels, income levels, gender groups, and racial or ethnic groups.

A single national average may hide important differences between communities. For public health planning, it is more useful to identify which combinations of year, location, and demographic group have higher obesity prevalence. These combinations can be understood as public health profiles.

Physical inactivity is also an important surveillance indicator. A lack of leisure-time physical activity may appear together with higher obesity prevalence in some public health profiles. However, this project treats the relationship as an observed association. It does not claim that physical inactivity directly causes obesity.

1.2 Project Focus

This project focuses on three connected tasks.

First, it describes adult obesity and physical inactivity trends across years, locations, and demographic groups. Second, it examines whether public health profile characteristics can explain differences in adult obesity prevalence. Third, it screens high-obesity demographic subgroup profiles and translates obesity-inactivity patterns into interpretable public health priority groups.

The focus is therefore not only on building models, but also on producing results that can be understood and used for public health monitoring and prioritisation.

1.3 Objectives and Research Questions

The objective of this project is to transform large-scale public health surveillance data into interpretable insights and practical public health priorities.

The project is guided by the following research questions:

  1. How did adult obesity and no leisure-time physical activity vary across U.S. locations from 2011 to 2021? (Exploratory analysis)
  2. How do obesity patterns differ across demographic groups such as age, education, income, gender, and race or ethnicity? (Demographic comparison)
  3. Can year, location, and demographic stratification variables explain variation in adult obesity prevalence? (Regression)
  4. Can high-obesity demographic subgroup public health profiles be screened using physical inactivity levels together with year, location, and demographic characteristics? (Classification)
  5. How can obesity and physical inactivity patterns be translated into interpretable public health intervention priorities? (Decision support)

Therefore, all findings should be interpreted as group-level public health insights rather than individual medical predictions.

1.4 Analytical Scope

This project is framed as a public health profile analysis rather than an individual diagnosis or personal risk prediction task.

  1. The unit of analysis is a public health profile, such as a specific year, location, and demographic group.
  2. The project focuses on two selected indicators: adult obesity and no leisure-time physical activity.
  3. The variable data_value has different meanings under different public health questions, so the analysis must select specific questionid values before modelling.
  4. Adult obesity and physical inactivity should only be compared after matching records by year, location, and stratification group.
  5. Broad EDA may show all selected records or total-stratification records, but modelling-related EDA, regression, classification, and priority analysis use demographic subgroup profiles with Total or Overall rows excluded.
  6. The regression and classification stages are designed for explanation and screening, not for causal claims.
  7. The final intervention priority matrix is intended to support public health planning by identifying subgroup profiles where obesity and physical inactivity are both relatively high.

2 Dataset Description

2.1 Source and Dimension

The dataset used in this project is the Nutrition, Physical Activity, and Obesity dataset. The accessible CSV file was obtained from Kaggle, while the original public health context comes from the CDC Division of Nutrition, Physical Activity, and Obesity surveillance programme. The underlying data source is the Behavioral Risk Factor Surveillance System, a large state-based health survey system in the United States.

  • Dataset URL: https://www.kaggle.com/datasets/mattop/nutrition-physical-activity-and-obesity/data
  • Original public health context: CDC Division of Nutrition, Physical Activity, and Obesity
  • Underlying surveillance source: Behavioral Risk Factor Surveillance System
  • File used in this project: Nutrition__Physical_Activity__and_Obesity.csv
  • Data type: Aggregated public health surveillance data
  • Geographic context: United States, including states, territories, and national-level records
  • Years covered: 2011-2021

The dataset dimension is calculated after loading the raw CSV file.

csv_file <- "Nutrition__Physical_Activity__and_Obesity.csv"

if (!file.exists(csv_file)) {
  csv_files <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
  csv_files <- csv_files[!str_detect(csv_files, "^project_output_")]

  if (length(csv_files) == 1) {
    csv_file <- csv_files[1]
    message("Using detected source CSV file: ", csv_file)
  } else {
    stop(
      "Please place Nutrition__Physical_Activity__and_Obesity.csv in the same folder as this Rmd file."
    )
  }
}

raw_data <- readr::read_csv(csv_file, show_col_types = FALSE)
clean_colnames <- function(x) {
  x %>%
    str_to_lower() %>%
    str_replace_all("[^a-z0-9]+", "_") %>%
    str_replace_all("^_|_$", "")
}

original_names <- names(raw_data)
names(raw_data) <- clean_colnames(names(raw_data))

column_mapping <- tibble(
  original_name = original_names,
  cleaned_name = names(raw_data)
)

n_rows <- nrow(raw_data)
n_cols <- ncol(raw_data)
n_dim <- n_rows * n_cols

dataset_dimension <- tibble(
  rows = n_rows,
  columns = n_cols,
  dimension_cells = n_dim
)

dataset_dimension %>%
  kable(caption = "Raw Dataset Dimension")
Raw Dataset Dimension
rows columns dimension_cells
88629 33 2924757
readr::write_csv(column_mapping, "project_output_column_mapping.csv")

The dataset contains 88,629 rows x 33 columns = 2,924,757 cells. This satisfies the course dataset-dimension requirement because the total dimension is greater than 100,000.

2.2 Variables Used

The dataset contains multiple public health questions, so many columns are metadata, identifiers, confidence limits, sample sizes, and stratification labels. The main variables used in this project are listed below.

Variable Description
yearstart Starting year of the record
yearend Ending year of the record
locationdesc U.S. state, territory, or national location description
class Broad public health category
topic Specific topic under each class
question Public health indicator question
data_value Main numeric value for the selected indicator
low_confidence_limit Lower confidence limit for the estimate
high_confidence_limit Upper confidence limit for the estimate
sample_size Sample size used for the estimate
stratificationcategory1 Type of demographic stratification
stratification1 Specific group within the stratification category
questionid Stable identifier for the public health question

2.3 Important Dataset Interpretation Issue

A key property of this dataset is that data_value is question-dependent. This means that the same numeric column can represent different measurements depending on the value of questionid.

For example:

  • Under an obesity question, data_value represents adult obesity prevalence.
  • Under a physical activity question, data_value represents a physical activity measurement.
  • Under a fruit or vegetable consumption question, data_value represents a nutrition-related measurement.

Therefore, this project does not pool all data_value records together. Instead, it selects specific public health questions before analysis. This avoids mixing values that have different meanings.

A second interpretation issue is the difference between Total or Overall rows and demographic subgroup rows. Total rows are useful for broad state-level trends because they summarise the overall adult population for a location and year. However, regression, classification, risk ranking, and intervention-priority analysis are based on demographic subgroup profiles. Therefore, Total or Overall rows are excluded from modelling-related sections to avoid mixing aggregate summaries with subgroup observations.

2.4 Selected Indicators

This project focuses on two public health indicators.

Indicator Question ID Question text
Adult obesity Q036 Percent of adults aged 18 years and older who have obesity
No leisure-time physical activity Q047 Percent of adults who engage in no leisure-time physical activity

These two indicators are selected because they support the main project objective: analysing obesity patterns and examining how they relate to physical inactivity at the public health profile level.

3 Data Structure and Quality

3.1 Overall Data Quality Summary

Before any modelling, the structure and quality of the raw data are profiled. This includes variable classes, missing values, distinct-value counts, duplicate rows, and complete-row counts.

structure_summary <- tibble(
  variable = names(raw_data),
  class = map_chr(raw_data, ~ class(.x)[1]),
  missing_n = map_int(raw_data, ~ sum(is.na(.x))),
  missing_rate = map_dbl(raw_data, ~ mean(is.na(.x))),
  unique_n = map_int(raw_data, ~ n_distinct(.x, na.rm = TRUE)),
  example_values = map_chr(
    raw_data,
    ~ paste(head(unique(na.omit(as.character(.x))), 5), collapse = " | ")
  )
)

quality_summary <- tibble(
  rows = nrow(raw_data),
  columns = ncol(raw_data),
  dimension_cells = nrow(raw_data) * ncol(raw_data),
  duplicate_rows = sum(duplicated(raw_data)),
  duplicate_rate = mean(duplicated(raw_data)),
  complete_rows = sum(complete.cases(raw_data)),
  complete_row_rate = mean(complete.cases(raw_data))
)

quality_summary %>%
  kable(digits = 4, caption = "Row-Level Data Quality Summary")
Row-Level Data Quality Summary
rows columns dimension_cells duplicate_rows duplicate_rate complete_rows complete_row_rate
88629 33 2924757 0 0 0 0
readr::write_csv(structure_summary, "project_output_variable_structure.csv")
readr::write_csv(quality_summary, "project_output_quality_summary.csv")

The raw dataset contains 88,629 rows and 33 columns. The total dataset dimension is 2,924,757 cells. The dataset contains 0 exact duplicate rows.

3.2 Key Variable Structure

key_variables <- c(
  "yearstart",
  "yearend",
  "locationdesc",
  "class",
  "topic",
  "question",
  "data_value",
  "low_confidence_limit",
  "high_confidence_limit",
  "sample_size",
  "stratificationcategory1",
  "stratification1",
  "questionid"
)

key_variable_summary <- structure_summary %>%
  filter(variable %in% key_variables)

key_variable_summary %>%
  kable(digits = 4, caption = "Key Variable Structure Summary")
Key Variable Structure Summary
variable class missing_n missing_rate unique_n example_values
yearstart numeric 0 0.0000 11 2020 | 2014 | 2013 | 2015 | 2012
yearend numeric 0 0.0000 11 2020 | 2014 | 2013 | 2015 | 2012
locationdesc character 0 0.0000 55 National | Guam | Wyoming | District of Columbia | Puerto Rico
class character 0 0.0000 3 Physical Activity | Obesity / Weight Status | Fruits and Vegetables
topic character 0 0.0000 3 Physical Activity - Behavior | Obesity / Weight Status | Fruits and Vegetables - Behavior
question character 0 0.0000 9 Percent of adults who engage in no leisure-time physical activity | Percent of adults aged 18 years and older who have obesity | Percent of adults aged 18 years and older who have an overweight classification | Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) | Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
data_value numeric 8778 0.0990 695 30.6 | 29.3 | 28.8 | 32.7 | 26.6
low_confidence_limit numeric 8778 0.0990 666 29.4 | 25.7 | 28.1 | 31.9 | 25.6
high_confidence_limit numeric 8778 0.0990 760 31.8 | 33.3 | 29.5 | 33.5 | 27.6
sample_size numeric 8778 0.0990 9580 31255 | 842 | 62562 | 60069 | 30904
questionid character 0 0.0000 9 Q047 | Q036 | Q037 | Q045 | Q044
stratificationcategory1 character 9 0.0001 6 Race/Ethnicity | Education | Income | Age (years) | Gender
stratification1 character 9 0.0001 28 Hispanic | High school graduate | $50,000 - $74,999 | Data not reported | Less than $15,000

3.3 Missing Value Review

high_missing_summary <- structure_summary %>%
  arrange(desc(missing_rate), desc(unique_n)) %>%
  slice_head(n = 10)

high_missing_summary %>%
  kable(digits = 4, caption = "Variables with Highest Missing Rates")
Variables with Highest Missing Rates
variable class missing_n missing_rate unique_n example_values
data_value_unit logical 88629 1.0000 0
total character 85464 0.9643 1 Total
gender character 82299 0.9286 2 Female | Male
data_value_footnote_symbol character 79851 0.9010 1 ~
data_value_footnote character 79851 0.9010 1 Data not available because sample size is insufficient.
education character 75969 0.8572 4 High school graduate | Less than high school | Some college or technical school | College graduate
age_years character 69639 0.7857 6 25 - 34 | 55 - 64 | 18 - 24 | 45 - 54 | 35 - 44
income character 66474 0.7500 7 $50,000 - $74,999 | Data not reported | Less than $15,000 | $25,000 - $34,999 | $15,000 - $24,999
race_ethnicity character 63309 0.7143 8 Hispanic | American Indian/Alaska Native | Asian | Non-Hispanic White | Other
sample_size numeric 8778 0.0990 9580 31255 | 842 | 62562 | 60069 | 30904

The number of complete rows is 0. This does not mean that the dataset is unusable. Much of the missingness is structural.

For example, demographic columns such as age_years, education, gender, income, race_ethnicity, and total are mutually exclusive stratification fields. A record normally belongs to one stratification category, so the other demographic columns are expected to be empty. The same information is carried more consistently by stratificationcategory1 and stratification1.

The dataset also contains missing values in variables such as data_value, low_confidence_limit, high_confidence_limit, and sample_size. These missing values are often linked to unavailable estimates, such as cases where the sample size is insufficient. Therefore, the modelling dataset is prepared by selecting the relevant indicators and removing rows where the selected indicator value is missing.

4 Data Cleaning and Question Selection

4.1 Cleaning Strategy

Because the data are aggregated and question-specific, cleaning here mainly means selecting and validating comparable records rather than repairing individual-level fields.

The cleaning workflow includes:

  1. Standardising all column names.
  2. Checking dataset dimensions, missing values, and duplicate rows.
  3. Identifying available public health questions.
  4. Selecting adult obesity and no leisure-time physical activity using questionid.
  5. Removing rows with missing data_value for the selected indicators.
  6. Validating that selected indicator values fall within the valid percentage range from 0 to 100.
  7. Exporting cleaned obesity and inactivity datasets as processed files.

The duplicate check found 0 exact duplicate rows, so no de-duplication was required.

4.2 Available Public Health Questions

has_questionid <- "questionid" %in% names(raw_data)
has_question <- "question" %in% names(raw_data)

if (!has_question) {
  stop("The dataset must contain a question column after column-name cleaning.")
}

question_table <- raw_data %>%
  distinct(
    questionid = if (has_questionid) questionid else NA_character_,
    question = question
  ) %>%
  arrange(questionid, question)

question_table %>%
  kable(caption = "Available Public Health Questions")
Available Public Health Questions
questionid question
Q018 Percent of adults who report consuming fruit less than one time daily
Q019 Percent of adults who report consuming vegetables less than one time daily
Q036 Percent of adults aged 18 years and older who have obesity
Q037 Percent of adults aged 18 years and older who have an overweight classification
Q043 Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Q044 Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
Q045 Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Q046 Percent of adults who engage in muscle-strengthening activities on 2 or more days a week
Q047 Percent of adults who engage in no leisure-time physical activity
readr::write_csv(question_table, "project_output_question_table.csv")

The table above confirms that the dataset contains multiple public health questions. Since data_value has a different meaning under different questions, the analysis selects only the two questions directly related to this project.

4.3 Selected Questions

if (has_questionid) {
  selected_questions <- question_table %>%
    filter(questionid %in% c("Q036", "Q047"))
} else {
  selected_questions <- question_table %>%
    filter(
      str_detect(str_to_lower(question), "obesity") |
        str_detect(
          str_to_lower(question),
          "no leisure-time physical activity|no leisure time physical activity"
        )
    )
}

selected_questions %>%
  kable(caption = "Selected Public Health Questions")
Selected Public Health Questions
questionid question
Q036 Percent of adults aged 18 years and older who have obesity
Q047 Percent of adults who engage in no leisure-time physical activity
readr::write_csv(selected_questions, "project_output_selected_questions.csv")

The selected questions are:

  • Q036: Percent of adults aged 18 years and older who have obesity.
  • Q047: Percent of adults who engage in no leisure-time physical activity.

4.4 Build Indicator Filters

The preferred selection method is questionid, because it is more stable than text matching. Text matching is used only as a fallback if the dataset does not contain a questionid column.

if (has_questionid) {
  obesity_filter <- raw_data$questionid == "Q036"
  inactivity_filter <- raw_data$questionid == "Q047"
} else {
  obesity_filter <- str_detect(
    str_to_lower(raw_data$question),
    "obesity"
  )

  inactivity_filter <- str_detect(
    str_to_lower(raw_data$question),
    "no leisure-time physical activity|no leisure time physical activity"
  )
}

indicator_match_summary <- tibble(
  indicator = c(
    "Adult obesity",
    "No leisure-time physical activity"
  ),
  matched_records_before_missing_value_removal = c(
    sum(obesity_filter, na.rm = TRUE),
    sum(inactivity_filter, na.rm = TRUE)
  )
)

indicator_match_summary %>%
  kable(caption = "Records Matched to Each Selected Indicator")
Records Matched to Each Selected Indicator
indicator matched_records_before_missing_value_removal
Adult obesity 16577
No leisure-time physical activity 16549

4.5 Build Cleaned Indicator Datasets

required_cols <- c(
  "yearstart",
  "locationdesc",
  "question",
  "data_value",
  "stratificationcategory1",
  "stratification1"
)

missing_required <- setdiff(required_cols, names(raw_data))

if (length(missing_required) > 0) {
  stop(
    paste(
      "Missing required columns:",
      paste(missing_required, collapse = ", ")
    )
  )
}

obesity_data <- raw_data %>%
  filter(obesity_filter, !is.na(data_value)) %>%
  mutate(
    indicator = "Adult obesity",
    stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
    stratification1 = coalesce(as.character(stratification1), "Unknown")
  )

inactivity_data <- raw_data %>%
  filter(inactivity_filter, !is.na(data_value)) %>%
  mutate(
    indicator = "No leisure-time physical activity",
    stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
    stratification1 = coalesce(as.character(stratification1), "Unknown")
  )

4.6 Cleaned Dataset Summary

obesity_summary <- obesity_data %>%
  summarise(
    rows = n(),
    years = n_distinct(yearstart),
    min_year = min(yearstart, na.rm = TRUE),
    max_year = max(yearstart, na.rm = TRUE),
    locations = n_distinct(locationdesc),
    mean_value = mean(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    min_value = min(data_value, na.rm = TRUE),
    max_value = max(data_value, na.rm = TRUE)
  )

inactivity_summary <- inactivity_data %>%
  summarise(
    rows = n(),
    years = n_distinct(yearstart),
    min_year = min(yearstart, na.rm = TRUE),
    max_year = max(yearstart, na.rm = TRUE),
    locations = n_distinct(locationdesc),
    mean_value = mean(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    min_value = min(data_value, na.rm = TRUE),
    max_value = max(data_value, na.rm = TRUE)
  )

indicator_summary <- bind_rows(
  obesity_summary %>% mutate(indicator = "Adult obesity"),
  inactivity_summary %>% mutate(indicator = "No leisure-time physical activity")
) %>%
  select(indicator, everything())

indicator_summary %>%
  kable(digits = 4, caption = "Cleaned Indicator Dataset Summary")
Cleaned Indicator Dataset Summary
indicator rows years min_year max_year locations mean_value sd_value min_value max_value
Adult obesity 14956 11 2011 2021 55 30.4590 7.2741 0.9 60.4
No leisure-time physical activity 14984 11 2011 2021 55 25.9161 8.3074 2.5 66.8
readr::write_csv(indicator_summary, "project_output_indicator_summary.csv")

After removing records with missing data_value, the adult obesity dataset contains 14,956 records, and the no leisure-time physical activity dataset contains 14,984 records.

4.7 Value Range Validation

Both selected indicators are percentages, so valid values should fall between 0 and 100.

value_range_check <- tibble(
  indicator = c(
    "Adult obesity",
    "No leisure-time physical activity"
  ),
  questionid = c("Q036", "Q047"),
  min_value = c(
    min(obesity_data$data_value, na.rm = TRUE),
    min(inactivity_data$data_value, na.rm = TRUE)
  ),
  max_value = c(
    max(obesity_data$data_value, na.rm = TRUE),
    max(inactivity_data$data_value, na.rm = TRUE)
  ),
  out_of_range_n = c(
    sum(obesity_data$data_value < 0 | obesity_data$data_value > 100, na.rm = TRUE),
    sum(inactivity_data$data_value < 0 | inactivity_data$data_value > 100, na.rm = TRUE)
  )
)

value_range_check %>%
  kable(digits = 2, caption = "Indicator Value Range Validation")
Indicator Value Range Validation
indicator questionid min_value max_value out_of_range_n
Adult obesity Q036 0.9 60.4 0
No leisure-time physical activity Q047 2.5 66.8 0
readr::write_csv(value_range_check, "project_output_value_range_check.csv")

The range validation confirms that all retained adult obesity and physical inactivity values are within the valid percentage range from 0 to 100.

4.8 Before and After Cleaning Summary

cleaning_summary <- tibble(
  stage = c(
    "Raw data",
    "Cleaned adult obesity dataset",
    "Cleaned no leisure-time physical activity dataset"
  ),
  questionid = c(
    NA_character_,
    "Q036",
    "Q047"
  ),
  rows = c(
    nrow(raw_data),
    nrow(obesity_data),
    nrow(inactivity_data)
  ),
  columns = c(
    ncol(raw_data),
    ncol(obesity_data),
    ncol(inactivity_data)
  )
)

cleaning_summary %>%
  kable(caption = "Before and After Cleaning Summary")
Before and After Cleaning Summary
stage questionid rows columns
Raw data NA 88629 33
Cleaned adult obesity dataset Q036 14956 34
Cleaned no leisure-time physical activity dataset Q047 14984 34
readr::write_csv(cleaning_summary, "project_output_cleaning_summary.csv")
readr::write_csv(obesity_data, "project_output_cleaned_obesity_dataset.csv")
readr::write_csv(inactivity_data, "project_output_cleaned_inactivity_dataset.csv")

The cleaned datasets provide the foundation for the next stages of the project, including exploratory trend analysis, demographic comparison, aligned obesity-inactivity analysis, regression modelling, classification modelling, and public health intervention priority analysis.

5 Exploratory Data Analysis

5.1 EDA Strategy

The EDA has three goals. First, it shows how adult obesity and physical inactivity changed from 2011 to 2021. Second, it compares obesity across locations and demographic groups. Third, it builds an aligned obesity-inactivity dataset. This makes sure the two indicators are compared within the same public health profile. This part is descriptive: it does not test causality and does not predict future obesity. It helps us understand the main patterns before the regression and classification models.

The EDA uses two related levels of analysis. Broad descriptive views may use all selected records or Total records to show general surveillance patterns. Modelling-related EDA uses demographic subgroup profiles only, with Total or Overall rows removed. This keeps the EDA, regression, classification, risk ranking, intervention matrix, and threshold sensitivity analysis on a consistent subgroup-level basis.

Three rules guide this EDA:

  1. data_value is interpreted only within the selected indicator.
  2. Obesity and physical inactivity are compared only after matching by year, location, and stratification group.
  3. Results are labelled by analysis level: broad all-record trends, total-stratification trends, or subgroup-level modelling profiles.

The figures in this section use a consistent visual style so that trends, rankings, and relationships can be compared more easily.

5.2 Overall Trend

The first trend plot uses all selected records for the two indicators. It gives a broad descriptive overview of the surveillance data before separating total-level and subgroup-level analysis. The next subsection then uses total-stratification records to check whether the main trend remains visible when repeated demographic subgroup records are removed.

This pooled view is useful as a starting point, but it should not be over-read. The data include both Total records and demographic subgroup records, so the yearly average can be affected by the mix of records in each year. For that reason, the total-stratification robustness check is used as a more comparable state-level follow-up, while subgroup-only aligned data are used later for modelling-related analysis.

combined_core <- bind_rows(
  obesity_data %>%
    mutate(indicator = "Adult obesity"),
  inactivity_data %>%
    mutate(indicator = "No leisure-time physical activity")
)

yearly_trend <- combined_core %>%
  group_by(indicator, yearstart) %>%
  summarise(
    records = n(),
    average_value = mean(data_value, na.rm = TRUE),
    median_value = median(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    .groups = "drop"
  )

trend_change <- yearly_trend %>%
  arrange(indicator, yearstart) %>%
  group_by(indicator) %>%
  summarise(
    first_year = first(yearstart),
    first_average = first(average_value),
    last_year = last(yearstart),
    last_average = last(average_value),
    absolute_change = last_average - first_average,
    relative_change_pct = absolute_change / first_average * 100,
    .groups = "drop"
  )

yearly_trend %>%
  kable(digits = 3, caption = "Yearly Average of Selected Public Health Indicators")
Yearly Average of Selected Public Health Indicators
indicator yearstart records average_value median_value sd_value
Adult obesity 2011 1332 27.736 28.10 6.756
Adult obesity 2012 1332 27.997 28.40 6.734
Adult obesity 2013 1344 28.816 29.35 6.782
Adult obesity 2014 1369 29.336 29.90 6.998
Adult obesity 2015 1361 29.533 30.00 6.963
Adult obesity 2016 1389 30.086 30.70 7.036
Adult obesity 2017 1372 30.956 31.60 7.047
Adult obesity 2018 1369 31.633 32.20 7.004
Adult obesity 2019 1343 32.356 32.90 7.178
Adult obesity 2020 1368 32.413 32.85 7.269
Adult obesity 2021 1377 34.033 34.60 7.417
No leisure-time physical activity 2011 1332 26.492 26.20 7.416
No leisure-time physical activity 2012 1336 23.898 22.95 7.956
No leisure-time physical activity 2013 1318 26.855 26.30 7.588
No leisure-time physical activity 2014 1376 24.236 23.50 8.035
No leisure-time physical activity 2015 1358 27.054 26.50 8.025
No leisure-time physical activity 2016 1399 24.895 24.10 8.614
No leisure-time physical activity 2017 1372 27.972 27.60 8.101
No leisure-time physical activity 2018 1377 25.632 24.60 8.720
No leisure-time physical activity 2019 1346 27.769 26.85 8.427
No leisure-time physical activity 2020 1382 24.933 23.85 8.531
No leisure-time physical activity 2021 1388 25.427 24.20 8.644
trend_change %>%
  kable(digits = 3, caption = "Change from First to Last Available Year")
Change from First to Last Available Year
indicator first_year first_average last_year last_average absolute_change relative_change_pct
Adult obesity 2011 27.736 2021 34.033 6.296 22.70
No leisure-time physical activity 2011 26.492 2021 25.427 -1.065 -4.02
readr::write_csv(yearly_trend, "project_output_yearly_trend.csv")
readr::write_csv(trend_change, "project_output_yearly_trend_change.csv")
yearly_trend %>%
  ggplot(aes(x = yearstart, y = average_value, colour = indicator, group = indicator)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  scale_colour_manual(values = indicator_colours) +
  scale_x_continuous(breaks = sort(unique(yearly_trend$yearstart))) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Overall Trends in Adult Obesity and Physical Inactivity",
    subtitle = "Broad descriptive view across all selected non-missing public health records",
    x = "Year",
    y = "Average value"
  ) +
  theme_report()

The two indicators show different shapes. Adult obesity shows an overall upward pattern across the decade. It rises from 27.74% in 2011 to 34.03% in 2021, an increase of 6.3 percentage points. Physical inactivity shows a recurring jagged pattern rather than a clear upward or downward trend, which is why it is interpreted more cautiously than the obesity trend.

5.3 Total-Stratification Robustness Check

The overall trend above mixes many demographic subgroup records for the same location and year. As a robustness check, this subsection keeps only the Total records. These are the overall adult rate for each location, with no demographic breakdown, so they avoid repeated demographic weighting and provide a more comparable year-location summary.

find_total_records <- function(df) {
  total_flag <- rep(FALSE, nrow(df))

  if ("stratificationcategory1" %in% names(df)) {
    total_flag <- total_flag |
      str_to_lower(as.character(df$stratificationcategory1)) %in% c("total", "overall")
  }

  if ("stratification1" %in% names(df)) {
    total_flag <- total_flag |
      str_to_lower(as.character(df$stratification1)) %in% c("total", "overall")
  }

  if ("total" %in% names(df)) {
    total_flag <- total_flag | !is.na(df$total)
  }

  total_flag
}

obesity_total <- obesity_data %>%
  filter(find_total_records(.))

inactivity_total <- inactivity_data %>%
  filter(find_total_records(.))

total_check <- tibble(
  indicator = c("Adult obesity", "No leisure-time physical activity"),
  records = c(nrow(obesity_total), nrow(inactivity_total)),
  years = c(n_distinct(obesity_total$yearstart), n_distinct(inactivity_total$yearstart)),
  locations = c(n_distinct(obesity_total$locationdesc), n_distinct(inactivity_total$locationdesc))
)

total_indicator_trend <- bind_rows(
  obesity_total %>%
    transmute(indicator = "Adult obesity", yearstart, locationdesc, value = data_value),
  inactivity_total %>%
    transmute(indicator = "No leisure-time physical activity", yearstart, locationdesc, value = data_value)
) %>%
  group_by(indicator, yearstart) %>%
  summarise(
    records = n(),
    average_value = mean(value, na.rm = TRUE),
    median_value = median(value, na.rm = TRUE),
    sd_value = sd(value, na.rm = TRUE),
    .groups = "drop"
  )

total_trend_change <- total_indicator_trend %>%
  arrange(indicator, yearstart) %>%
  group_by(indicator) %>%
  summarise(
    first_year = first(yearstart),
    first_average = first(average_value),
    last_year = last(yearstart),
    last_average = last(average_value),
    absolute_change = last_average - first_average,
    relative_change_pct = absolute_change / first_average * 100,
    .groups = "drop"
  )

trend_change_comparison <- bind_rows(
  trend_change %>%
    mutate(analysis_basis = "All selected records"),
  total_trend_change %>%
    mutate(analysis_basis = "Total records only")
) %>%
  select(
    analysis_basis,
    indicator,
    first_year,
    first_average,
    last_year,
    last_average,
    absolute_change
  )

total_check %>%
  kable(caption = "Coverage of Total-Stratification Records")
Coverage of Total-Stratification Records
indicator records years locations
Adult obesity 589 11 55
No leisure-time physical activity 588 11 55
trend_change_comparison %>%
  kable(digits = 3, caption = "First-to-Last-Year Change Under Two Analysis Bases")
First-to-Last-Year Change Under Two Analysis Bases
analysis_basis indicator first_year first_average last_year last_average absolute_change
All selected records Adult obesity 2011 27.736 2021 34.033 6.296
All selected records No leisure-time physical activity 2011 26.492 2021 25.427 -1.065
Total records only Adult obesity 2011 27.588 2021 33.537 5.949
Total records only No leisure-time physical activity 2011 25.737 2021 24.187 -1.550
readr::write_csv(total_check, "project_output_total_stratification_check.csv")
readr::write_csv(total_indicator_trend, "project_output_total_indicator_trend.csv")
readr::write_csv(total_trend_change, "project_output_total_trend_change.csv")
readr::write_csv(trend_change_comparison, "project_output_trend_change_comparison.csv")
trend_change_comparison %>%
  mutate(
    analysis_basis = factor(
      analysis_basis,
      levels = c("All selected records", "Total records only")
    )
  ) %>%
  ggplot(aes(x = indicator, y = absolute_change, fill = analysis_basis)) +
  geom_col(width = 0.64, position = position_dodge(width = 0.72)) +
  geom_text(
    aes(label = round(absolute_change, 2)),
    position = position_dodge(width = 0.72),
    vjust = if_else(trend_change_comparison$absolute_change >= 0, -0.35, 1.25),
    colour = "#243447",
    size = 3.6
  ) +
  geom_hline(yintercept = 0, colour = "#5D6975", linewidth = 0.5) +
  scale_fill_manual(values = c(
    "All selected records" = "#A7B1B7",
    "Total records only" = "#496A81"
  )) +
  scale_y_continuous(
    labels = scales::label_number(suffix = " pts"),
    expand = expansion(mult = c(0.16, 0.16))
  ) +
  labs(
    title = "Trend Change Robustness Check",
    subtitle = "First-to-last-year change remains similar after using Total-only records",
    x = NULL,
    y = "Change from first to last year"
  ) +
  theme_report()

The total records cover the indicators well for state-level analysis. Adult obesity total records span 11 years and 55 locations. On this Total-only basis, adult obesity still rises from 27.59% to 33.54%, while physical inactivity shows no clear upward or downward trend and oscillates around a broadly similar level. The robustness plot does not repeat the first trend chart; instead, it compares the first-to-last-year change under the all-record and Total-only bases. The obesity increase remains similar under both bases, so the upward obesity pattern is not simply an artefact of repeated demographic subgroup records.

Two limits apply to this trend. First, it is an unweighted average of available location-level total records. Each location has the same weight, so a large state and a small state are treated the same. It should be read as a state-level surveillance trend, not a population-weighted national prevalence estimate. Second, 2020 and 2021 overlap with the COVID-19 pandemic period, so changes in those years should be read carefully: they may reflect both real public health change and data-collection context.

6 Location and Demographic Comparison

Trends describe the when; location and demographic comparisons describe the where and who. This section first compares locations on a total-stratification basis, then uses demographic stratification records to identify group-level differences in obesity prevalence.

6.1 Location Ranking

The location ranking uses Total obesity records only, so each location is compared at the same level and locations with more subgroup records do not get extra weight. To reduce instability from very short series, the ranking keeps only locations with at least eight years of total-stratification obesity records.

min_years_for_location <- 8

location_obesity <- obesity_total %>%
  filter(locationdesc != "National") %>%
  group_by(locationdesc) %>%
  summarise(
    records = n(),
    years = n_distinct(yearstart),
    average_obesity = mean(data_value, na.rm = TRUE),
    median_obesity = median(data_value, na.rm = TRUE),
    min_obesity = min(data_value, na.rm = TRUE),
    max_obesity = max(data_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(years >= min_years_for_location) %>%
  arrange(desc(average_obesity))

top_obesity_locations <- location_obesity %>%
  slice_max(average_obesity, n = 10, with_ties = FALSE)

low_obesity_locations <- location_obesity %>%
  slice_min(average_obesity, n = 10, with_ties = FALSE)

location_obesity_extremes <- bind_rows(
  top_obesity_locations %>% mutate(group = "Highest average obesity"),
  low_obesity_locations %>% mutate(group = "Lowest average obesity")
)

top_obesity_locations %>%
  kable(digits = 3, caption = "Top 10 Locations by Average Adult Obesity")
Top 10 Locations by Average Adult Obesity
locationdesc records years average_obesity median_obesity min_obesity max_obesity
Mississippi 11 11 37.218 37.3 34.6 40.8
West Virginia 11 11 37.027 37.7 32.4 40.6
Louisiana 11 11 35.764 35.9 33.1 38.6
Arkansas 11 11 35.518 35.7 30.9 38.7
Alabama 11 11 35.427 35.7 32.0 39.9
Kentucky 11 11 34.509 34.3 30.4 40.3
Oklahoma 11 11 34.491 33.9 31.1 39.4
Tennessee 11 11 33.464 33.8 29.2 36.5
Indiana 11 11 33.327 32.7 30.8 36.8
South Carolina 11 11 33.300 32.3 30.8 36.2
low_obesity_locations %>%
  kable(digits = 3, caption = "Lowest 10 Locations by Average Adult Obesity")
Lowest 10 Locations by Average Adult Obesity
locationdesc records years average_obesity median_obesity min_obesity max_obesity
Colorado 11 11 22.273 22.3 20.2 25.1
District of Columbia 11 11 23.218 23.0 21.7 24.7
Hawaii 11 11 23.545 23.8 21.8 25.0
Massachusetts 11 11 24.455 24.3 22.7 27.4
California 11 11 25.618 25.0 23.8 30.3
New York 11 11 26.073 25.7 23.6 29.1
Vermont 11 11 26.164 26.3 23.7 29.0
New Jersey 10 10 26.340 26.6 23.7 28.2
Montana 11 11 26.345 25.5 23.6 31.8
Utah 11 11 26.382 25.4 24.1 30.9
readr::write_csv(location_obesity, "project_output_location_obesity_ranking.csv")
readr::write_csv(location_obesity_extremes, "project_output_location_obesity_extremes.csv")
location_obesity_extremes %>%
  mutate(
    location_label = reorder(locationdesc, average_obesity),
    group = factor(group, levels = c("Highest average obesity", "Lowest average obesity"))
  ) %>%
  ggplot(aes(x = location_label, y = average_obesity, fill = group)) +
  geom_col(width = 0.72) +
  geom_text(
    aes(label = round(average_obesity, 1)),
    hjust = -0.12,
    size = 3.2,
    colour = "#243447"
  ) +
  coord_flip() +
  facet_wrap(~group, scales = "free") +
  scale_fill_manual(values = c(
    "Highest average obesity" = bar_fill_high,
    "Lowest average obesity" = bar_fill_low
  )) +
  scale_y_continuous(
    labels = scales::label_number(suffix = "%"),
    expand = expansion(mult = c(0, 0.14))
  ) +
  labs(
    title = "Location Differences in Adult Obesity",
    subtitle = paste0("Total-stratification records; locations with at least ", min_years_for_location, " years of data"),
    x = NULL,
    y = "Average adult obesity"
  ) +
  theme_report()

The ranking shows clear differences between locations, so obesity is not evenly spread across places. The highest average obesity locations are led by Mississippi, West Virginia, and Louisiana, which form a clear concentration in Southern and Appalachian areas. The lowest are led by Colorado, District of Columbia, and Hawaii. This is still descriptive: it shows where higher obesity values appear, not why. The gap between the highest and lowest groups is large, which is why locationdesc is later included in the regression model.

6.2 Demographic Stratification

The demographic comparison uses subgroup records only. Total records are removed so that overall rates are not mixed with subgroup rates. The aim is to see which age, education, income, gender, and race or ethnicity groups have higher average obesity values. These results describe public health profiles, not individual biological risk.

obesity_stratification <- obesity_data %>%
  filter(
    !is.na(stratificationcategory1),
    !is.na(stratification1),
    !str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
    !str_to_lower(as.character(stratification1)) %in% c("total", "overall")
  ) %>%
  group_by(stratificationcategory1, stratification1) %>%
  summarise(
    records = n(),
    years = n_distinct(yearstart),
    locations = n_distinct(locationdesc),
    average_obesity = mean(data_value, na.rm = TRUE),
    median_obesity = median(data_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(stratificationcategory1, desc(average_obesity))

stratification_gap <- obesity_stratification %>%
  group_by(stratificationcategory1) %>%
  arrange(desc(average_obesity), .by_group = TRUE) %>%
  summarise(
    highest_group = first(stratification1),
    highest_average = first(average_obesity),
    lowest_group = last(stratification1),
    lowest_average = last(average_obesity),
    within_category_gap = highest_average - lowest_average,
    groups_compared = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(within_category_gap))

obesity_stratification %>%
  kable(digits = 3, caption = "Average Adult Obesity by Demographic Stratification Group")
Average Adult Obesity by Demographic Stratification Group
stratificationcategory1 stratification1 records years locations average_obesity median_obesity
Age (years) 45 - 54 589 11 55 35.913 35.50
Age (years) 55 - 64 589 11 55 34.831 34.90
Age (years) 35 - 44 589 11 55 33.902 33.60
Age (years) 25 - 34 589 11 55 28.883 28.80
Age (years) 65 or older 589 11 55 27.877 28.00
Age (years) 18 - 24 589 11 55 17.751 17.30
Education Less than high school 589 11 55 34.207 34.30
Education High school graduate 589 11 55 32.424 32.40
Education Some college or technical school 589 11 55 31.849 31.80
Education College graduate 589 11 55 24.414 24.30
Gender Male 589 11 55 30.218 30.30
Gender Female 589 11 55 30.173 29.80
Income Less than $15,000 589 11 55 35.324 35.40
Income $15,000 - $24,999 589 11 55 33.840 33.90
Income $25,000 - $34,999 589 11 55 32.313 32.10
Income $35,000 - $49,999 589 11 55 31.980 32.00
Income $50,000 - $74,999 588 11 55 31.338 31.15
Income $75,000 or greater 589 11 55 27.418 27.00
Income Data not reported 589 11 55 25.834 25.70
Race/Ethnicity Hawaiian/Pacific Islander 41 11 8 39.715 41.00
Race/Ethnicity Non-Hispanic Black 473 11 47 38.303 38.70
Race/Ethnicity American Indian/Alaska Native 388 11 46 36.869 37.50
Race/Ethnicity Hispanic 573 11 55 32.056 32.10
Race/Ethnicity 2 or more races 540 11 53 32.047 31.85
Race/Ethnicity Non-Hispanic White 580 11 54 28.566 29.10
Race/Ethnicity Other 196 11 42 27.281 26.75
Race/Ethnicity Asian 386 11 40 10.967 10.20
stratification_gap %>%
  kable(digits = 3, caption = "Within-Category Obesity Gaps")
Within-Category Obesity Gaps
stratificationcategory1 highest_group highest_average lowest_group lowest_average within_category_gap groups_compared
Race/Ethnicity Hawaiian/Pacific Islander 39.715 Asian 10.967 28.748 8
Age (years) 45 - 54 35.913 18 - 24 17.751 18.162 6
Education Less than high school 34.207 College graduate 24.414 9.794 4
Income Less than $15,000 35.324 Data not reported 25.834 9.489 7
Gender Male 30.218 Female 30.173 0.045 2
readr::write_csv(obesity_stratification, "project_output_obesity_stratification_profile.csv")
readr::write_csv(stratification_gap, "project_output_obesity_stratification_gap.csv")
obesity_stratification %>%
  mutate(
    stratification_label = reorder(stratification1, average_obesity)
  ) %>%
  ggplot(aes(x = stratification_label, y = average_obesity)) +
  geom_col(fill = bar_fill_neutral, width = 0.72) +
  coord_flip() +
  facet_wrap(~stratificationcategory1, scales = "free") +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Adult Obesity by Demographic Stratification Group",
    subtitle = "Average values across available years and locations",
    x = NULL,
    y = "Average adult obesity"
  ) +
  theme_report(base_size = 11)

The chart shows that obesity differs across reported demographic groups, not only across places. The largest within-category gap appears in Race/Ethnicity, where Hawaiian/Pacific Islander has the highest average obesity value and Asian has the lowest. The results suggest that obesity is higher in some middle-aged, lower education, and lower income profiles. Race or ethnicity shows a wide spread, while gender differences are small. This is why stratification1 is later included in the regression model and also used as profile-level information in the classification screening model.

This does not mean any group causes obesity. It only shows that obesity values differ across reported profiles. Two limits apply. These averages are not population-weighted demographic prevalence; they are averages across available subgroup records. They are also averaged across years and locations, so a group whose records come mostly from higher-obesity locations can have its average pulled up by that coverage mix rather than by the demographic factor itself. In addition, some race or ethnicity groups have fewer available records and locations than others, so extreme subgroup averages should be read with coverage caution. Overall, these results should be read as profile-level patterns.

7 Aligned Relationship and Robustness Checks

7.1 Aligned Obesity-Inactivity Relationship

Obesity and physical inactivity come from separate rows, so their relationship should not be measured by comparing all obesity records with all inactivity records. A fair comparison matches the two indicators by the same year, location, stratification category, and stratification group. This gives one row per public health profile, with both values attached. After this alignment, the report keeps two versions: a broad aligned dataset for descriptive comparison and a subgroup-only aligned dataset for regression, classification, risk ranking, intervention matrix, and threshold sensitivity analysis.

alignment_keys <- c("yearstart", "locationdesc")

if ("stratificationcategory1" %in% names(obesity_data) &&
    "stratificationcategory1" %in% names(inactivity_data)) {
  alignment_keys <- c(alignment_keys, "stratificationcategory1")
}

if ("stratification1" %in% names(obesity_data) &&
    "stratification1" %in% names(inactivity_data)) {
  alignment_keys <- c(alignment_keys, "stratification1")
}

obesity_for_join <- obesity_data %>%
  select(all_of(alignment_keys), obesity_value = data_value) %>%
  group_by(across(all_of(alignment_keys))) %>%
  summarise(obesity_value = mean(obesity_value, na.rm = TRUE), .groups = "drop")

inactivity_for_join <- inactivity_data %>%
  select(all_of(alignment_keys), inactivity_value = data_value) %>%
  group_by(across(all_of(alignment_keys))) %>%
  summarise(inactivity_value = mean(inactivity_value, na.rm = TRUE), .groups = "drop")

aligned_health <- obesity_for_join %>%
  inner_join(inactivity_for_join, by = alignment_keys)

modelling_aligned_health <- aligned_health %>%
  filter(!find_total_records(.))

aligned_scope_summary <- tibble(
  analysis_scope = c(
    "All aligned profiles",
    "Subgroup aligned profiles used for modelling"
  ),
  profiles = c(nrow(aligned_health), nrow(modelling_aligned_health)),
  total_or_overall_profiles_excluded = c(
    0,
    nrow(aligned_health) - nrow(modelling_aligned_health)
  )
)

aligned_summary <- aligned_health %>%
  summarise(
    aligned_profiles = n(),
    obesity_mean = mean(obesity_value, na.rm = TRUE),
    inactivity_mean = mean(inactivity_value, na.rm = TRUE),
    obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
    inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
    correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
  )

modelling_aligned_summary <- modelling_aligned_health %>%
  summarise(
    aligned_profiles = n(),
    obesity_mean = mean(obesity_value, na.rm = TRUE),
    inactivity_mean = mean(inactivity_value, na.rm = TRUE),
    obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
    inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
    correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
  )

aligned_summary_compare <- bind_rows(
  aligned_summary %>% mutate(analysis_scope = "All aligned profiles"),
  modelling_aligned_summary %>% mutate(analysis_scope = "Subgroup aligned modelling profiles")
) %>%
  select(analysis_scope, everything())

aligned_quadrant_preview <- modelling_aligned_health %>%
  mutate(
    obesity_level = if_else(
      obesity_value >= modelling_aligned_summary$obesity_q75,
      "High obesity",
      "Lower obesity"
    ),
    inactivity_level = if_else(
      inactivity_value >= modelling_aligned_summary$inactivity_q75,
      "High inactivity",
      "Lower inactivity"
    ),
    profile_group = case_when(
      obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
        "High obesity + high inactivity",
      obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
        "High obesity only",
      obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
        "High inactivity only",
      TRUE ~ "Lower obesity + lower inactivity"
    )
  ) %>%
  count(profile_group, name = "profiles") %>%
  mutate(share = profiles / sum(profiles)) %>%
  arrange(desc(profiles))

aligned_scope_summary %>%
  kable(digits = 3, caption = "Aligned Dataset Scope for Descriptive and Modelling Analysis")
Aligned Dataset Scope for Descriptive and Modelling Analysis
analysis_scope profiles total_or_overall_profiles_excluded
All aligned profiles 14922 0
Subgroup aligned profiles used for modelling 14334 588
aligned_summary_compare %>%
  kable(digits = 3, caption = "Aligned Obesity-Inactivity Relationship Summary by Analysis Scope")
Aligned Obesity-Inactivity Relationship Summary by Analysis Scope
analysis_scope aligned_profiles obesity_mean inactivity_mean obesity_q75 inactivity_q75 correlation
All aligned profiles 14922 30.465 25.924 35.1 31.1 0.458
Subgroup aligned modelling profiles 14334 30.475 25.964 35.2 31.3 0.457
aligned_quadrant_preview %>%
  kable(digits = 3, caption = "Exploratory Subgroup Profile Groups Based on 75th Percentile Cut-Offs")
Exploratory Subgroup Profile Groups Based on 75th Percentile Cut-Offs
profile_group profiles share
Lower obesity + lower inactivity 8779 0.612
High obesity only 1970 0.137
High inactivity only 1960 0.137
High obesity + high inactivity 1625 0.113
readr::write_csv(aligned_health, "project_output_aligned_obesity_inactivity_all_profiles.csv")
readr::write_csv(modelling_aligned_health, "project_output_aligned_obesity_inactivity_subgroup_profiles.csv")
readr::write_csv(aligned_scope_summary, "project_output_aligned_scope_summary.csv")
readr::write_csv(aligned_summary_compare, "project_output_aligned_relationship_summary.csv")
readr::write_csv(aligned_quadrant_preview, "project_output_aligned_quadrant_preview.csv")
set.seed(7004)

modelling_aligned_health %>%
  slice_sample(n = min(5000, nrow(modelling_aligned_health))) %>%
  ggplot(aes(x = inactivity_value, y = obesity_value)) +
  geom_point(alpha = 0.28, colour = "#385A6B", size = 1.6) +
  geom_smooth(method = "lm", se = TRUE, colour = "#B4443F", fill = "#E6B8B4") +
  geom_vline(
    xintercept = modelling_aligned_summary$inactivity_q75,
    linetype = "dashed",
    colour = "#5D6975"
  ) +
  geom_hline(
    yintercept = modelling_aligned_summary$obesity_q75,
    linetype = "dashed",
    colour = "#5D6975"
  ) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Aligned Relationship Between Physical Inactivity and Adult Obesity",
    subtitle = "Subgroup aligned profiles; dashed lines mark 75th percentile cut-offs",
    x = "No leisure-time physical activity",
    y = "Adult obesity"
  ) +
  theme_report()

After alignment, the broad aligned dataset has 14,922 comparable public health profiles. For modelling-related analysis, Total or Overall rows are excluded, leaving 14,334 subgroup aligned profiles. The all-profile aligned correlation is 0.458, while the subgroup modelling-scope correlation is 0.457. Both are positive, but they should be read as descriptive profile-level associations rather than causal effects. The subgroup aligned version is used for the within-profile check, lagged association check, regression adjustment ladder, classification model, risk ranking, intervention matrix, and threshold sensitivity analysis.

The alignment uses an inner join, so it keeps only profiles that have a record for both obesity and inactivity in the same year, location, and stratification group. The surveillance panel is mildly unbalanced (some locations or subgroups are not screened for every indicator in every year), so this forces the comparison onto a fully overlapping set of profiles. That keeps the correlation basis exactly consistent across the two indicators, at the cost of dropping profiles seen for only one of them. Removing aggregate Total or Overall rows for the modelling scope also avoids treating statewide totals as if they were independent demographic subgroup observations.

7.2 Within-Profile Robustness Check

The headline all-profile correlation still mixes two different things: stable differences between profiles (some locations and groups are simply higher on both indicators) and year-to-year changes within the same profile. To separate these, this check uses the subgroup aligned modelling dataset, removes the average value of each profile, and then re-correlates. It therefore focuses on whether obesity and inactivity still move together within the same location x stratification category x stratification group profile over time. This is a stricter check than the raw aligned correlation. Profiles with fewer than three years are dropped, because a one- or two-year profile has almost no within variation and would pile up near zero and distort the result.

# Within-profile check uses the subgroup aligned modelling dataset,
# so it is on the same basis as the regression and classification sections.
within_base <- modelling_aligned_health

within_corr_df <- within_base %>%
  group_by(locationdesc, stratificationcategory1, stratification1) %>%
  filter(n() >= 3) %>%
  mutate(
    obesity_within = obesity_value - mean(obesity_value, na.rm = TRUE),
    inactivity_within = inactivity_value - mean(inactivity_value, na.rm = TRUE)
  ) %>%
  ungroup()

raw_aligned_corr <- cor(
  aligned_health$obesity_value,
  aligned_health$inactivity_value,
  use = "complete.obs"
)

subgroup_aligned_corr <- cor(
  modelling_aligned_health$obesity_value,
  modelling_aligned_health$inactivity_value,
  use = "complete.obs"
)

within_profile_corr <- cor(
  within_corr_df$obesity_within,
  within_corr_df$inactivity_within,
  use = "complete.obs"
)

within_corr_compare <- tibble(
  correlation_type = c(
    "Raw aligned correlation (all profiles)",
    "Raw aligned correlation (subgroup modelling profiles)",
    "Within-profile demeaned correlation (subgroup profiles, >= 3 years)"
  ),
  profiles_used = c(
    sum(complete.cases(aligned_health[, c("obesity_value", "inactivity_value")])),
    sum(complete.cases(modelling_aligned_health[, c("obesity_value", "inactivity_value")])),
    nrow(within_corr_df)
  ),
  correlation = c(raw_aligned_corr, subgroup_aligned_corr, within_profile_corr)
)

within_corr_compare %>%
  kable(digits = 3, caption = "Raw Versus Within-Profile Demeaned Correlation")
Raw Versus Within-Profile Demeaned Correlation
correlation_type profiles_used correlation
Raw aligned correlation (all profiles) 14922 0.458
Raw aligned correlation (subgroup modelling profiles) 14334 0.457
Within-profile demeaned correlation (subgroup profiles, >= 3 years) 14259 0.073
readr::write_csv(within_corr_compare, "project_output_within_profile_correlation.csv")

The raw aligned correlation is 0.458 across all aligned profiles and 0.457 within the subgroup modelling scope. The within-profile demeaned correlation is 0.073. Because the within-profile figure uses demographic subgroup profiles only, it is on the same basis as the regression adjustment ladder and classification model later in the report. The within-profile correlation is much smaller, which suggests that much of the subgroup obesity-inactivity relationship comes from stable differences between locations and demographic profiles, not from year-to-year movement within the same profile. In plain terms, some profiles are high on both indicators because of their location and demographic structure; the relationship is weaker once we only look at changes within the same profile over time. One more caveat applies even to the within number: because both indicators drift upward over the decade within a profile, a positive within-profile correlation can partly reflect a shared time trend, so it is not removed here and the result is treated as descriptive only.

7.3 Exploratory Lagged Association Check

The aligned relationship above uses same-year subgroup profiles. The next question is whether earlier physical inactivity is related to later obesity in the same location and stratification group. This check uses the subgroup aligned modelling dataset and compares inactivity with obesity at lags of 0, 1, and 2 years. It is only a descriptive check, not a forecasting model and not a causal test.

lag_keys <- setdiff(alignment_keys, "yearstart")

lag_correlations <- map_dfr(0:2, function(k) {
  inactivity_part <- modelling_aligned_health %>%
    transmute(
      across(all_of(lag_keys)),
      match_year = yearstart + k,
      inactivity_value
    )

  obesity_part <- modelling_aligned_health %>%
    transmute(
      across(all_of(lag_keys)),
      match_year = yearstart,
      obesity_value
    )

  joined <- inner_join(
    inactivity_part,
    obesity_part,
    by = c(lag_keys, "match_year")
  )

  tibble(
    lag_years = k,
    matched_pairs = nrow(joined),
    correlation = cor(
      joined$inactivity_value,
      joined$obesity_value,
      use = "complete.obs"
    )
  )
})

lag_correlations %>%
  kable(digits = 3, caption = "Lagged Correlation Between Inactivity and Later Obesity")
Lagged Correlation Between Inactivity and Later Obesity
lag_years matched_pairs correlation
0 14334 0.457
1 12795 0.461
2 11473 0.478
readr::write_csv(lag_correlations, "project_output_lag_correlations.csv")
lag_correlations %>%
  ggplot(aes(x = factor(lag_years), y = correlation, group = 1)) +
  geom_col(fill = bar_fill_neutral, width = 0.62) +
  geom_text(
    aes(label = paste0("r = ", round(correlation, 3), "\n", "n = ", scales::comma(matched_pairs))),
    vjust = -0.35,
    colour = "#243447",
    size = 3.5,
    lineheight = 0.9
  ) +
  scale_y_continuous(limits = c(0, max(lag_correlations$correlation) + 0.12)) +
  labs(
    title = "Lagged Association Between Physical Inactivity and Later Obesity",
    subtitle = "Subgroup aligned profiles; lag 1 and lag 2 compare earlier inactivity with later obesity",
    x = "Lag in years",
    y = "Correlation (r)"
  ) +
  theme_report()

The lagged correlations are 0.457, 0.461, 0.478 for lag 0, lag 1, and lag 2. They stay positive and do not weaken much as the lag grows, although the number of matched pairs decreases at longer lags. This pattern may be linked to stable differences between locations and demographic subgroup profiles. Therefore, it should be read as a stable subgroup-level association, not as strong early-warning evidence. No out-of-sample forecasting is done here.

8 Regression Model

8.1 Modelling Objective and Design

The regression section addresses the third research question:

Can year, location, and demographic stratification variables explain variation in adult obesity prevalence?

The goal is not to predict obesity for one person. The target is the adult obesity percentage for an aggregated public health profile. In simple terms, the model checks why obesity values differ across years, locations, and demographic groups in the surveillance data.

This section has two parts. First, it fits a structural additive model with year, location, and demographic group. This model summarises the main patterns already seen in the EDA. Second, a later subsection uses an adjustment ladder to test whether physical inactivity still explains obesity after year, location, and demographic group are considered. The second part is more important for the obesity-inactivity question, because it answers something the EDA cannot: how much of the association survives once geography and demographics are controlled.

The regression target is:

Adult obesity data_value

The main predictors are:

yearstart + locationdesc + stratification1

stratificationcategory1 is not in the main formula because stratification1 already gives the specific demographic group. In this dataset the category and group fields are nested, so using both adds repeated dummy variables and makes the model harder to read. Variables that are directly derived from the reported estimate, such as the confidence limits and alternative value fields, are not used, because they would leak the indicator value itself. Sample size is different: it is not a leakage variable, but it is also not used as a predictor in this simple model. It records how precise each estimate is, and future work could use it for weighting (for example inverse-variance weighting), so small-sample estimates have less influence.

This regression uses demographic subgroup records only. The Total rows are overall state-year rates, that is, aggregates of the subgroup rows in the same state and year. They sit at a different level and are not independent of the subgroups. If both are used together, the same population may be represented more than once, which can make the regression look more stable than it really is. Total rows are kept for the EDA trend but removed here, which also matches the research question’s focus on demographic stratification.

This setup is simple and easy to interpret. It uses the same profile dimensions as the EDA: time, place, and demographic group. But subgroup rows are not fully independent: in the same state and year, the age, gender, income, education, and race or ethnicity rows are different views of the same survey sample. Because of this, the coefficients are used as descriptive, model-adjusted differences, not as formal statistical inference. The standard errors and p-values from lm() are not used for the main interpretation, because they would understate the uncertainty under this dependence.

Two models are fitted:

  1. Linear regression, used as the main interpretable model.
  2. Regression tree, used as a simple non-linear benchmark.

8.2 Train-Test Split and Model Fitting

regression_predictors <- c(
  "yearstart",
  "locationdesc",
  "stratification1"
)

regression_predictors <- regression_predictors[
  regression_predictors %in% names(obesity_data)
]

regression_df <- obesity_data %>%
  filter(
    !str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
    !str_to_lower(as.character(stratification1)) %in% c("total", "overall")
  ) %>%
  select(data_value, all_of(regression_predictors)) %>%
  drop_na(data_value) %>%
  mutate(across(where(is.character), as.factor))

set.seed(7004)
regression_train_index <- sample.int(
  nrow(regression_df),
  floor(0.8 * nrow(regression_df))
)

reg_train <- regression_df[regression_train_index, ]
reg_test <- regression_df[-regression_train_index, ]

drop_unseen_levels <- function(train, test) {
  keep <- rep(TRUE, nrow(test))

  for (column_name in names(test)) {
    if (is.factor(test[[column_name]])) {
      seen_levels <- unique(as.character(train[[column_name]]))
      keep <- keep & as.character(test[[column_name]]) %in% seen_levels
    }
  }

  list(
    test = test[keep, ],
    dropped = sum(!keep)
  )
}

reg_guard <- drop_unseen_levels(reg_train, reg_test)
reg_test <- reg_guard$test
reg_train <- droplevels(reg_train)

regression_design_summary <- tibble(
  target = "Adult obesity data_value",
  predictors = paste(regression_predictors, collapse = " + "),
  modelling_rows = nrow(regression_df),
  train_rows = nrow(reg_train),
  test_rows_after_guard = nrow(reg_test),
  test_rows_dropped_unseen_levels = reg_guard$dropped
)

reg_formula <- as.formula(
  paste("data_value ~", paste(regression_predictors, collapse = " + "))
)

reg_lm <- lm(reg_formula, data = reg_train)
reg_tree <- rpart(reg_formula, data = reg_train, method = "anova")

reg_pred_lm <- predict(reg_lm, newdata = reg_test)
reg_pred_tree <- predict(reg_tree, newdata = reg_test)

reg_baseline_pred <- rep(mean(reg_train$data_value, na.rm = TRUE), nrow(reg_test))
reg_baseline_rmse <- sqrt(mean((reg_test$data_value - reg_baseline_pred)^2))

regression_metric_row <- function(model_name, actual, predicted, n_train, baseline_rmse) {
  rmse_value <- sqrt(mean((actual - predicted)^2))

  tibble(
    model = model_name,
    target = "Adult obesity data_value",
    n_train = n_train,
    n_test = length(actual),
    rmse = rmse_value,
    mae = mean(abs(actual - predicted)),
    baseline_rmse = baseline_rmse,
    rmse_improvement_pct = (baseline_rmse - rmse_value) / baseline_rmse * 100,
    test_r2 = 1 - sum((actual - predicted)^2) /
      sum((actual - mean(actual))^2)
  )
}

regression_metrics <- bind_rows(
  regression_metric_row(
    "Linear regression",
    reg_test$data_value,
    reg_pred_lm,
    nrow(reg_train),
    reg_baseline_rmse
  ),
  regression_metric_row(
    "Regression tree",
    reg_test$data_value,
    reg_pred_tree,
    nrow(reg_train),
    reg_baseline_rmse
  )
)

regression_design_summary %>%
  kable(caption = "Regression Modelling Design Summary")
Regression Modelling Design Summary
target predictors modelling_rows train_rows test_rows_after_guard test_rows_dropped_unseen_levels
Adult obesity data_value yearstart + locationdesc + stratification1 14367 11493 2874 0
regression_metrics %>%
  kable(digits = 4, caption = "Regression Model Performance on Test Data")
Regression Model Performance on Test Data
model target n_train n_test rmse mae baseline_rmse rmse_improvement_pct test_r2
Linear regression Adult obesity data_value 11493 2874 3.7328 2.5359 7.4214 49.7016 0.7469
Regression tree Adult obesity data_value 11493 2874 4.4940 3.3592 7.4214 39.4460 0.6332
readr::write_csv(regression_design_summary, "project_output_regression_design_summary.csv")
readr::write_csv(regression_metrics, "project_output_regression_metrics.csv")

After the train-test split, the unseen-factor guard removed 0 test records, so the test set has no unseen location or stratification levels. Under this row-level random split, the linear regression explains 74.7% of the test-set variation in adult obesity, with an RMSE of 3.73 percentage points. This should be read as within-panel explanation, not forecasting skill. The same locations and demographic groups appear in both the training and test sets, so the model is mainly recovering already-observed profile means. The mean-only baseline is also weak, so beating it by 49.7% mainly shows that location and group structure is useful, not that the model can predict new locations or groups.

The regression tree gives a simple non-linear comparison. In this run it does not improve on the linear model. So the linear regression is kept as the main model, because it is simpler, clearer, and fits the observed profiles at least as well under the same split.

regression_metrics %>%
  select(model, rmse, mae, test_r2) %>%
  pivot_longer(
    cols = c(rmse, mae, test_r2),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      rmse = "RMSE",
      mae = "MAE",
      test_r2 = "Test R-squared"
    )
  ) %>%
  ggplot(aes(x = model, y = value, fill = model)) +
  geom_col(width = 0.62) +
  facet_wrap(~metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "Linear regression" = "#B4443F",
    "Regression tree" = "#496A81"
  )) +
  labs(
    title = "Regression Model Performance Comparison",
    subtitle = "Lower RMSE/MAE is better; higher test R-squared is better",
    x = NULL,
    y = NULL
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The comparison points the same way. Under the row-level split, the linear regression has lower error than the simple regression tree. This suggests that additive location and demographic differences capture much of the observed profile-level variation. Again, this is within-panel explanation, not future forecasting ability, and the linear model is preferred mainly because it is simpler and easier to explain in a public health report.

8.3 Observed Versus Predicted Values

regression_predictions <- reg_test %>%
  mutate(
    predicted_obesity = reg_pred_lm,
    residual = data_value - predicted_obesity,
    absolute_residual = abs(residual)
  )

regression_error_summary <- regression_predictions %>%
  summarise(
    mean_observed = mean(data_value, na.rm = TRUE),
    mean_predicted = mean(predicted_obesity, na.rm = TRUE),
    mean_residual = mean(residual, na.rm = TRUE),
    median_absolute_residual = median(absolute_residual, na.rm = TRUE),
    pct_within_3_points = mean(absolute_residual <= 3) * 100,
    pct_within_5_points = mean(absolute_residual <= 5) * 100
  )

regression_error_summary %>%
  kable(digits = 3, caption = "Linear Regression Prediction Error Summary")
Linear Regression Prediction Error Summary
mean_observed mean_predicted mean_residual median_absolute_residual pct_within_3_points pct_within_5_points
30.58 30.554 0.026 1.743 72.199 87.752
readr::write_csv(regression_predictions, "project_output_regression_predictions.csv")
readr::write_csv(regression_error_summary, "project_output_regression_error_summary.csv")
set.seed(7004)

regression_predictions %>%
  slice_sample(n = min(5000, nrow(regression_predictions))) %>%
  ggplot(aes(x = predicted_obesity, y = data_value)) +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.6) +
  geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Observed Versus Predicted Adult Obesity",
    subtitle = "Linear regression predictions on the held-out test set",
    x = "Predicted adult obesity",
    y = "Observed adult obesity"
  ) +
  theme_report()

Most points lie close to the 45-degree line, so the model captures the main structure of obesity differences across public health profiles. The median absolute residual is 1.74 percentage points, and 87.8% of test records are predicted within five percentage points of the observed value.

9 Regression Diagnostics / Interpretation

9.1 Cross-Validation Stability Check

A single train-test split can be lucky or unlucky. To check that the result is not driven by one split, a simple 5-fold cross-validation is run for the linear regression. No hyperparameters are tuned; the goal is only to check whether performance is stable across different row-level splits.

rng_state <- .Random.seed

set.seed(2026)
cv_folds <- sample(rep(1:5, length.out = nrow(regression_df)))

cv_results <- map_dfr(1:5, function(fold_id) {
  cv_train <- regression_df[cv_folds != fold_id, ]
  cv_test <- regression_df[cv_folds == fold_id, ]

  cv_guard <- drop_unseen_levels(cv_train, cv_test)
  cv_test <- cv_guard$test
  cv_train <- droplevels(cv_train)

  cv_fit <- lm(reg_formula, data = cv_train)
  cv_pred <- predict(cv_fit, newdata = cv_test)

  tibble(
    fold = fold_id,
    n_test = nrow(cv_test),
    rows_dropped = cv_guard$dropped,
    rmse = sqrt(mean((cv_test$data_value - cv_pred)^2)),
    test_r2 = 1 - sum((cv_test$data_value - cv_pred)^2) /
      sum((cv_test$data_value - mean(cv_test$data_value))^2)
  )
})

cv_summary <- cv_results %>%
  summarise(
    mean_rmse = mean(rmse),
    sd_rmse = sd(rmse),
    mean_test_r2 = mean(test_r2),
    sd_test_r2 = sd(test_r2),
    total_rows_dropped = sum(rows_dropped)
  )

cv_results %>%
  kable(digits = 4, caption = "Five-Fold Cross-Validation Results")
Five-Fold Cross-Validation Results
fold n_test rows_dropped rmse test_r2
1 2874 0 3.5271 0.7697
2 2874 0 3.6339 0.7591
3 2873 0 3.7346 0.7518
4 2873 0 3.5833 0.7552
5 2873 0 3.5892 0.7637
cv_summary %>%
  kable(digits = 4, caption = "Cross-Validation Stability Summary")
Cross-Validation Stability Summary
mean_rmse sd_rmse mean_test_r2 sd_test_r2 total_rows_dropped
3.6136 0.0776 0.7599 0.007 0
readr::write_csv(cv_results, "project_output_regression_cv.csv")
readr::write_csv(cv_summary, "project_output_regression_cv_summary.csv")

.Random.seed <- rng_state

Across the five folds, the average test R-squared is 0.76, with a standard deviation of 0.007. The low variation shows the result is not caused by one special split. But this check is still limited: every fold shares most of the same locations and demographic groups, so it only checks interpolation stability within the observed panel. It does not test generalisation to new locations, future years, or unseen demographic groups, which would need leave-one-state-out or forward-in-time validation (noted as future work).

9.2 Residual Diagnostics

The diagnostics check whether the linear model is a reasonable working description of the data. The main questions are simple: are the residuals centred near zero, do the errors show strong patterns, and do a few hard-to-fit profiles dominate the results?

residual_diagnostics <- regression_predictions %>%
  transmute(
    observed = data_value,
    fitted = predicted_obesity,
    residual = residual,
    absolute_residual = absolute_residual,
    standardised_residual = as.numeric(scale(residual))
  )

residual_summary <- residual_diagnostics %>%
  summarise(
    mean_residual = mean(residual, na.rm = TRUE),
    sd_residual = sd(residual, na.rm = TRUE),
    median_residual = median(residual, na.rm = TRUE),
    p95_absolute_residual = quantile(abs(residual), 0.95, na.rm = TRUE),
    max_absolute_residual = max(abs(residual), na.rm = TRUE)
  )

large_residual_profiles <- regression_predictions %>%
  arrange(desc(absolute_residual)) %>%
  slice_head(n = 10)

residual_summary %>%
  kable(digits = 3, caption = "Residual Summary for Linear Regression")
Residual Summary for Linear Regression
mean_residual sd_residual median_residual p95_absolute_residual max_absolute_residual
0.026 3.733 0.078 7.72 23.691
large_residual_profiles %>%
  kable(digits = 3, caption = "Largest Absolute Residuals in the Test Set")
Largest Absolute Residuals in the Test Set
data_value yearstart locationdesc stratification1 predicted_obesity residual absolute_residual
16.7 2018 Ohio American Indian/Alaska Native 40.391 -23.691 23.691
54.3 2014 South Dakota 2 or more races 32.298 22.002 22.002
56.2 2013 Hawaii Hawaiian/Pacific Islander 34.471 21.729 21.729
12.6 2021 Virgin Islands Non-Hispanic White 34.279 -21.679 21.679
17.5 2018 Georgia American Indian/Alaska Native 38.500 -21.000 21.000
15.0 2011 Maine Non-Hispanic Black 34.260 -19.260 19.260
19.3 2011 Arkansas American Indian/Alaska Native 37.986 -18.686 18.686
53.6 2014 Arkansas 2 or more races 35.166 18.434 18.434
8.6 2011 Georgia Other 26.639 -18.039 18.039
17.4 2015 Kentucky 2 or more races 35.025 -17.625 17.625
readr::write_csv(residual_diagnostics, "project_output_regression_residual_diagnostics.csv")
readr::write_csv(residual_summary, "project_output_regression_residual_summary.csv")
readr::write_csv(large_residual_profiles, "project_output_regression_largest_residuals.csv")
residual_diagnostics %>%
  ggplot(aes(x = fitted, y = residual)) +
  geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
  geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Residuals Versus Fitted Values",
    x = "Fitted adult obesity",
    y = "Residual"
  ) +
  theme_report(base_size = 11)

residual_diagnostics %>%
  ggplot(aes(x = residual)) +
  geom_histogram(bins = 40, fill = "#496A81", colour = "white") +
  geom_vline(xintercept = 0, colour = "#B4443F", linewidth = 1) +
  scale_x_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Residual Distribution",
    x = "Residual",
    y = "Profiles"
  ) +
  theme_report(base_size = 11)

qq_values <- qqnorm(residual_diagnostics$standardised_residual, plot.it = FALSE)
qq_data <- tibble(
  theoretical = qq_values$x,
  sample = qq_values$y
)

qq_data %>%
  ggplot(aes(x = theoretical, y = sample)) +
  geom_point(alpha = 0.35, colour = "#385A6B", size = 1.4) +
  geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
  labs(
    title = "Normal Q-Q Check",
    x = "Theoretical quantiles",
    y = "Sample quantiles"
  ) +
  theme_report(base_size = 11)

residual_diagnostics %>%
  mutate(sqrt_abs_standardised = sqrt(abs(standardised_residual))) %>%
  ggplot(aes(x = fitted, y = sqrt_abs_standardised)) +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
  geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Scale-Location Check",
    x = "Fitted adult obesity",
    y = "Sqrt(abs(standardised residual))"
  ) +
  theme_report(base_size = 11)

The residuals are centred close to zero, so the model does not strongly overpredict or underpredict obesity overall. The histogram and Q-Q plot show most residuals near zero, but the tails are heavier than a perfect normal distribution. So most profiles are fitted reasonably well, while a few profiles are still hard to explain using only year, location, and stratification group. The scale-location plot does not show an extreme funnel shape, but some heterogeneity is still expected: the rows are aggregated survey estimates, and their precision varies with sample size, which is not modelled here (no weighting is applied). So these plots are used as practical checks, not as a formal test that the constant-variance assumption fully holds.

9.3 Factor-Effect Interpretation

The metrics show the model captures much of the observed profile-level structure, but it is more useful once the coefficients are read. Because locationdesc and stratification1 are categorical, the linear regression uses dummy variables. Each location or stratification coefficient is a difference from a reference level, holding year and the other profile variables constant.

lm_coef_tbl <- tibble(
  term = names(coef(reg_lm)),
  estimate = as.numeric(coef(reg_lm))
) %>%
  filter(!is.na(estimate))

year_effect <- lm_coef_tbl %>%
  filter(term == "yearstart") %>%
  pull(estimate)

ref_location <- levels(reg_train$locationdesc)[1]
ref_stratification <- levels(reg_train$stratification1)[1]

location_effects <- lm_coef_tbl %>%
  filter(str_detect(term, "^locationdesc")) %>%
  transmute(
    location = str_remove(term, "^locationdesc"),
    effect_vs_reference = estimate
  ) %>%
  arrange(desc(effect_vs_reference))

stratification_effects <- lm_coef_tbl %>%
  filter(str_detect(term, "^stratification1")) %>%
  transmute(
    stratification_group = str_remove(term, "^stratification1"),
    effect_vs_reference = estimate
  ) %>%
  arrange(desc(effect_vs_reference))

location_effect_extremes <- bind_rows(
  location_effects %>%
    slice_head(n = 8) %>%
    mutate(direction = "Higher than reference"),
  location_effects %>%
    slice_tail(n = 8) %>%
    mutate(direction = "Lower than reference")
) %>%
  mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))

stratification_effect_extremes <- bind_rows(
  stratification_effects %>%
    slice_head(n = 8) %>%
    mutate(direction = "Higher than reference"),
  stratification_effects %>%
    slice_tail(n = 8) %>%
    mutate(direction = "Lower than reference")
) %>%
  mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))

tibble(
  coefficient = "yearstart",
  interpretation = "Estimated yearly change in adult obesity, holding location and stratification group constant",
  estimate_percentage_points = year_effect
) %>%
  kable(digits = 3, caption = "Year Effect from Linear Regression")
Year Effect from Linear Regression
coefficient interpretation estimate_percentage_points
yearstart Estimated yearly change in adult obesity, holding location and stratification group constant 0.6
location_effect_extremes %>%
  kable(digits = 3, caption = paste0(
    "Largest Location Effects Relative to Reference Location: ",
    ref_location
  ))
Largest Location Effects Relative to Reference Location: Alabama
location effect_vs_reference direction
West Virginia 2.047 Higher than reference
Mississippi 1.920 Higher than reference
Louisiana 0.340 Higher than reference
Arkansas -0.074 Higher than reference
Oklahoma -0.574 Higher than reference
Kentucky -0.816 Higher than reference
Indiana -1.464 Higher than reference
Iowa -1.487 Higher than reference
Vermont -7.976 Lower than reference
Utah -8.066 Lower than reference
California -8.284 Lower than reference
New York -8.443 Lower than reference
Massachusetts -9.096 Lower than reference
Hawaii -9.665 Lower than reference
District of Columbia -9.794 Lower than reference
Colorado -11.933 Lower than reference
stratification_effect_extremes %>%
  kable(digits = 3, caption = paste0(
    "Largest Stratification Group Effects Relative to Reference Group: ",
    ref_stratification
  ))
Largest Stratification Group Effects Relative to Reference Group: $15,000 - $24,999
stratification_group effect_vs_reference direction
Hawaiian/Pacific Islander 7.801 Higher than reference
Non-Hispanic Black 4.236 Higher than reference
American Indian/Alaska Native 2.927 Higher than reference
45 - 54 2.201 Higher than reference
Less than $15,000 1.530 Higher than reference
55 - 64 1.047 Higher than reference
Less than high school 0.373 Higher than reference
35 - 44 0.121 Higher than reference
25 - 34 -4.880 Lower than reference
Non-Hispanic White -5.133 Lower than reference
65 or older -6.010 Lower than reference
$75,000 or greater -6.376 Lower than reference
Data not reported -7.972 Lower than reference
College graduate -9.425 Lower than reference
18 - 24 -16.029 Lower than reference
Asian -22.070 Lower than reference
readr::write_csv(lm_coef_tbl, "project_output_regression_coefficients.csv")
readr::write_csv(location_effects, "project_output_regression_location_effects.csv")
readr::write_csv(stratification_effects, "project_output_regression_stratification_effects.csv")
location_effect_extremes %>%
  mutate(location = reorder(location, effect_vs_reference)) %>%
  ggplot(aes(x = location, y = effect_vs_reference, fill = direction)) +
  geom_col(width = 0.72) +
  coord_flip() +
  facet_wrap(~direction, scales = "free_y") +
  scale_fill_manual(values = c(
    "Higher than reference" = "#B4443F",
    "Lower than reference" = "#4E7D63"
  )) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Largest Location Effects in the Linear Regression",
    subtitle = paste0("Effects are percentage-point differences relative to ", ref_location),
    x = NULL,
    y = "Effect versus reference"
  ) +
  theme_report(base_size = 11)

stratification_effect_extremes %>%
  mutate(stratification_group = reorder(stratification_group, effect_vs_reference)) %>%
  ggplot(aes(x = stratification_group, y = effect_vs_reference, fill = direction)) +
  geom_col(width = 0.72) +
  coord_flip() +
  facet_wrap(~direction, scales = "free_y") +
  scale_fill_manual(values = c(
    "Higher than reference" = "#B4443F",
    "Lower than reference" = "#4E7D63"
  )) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Largest Demographic Group Effects in the Linear Regression",
    subtitle = paste0("Effects are percentage-point differences relative to ", ref_stratification),
    x = NULL,
    y = "Effect versus reference"
  ) +
  theme_report(base_size = 11)

All coefficients below are descriptive, model-adjusted differences within this additive profile model. They are not formal inferential or causal estimates, and the standard errors are not interpreted (see the design note on non-independent subgroup rows).

In this additive model, the yearstart coefficient is 0.6. This suggests an average upward time pattern of about 0.6 percentage points per year, or roughly 6 percentage points over ten years, which matches the upward trend in the EDA. The model has no location-specific time terms, so this is an average pattern across profiles. It does not mean every location or every group rises at the same rate each year.

The location effects also match the location ranking. Locations with the largest positive coefficients include West Virginia, Mississippi, Louisiana, and those with the most negative include Colorado, District of Columbia, Hawaii. These differences stay visible after adjusting for year and demographic stratification group.

The stratification effects show demographic group is also linked with obesity variation. Groups with the largest positive coefficients include Hawaiian/Pacific Islander, Non-Hispanic Black, American Indian/Alaska Native, and those with the most negative include Asian, 18 - 24, College graduate. This connects with the demographic EDA: obesity is not evenly distributed across reported profiles. These subgroup coefficients compare reported stratification labels with the reference label. Because stratification1 mixes age, gender, income, education, and race or ethnicity into one label variable, they should be read as descriptive profile differences from the reference group, not as separate effects of age, income, education, or race.

9.4 Does Physical Inactivity Add Explanatory Power? An Adjustment Ladder

The model above mainly restates, in one equation, what the EDA already showed: obesity differs by year, place, and demographic group. By itself, this does not add much new information. A more useful question for this project is whether physical inactivity explains obesity beyond those structural differences. The EDA found a moderate subgroup aligned correlation (0.457) but a much smaller within-profile correlation (0.073), which already hints that much of the obesity-inactivity link is stable between-profile structure. The regression can make this clear by adding controls one block at a time and watching the inactivity coefficient.

This uses the same subgroup aligned modelling dataset created in the EDA section, so each row has both an obesity and an inactivity value for the same demographic subgroup profile and excludes Total or Overall rows. Four models are fitted on the full sample. This is not a train-test prediction task; the goal is to watch how the inactivity coefficient changes. The ladder keeps the same outcome and adds controls step by step: M1 inactivity only; M2 adds year; M3 adds location; M4 adds demographic group.

ladder_df <- modelling_aligned_health %>%
  select(obesity_value, inactivity_value, yearstart, locationdesc, stratification1) %>%
  drop_na() %>%
  mutate(across(where(is.character), as.factor))

ladder_models <- list(
  "M1: inactivity only"                    = lm(obesity_value ~ inactivity_value, data = ladder_df),
  "M2: + year"                             = lm(obesity_value ~ inactivity_value + yearstart, data = ladder_df),
  "M3: + location"                         = lm(obesity_value ~ inactivity_value + yearstart + locationdesc, data = ladder_df),
  "M4: + location + demographic group"     = lm(obesity_value ~ inactivity_value + yearstart + locationdesc + stratification1, data = ladder_df)
)

ladder_summary <- imap_dfr(ladder_models, function(fit, label) {
  tibble(
    model = label,
    inactivity_coefficient = unname(coef(fit)["inactivity_value"]),
    adj_r2 = summary(fit)$adj.r.squared
  )
})

ladder_summary <- ladder_summary %>%
  mutate(
    coef_retained_vs_m1 = inactivity_coefficient / inactivity_coefficient[model == "M1: inactivity only"]
  )

ladder_summary %>%
  kable(digits = 3, caption = "Inactivity Coefficient and Fit as Controls Are Added (Adjustment Ladder)")
Inactivity Coefficient and Fit as Controls Are Added (Adjustment Ladder)
model inactivity_coefficient adj_r2 coef_retained_vs_m1
M1: inactivity only 0.401 0.209 1.000
M2: + year 0.397 0.270 0.990
M3: + location 0.357 0.384 0.892
M4: + location + demographic group 0.139 0.769 0.347
readr::write_csv(ladder_summary, "project_output_inactivity_adjustment_ladder.csv")
ladder_summary %>%
  mutate(model = factor(model, levels = ladder_summary$model)) %>%
  ggplot(aes(x = model, y = inactivity_coefficient, group = 1)) +
  geom_col(fill = bar_fill_neutral, width = 0.62) +
  geom_text(aes(label = round(inactivity_coefficient, 3)), vjust = -0.35, colour = "#243447", size = 3.6) +
  geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
  labs(
    title = "Inactivity Coefficient Shrinks as Geography and Demographics Are Controlled",
    subtitle = "Each step adds a control block; a coefficient that collapses signals between-profile confounding",
    x = NULL,
    y = "Coefficient on inactivity (pts obesity per pt inactivity)"
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The inactivity coefficient falls from 0.401 in M1 to 0.139 in M4. After location and demographic group are added, it keeps only about 35% of its unadjusted size. Most of the drop happens when demographic group enters. This suggests that the raw association is strongly linked to demographic profile differences. For example, some income, education, and race or ethnicity profiles may be high on both indicators. This is the most useful result, and one the EDA cannot give: the coefficient pattern suggests that a large part of the apparent obesity-inactivity association is tied to between-profile structure, while a modest part survives adjustment.

These numbers also fit the earlier within-profile check on a sensible gradient. M4 still leaves some association because it only adjusts for average location and group differences. The within-profile demeaned correlation (0.073) is stricter, because it removes each full profile’s mean and keeps only year-to-year movement, and it is close to zero. The more completely profile identity is absorbed, the less remains – all pointing the same way: the obesity-inactivity link is mainly stable between-profile differences, not an independent within-profile relationship. The remaining M4 coefficient should not be over-read as a clean effect.

There are three cautions. First, this is an ecological regression: obesity and inactivity are both survey-estimated outcomes for aggregated groups, so the coefficient is an association between two outcomes, not a behavioural or causal effect, and it has no direction. Second, inactivity_value is also an estimate with measurement error, so the coefficient is pulled toward zero (regression dilution); the true link could be a bit stronger than M4 shows. Third, no precision weighting is used because the aligned data does not carry sample size; inverse-variance weighting is left as future work. So the ladder answers the variable-value question descriptively: most of the obesity-inactivity association is structural, and after geography and demographics are accounted for, physical inactivity adds only modest additional descriptive explanatory power for obesity in this aggregated surveillance panel.

One more point on the two regressions. The structural model at the start of this section is not the main evidence about the obesity-inactivity relationship; it just summarises how obesity itself varies by year, location, and demographic group. The adjustment ladder is the main regression evidence on variable value, and its message is that inactivity is mostly a marker of between-profile structure, not an independent explainer.

9.5 Interpretation of Model Results

The regression shows that adult obesity is structured by year, location, and demographic group. Its performance should be read as profile-level explanation within the observed surveillance panel. Because locationdesc and stratification1 are predictors, the model is mainly learning differences among already-observed locations and groups. This fits the project goal of profiling existing records, but it should not be described as strong future forecasting ability for new locations or new demographic categories.

The practical interpretation is:

  1. Year matters: obesity values increase over the study period, consistent with the EDA trend.
  2. Location matters: geographic differences remain large, consistent with the location ranking.
  3. Demographic stratification matters: profile-level differences by age, income, education, and race or ethnicity help explain observed obesity variation.
  4. Most of the inactivity link is structural: the coefficient pattern suggests that much of the obesity-inactivity association reflects stable between-profile differences. Only a modest and hard-to-interpret residual association remains.
  5. Model performance is explanatory, not causal: the regression identifies structured differences in observed public health profiles, but it does not estimate causal effects.

Overall, the regression has two roles. The structural model confirms the EDA findings and puts them in one additive equation: obesity varies by time, place, and demographic group. The adjustment ladder then goes one step further, showing that most of the obesity-inactivity association is carried by that same structure, with only a modest residual after adjustment. Together they set up the classification stage, where inactivity is still useful as one profile feature for ranking and screening, not as a clean causal driver.

Future work on inference. This report treats the coefficients as descriptive and does not interpret p-values. A more formal analysis could use cluster-robust standard errors clustered by locationdesc, or a linear mixed-effects model. These methods would give more careful uncertainty estimates. However, they would still need caution because demographic rows can be different slices of the same survey sample.

10 Classification Model: High-Obesity Subgroup Profile Screening

10.1 Classification Objective

The classification section addresses the fourth research question:

Can high-obesity demographic subgroup public health profiles be screened using physical inactivity levels together with year, location, and demographic characteristics?

This is a different task from the regression model. The regression model explains the continuous adult obesity percentage across public health profiles, while the classification model converts the same profile-level information into a screening task: identifying which profiles are more likely to fall into the high-obesity group.

To keep the modelling level consistent with the revised EDA and regression sections, the classification analysis uses the same subgroup aligned modelling dataset created after the obesity-inactivity alignment. Total or Overall rows are excluded because they are aggregate summaries rather than independent subgroup observations. This avoids mixing statewide total records with demographic subgroup records in the same screening model.

The target is defined as:

high_obesity = 1 if obesity_value >= 75th percentile of TRAINING obesity values
high_obesity = 0 otherwise

The 75th percentile threshold is calculated from the training split only. This avoids test-set leakage and keeps the classification task honest. The threshold is not a medical standard; it is a relative rule for public health screening and prioritisation.

10.2 Train-Test Split and Classification Models

classification_health <- modelling_aligned_health

classification_predictors <- c(
  "inactivity_value",
  "yearstart",
  "locationdesc",
  "stratification1"
)

classification_predictors <- classification_predictors[
  classification_predictors %in% names(classification_health)
]

classification_df <- classification_health %>%
  select(obesity_value, all_of(classification_predictors)) %>%
  drop_na() %>%
  mutate(across(where(is.character), as.factor))

set.seed(7004)
classification_train_index <- sample.int(
  nrow(classification_df),
  floor(0.8 * nrow(classification_df))
)

cls_train <- classification_df[classification_train_index, ]
cls_test <- classification_df[-classification_train_index, ]

cls_guard <- drop_unseen_levels(cls_train, cls_test)
cls_test <- cls_guard$test
cls_train <- droplevels(cls_train)

obesity_threshold <- unname(quantile(cls_train$obesity_value, 0.75, na.rm = TRUE))

cls_train <- cls_train %>%
  mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))

cls_test <- cls_test %>%
  mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))

cls_formula <- as.formula(
  paste("high_obesity ~", paste(classification_predictors, collapse = " + "))
)

cls_glm <- glm(cls_formula, data = cls_train, family = binomial())

cls_tree <- rpart(
  as.formula(paste(
    "factor(high_obesity) ~",
    paste(classification_predictors, collapse = " + ")
  )),
  data = cls_train,
  method = "class"
)

cls_prob_glm <- predict(cls_glm, newdata = cls_test, type = "response")
cls_prob_tree <- predict(cls_tree, newdata = cls_test, type = "prob")[, "1"]

classification_design_summary <- tibble(
  target = "High-obesity subgroup public health profile",
  analysis_scope = "Demographic subgroup profiles only; Total/Overall rows excluded",
  threshold_source = "75th percentile of training-set obesity values",
  obesity_threshold = obesity_threshold,
  predictors = paste(classification_predictors, collapse = " + "),
  original_aligned_profiles = nrow(aligned_health),
  subgroup_profiles_used = nrow(classification_health),
  excluded_total_or_overall_profiles = nrow(aligned_health) - nrow(classification_health),
  modelling_rows = nrow(classification_df),
  train_rows = nrow(cls_train),
  test_rows_after_guard = nrow(cls_test),
  test_rows_dropped_unseen_levels = cls_guard$dropped,
  positive_rate_test = mean(cls_test$high_obesity == 1)
)

classification_design_summary %>%
  kable(digits = 4, caption = "Classification Modelling Design Summary")
Classification Modelling Design Summary
target analysis_scope threshold_source obesity_threshold predictors original_aligned_profiles subgroup_profiles_used excluded_total_or_overall_profiles modelling_rows train_rows test_rows_after_guard test_rows_dropped_unseen_levels positive_rate_test
High-obesity subgroup public health profile Demographic subgroup profiles only; Total/Overall rows excluded 75th percentile of training-set obesity values 35.2 inactivity_value + yearstart + locationdesc + stratification1 14922 14334 588 14334 11467 2867 0 0.2483
readr::write_csv(classification_design_summary, "project_output_classification_design_summary.csv")

Logistic regression is used as the main classification model because it is interpretable and suitable for a binary screening task. A classification tree is included as a benchmark model. The purpose is not to diagnose individuals or predict future obesity status, but to rank observed subgroup profiles for public health attention.

10.3 Classification Performance Metrics

For public health prioritisation, the model should not only classify profiles at one threshold. It should also rank profiles so that the highest-risk subgroup profiles can be reviewed first. Therefore, the evaluation reports both threshold-based metrics and ranking-based metrics.

calc_auc_pr <- function(actual, prob, positive = 1) {
  y <- if_else(actual == positive, 1, 0)
  positive_n <- sum(y == 1)

  if (positive_n == 0) {
    return(NA_real_)
  }

  ordered_index <- order(prob, decreasing = TRUE)
  y_sorted <- y[ordered_index]

  tp <- cumsum(y_sorted == 1)
  fp <- cumsum(y_sorted == 0)

  recall <- tp / positive_n
  precision <- tp / pmax(tp + fp, 1)

  recall <- c(0, recall)
  precision <- c(1, precision)

  sum(diff(recall) * (precision[-1] + precision[-length(precision)]) / 2)
}

calc_auc_roc <- function(actual, prob, positive = 1) {
  y <- if_else(actual == positive, 1, 0)
  positive_n <- sum(y == 1)
  negative_n <- sum(y == 0)

  if (positive_n == 0 || negative_n == 0) {
    return(NA_real_)
  }

  rank_sum_positive <- sum(rank(prob)[y == 1])
  (rank_sum_positive - positive_n * (positive_n + 1) / 2) /
    (positive_n * negative_n)
}

top_fraction_capture <- function(actual, prob, positive = 1, fraction = 0.20) {
  ranking_df <- tibble(actual = actual, prob = prob) %>%
    arrange(desc(prob))

  n_top <- max(1, floor(nrow(ranking_df) * fraction))
  top_df <- ranking_df %>% slice_head(n = n_top)

  overall_rate <- mean(ranking_df$actual == positive)
  top_rate <- mean(top_df$actual == positive)

  tibble(
    top_fraction = fraction,
    overall_rate = overall_rate,
    top_rate = top_rate,
    lift = top_rate / overall_rate,
    capture = sum(top_df$actual == positive) / sum(ranking_df$actual == positive)
  )
}

classification_metric_row <- function(model_name, actual, prob, threshold = 0.5) {
  predicted <- if_else(prob >= threshold, 1, 0)

  tp <- sum(predicted == 1 & actual == 1)
  fp <- sum(predicted == 1 & actual == 0)
  tn <- sum(predicted == 0 & actual == 0)
  fn <- sum(predicted == 0 & actual == 1)

  precision <- tp / max(tp + fp, 1)
  recall <- tp / max(tp + fn, 1)
  specificity <- tn / max(tn + fp, 1)
  accuracy <- (tp + tn) / max(tp + tn + fp + fn, 1)
  f1 <- if ((precision + recall) > 0) {
    2 * precision * recall / (precision + recall)
  } else {
    0
  }

  top20 <- top_fraction_capture(actual, prob, fraction = 0.20)
  auc_pr <- calc_auc_pr(actual, prob)
  auc_roc <- calc_auc_roc(actual, prob)

  tibble(
    model = model_name,
    target = "High-obesity subgroup profile",
    obesity_threshold_q75 = obesity_threshold,
    n_train = nrow(cls_train),
    n_test = length(actual),
    positive_rate = mean(actual == 1),
    roc_auc = auc_roc,
    auc_pr = auc_pr,
    auc_pr_lift = auc_pr / mean(actual == 1),
    top20_lift = top20$lift,
    top20_capture = top20$capture,
    accuracy_threshold_0_5 = accuracy,
    precision_threshold_0_5 = precision,
    recall_threshold_0_5 = recall,
    specificity_threshold_0_5 = specificity,
    f1_threshold_0_5 = f1
  )
}

classification_metrics <- bind_rows(
  classification_metric_row("Logistic regression", cls_test$high_obesity, cls_prob_glm),
  classification_metric_row("Classification tree", cls_test$high_obesity, cls_prob_tree)
)

classification_metrics %>%
  kable(digits = 4, caption = "Classification Model Performance")
Classification Model Performance
model target obesity_threshold_q75 n_train n_test positive_rate roc_auc auc_pr auc_pr_lift top20_lift top20_capture accuracy_threshold_0_5 precision_threshold_0_5 recall_threshold_0_5 specificity_threshold_0_5 f1_threshold_0_5
Logistic regression High-obesity subgroup profile 35.2 11467 2867 0.2483 0.9281 0.8152 3.2824 3.2818 0.6559 0.8804 0.7896 0.7065 0.9378 0.7457
Classification tree High-obesity subgroup profile 35.2 11467 2867 0.2483 0.8478 0.6792 2.7349 3.0358 0.6067 0.8556 0.7435 0.6390 0.9271 0.6873
readr::write_csv(classification_metrics, "project_output_classification_metrics.csv")
classification_metrics %>%
  select(model, roc_auc, auc_pr, top20_capture, f1_threshold_0_5) %>%
  pivot_longer(
    cols = c(roc_auc, auc_pr, top20_capture, f1_threshold_0_5),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      roc_auc = "ROC-AUC",
      auc_pr = "AUC-PR",
      top20_capture = "Top-20% capture",
      f1_threshold_0_5 = "F1 at 0.5"
    )
  ) %>%
  ggplot(aes(x = model, y = value, fill = model)) +
  geom_col(width = 0.62) +
  facet_wrap(~metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "Logistic regression" = "#B4443F",
    "Classification tree" = "#496A81"
  )) +
  labs(
    title = "Classification Model Performance Comparison",
    subtitle = "Ranking metrics are especially important for public health prioritisation",
    x = NULL,
    y = NULL
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The logistic regression is carried forward as the main classification model because it provides strong ranking performance while remaining interpretable. Its AUC-PR is 0.815, which is 3.28 times the test-set positive rate. The top 20% highest-ranked subgroup profiles capture 65.6% of actual high-obesity subgroup profiles, which is useful when public health attention must be prioritised.

10.4 Confusion Matrix at 0.5 Reference Threshold

The confusion matrix uses 0.5 as a simple reference threshold for interpretation. This is not treated as the final public health decision rule, because prioritisation is better supported by ranking metrics such as AUC-PR, top-20% capture, and risk decile validation.

cls_pred_glm <- if_else(cls_prob_glm >= 0.5, 1, 0)

confusion_table <- table(
  Predicted = factor(
    cls_pred_glm,
    levels = c(0, 1),
    labels = c("Lower risk", "High risk")
  ),
  Actual = factor(
    cls_test$high_obesity,
    levels = c(0, 1),
    labels = c("Not high obesity", "High obesity")
  )
)

tn_cm <- confusion_table["Lower risk", "Not high obesity"]
fp_cm <- confusion_table["High risk", "Not high obesity"]
fn_cm <- confusion_table["Lower risk", "High obesity"]
tp_cm <- confusion_table["High risk", "High obesity"]

confusion_summary <- tibble(
  metric = c(
    "Sensitivity / Recall",
    "Specificity",
    "Positive Predictive Value / Precision",
    "Negative Predictive Value",
    "Accuracy"
  ),
  value = c(
    tp_cm / max(tp_cm + fn_cm, 1),
    tn_cm / max(tn_cm + fp_cm, 1),
    tp_cm / max(tp_cm + fp_cm, 1),
    tn_cm / max(tn_cm + fn_cm, 1),
    (tp_cm + tn_cm) / max(tp_cm + tn_cm + fp_cm + fn_cm, 1)
  )
)

confusion_table %>%
  kable(caption = "Confusion Matrix for Logistic Regression at 0.5 Threshold")
Confusion Matrix for Logistic Regression at 0.5 Threshold
Not high obesity High obesity
Lower risk 2021 209
High risk 134 503
confusion_summary %>%
  kable(digits = 4, caption = "Confusion Matrix Summary")
Confusion Matrix Summary
metric value
Sensitivity / Recall 0.7065
Specificity 0.9378
Positive Predictive Value / Precision 0.7896
Negative Predictive Value 0.9063
Accuracy 0.8804
readr::write_csv(
  as.data.frame.matrix(confusion_table) %>% rownames_to_column("Predicted"),
  "project_output_confusion_matrix.csv"
)
readr::write_csv(confusion_summary, "project_output_confusion_summary.csv")
classification_predictions <- cls_test %>%
  mutate(
    high_obesity_score = cls_prob_glm,
    predicted_high_obesity = cls_pred_glm
  )

classification_predictions %>%
  ggplot(aes(x = high_obesity_score, fill = factor(high_obesity))) +
  geom_histogram(bins = 40, alpha = 0.65, position = "identity") +
  scale_fill_manual(values = c("0" = "#496A81", "1" = "#B4443F")) +
  labs(
    title = "Predicted High-Obesity Score Distribution",
    subtitle = "Higher scores indicate higher predicted probability of belonging to the high-obesity group",
    x = "Predicted high-obesity score",
    y = "Profiles",
    fill = "Actual high obesity"
  ) +
  theme_report()

readr::write_csv(classification_predictions, "project_output_classification_predictions.csv")

At the 0.5 reference threshold, the logistic regression achieves an accuracy of 0.88 and a recall of 0.706. For public health screening, recall and ranking performance are especially important because the goal is to avoid missing many high-obesity subgroup profiles when prioritising follow-up attention.

11 Risk Ranking Validation

11.1 Risk Decile Validation

The classification model is most useful if its risk scores rank subgroup profiles meaningfully. To validate the ranking, test profiles are divided into ten deciles based on predicted high-obesity score. Decile 10 contains the highest predicted-risk subgroup profiles.

risk_decile_summary <- classification_predictions %>%
  mutate(
    risk_decile = ntile(high_obesity_score, 10)
  ) %>%
  group_by(risk_decile) %>%
  summarise(
    profiles = n(),
    average_score = mean(high_obesity_score, na.rm = TRUE),
    observed_high_obesity_rate = mean(high_obesity == 1),
    average_obesity_value = mean(obesity_value, na.rm = TRUE),
    average_inactivity_value = mean(inactivity_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(risk_decile)

top_decile <- risk_decile_summary %>%
  filter(risk_decile == max(risk_decile))

bottom_decile <- risk_decile_summary %>%
  filter(risk_decile == min(risk_decile))

top_risk_profiles <- classification_predictions %>%
  arrange(desc(high_obesity_score)) %>%
  slice_head(n = 10) %>%
  transmute(
    Year = as.integer(yearstart),
    Location = locationdesc,
    Group = stratification1,
    Inactivity = round(inactivity_value, 1),
    Obesity = round(obesity_value, 1),
    Score = round(high_obesity_score, 3),
    ActualHighObesity = high_obesity
  )

risk_decile_summary %>%
  kable(digits = 4, caption = "Risk Decile Validation for Logistic Regression")
Risk Decile Validation for Logistic Regression
risk_decile profiles average_score observed_high_obesity_rate average_obesity_value average_inactivity_value
1 287 0.0000 0.0070 17.7571 19.2976
2 287 0.0018 0.0035 24.3613 21.9387
3 287 0.0086 0.0035 26.9509 22.6714
4 287 0.0255 0.0418 29.1000 23.6540
5 287 0.0624 0.0592 30.4547 24.9732
6 287 0.1322 0.0976 31.6300 26.9049
7 287 0.2498 0.2195 33.0031 27.4808
8 286 0.4361 0.4266 34.7374 28.7986
9 286 0.6780 0.7308 36.9612 30.4559
10 286 0.9104 0.8986 40.4563 33.3958
top_risk_profiles %>%
  kable(caption = "Top 10 Highest-Risk Subgroup Profiles by Predicted Score")
Top 10 Highest-Risk Subgroup Profiles by Predicted Score
Year Location Group Inactivity Obesity Score ActualHighObesity
2018 Mississippi Non-Hispanic Black 34.6 45.7 0.998 1
2017 Mississippi Non-Hispanic Black 36.2 45.9 0.997 1
2019 Mississippi Less than $15,000 54.3 42.8 0.997 1
2021 Kentucky Non-Hispanic Black 37.4 46.8 0.996 1
2021 Mississippi 45 - 54 30.0 51.6 0.996 1
2019 Mississippi 45 - 54 40.0 50.1 0.996 1
2020 Kentucky Non-Hispanic Black 39.3 45.4 0.996 1
2020 Mississippi 55 - 64 37.5 43.9 0.995 1
2019 West Virginia 45 - 54 32.5 50.2 0.994 1
2017 Louisiana Non-Hispanic Black 36.9 42.2 0.993 1
readr::write_csv(risk_decile_summary, "project_output_high_obesity_risk_decile.csv")
readr::write_csv(top_risk_profiles, "project_output_top_risk_profiles.csv")
risk_decile_summary %>%
  ggplot(aes(x = factor(risk_decile), y = observed_high_obesity_rate)) +
  geom_col(fill = "#B4443F", width = 0.68) +
  geom_line(aes(group = 1), colour = "#243447", linewidth = 0.8) +
  geom_point(colour = "#243447", size = 2) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
  labs(
    title = "Observed High-Obesity Rate by Predicted Risk Decile",
    subtitle = "A useful screening model should show higher observed risk in higher score deciles",
    x = "Predicted risk decile",
    y = "Observed high-obesity rate"
  ) +
  theme_report()

The risk decile validation shows a clear ranking pattern. The lowest decile has an observed high-obesity rate of 0.7%, while the highest decile has an observed high-obesity rate of 89.9%. In decision terms, if a public health team works down the ranked list, it reaches genuinely high-risk subgroup profiles much earlier than random selection. This means the model is not only classifying profiles at a fixed threshold; it is also ordering subgroup profiles in a way that is meaningful for prioritisation within the observed surveillance panel.

12 Public Health Intervention Matrix

12.1 Translating Indicators into Intervention Segments

The classification model provides a screening score, but public health planning also needs interpretable action groups. The intervention matrix translates the two aligned indicators into four public health segments using the 75th percentile cut-offs for obesity and physical inactivity. It uses the same subgroup-level base as the classification model, so the intervention groups describe demographic subgroup profiles rather than a mixture of subgroup records and aggregate summaries.

The four segments are:

Segment Profile Pattern Suggested Public Health Focus
Priority Intervention High obesity and high inactivity Combined obesity prevention and physical activity promotion
Weight Management Focus High obesity and lower inactivity Nutrition, weight management, and obesity prevention support
Preventive Physical Activity Focus Lower obesity and high inactivity Early physical activity promotion and prevention
Standard Monitoring Lower obesity and lower inactivity Routine surveillance and monitoring
priority_base <- classification_health

obesity_q75 <- unname(quantile(priority_base$obesity_value, 0.75, na.rm = TRUE))
inactivity_q75 <- unname(quantile(priority_base$inactivity_value, 0.75, na.rm = TRUE))

action_matrix_data <- priority_base %>%
  mutate(
    obesity_level = if_else(obesity_value >= obesity_q75, "High obesity", "Lower obesity"),
    inactivity_level = if_else(inactivity_value >= inactivity_q75, "High inactivity", "Lower inactivity"),
    public_health_segment = case_when(
      obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
        "Priority Intervention",
      obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
        "Weight Management Focus",
      obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
        "Preventive Physical Activity Focus",
      TRUE ~ "Standard Monitoring"
    ),
    public_health_segment = factor(
      public_health_segment,
      levels = c(
        "Priority Intervention",
        "Weight Management Focus",
        "Preventive Physical Activity Focus",
        "Standard Monitoring"
      )
    )
  )

action_matrix_summary <- action_matrix_data %>%
  group_by(public_health_segment) %>%
  summarise(
    profiles = n(),
    share = profiles / nrow(action_matrix_data),
    average_obesity = mean(obesity_value, na.rm = TRUE),
    average_inactivity = mean(inactivity_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(public_health_segment)

action_matrix_summary %>%
  kable(digits = 4, caption = "Public Health Intervention Matrix Summary")
Public Health Intervention Matrix Summary
public_health_segment profiles share average_obesity average_inactivity
Priority Intervention 1625 0.1134 39.3788 37.8977
Weight Management Focus 1970 0.1374 38.9139 25.1503
Preventive Physical Activity Focus 1960 0.1367 30.4933 36.5277
Standard Monitoring 8779 0.6125 26.9292 21.5793
readr::write_csv(action_matrix_data, "project_output_public_health_action_matrix_data.csv")
readr::write_csv(action_matrix_summary, "project_output_public_health_action_matrix_summary.csv")
action_matrix_data %>%
  ggplot(aes(x = inactivity_value, y = obesity_value, colour = public_health_segment)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_vline(xintercept = inactivity_q75, linetype = "dashed", colour = "#5D6975") +
  geom_hline(yintercept = obesity_q75, linetype = "dashed", colour = "#5D6975") +
  scale_colour_manual(values = c(
    "Priority Intervention" = "#B4443F",
    "Weight Management Focus" = "#D18F3F",
    "Preventive Physical Activity Focus" = "#2F6F8F",
    "Standard Monitoring" = "#7A8793"
  )) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Public Health Intervention Matrix",
    subtitle = "Segments are based on 75th percentile cut-offs for subgroup aligned obesity and inactivity values",
    x = "No leisure-time physical activity",
    y = "Adult obesity",
    colour = "Segment"
  ) +
  theme_report()

The Priority Intervention segment contains subgroup profiles where both obesity and inactivity are relatively high. This segment represents 11.3% of subgroup aligned profiles, with an average obesity value of 39.38% and an average inactivity value of 37.9%. These profiles are the most natural starting point for combined obesity prevention and physical activity promotion, but the segment should be read as decision support rather than an automatic policy list.

13 Threshold Sensitivity and Robust Core

13.1 Sensitivity to Q70, Q75, and Q80 Cut-Offs

The 75th percentile cut-off is a design choice, not a clinical threshold. To check whether the intervention priority result depends too heavily on one cut-off, the matrix is recalculated using 70th, 75th, and 80th percentile thresholds on the same subgroup-level aligned data.

priority_flag_at <- function(df, q) {
  obesity_cutoff <- unname(quantile(df$obesity_value, q, na.rm = TRUE))
  inactivity_cutoff <- unname(quantile(df$inactivity_value, q, na.rm = TRUE))

  df$obesity_value >= obesity_cutoff &
    df$inactivity_value >= inactivity_cutoff
}

priority_base <- classification_health
sensitivity_quantiles <- c(0.70, 0.75, 0.80)

sensitivity_summary <- map_dfr(sensitivity_quantiles, function(q) {
  priority_flag <- priority_flag_at(priority_base, q)

  tibble(
    threshold = paste0("Q", q * 100),
    obesity_cutoff = unname(quantile(priority_base$obesity_value, q, na.rm = TRUE)),
    inactivity_cutoff = unname(quantile(priority_base$inactivity_value, q, na.rm = TRUE)),
    priority_profiles = sum(priority_flag),
    priority_share = mean(priority_flag),
    average_obesity_priority = mean(priority_base$obesity_value[priority_flag], na.rm = TRUE),
    average_inactivity_priority = mean(priority_base$inactivity_value[priority_flag], na.rm = TRUE)
  )
})

robust_priority_data <- priority_base %>%
  mutate(
    priority_q70 = priority_flag_at(priority_base, 0.70),
    priority_q75 = priority_flag_at(priority_base, 0.75),
    priority_q80 = priority_flag_at(priority_base, 0.80),
    robust_priority = priority_q70 & priority_q75 & priority_q80
  )

robust_priority_profiles <- robust_priority_data %>%
  filter(robust_priority) %>%
  arrange(desc(obesity_value + inactivity_value))

robust_core_top10 <- robust_priority_profiles %>%
  slice_head(n = 10) %>%
  transmute(
    Year = as.integer(yearstart),
    Location = locationdesc,
    Group = stratification1,
    Obesity = round(obesity_value, 1),
    Inactivity = round(inactivity_value, 1),
    CombinedBurden = round(obesity_value + inactivity_value, 1)
  )

robust_priority_summary <- tibble(
  n_robust_priority_profiles = nrow(robust_priority_profiles),
  robust_priority_share = nrow(robust_priority_profiles) / nrow(priority_base),
  average_obesity = mean(robust_priority_profiles$obesity_value, na.rm = TRUE),
  average_inactivity = mean(robust_priority_profiles$inactivity_value, na.rm = TRUE)
)

priority_q70_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q70"]
priority_q75_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q75"]
priority_q80_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q80"]
robust_priority_n <- nrow(robust_priority_profiles)
robust_priority_share <- robust_priority_n / nrow(priority_base)

sensitivity_summary %>%
  kable(digits = 4, caption = "Threshold Sensitivity for Priority Intervention Profiles")
Threshold Sensitivity for Priority Intervention Profiles
threshold obesity_cutoff inactivity_cutoff priority_profiles priority_share average_obesity_priority average_inactivity_priority
Q70 34.2 29.9 2178 0.1519 38.6547 36.8449
Q75 35.2 31.3 1625 0.1134 39.3788 37.8977
Q80 36.2 32.8 1137 0.0793 40.1659 38.9839
robust_priority_summary %>%
  kable(digits = 4, caption = "Robust Priority Core Summary")
Robust Priority Core Summary
n_robust_priority_profiles robust_priority_share average_obesity average_inactivity
1137 0.0793 40.1659 38.9839
robust_priority_profiles %>%
  slice_head(n = 20) %>%
  kable(digits = 3, caption = "Top 20 Robust Priority Profiles")
Top 20 Robust Priority Profiles
yearstart locationdesc stratificationcategory1 stratification1 obesity_value inactivity_value priority_q70 priority_q75 priority_q80 robust_priority
2019 North Dakota Race/Ethnicity American Indian/Alaska Native 57.8 49.5 TRUE TRUE TRUE TRUE
2017 Puerto Rico Education Less than high school 39.0 66.8 TRUE TRUE TRUE TRUE
2015 Ohio Race/Ethnicity American Indian/Alaska Native 51.6 51.4 TRUE TRUE TRUE TRUE
2017 Puerto Rico Age (years) 45 - 54 41.8 59.6 TRUE TRUE TRUE TRUE
2021 Puerto Rico Education Less than high school 39.0 61.4 TRUE TRUE TRUE TRUE
2013 Tennessee Education Less than high school 41.3 58.9 TRUE TRUE TRUE TRUE
2017 Puerto Rico Age (years) 55 - 64 40.8 58.8 TRUE TRUE TRUE TRUE
2019 Mississippi Income Less than $15,000 42.8 54.3 TRUE TRUE TRUE TRUE
2016 New Jersey Education Less than high school 41.5 54.5 TRUE TRUE TRUE TRUE
2017 Puerto Rico Income Less than $15,000 37.5 57.9 TRUE TRUE TRUE TRUE
2019 Mississippi Education Less than high school 43.1 52.3 TRUE TRUE TRUE TRUE
2017 Arkansas Race/Ethnicity American Indian/Alaska Native 50.3 44.8 TRUE TRUE TRUE TRUE
2019 Kentucky Education Less than high school 41.2 53.9 TRUE TRUE TRUE TRUE
2018 Kentucky Income Less than $15,000 39.8 55.2 TRUE TRUE TRUE TRUE
2019 Puerto Rico Age (years) 55 - 64 38.8 56.2 TRUE TRUE TRUE TRUE
2018 Arkansas Income Less than $15,000 50.5 44.4 TRUE TRUE TRUE TRUE
2012 Arkansas Race/Ethnicity American Indian/Alaska Native 46.0 48.8 TRUE TRUE TRUE TRUE
2021 Puerto Rico Income Less than $15,000 39.4 55.4 TRUE TRUE TRUE TRUE
2018 Mississippi Income Less than $15,000 46.1 48.4 TRUE TRUE TRUE TRUE
2020 Puerto Rico Age (years) 55 - 64 41.8 52.6 TRUE TRUE TRUE TRUE
robust_core_top10 %>%
  kable(caption = "Top 10 Robust-Core Priority Profiles by Combined Burden")
Top 10 Robust-Core Priority Profiles by Combined Burden
Year Location Group Obesity Inactivity CombinedBurden
2019 North Dakota American Indian/Alaska Native 57.8 49.5 107.3
2017 Puerto Rico Less than high school 39.0 66.8 105.8
2015 Ohio American Indian/Alaska Native 51.6 51.4 103.0
2017 Puerto Rico 45 - 54 41.8 59.6 101.4
2021 Puerto Rico Less than high school 39.0 61.4 100.4
2013 Tennessee Less than high school 41.3 58.9 100.2
2017 Puerto Rico 55 - 64 40.8 58.8 99.6
2019 Mississippi Less than $15,000 42.8 54.3 97.1
2016 New Jersey Less than high school 41.5 54.5 96.0
2017 Puerto Rico Less than $15,000 37.5 57.9 95.4
readr::write_csv(sensitivity_summary, "project_output_threshold_sensitivity.csv")
readr::write_csv(robust_priority_data, "project_output_robust_priority_data.csv")
readr::write_csv(robust_priority_profiles, "project_output_robust_priority_profiles.csv")
readr::write_csv(robust_core_top10, "project_output_robust_core_top10_profiles.csv")
readr::write_csv(robust_priority_summary, "project_output_robust_priority_summary.csv")
sensitivity_summary %>%
  ggplot(aes(x = threshold, y = priority_profiles, fill = threshold)) +
  geom_col(width = 0.62) +
  geom_text(
    aes(label = priority_profiles),
    vjust = -0.35,
    colour = "#243447",
    size = 3.8
  ) +
  scale_fill_manual(values = c(
    "Q70" = "#D18F3F",
    "Q75" = "#B4443F",
    "Q80" = "#7D2E2A"
  )) +
  labs(
    title = "Priority Intervention Profiles Under Alternative Thresholds",
    subtitle = "Higher thresholds define a smaller but stricter intervention core",
    x = "Quantile threshold",
    y = "Priority profiles"
  ) +
  theme_report()

The number of priority profiles changes from 2178 at Q70 to 1625 at Q75 and 1137 at Q80. Because both obesity and inactivity cut-offs rise together under the same quantile rule, the three priority sets are nested: every Q80 priority profile is also a priority profile at Q75 and Q70. The robust core contains 1137 subgroup profiles, or 7.93% of subgroup aligned profiles. These profiles remain high-priority under all three reasonable cut-off choices, so they form a defensible planning starting point rather than a final automatic policy decision.

14 Dashboard / App Design

14.1 Purpose of the Dashboard

A Shiny dashboard, app.R, accompanies this report. The dashboard is designed as an interactive public health profile explorer. Its purpose is to help non-technical users explore the main findings of the project without reading every table or code block in the full report.

The dashboard does not provide individual-level medical diagnosis, personal obesity prediction, or causal recommendations. It works with the same aggregated surveillance logic used throughout the report: each record represents a public health profile defined by year, location, indicator, and demographic stratification group.

The dashboard supports three practical user questions:

  1. How do adult obesity and physical inactivity values change across years?
  2. Which locations and demographic groups show higher average values?
  3. Which aligned subgroup profiles fall into higher-priority intervention segments?

14.2 App Structure

The app is organised into three pages. Each page corresponds to a major analytical stage in the report.

Dashboard Page Main Purpose Related Report Section
Overview and Trend Explore yearly trends by indicator, year range, location, and stratification category Exploratory Data Analysis
Location and Demographic Explorer Identify locations and demographic groups with higher average obesity or inactivity values Location and Demographic Comparison
Priority Matrix Explorer Classify aligned subgroup profiles into intervention segments under Q70, Q75, or Q80 thresholds Public Health Intervention Matrix and Threshold Sensitivity

This structure makes the dashboard a practical extension of the analysis. The first page explains the when, the second explains the where and who, and the third translates the results into public health action groups.

14.3 Page 1: Overview and Trend

The Overview and Trend page allows users to select an indicator, year range, location, and stratification category. It then displays a yearly trend plot and a small summary table containing the number of records, number of years, number of locations, average value, minimum value, and maximum value.

This page is useful for quickly checking whether a selected indicator is rising, falling, or fluctuating across time. For example, users can compare the adult obesity trend with the no leisure-time physical activity trend, or focus on a specific location or stratification category.

14.4 Page 2: Location and Demographic Explorer

The Location and Demographic Explorer page helps users identify where higher values are concentrated. Users select an indicator and year option, and the app displays:

  1. The top 10 locations by average indicator value.
  2. The top 10 stratification groups by average indicator value.

This page supports the location and demographic comparison logic from the EDA. It helps users move beyond national averages and see which public health profiles deserve closer attention.

14.5 Page 3: Priority Matrix Explorer

The Priority Matrix Explorer page is the most decision-oriented part of the app. It uses aligned subgroup obesity-inactivity profiles and classifies them into four intervention segments:

Segment Interpretation
Priority Intervention High obesity and high physical inactivity
Weight Management Focus High obesity and lower physical inactivity
Preventive Physical Activity Focus Lower obesity and high physical inactivity
Standard Monitoring Lower obesity and lower physical inactivity

Users can select Q70, Q75, or Q80 as the high-risk quantile threshold. The app then shows the number of subgroup profiles in each segment, segment averages, and a filtered table of the highest combined-burden subgroup profiles. This directly connects the dashboard to the threshold sensitivity analysis in the report.

14.6 Design Rationale

The app design follows the same principles as the analysis:

  1. Indicator-specific interpretation: obesity and physical inactivity values are not mixed unless the records are aligned.
  2. Profile-level framing: results are shown as subgroup-level public health profiles, not individual risks.
  3. Interpretable controls: users filter by year, location, indicator, and stratification category rather than by technical model parameters.
  4. Decision support: the priority matrix converts analytical results into clear public health segments.
  5. Threshold transparency: Q70, Q75, and Q80 options show how intervention lists expand or contract under different cut-off choices.

14.7 How to Run the App

To run the dashboard:

  1. Place Nutrition__Physical_Activity__and_Obesity.csv in the same folder as app.R.
  2. Open app.R in RStudio.
  3. Click Run App.

The dashboard requires the following R packages:

shiny, dplyr, tidyr, ggplot2, readr, stringr

14.8 Dashboard Limitations

The dashboard is intended for exploration and communication, not for clinical or causal decision-making. Several limitations should be kept in mind:

  1. It uses aggregated public health surveillance data, not individual-level records.
  2. Segment labels are based on relative quantile thresholds, not medical thresholds.
  3. The priority matrix highlights subgroup profiles for public health attention but does not prove that physical inactivity causes obesity.
  4. The dashboard separates broad trend exploration from subgroup-level priority analysis; users should not compare total-level and subgroup-level outputs as if they had the same unit of analysis.
  5. Filtered dashboard views may contain fewer records, so small filtered groups should be interpreted carefully.
  6. The app is designed as a compact project dashboard rather than a full production surveillance system.

Overall, the dashboard strengthens the project by turning the statistical analysis into an interactive tool. It allows users to explore trends, compare locations and demographic groups, and inspect intervention priority profiles in a way that is more accessible than static tables alone.

15 Discussion / Limitations / Conclusion

15.1 Discussion

This project shows how a large public health surveillance dataset can be transformed into an interpretable workflow for obesity and physical inactivity profiling. The analysis begins with careful question selection, because data_value has different meanings under different public health questions. By focusing on adult obesity (Q036) and no leisure-time physical activity (Q047), the project avoids mixing incompatible indicators and keeps the interpretation consistent.

The exploratory analysis shows that adult obesity increased over the study period. In the all-record trend, adult obesity rose by 6.3 percentage points from 2011 to 2021. The total-stratification trend confirms that the increase remains visible after reducing demographic composition effects. This supports the conclusion that the rising obesity pattern is not only a result of how stratified records are pooled.

Location and demographic comparison further show that obesity is unevenly distributed. The highest average obesity locations include Mississippi, West Virginia, Louisiana, while lower-obesity locations include Colorado, District of Columbia, Hawaii. Demographic profiles also differ strongly, especially across race or ethnicity, age, education, and income groups. These findings justify the modelling strategy: year, location, and demographic stratification are not arbitrary predictors, but variables that the EDA already shows to be important.

The aligned obesity-inactivity analysis is a central design choice. Because obesity and physical inactivity are reported in separate rows, the two indicators are compared only after matching by year, location, stratification category, and stratification group. The report then separates the broad all-profile alignment from the subgroup aligned modelling scope. Within the subgroup modelling scope, the raw correlation between adult obesity and no leisure-time physical activity is 0.457, indicating a positive descriptive association. The within-profile demeaned correlation (0.073) shows how much of this survives once stable between-profile differences are removed, and the subgroup-level lagged association check stays positive across lag 0, lag 1, and lag 2. Taken together, these are stable subgroup profile-level associations rather than validated predictive or causal relationships, so they motivate examining inactivity alongside obesity while still requiring a cautious non-causal interpretation.

The same pattern reappears in the regression adjustment ladder, which forms a clean gradient with the EDA: the inactivity coefficient falls from about 0.4 when unadjusted to about 0.14 once location and demographic group are controlled (roughly 35% retained), and the stricter within-profile correlation (0.073) sits closer to zero still. The more completely profile identity is absorbed, the less of the association survives, so the EDA and the regression tell one consistent story: the obesity-inactivity link is carried mostly by stable differences between locations and demographic subgroup profiles, with only a modest residual remaining after full adjustment.

The regression model confirms that adult obesity values are strongly structured by public health profile characteristics. The linear regression explains 74.7% of test-set variation in adult obesity values and performs better than the regression tree benchmark. The coefficient interpretation strengthens the result: the yearstart coefficient is approximately 0.6, meaning that obesity increases by about 0.6 percentage points per year after controlling for location and stratification group. This connects the model directly back to the EDA trend.

The classification model provides a practical subgroup-level screening layer. After excluding Total or Overall rows to match the regression scope, the logistic regression achieves a ROC-AUC of 0.928 and an AUC-PR of 0.815. More importantly for prioritisation, the top 20% highest-ranked subgroup profiles capture 65.6% of true high-obesity profiles. The risk-decile validation shows that observed high-obesity rates rise sharply from the lowest to the highest predicted-risk deciles, confirming that the model ranks subgroup profiles in a meaningful order within the observed surveillance panel.

Finally, the public health intervention matrix translates statistical results into action categories. The Priority Intervention segment contains subgroup profiles with both high obesity and high inactivity, representing 11.3% of subgroup aligned profiles. Threshold sensitivity analysis shows that the priority list contracts as the threshold moves from Q70 to Q80, and the robust core contains 1137 subgroup profiles that remain high-priority under all three cut-off choices. This robust core is a defensible starting point for intervention planning, not an automatic policy decision.

15.2 Limitations

Several limitations should be considered when interpreting the results.

  1. The dataset is aggregated, not individual-level.
    Each row represents a public health profile such as a year, location, indicator, and stratification group. The results cannot be used to predict whether a specific person is obese.

  2. The analysis is associational, not causal.
    The aligned analysis shows that obesity and physical inactivity tend to be higher in the same public health profiles, but this does not prove that physical inactivity causes obesity. Other factors such as diet, income, local environment, healthcare access, and policy context may also contribute.

  3. Model performance reflects interpolation within the observed surveillance panel.
    Because location and stratification group are predictors, the regression and classification models mainly learn patterns among already-observed locations and demographic groups. The results should not be treated as strong evidence of performance for entirely new locations, future policy contexts, or unseen demographic categories.

  4. High-risk thresholds are relative.
    The project uses quantile thresholds such as Q75 to define high-obesity and high-inactivity profiles. These thresholds are useful for prioritisation, but they are not medical, clinical, or policy-mandated thresholds.

  5. Total-level and subgroup-level records have different meanings.
    Total or Overall rows summarise a location-year, while subgroup rows describe demographic slices. The report keeps these levels separate, but users should not interpret subgroup models as population-weighted national estimates.

  6. Some strata have fewer records than others.
    Certain demographic groups and locations have smaller coverage or fewer years of data. This can make some subgroup averages and extreme rankings less stable.

  7. The models are intentionally simple.
    The regression and classification models use interpretable predictors and additive structures. They do not fully model complex interactions between location, income, education, age, race or ethnicity, and physical inactivity.

  8. The intervention matrix supports prioritisation, not final policy decisions.
    The matrix identifies profiles that may deserve attention, but real interventions should also consider local context, feasibility, resources, community needs, and external validation.

  9. The dashboard is exploratory.
    The Shiny app helps users inspect trends, rankings, and priority segments, but it is not a production surveillance system or clinical decision tool.

15.3 Conclusion

This project analysed adult obesity and no leisure-time physical activity using the Nutrition, Physical Activity, and Obesity public health surveillance dataset. The dataset satisfies the course dimension requirement and supports a complete data science workflow: data cleaning, exploratory analysis, aligned relationship analysis, regression modelling, classification modelling, intervention prioritisation, threshold sensitivity analysis, and dashboard design.

The main findings are clear. Adult obesity increased from 2011 to 2021, and the increase remains visible in total-stratification records. Obesity also differs substantially across locations and demographic groups, confirming that public health profiles are not evenly distributed. After fair alignment, obesity and physical inactivity show a positive subgroup-level association with a correlation of 0.457 within the modelling scope.

The regression model explains 74.7% of test-set obesity variation, showing that year, location, and demographic group account for a large share of observed obesity differences. The classification model adds subgroup-level screening value: the logistic regression captures 65.6% of true high-obesity subgroup profiles within the top 20% highest-ranked subgroup profiles.

The final public health intervention matrix turns these results into practical segments. The Priority Intervention group combines high obesity with high physical inactivity and represents the strongest candidate group for combined obesity prevention and physical activity promotion. The robust core of 1137 subgroup profiles remains high-priority under Q70, Q75, and Q80 threshold choices, making it a defensible starting point for planning.

Overall, the project demonstrates how programming and data science can convert large-scale public health surveillance data into interpretable insights, screening tools, intervention priorities, and an interactive dashboard. All conclusions should be understood as group-level public health insights rather than individual medical predictions or causal claims.