Division of Work

Member	Student ID	Main Assigned Section
Chen Miaojun	25072895	Dataset Description; Data Structure and Quality; Data Cleaning and Question Selection; Dashboard / App Design
Wang Peiyuan	25089856	Exploratory Data Analysis; Location and Demographic Comparison; Aligned Relationship and Robustness Checks
Huang Shengqi	24080731	Regression Model; Regression Diagnostics / Interpretation; Adjustment Ladder; statistical limitation review
Wang Hexiao	25089268	Classification Model: High-Obesity Subgroup Profile Screening; Risk Ranking Validation; Public Health Intervention Matrix; Threshold Sensitivity and Robust Core
All members	N/A	Discussion / Limitations / Conclusion; final review; presentation preparation

Abstract

Obesity and physical inactivity are important public health concerns because they are closely related to long-term population health, healthcare burden, and preventive health planning. This project analyses the Nutrition, Physical Activity, and Obesity public health surveillance dataset in the United States. The dataset covers 2011 to 2021 and contains 88,629 records with 33 variables, giving a total dataset dimension of more than 2.9 million cells.

The analysis focuses on two selected indicators: adult obesity and no leisure-time physical activity. Since the dataset is aggregated, each row represents a public health profile rather than an individual person. The project therefore studies obesity and physical inactivity patterns across years, locations, and demographic groups instead of predicting individual medical outcomes. A distinction is made between broad descriptive views, which may use all selected surveillance records, and modelling-related subgroup analysis, which excludes Total or Overall aggregate rows to keep the unit of analysis consistent.

The overall aim is to build an interpretable public health decision-support workflow. The report first prepares clean indicator-specific datasets, then supports exploratory analysis, aligned obesity-inactivity comparison, regression modelling, classification modelling, and public health intervention priority analysis. The workflow deliberately separates total-stratification summaries from demographic subgroup profiles so that exploratory patterns, regression explanation, classification screening, and intervention priorities are interpreted at the correct level. All findings should be interpreted as group-level public health insights rather than individual medical predictions or causal effects.

The main findings show that adult obesity increased by about six percentage points over the study period. Obesity and physical inactivity have a positive subgroup-level aligned correlation of about 0.46, but the within-profile demeaned correlation is only about 0.07, suggesting that much of the relationship is driven by stable differences between locations and demographic profiles. The regression model explains about three-quarters of test-set obesity variation, while the classification model captures about two-thirds of true high-obesity subgroup profiles within the top 20% highest-ranked profiles.

Executive Decision Summary

Adult obesity increased clearly from 2011 to 2021, and the rise remains visible when the analysis focuses on total-stratification records.
Obesity is unevenly distributed across locations and demographic subgroup profiles, so public health monitoring should not rely only on broad averages.
Physical inactivity is positively associated with adult obesity at the subgroup profile level, but the relationship weakens strongly after within-profile adjustment. This supports a cautious interpretation: inactivity is useful for profiling and prioritisation, but the analysis does not prove a causal effect.
The regression model provides strong profile-level explanation within the observed surveillance panel, mainly because year, location, and demographic group capture a large share of obesity variation.
The classification and intervention matrix translate the analysis into practical priority groups. The most defensible starting point is the robust core of profiles that remain high-priority under Q70, Q75, and Q80 threshold choices.
The Shiny dashboard extends the report by allowing users to explore trends, location and demographic rankings, and intervention priority segments interactively.

1 Introduction

1.1 Background

Obesity and physical inactivity are major public health concerns because they are closely associated with chronic disease burden, healthcare cost, and preventive health planning. However, obesity is not evenly distributed across the population. Its prevalence can vary across locations, years, age groups, education levels, income levels, gender groups, and racial or ethnic groups.

A single national average may hide important differences between communities. For public health planning, it is more useful to identify which combinations of year, location, and demographic group have higher obesity prevalence. These combinations can be understood as public health profiles.

Physical inactivity is also an important surveillance indicator. A lack of leisure-time physical activity may appear together with higher obesity prevalence in some public health profiles. However, this project treats the relationship as an observed association. It does not claim that physical inactivity directly causes obesity.

1.2 Project Focus

This project focuses on three connected tasks.

First, it describes adult obesity and physical inactivity trends across years, locations, and demographic groups. Second, it examines whether public health profile characteristics can explain differences in adult obesity prevalence. Third, it screens high-obesity demographic subgroup profiles and translates obesity-inactivity patterns into interpretable public health priority groups.

The focus is therefore not only on building models, but also on producing results that can be understood and used for public health monitoring and prioritisation.

1.3 Objectives and Research Questions

The objective of this project is to transform large-scale public health surveillance data into interpretable insights and practical public health priorities.

The project is guided by the following research questions:

How did adult obesity and no leisure-time physical activity vary across U.S. locations from 2011 to 2021? (Exploratory analysis)
How do obesity patterns differ across demographic groups such as age, education, income, gender, and race or ethnicity? (Demographic comparison)
Can year, location, and demographic stratification variables explain variation in adult obesity prevalence? (Regression)
Can high-obesity demographic subgroup public health profiles be screened using physical inactivity levels together with year, location, and demographic characteristics? (Classification)
How can obesity and physical inactivity patterns be translated into interpretable public health intervention priorities? (Decision support)

Therefore, all findings should be interpreted as group-level public health insights rather than individual medical predictions.

1.4 Analytical Scope

This project is framed as a public health profile analysis rather than an individual diagnosis or personal risk prediction task.

The unit of analysis is a public health profile, such as a specific year, location, and demographic group.
The project focuses on two selected indicators: adult obesity and no leisure-time physical activity.
The variable data_value has different meanings under different public health questions, so the analysis must select specific questionid values before modelling.
Adult obesity and physical inactivity should only be compared after matching records by year, location, and stratification group.
Broad EDA may show all selected records or total-stratification records, but modelling-related EDA, regression, classification, and priority analysis use demographic subgroup profiles with Total or Overall rows excluded.
The regression and classification stages are designed for explanation and screening, not for causal claims.
The final intervention priority matrix is intended to support public health planning by identifying subgroup profiles where obesity and physical inactivity are both relatively high.

2 Dataset Description

2.1 Source and Dimension

The dataset used in this project is the Nutrition, Physical Activity, and Obesity dataset. The accessible CSV file was obtained from Kaggle, while the original public health context comes from the CDC Division of Nutrition, Physical Activity, and Obesity surveillance programme. The underlying data source is the Behavioral Risk Factor Surveillance System, a large state-based health survey system in the United States.

Dataset URL: https://www.kaggle.com/datasets/mattop/nutrition-physical-activity-and-obesity/data
Original public health context: CDC Division of Nutrition, Physical Activity, and Obesity
Underlying surveillance source: Behavioral Risk Factor Surveillance System
File used in this project: Nutrition__Physical_Activity__and_Obesity.csv
Data type: Aggregated public health surveillance data
Geographic context: United States, including states, territories, and national-level records
Years covered: 2011-2021

The dataset dimension is calculated after loading the raw CSV file.

csv_file <- "Nutrition__Physical_Activity__and_Obesity.csv"

if (!file.exists(csv_file)) {
  csv_files <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
  csv_files <- csv_files[!str_detect(csv_files, "^project_output_")]

  if (length(csv_files) == 1) {
    csv_file <- csv_files[1]
    message("Using detected source CSV file: ", csv_file)
  } else {
    stop(
      "Please place Nutrition__Physical_Activity__and_Obesity.csv in the same folder as this Rmd file."
    )
  }
}

raw_data <- readr::read_csv(csv_file, show_col_types = FALSE)

clean_colnames <- function(x) {
  x %>%
    str_to_lower() %>%
    str_replace_all("[^a-z0-9]+", "_") %>%
    str_replace_all("^_|_$", "")
}

original_names <- names(raw_data)
names(raw_data) <- clean_colnames(names(raw_data))

column_mapping <- tibble(
  original_name = original_names,
  cleaned_name = names(raw_data)
)

n_rows <- nrow(raw_data)
n_cols <- ncol(raw_data)
n_dim <- n_rows * n_cols

dataset_dimension <- tibble(
  rows = n_rows,
  columns = n_cols,
  dimension_cells = n_dim
)

dataset_dimension %>%
  kable(caption = "Raw Dataset Dimension")

Raw Dataset Dimension
rows	columns	dimension_cells
88629	33	2924757

readr::write_csv(column_mapping, "project_output_column_mapping.csv")

The dataset contains 88,629 rows x 33 columns = 2,924,757 cells. This satisfies the course dataset-dimension requirement because the total dimension is greater than 100,000.

2.2 Variables Used

The dataset contains multiple public health questions, so many columns are metadata, identifiers, confidence limits, sample sizes, and stratification labels. The main variables used in this project are listed below.

Variable	Description
`yearstart`	Starting year of the record
`yearend`	Ending year of the record
`locationdesc`	U.S. state, territory, or national location description
`class`	Broad public health category
`topic`	Specific topic under each class
`question`	Public health indicator question
`data_value`	Main numeric value for the selected indicator
`low_confidence_limit`	Lower confidence limit for the estimate
`high_confidence_limit`	Upper confidence limit for the estimate
`sample_size`	Sample size used for the estimate
`stratificationcategory1`	Type of demographic stratification
`stratification1`	Specific group within the stratification category
`questionid`	Stable identifier for the public health question

2.3 Important Dataset Interpretation Issue

A key property of this dataset is that data_value is question-dependent. This means that the same numeric column can represent different measurements depending on the value of questionid.

For example:

Under an obesity question, data_value represents adult obesity prevalence.
Under a physical activity question, data_value represents a physical activity measurement.
Under a fruit or vegetable consumption question, data_value represents a nutrition-related measurement.

Therefore, this project does not pool all data_value records together. Instead, it selects specific public health questions before analysis. This avoids mixing values that have different meanings.

A second interpretation issue is the difference between Total or Overall rows and demographic subgroup rows. Total rows are useful for broad state-level trends because they summarise the overall adult population for a location and year. However, regression, classification, risk ranking, and intervention-priority analysis are based on demographic subgroup profiles. Therefore, Total or Overall rows are excluded from modelling-related sections to avoid mixing aggregate summaries with subgroup observations.

2.4 Selected Indicators

This project focuses on two public health indicators.

Indicator	Question ID	Question text
Adult obesity	`Q036`	Percent of adults aged 18 years and older who have obesity
No leisure-time physical activity	`Q047`	Percent of adults who engage in no leisure-time physical activity

These two indicators are selected because they support the main project objective: analysing obesity patterns and examining how they relate to physical inactivity at the public health profile level.

3 Data Structure and Quality

3.1 Overall Data Quality Summary

Before any modelling, the structure and quality of the raw data are profiled. This includes variable classes, missing values, distinct-value counts, duplicate rows, and complete-row counts.

structure_summary <- tibble(
  variable = names(raw_data),
  class = map_chr(raw_data, ~ class(.x)[1]),
  missing_n = map_int(raw_data, ~ sum(is.na(.x))),
  missing_rate = map_dbl(raw_data, ~ mean(is.na(.x))),
  unique_n = map_int(raw_data, ~ n_distinct(.x, na.rm = TRUE)),
  example_values = map_chr(
    raw_data,
    ~ paste(head(unique(na.omit(as.character(.x))), 5), collapse = " | ")
  )
)

quality_summary <- tibble(
  rows = nrow(raw_data),
  columns = ncol(raw_data),
  dimension_cells = nrow(raw_data) * ncol(raw_data),
  duplicate_rows = sum(duplicated(raw_data)),
  duplicate_rate = mean(duplicated(raw_data)),
  complete_rows = sum(complete.cases(raw_data)),
  complete_row_rate = mean(complete.cases(raw_data))
)

quality_summary %>%
  kable(digits = 4, caption = "Row-Level Data Quality Summary")

Row-Level Data Quality Summary
rows	columns	dimension_cells	duplicate_rows	duplicate_rate	complete_rows	complete_row_rate
88629	33	2924757	0	0	0	0

readr::write_csv(structure_summary, "project_output_variable_structure.csv")
readr::write_csv(quality_summary, "project_output_quality_summary.csv")

The raw dataset contains 88,629 rows and 33 columns. The total dataset dimension is 2,924,757 cells. The dataset contains 0 exact duplicate rows.

3.2 Key Variable Structure

key_variables <- c(
  "yearstart",
  "yearend",
  "locationdesc",
  "class",
  "topic",
  "question",
  "data_value",
  "low_confidence_limit",
  "high_confidence_limit",
  "sample_size",
  "stratificationcategory1",
  "stratification1",
  "questionid"
)

key_variable_summary <- structure_summary %>%
  filter(variable %in% key_variables)

key_variable_summary %>%
  kable(digits = 4, caption = "Key Variable Structure Summary")

Key Variable Structure Summary
variable	class	missing_n	missing_rate	unique_n	example_values
yearstart	numeric	0	0.0000	11	2020 \| 2014 \| 2013 \| 2015 \| 2012
yearend	numeric	0	0.0000	11	2020 \| 2014 \| 2013 \| 2015 \| 2012
locationdesc	character	0	0.0000	55	National \| Guam \| Wyoming \| District of Columbia \| Puerto Rico
class	character	0	0.0000	3	Physical Activity \| Obesity / Weight Status \| Fruits and Vegetables
topic	character	0	0.0000	3	Physical Activity - Behavior \| Obesity / Weight Status \| Fruits and Vegetables - Behavior
question	character	0	0.0000	9	Percent of adults who engage in no leisure-time physical activity \| Percent of adults aged 18 years and older who have obesity \| Percent of adults aged 18 years and older who have an overweight classification \| Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) \| Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
data_value	numeric	8778	0.0990	695	30.6 \| 29.3 \| 28.8 \| 32.7 \| 26.6
low_confidence_limit	numeric	8778	0.0990	666	29.4 \| 25.7 \| 28.1 \| 31.9 \| 25.6
high_confidence_limit	numeric	8778	0.0990	760	31.8 \| 33.3 \| 29.5 \| 33.5 \| 27.6
sample_size	numeric	8778	0.0990	9580	31255 \| 842 \| 62562 \| 60069 \| 30904
questionid	character	0	0.0000	9	Q047 \| Q036 \| Q037 \| Q045 \| Q044
stratificationcategory1	character	9	0.0001	6	Race/Ethnicity \| Education \| Income \| Age (years) \| Gender
stratification1	character	9	0.0001	28	Hispanic \| High school graduate \| $50,000 - $74,999 \| Data not reported \| Less than $15,000

3.3 Missing Value Review

high_missing_summary <- structure_summary %>%
  arrange(desc(missing_rate), desc(unique_n)) %>%
  slice_head(n = 10)

high_missing_summary %>%
  kable(digits = 4, caption = "Variables with Highest Missing Rates")

Variables with Highest Missing Rates
variable	class	missing_n	missing_rate	unique_n	example_values
data_value_unit	logical	88629	1.0000	0
total	character	85464	0.9643	1	Total
gender	character	82299	0.9286	2	Female \| Male
data_value_footnote_symbol	character	79851	0.9010	1	~
data_value_footnote	character	79851	0.9010	1	Data not available because sample size is insufficient.
education	character	75969	0.8572	4	High school graduate \| Less than high school \| Some college or technical school \| College graduate
age_years	character	69639	0.7857	6	25 - 34 \| 55 - 64 \| 18 - 24 \| 45 - 54 \| 35 - 44
income	character	66474	0.7500	7	$50,000 - $74,999 \| Data not reported \| Less than $15,000 \| $25,000 - $34,999 \| $15,000 - $24,999
race_ethnicity	character	63309	0.7143	8	Hispanic \| American Indian/Alaska Native \| Asian \| Non-Hispanic White \| Other
sample_size	numeric	8778	0.0990	9580	31255 \| 842 \| 62562 \| 60069 \| 30904

The number of complete rows is 0. This does not mean that the dataset is unusable. Much of the missingness is structural.

For example, demographic columns such as age_years, education, gender, income, race_ethnicity, and total are mutually exclusive stratification fields. A record normally belongs to one stratification category, so the other demographic columns are expected to be empty. The same information is carried more consistently by stratificationcategory1 and stratification1.

The dataset also contains missing values in variables such as data_value, low_confidence_limit, high_confidence_limit, and sample_size. These missing values are often linked to unavailable estimates, such as cases where the sample size is insufficient. Therefore, the modelling dataset is prepared by selecting the relevant indicators and removing rows where the selected indicator value is missing.

4 Data Cleaning and Question Selection

4.1 Cleaning Strategy

Because the data are aggregated and question-specific, cleaning here mainly means selecting and validating comparable records rather than repairing individual-level fields.

The cleaning workflow includes:

Standardising all column names.
Checking dataset dimensions, missing values, and duplicate rows.
Identifying available public health questions.
Selecting adult obesity and no leisure-time physical activity using questionid.
Removing rows with missing data_value for the selected indicators.
Validating that selected indicator values fall within the valid percentage range from 0 to 100.
Exporting cleaned obesity and inactivity datasets as processed files.

The duplicate check found 0 exact duplicate rows, so no de-duplication was required.

4.2 Available Public Health Questions

has_questionid <- "questionid" %in% names(raw_data)
has_question <- "question" %in% names(raw_data)

if (!has_question) {
  stop("The dataset must contain a question column after column-name cleaning.")
}

question_table <- raw_data %>%
  distinct(
    questionid = if (has_questionid) questionid else NA_character_,
    question = question
  ) %>%
  arrange(questionid, question)

question_table %>%
  kable(caption = "Available Public Health Questions")

Available Public Health Questions
questionid	question
Q018	Percent of adults who report consuming fruit less than one time daily
Q019	Percent of adults who report consuming vegetables less than one time daily
Q036	Percent of adults aged 18 years and older who have obesity
Q037	Percent of adults aged 18 years and older who have an overweight classification
Q043	Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Q044	Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week
Q045	Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)
Q046	Percent of adults who engage in muscle-strengthening activities on 2 or more days a week
Q047	Percent of adults who engage in no leisure-time physical activity

readr::write_csv(question_table, "project_output_question_table.csv")

The table above confirms that the dataset contains multiple public health questions. Since data_value has a different meaning under different questions, the analysis selects only the two questions directly related to this project.

4.3 Selected Questions

if (has_questionid) {
  selected_questions <- question_table %>%
    filter(questionid %in% c("Q036", "Q047"))
} else {
  selected_questions <- question_table %>%
    filter(
      str_detect(str_to_lower(question), "obesity") |
        str_detect(
          str_to_lower(question),
          "no leisure-time physical activity|no leisure time physical activity"
        )
    )
}

selected_questions %>%
  kable(caption = "Selected Public Health Questions")

Selected Public Health Questions
questionid	question
Q036	Percent of adults aged 18 years and older who have obesity
Q047	Percent of adults who engage in no leisure-time physical activity

readr::write_csv(selected_questions, "project_output_selected_questions.csv")

The selected questions are:

Q036: Percent of adults aged 18 years and older who have obesity.
Q047: Percent of adults who engage in no leisure-time physical activity.

4.4 Build Indicator Filters

The preferred selection method is questionid, because it is more stable than text matching. Text matching is used only as a fallback if the dataset does not contain a questionid column.

if (has_questionid) {
  obesity_filter <- raw_data$questionid == "Q036"
  inactivity_filter <- raw_data$questionid == "Q047"
} else {
  obesity_filter <- str_detect(
    str_to_lower(raw_data$question),
    "obesity"
  )

  inactivity_filter <- str_detect(
    str_to_lower(raw_data$question),
    "no leisure-time physical activity|no leisure time physical activity"
  )
}

indicator_match_summary <- tibble(
  indicator = c(
    "Adult obesity",
    "No leisure-time physical activity"
  ),
  matched_records_before_missing_value_removal = c(
    sum(obesity_filter, na.rm = TRUE),
    sum(inactivity_filter, na.rm = TRUE)
  )
)

indicator_match_summary %>%
  kable(caption = "Records Matched to Each Selected Indicator")

Records Matched to Each Selected Indicator
indicator	matched_records_before_missing_value_removal
Adult obesity	16577
No leisure-time physical activity	16549

4.5 Build Cleaned Indicator Datasets

required_cols <- c(
  "yearstart",
  "locationdesc",
  "question",
  "data_value",
  "stratificationcategory1",
  "stratification1"
)

missing_required <- setdiff(required_cols, names(raw_data))

if (length(missing_required) > 0) {
  stop(
    paste(
      "Missing required columns:",
      paste(missing_required, collapse = ", ")
    )
  )
}

obesity_data <- raw_data %>%
  filter(obesity_filter, !is.na(data_value)) %>%
  mutate(
    indicator = "Adult obesity",
    stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
    stratification1 = coalesce(as.character(stratification1), "Unknown")
  )

inactivity_data <- raw_data %>%
  filter(inactivity_filter, !is.na(data_value)) %>%
  mutate(
    indicator = "No leisure-time physical activity",
    stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
    stratification1 = coalesce(as.character(stratification1), "Unknown")
  )

4.6 Cleaned Dataset Summary

obesity_summary <- obesity_data %>%
  summarise(
    rows = n(),
    years = n_distinct(yearstart),
    min_year = min(yearstart, na.rm = TRUE),
    max_year = max(yearstart, na.rm = TRUE),
    locations = n_distinct(locationdesc),
    mean_value = mean(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    min_value = min(data_value, na.rm = TRUE),
    max_value = max(data_value, na.rm = TRUE)
  )

inactivity_summary <- inactivity_data %>%
  summarise(
    rows = n(),
    years = n_distinct(yearstart),
    min_year = min(yearstart, na.rm = TRUE),
    max_year = max(yearstart, na.rm = TRUE),
    locations = n_distinct(locationdesc),
    mean_value = mean(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    min_value = min(data_value, na.rm = TRUE),
    max_value = max(data_value, na.rm = TRUE)
  )

indicator_summary <- bind_rows(
  obesity_summary %>% mutate(indicator = "Adult obesity"),
  inactivity_summary %>% mutate(indicator = "No leisure-time physical activity")
) %>%
  select(indicator, everything())

indicator_summary %>%
  kable(digits = 4, caption = "Cleaned Indicator Dataset Summary")

Cleaned Indicator Dataset Summary
indicator	rows	years	min_year	max_year	locations	mean_value	sd_value	min_value	max_value
Adult obesity	14956	11	2011	2021	55	30.4590	7.2741	0.9	60.4
No leisure-time physical activity	14984	11	2011	2021	55	25.9161	8.3074	2.5	66.8

readr::write_csv(indicator_summary, "project_output_indicator_summary.csv")

After removing records with missing data_value, the adult obesity dataset contains 14,956 records, and the no leisure-time physical activity dataset contains 14,984 records.

4.7 Value Range Validation

Both selected indicators are percentages, so valid values should fall between 0 and 100.

value_range_check <- tibble(
  indicator = c(
    "Adult obesity",
    "No leisure-time physical activity"
  ),
  questionid = c("Q036", "Q047"),
  min_value = c(
    min(obesity_data$data_value, na.rm = TRUE),
    min(inactivity_data$data_value, na.rm = TRUE)
  ),
  max_value = c(
    max(obesity_data$data_value, na.rm = TRUE),
    max(inactivity_data$data_value, na.rm = TRUE)
  ),
  out_of_range_n = c(
    sum(obesity_data$data_value < 0 | obesity_data$data_value > 100, na.rm = TRUE),
    sum(inactivity_data$data_value < 0 | inactivity_data$data_value > 100, na.rm = TRUE)
  )
)

value_range_check %>%
  kable(digits = 2, caption = "Indicator Value Range Validation")

Indicator Value Range Validation
indicator	questionid	min_value	max_value	out_of_range_n
Adult obesity	Q036	0.9	60.4	0
No leisure-time physical activity	Q047	2.5	66.8	0

readr::write_csv(value_range_check, "project_output_value_range_check.csv")

The range validation confirms that all retained adult obesity and physical inactivity values are within the valid percentage range from 0 to 100.

4.8 Before and After Cleaning Summary

cleaning_summary <- tibble(
  stage = c(
    "Raw data",
    "Cleaned adult obesity dataset",
    "Cleaned no leisure-time physical activity dataset"
  ),
  questionid = c(
    NA_character_,
    "Q036",
    "Q047"
  ),
  rows = c(
    nrow(raw_data),
    nrow(obesity_data),
    nrow(inactivity_data)
  ),
  columns = c(
    ncol(raw_data),
    ncol(obesity_data),
    ncol(inactivity_data)
  )
)

cleaning_summary %>%
  kable(caption = "Before and After Cleaning Summary")

Before and After Cleaning Summary
stage	questionid	rows	columns
Raw data	NA	88629	33
Cleaned adult obesity dataset	Q036	14956	34
Cleaned no leisure-time physical activity dataset	Q047	14984	34

readr::write_csv(cleaning_summary, "project_output_cleaning_summary.csv")
readr::write_csv(obesity_data, "project_output_cleaned_obesity_dataset.csv")
readr::write_csv(inactivity_data, "project_output_cleaned_inactivity_dataset.csv")

The cleaned datasets provide the foundation for the next stages of the project, including exploratory trend analysis, demographic comparison, aligned obesity-inactivity analysis, regression modelling, classification modelling, and public health intervention priority analysis.

5 Exploratory Data Analysis

5.1 EDA Strategy

The EDA has three goals. First, it shows how adult obesity and physical inactivity changed from 2011 to 2021. Second, it compares obesity across locations and demographic groups. Third, it builds an aligned obesity-inactivity dataset. This makes sure the two indicators are compared within the same public health profile. This part is descriptive: it does not test causality and does not predict future obesity. It helps us understand the main patterns before the regression and classification models.

The EDA uses two related levels of analysis. Broad descriptive views may use all selected records or Total records to show general surveillance patterns. Modelling-related EDA uses demographic subgroup profiles only, with Total or Overall rows removed. This keeps the EDA, regression, classification, risk ranking, intervention matrix, and threshold sensitivity analysis on a consistent subgroup-level basis.

Three rules guide this EDA:

data_value is interpreted only within the selected indicator.
Obesity and physical inactivity are compared only after matching by year, location, and stratification group.
Results are labelled by analysis level: broad all-record trends, total-stratification trends, or subgroup-level modelling profiles.

The figures in this section use a consistent visual style so that trends, rankings, and relationships can be compared more easily.

5.2 Overall Trend

The first trend plot uses all selected records for the two indicators. It gives a broad descriptive overview of the surveillance data before separating total-level and subgroup-level analysis. The next subsection then uses total-stratification records to check whether the main trend remains visible when repeated demographic subgroup records are removed.

This pooled view is useful as a starting point, but it should not be over-read. The data include both Total records and demographic subgroup records, so the yearly average can be affected by the mix of records in each year. For that reason, the total-stratification robustness check is used as a more comparable state-level follow-up, while subgroup-only aligned data are used later for modelling-related analysis.

combined_core <- bind_rows(
  obesity_data %>%
    mutate(indicator = "Adult obesity"),
  inactivity_data %>%
    mutate(indicator = "No leisure-time physical activity")
)

yearly_trend <- combined_core %>%
  group_by(indicator, yearstart) %>%
  summarise(
    records = n(),
    average_value = mean(data_value, na.rm = TRUE),
    median_value = median(data_value, na.rm = TRUE),
    sd_value = sd(data_value, na.rm = TRUE),
    .groups = "drop"
  )

trend_change <- yearly_trend %>%
  arrange(indicator, yearstart) %>%
  group_by(indicator) %>%
  summarise(
    first_year = first(yearstart),
    first_average = first(average_value),
    last_year = last(yearstart),
    last_average = last(average_value),
    absolute_change = last_average - first_average,
    relative_change_pct = absolute_change / first_average * 100,
    .groups = "drop"
  )

yearly_trend %>%
  kable(digits = 3, caption = "Yearly Average of Selected Public Health Indicators")

Yearly Average of Selected Public Health Indicators
indicator	yearstart	records	average_value	median_value	sd_value
Adult obesity	2011	1332	27.736	28.10	6.756
Adult obesity	2012	1332	27.997	28.40	6.734
Adult obesity	2013	1344	28.816	29.35	6.782
Adult obesity	2014	1369	29.336	29.90	6.998
Adult obesity	2015	1361	29.533	30.00	6.963
Adult obesity	2016	1389	30.086	30.70	7.036
Adult obesity	2017	1372	30.956	31.60	7.047
Adult obesity	2018	1369	31.633	32.20	7.004
Adult obesity	2019	1343	32.356	32.90	7.178
Adult obesity	2020	1368	32.413	32.85	7.269
Adult obesity	2021	1377	34.033	34.60	7.417
No leisure-time physical activity	2011	1332	26.492	26.20	7.416
No leisure-time physical activity	2012	1336	23.898	22.95	7.956
No leisure-time physical activity	2013	1318	26.855	26.30	7.588
No leisure-time physical activity	2014	1376	24.236	23.50	8.035
No leisure-time physical activity	2015	1358	27.054	26.50	8.025
No leisure-time physical activity	2016	1399	24.895	24.10	8.614
No leisure-time physical activity	2017	1372	27.972	27.60	8.101
No leisure-time physical activity	2018	1377	25.632	24.60	8.720
No leisure-time physical activity	2019	1346	27.769	26.85	8.427
No leisure-time physical activity	2020	1382	24.933	23.85	8.531
No leisure-time physical activity	2021	1388	25.427	24.20	8.644

trend_change %>%
  kable(digits = 3, caption = "Change from First to Last Available Year")

Change from First to Last Available Year
indicator	first_year	first_average	last_year	last_average	absolute_change	relative_change_pct
Adult obesity	2011	27.736	2021	34.033	6.296	22.70
No leisure-time physical activity	2011	26.492	2021	25.427	-1.065	-4.02

readr::write_csv(yearly_trend, "project_output_yearly_trend.csv")
readr::write_csv(trend_change, "project_output_yearly_trend_change.csv")

yearly_trend %>%
  ggplot(aes(x = yearstart, y = average_value, colour = indicator, group = indicator)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  scale_colour_manual(values = indicator_colours) +
  scale_x_continuous(breaks = sort(unique(yearly_trend$yearstart))) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Overall Trends in Adult Obesity and Physical Inactivity",
    subtitle = "Broad descriptive view across all selected non-missing public health records",
    x = "Year",
    y = "Average value"
  ) +
  theme_report()

The two indicators show different shapes. Adult obesity shows an overall upward pattern across the decade. It rises from 27.74% in 2011 to 34.03% in 2021, an increase of 6.3 percentage points. Physical inactivity shows a recurring jagged pattern rather than a clear upward or downward trend, which is why it is interpreted more cautiously than the obesity trend.

5.3 Total-Stratification Robustness Check

The overall trend above mixes many demographic subgroup records for the same location and year. As a robustness check, this subsection keeps only the Total records. These are the overall adult rate for each location, with no demographic breakdown, so they avoid repeated demographic weighting and provide a more comparable year-location summary.

find_total_records <- function(df) {
  total_flag <- rep(FALSE, nrow(df))

  if ("stratificationcategory1" %in% names(df)) {
    total_flag <- total_flag |
      str_to_lower(as.character(df$stratificationcategory1)) %in% c("total", "overall")
  }

  if ("stratification1" %in% names(df)) {
    total_flag <- total_flag |
      str_to_lower(as.character(df$stratification1)) %in% c("total", "overall")
  }

  if ("total" %in% names(df)) {
    total_flag <- total_flag | !is.na(df$total)
  }

  total_flag
}

obesity_total <- obesity_data %>%
  filter(find_total_records(.))

inactivity_total <- inactivity_data %>%
  filter(find_total_records(.))

total_check <- tibble(
  indicator = c("Adult obesity", "No leisure-time physical activity"),
  records = c(nrow(obesity_total), nrow(inactivity_total)),
  years = c(n_distinct(obesity_total$yearstart), n_distinct(inactivity_total$yearstart)),
  locations = c(n_distinct(obesity_total$locationdesc), n_distinct(inactivity_total$locationdesc))
)

total_indicator_trend <- bind_rows(
  obesity_total %>%
    transmute(indicator = "Adult obesity", yearstart, locationdesc, value = data_value),
  inactivity_total %>%
    transmute(indicator = "No leisure-time physical activity", yearstart, locationdesc, value = data_value)
) %>%
  group_by(indicator, yearstart) %>%
  summarise(
    records = n(),
    average_value = mean(value, na.rm = TRUE),
    median_value = median(value, na.rm = TRUE),
    sd_value = sd(value, na.rm = TRUE),
    .groups = "drop"
  )

total_trend_change <- total_indicator_trend %>%
  arrange(indicator, yearstart) %>%
  group_by(indicator) %>%
  summarise(
    first_year = first(yearstart),
    first_average = first(average_value),
    last_year = last(yearstart),
    last_average = last(average_value),
    absolute_change = last_average - first_average,
    relative_change_pct = absolute_change / first_average * 100,
    .groups = "drop"
  )

trend_change_comparison <- bind_rows(
  trend_change %>%
    mutate(analysis_basis = "All selected records"),
  total_trend_change %>%
    mutate(analysis_basis = "Total records only")
) %>%
  select(
    analysis_basis,
    indicator,
    first_year,
    first_average,
    last_year,
    last_average,
    absolute_change
  )

total_check %>%
  kable(caption = "Coverage of Total-Stratification Records")

Coverage of Total-Stratification Records
indicator	records	years	locations
Adult obesity	589	11	55
No leisure-time physical activity	588	11	55

trend_change_comparison %>%
  kable(digits = 3, caption = "First-to-Last-Year Change Under Two Analysis Bases")

First-to-Last-Year Change Under Two Analysis Bases
analysis_basis	indicator	first_year	first_average	last_year	last_average	absolute_change
All selected records	Adult obesity	2011	27.736	2021	34.033	6.296
All selected records	No leisure-time physical activity	2011	26.492	2021	25.427	-1.065
Total records only	Adult obesity	2011	27.588	2021	33.537	5.949
Total records only	No leisure-time physical activity	2011	25.737	2021	24.187	-1.550

readr::write_csv(total_check, "project_output_total_stratification_check.csv")
readr::write_csv(total_indicator_trend, "project_output_total_indicator_trend.csv")
readr::write_csv(total_trend_change, "project_output_total_trend_change.csv")
readr::write_csv(trend_change_comparison, "project_output_trend_change_comparison.csv")

trend_change_comparison %>%
  mutate(
    analysis_basis = factor(
      analysis_basis,
      levels = c("All selected records", "Total records only")
    )
  ) %>%
  ggplot(aes(x = indicator, y = absolute_change, fill = analysis_basis)) +
  geom_col(width = 0.64, position = position_dodge(width = 0.72)) +
  geom_text(
    aes(label = round(absolute_change, 2)),
    position = position_dodge(width = 0.72),
    vjust = if_else(trend_change_comparison$absolute_change >= 0, -0.35, 1.25),
    colour = "#243447",
    size = 3.6
  ) +
  geom_hline(yintercept = 0, colour = "#5D6975", linewidth = 0.5) +
  scale_fill_manual(values = c(
    "All selected records" = "#A7B1B7",
    "Total records only" = "#496A81"
  )) +
  scale_y_continuous(
    labels = scales::label_number(suffix = " pts"),
    expand = expansion(mult = c(0.16, 0.16))
  ) +
  labs(
    title = "Trend Change Robustness Check",
    subtitle = "First-to-last-year change remains similar after using Total-only records",
    x = NULL,
    y = "Change from first to last year"
  ) +
  theme_report()

The total records cover the indicators well for state-level analysis. Adult obesity total records span 11 years and 55 locations. On this Total-only basis, adult obesity still rises from 27.59% to 33.54%, while physical inactivity shows no clear upward or downward trend and oscillates around a broadly similar level. The robustness plot does not repeat the first trend chart; instead, it compares the first-to-last-year change under the all-record and Total-only bases. The obesity increase remains similar under both bases, so the upward obesity pattern is not simply an artefact of repeated demographic subgroup records.

Two limits apply to this trend. First, it is an unweighted average of available location-level total records. Each location has the same weight, so a large state and a small state are treated the same. It should be read as a state-level surveillance trend, not a population-weighted national prevalence estimate. Second, 2020 and 2021 overlap with the COVID-19 pandemic period, so changes in those years should be read carefully: they may reflect both real public health change and data-collection context.

6 Location and Demographic Comparison

Trends describe the when; location and demographic comparisons describe the where and who. This section first compares locations on a total-stratification basis, then uses demographic stratification records to identify group-level differences in obesity prevalence.

6.1 Location Ranking

The location ranking uses Total obesity records only, so each location is compared at the same level and locations with more subgroup records do not get extra weight. To reduce instability from very short series, the ranking keeps only locations with at least eight years of total-stratification obesity records.

min_years_for_location <- 8

location_obesity <- obesity_total %>%
  filter(locationdesc != "National") %>%
  group_by(locationdesc) %>%
  summarise(
    records = n(),
    years = n_distinct(yearstart),
    average_obesity = mean(data_value, na.rm = TRUE),
    median_obesity = median(data_value, na.rm = TRUE),
    min_obesity = min(data_value, na.rm = TRUE),
    max_obesity = max(data_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(years >= min_years_for_location) %>%
  arrange(desc(average_obesity))

top_obesity_locations <- location_obesity %>%
  slice_max(average_obesity, n = 10, with_ties = FALSE)

low_obesity_locations <- location_obesity %>%
  slice_min(average_obesity, n = 10, with_ties = FALSE)

location_obesity_extremes <- bind_rows(
  top_obesity_locations %>% mutate(group = "Highest average obesity"),
  low_obesity_locations %>% mutate(group = "Lowest average obesity")
)

top_obesity_locations %>%
  kable(digits = 3, caption = "Top 10 Locations by Average Adult Obesity")

Top 10 Locations by Average Adult Obesity
locationdesc	records	years	average_obesity	median_obesity	min_obesity	max_obesity
Mississippi	11	11	37.218	37.3	34.6	40.8
West Virginia	11	11	37.027	37.7	32.4	40.6
Louisiana	11	11	35.764	35.9	33.1	38.6
Arkansas	11	11	35.518	35.7	30.9	38.7
Alabama	11	11	35.427	35.7	32.0	39.9
Kentucky	11	11	34.509	34.3	30.4	40.3
Oklahoma	11	11	34.491	33.9	31.1	39.4
Tennessee	11	11	33.464	33.8	29.2	36.5
Indiana	11	11	33.327	32.7	30.8	36.8
South Carolina	11	11	33.300	32.3	30.8	36.2

low_obesity_locations %>%
  kable(digits = 3, caption = "Lowest 10 Locations by Average Adult Obesity")

Lowest 10 Locations by Average Adult Obesity
locationdesc	records	years	average_obesity	median_obesity	min_obesity	max_obesity
Colorado	11	11	22.273	22.3	20.2	25.1
District of Columbia	11	11	23.218	23.0	21.7	24.7
Hawaii	11	11	23.545	23.8	21.8	25.0
Massachusetts	11	11	24.455	24.3	22.7	27.4
California	11	11	25.618	25.0	23.8	30.3
New York	11	11	26.073	25.7	23.6	29.1
Vermont	11	11	26.164	26.3	23.7	29.0
New Jersey	10	10	26.340	26.6	23.7	28.2
Montana	11	11	26.345	25.5	23.6	31.8
Utah	11	11	26.382	25.4	24.1	30.9

readr::write_csv(location_obesity, "project_output_location_obesity_ranking.csv")
readr::write_csv(location_obesity_extremes, "project_output_location_obesity_extremes.csv")

location_obesity_extremes %>%
  mutate(
    location_label = reorder(locationdesc, average_obesity),
    group = factor(group, levels = c("Highest average obesity", "Lowest average obesity"))
  ) %>%
  ggplot(aes(x = location_label, y = average_obesity, fill = group)) +
  geom_col(width = 0.72) +
  geom_text(
    aes(label = round(average_obesity, 1)),
    hjust = -0.12,
    size = 3.2,
    colour = "#243447"
  ) +
  coord_flip() +
  facet_wrap(~group, scales = "free") +
  scale_fill_manual(values = c(
    "Highest average obesity" = bar_fill_high,
    "Lowest average obesity" = bar_fill_low
  )) +
  scale_y_continuous(
    labels = scales::label_number(suffix = "%"),
    expand = expansion(mult = c(0, 0.14))
  ) +
  labs(
    title = "Location Differences in Adult Obesity",
    subtitle = paste0("Total-stratification records; locations with at least ", min_years_for_location, " years of data"),
    x = NULL,
    y = "Average adult obesity"
  ) +
  theme_report()

The ranking shows clear differences between locations, so obesity is not evenly spread across places. The highest average obesity locations are led by Mississippi, West Virginia, and Louisiana, which form a clear concentration in Southern and Appalachian areas. The lowest are led by Colorado, District of Columbia, and Hawaii. This is still descriptive: it shows where higher obesity values appear, not why. The gap between the highest and lowest groups is large, which is why locationdesc is later included in the regression model.

6.2 Demographic Stratification

The demographic comparison uses subgroup records only. Total records are removed so that overall rates are not mixed with subgroup rates. The aim is to see which age, education, income, gender, and race or ethnicity groups have higher average obesity values. These results describe public health profiles, not individual biological risk.

obesity_stratification <- obesity_data %>%
  filter(
    !is.na(stratificationcategory1),
    !is.na(stratification1),
    !str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
    !str_to_lower(as.character(stratification1)) %in% c("total", "overall")
  ) %>%
  group_by(stratificationcategory1, stratification1) %>%
  summarise(
    records = n(),
    years = n_distinct(yearstart),
    locations = n_distinct(locationdesc),
    average_obesity = mean(data_value, na.rm = TRUE),
    median_obesity = median(data_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(stratificationcategory1, desc(average_obesity))

stratification_gap <- obesity_stratification %>%
  group_by(stratificationcategory1) %>%
  arrange(desc(average_obesity), .by_group = TRUE) %>%
  summarise(
    highest_group = first(stratification1),
    highest_average = first(average_obesity),
    lowest_group = last(stratification1),
    lowest_average = last(average_obesity),
    within_category_gap = highest_average - lowest_average,
    groups_compared = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(within_category_gap))

obesity_stratification %>%
  kable(digits = 3, caption = "Average Adult Obesity by Demographic Stratification Group")

Average Adult Obesity by Demographic Stratification Group
stratificationcategory1	stratification1	records	years	locations	average_obesity	median_obesity
Age (years)	45 - 54	589	11	55	35.913	35.50
Age (years)	55 - 64	589	11	55	34.831	34.90
Age (years)	35 - 44	589	11	55	33.902	33.60
Age (years)	25 - 34	589	11	55	28.883	28.80
Age (years)	65 or older	589	11	55	27.877	28.00
Age (years)	18 - 24	589	11	55	17.751	17.30
Education	Less than high school	589	11	55	34.207	34.30
Education	High school graduate	589	11	55	32.424	32.40
Education	Some college or technical school	589	11	55	31.849	31.80
Education	College graduate	589	11	55	24.414	24.30
Gender	Male	589	11	55	30.218	30.30
Gender	Female	589	11	55	30.173	29.80
Income	Less than $15,000	589	11	55	35.324	35.40
Income	$15,000 - $24,999	589	11	55	33.840	33.90
Income	$25,000 - $34,999	589	11	55	32.313	32.10
Income	$35,000 - $49,999	589	11	55	31.980	32.00
Income	$50,000 - $74,999	588	11	55	31.338	31.15
Income	$75,000 or greater	589	11	55	27.418	27.00
Income	Data not reported	589	11	55	25.834	25.70
Race/Ethnicity	Hawaiian/Pacific Islander	41	11	8	39.715	41.00
Race/Ethnicity	Non-Hispanic Black	473	11	47	38.303	38.70
Race/Ethnicity	American Indian/Alaska Native	388	11	46	36.869	37.50
Race/Ethnicity	Hispanic	573	11	55	32.056	32.10
Race/Ethnicity	2 or more races	540	11	53	32.047	31.85
Race/Ethnicity	Non-Hispanic White	580	11	54	28.566	29.10
Race/Ethnicity	Other	196	11	42	27.281	26.75
Race/Ethnicity	Asian	386	11	40	10.967	10.20

stratification_gap %>%
  kable(digits = 3, caption = "Within-Category Obesity Gaps")

Within-Category Obesity Gaps
stratificationcategory1	highest_group	highest_average	lowest_group	lowest_average	within_category_gap	groups_compared
Race/Ethnicity	Hawaiian/Pacific Islander	39.715	Asian	10.967	28.748	8
Age (years)	45 - 54	35.913	18 - 24	17.751	18.162	6
Education	Less than high school	34.207	College graduate	24.414	9.794	4
Income	Less than $15,000	35.324	Data not reported	25.834	9.489	7
Gender	Male	30.218	Female	30.173	0.045	2

readr::write_csv(obesity_stratification, "project_output_obesity_stratification_profile.csv")
readr::write_csv(stratification_gap, "project_output_obesity_stratification_gap.csv")

obesity_stratification %>%
  mutate(
    stratification_label = reorder(stratification1, average_obesity)
  ) %>%
  ggplot(aes(x = stratification_label, y = average_obesity)) +
  geom_col(fill = bar_fill_neutral, width = 0.72) +
  coord_flip() +
  facet_wrap(~stratificationcategory1, scales = "free") +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Adult Obesity by Demographic Stratification Group",
    subtitle = "Average values across available years and locations",
    x = NULL,
    y = "Average adult obesity"
  ) +
  theme_report(base_size = 11)

The chart shows that obesity differs across reported demographic groups, not only across places. The largest within-category gap appears in Race/Ethnicity, where Hawaiian/Pacific Islander has the highest average obesity value and Asian has the lowest. The results suggest that obesity is higher in some middle-aged, lower education, and lower income profiles. Race or ethnicity shows a wide spread, while gender differences are small. This is why stratification1 is later included in the regression model and also used as profile-level information in the classification screening model.

This does not mean any group causes obesity. It only shows that obesity values differ across reported profiles. Two limits apply. These averages are not population-weighted demographic prevalence; they are averages across available subgroup records. They are also averaged across years and locations, so a group whose records come mostly from higher-obesity locations can have its average pulled up by that coverage mix rather than by the demographic factor itself. In addition, some race or ethnicity groups have fewer available records and locations than others, so extreme subgroup averages should be read with coverage caution. Overall, these results should be read as profile-level patterns.

7 Aligned Relationship and Robustness Checks

7.1 Aligned Obesity-Inactivity Relationship

Obesity and physical inactivity come from separate rows, so their relationship should not be measured by comparing all obesity records with all inactivity records. A fair comparison matches the two indicators by the same year, location, stratification category, and stratification group. This gives one row per public health profile, with both values attached. After this alignment, the report keeps two versions: a broad aligned dataset for descriptive comparison and a subgroup-only aligned dataset for regression, classification, risk ranking, intervention matrix, and threshold sensitivity analysis.

alignment_keys <- c("yearstart", "locationdesc")

if ("stratificationcategory1" %in% names(obesity_data) &&
    "stratificationcategory1" %in% names(inactivity_data)) {
  alignment_keys <- c(alignment_keys, "stratificationcategory1")
}

if ("stratification1" %in% names(obesity_data) &&
    "stratification1" %in% names(inactivity_data)) {
  alignment_keys <- c(alignment_keys, "stratification1")
}

obesity_for_join <- obesity_data %>%
  select(all_of(alignment_keys), obesity_value = data_value) %>%
  group_by(across(all_of(alignment_keys))) %>%
  summarise(obesity_value = mean(obesity_value, na.rm = TRUE), .groups = "drop")

inactivity_for_join <- inactivity_data %>%
  select(all_of(alignment_keys), inactivity_value = data_value) %>%
  group_by(across(all_of(alignment_keys))) %>%
  summarise(inactivity_value = mean(inactivity_value, na.rm = TRUE), .groups = "drop")

aligned_health <- obesity_for_join %>%
  inner_join(inactivity_for_join, by = alignment_keys)

modelling_aligned_health <- aligned_health %>%
  filter(!find_total_records(.))

aligned_scope_summary <- tibble(
  analysis_scope = c(
    "All aligned profiles",
    "Subgroup aligned profiles used for modelling"
  ),
  profiles = c(nrow(aligned_health), nrow(modelling_aligned_health)),
  total_or_overall_profiles_excluded = c(
    0,
    nrow(aligned_health) - nrow(modelling_aligned_health)
  )
)

aligned_summary <- aligned_health %>%
  summarise(
    aligned_profiles = n(),
    obesity_mean = mean(obesity_value, na.rm = TRUE),
    inactivity_mean = mean(inactivity_value, na.rm = TRUE),
    obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
    inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
    correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
  )

modelling_aligned_summary <- modelling_aligned_health %>%
  summarise(
    aligned_profiles = n(),
    obesity_mean = mean(obesity_value, na.rm = TRUE),
    inactivity_mean = mean(inactivity_value, na.rm = TRUE),
    obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
    inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
    correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
  )

aligned_summary_compare <- bind_rows(
  aligned_summary %>% mutate(analysis_scope = "All aligned profiles"),
  modelling_aligned_summary %>% mutate(analysis_scope = "Subgroup aligned modelling profiles")
) %>%
  select(analysis_scope, everything())

aligned_quadrant_preview <- modelling_aligned_health %>%
  mutate(
    obesity_level = if_else(
      obesity_value >= modelling_aligned_summary$obesity_q75,
      "High obesity",
      "Lower obesity"
    ),
    inactivity_level = if_else(
      inactivity_value >= modelling_aligned_summary$inactivity_q75,
      "High inactivity",
      "Lower inactivity"
    ),
    profile_group = case_when(
      obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
        "High obesity + high inactivity",
      obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
        "High obesity only",
      obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
        "High inactivity only",
      TRUE ~ "Lower obesity + lower inactivity"
    )
  ) %>%
  count(profile_group, name = "profiles") %>%
  mutate(share = profiles / sum(profiles)) %>%
  arrange(desc(profiles))

aligned_scope_summary %>%
  kable(digits = 3, caption = "Aligned Dataset Scope for Descriptive and Modelling Analysis")

Aligned Dataset Scope for Descriptive and Modelling Analysis
analysis_scope	profiles	total_or_overall_profiles_excluded
All aligned profiles	14922	0
Subgroup aligned profiles used for modelling	14334	588

aligned_summary_compare %>%
  kable(digits = 3, caption = "Aligned Obesity-Inactivity Relationship Summary by Analysis Scope")

Aligned Obesity-Inactivity Relationship Summary by Analysis Scope
analysis_scope	aligned_profiles	obesity_mean	inactivity_mean	obesity_q75	inactivity_q75	correlation
All aligned profiles	14922	30.465	25.924	35.1	31.1	0.458
Subgroup aligned modelling profiles	14334	30.475	25.964	35.2	31.3	0.457

aligned_quadrant_preview %>%
  kable(digits = 3, caption = "Exploratory Subgroup Profile Groups Based on 75th Percentile Cut-Offs")

Exploratory Subgroup Profile Groups Based on 75th Percentile Cut-Offs
profile_group	profiles	share
Lower obesity + lower inactivity	8779	0.612
High obesity only	1970	0.137
High inactivity only	1960	0.137
High obesity + high inactivity	1625	0.113

readr::write_csv(aligned_health, "project_output_aligned_obesity_inactivity_all_profiles.csv")
readr::write_csv(modelling_aligned_health, "project_output_aligned_obesity_inactivity_subgroup_profiles.csv")
readr::write_csv(aligned_scope_summary, "project_output_aligned_scope_summary.csv")
readr::write_csv(aligned_summary_compare, "project_output_aligned_relationship_summary.csv")
readr::write_csv(aligned_quadrant_preview, "project_output_aligned_quadrant_preview.csv")

set.seed(7004)

modelling_aligned_health %>%
  slice_sample(n = min(5000, nrow(modelling_aligned_health))) %>%
  ggplot(aes(x = inactivity_value, y = obesity_value)) +
  geom_point(alpha = 0.28, colour = "#385A6B", size = 1.6) +
  geom_smooth(method = "lm", se = TRUE, colour = "#B4443F", fill = "#E6B8B4") +
  geom_vline(
    xintercept = modelling_aligned_summary$inactivity_q75,
    linetype = "dashed",
    colour = "#5D6975"
  ) +
  geom_hline(
    yintercept = modelling_aligned_summary$obesity_q75,
    linetype = "dashed",
    colour = "#5D6975"
  ) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Aligned Relationship Between Physical Inactivity and Adult Obesity",
    subtitle = "Subgroup aligned profiles; dashed lines mark 75th percentile cut-offs",
    x = "No leisure-time physical activity",
    y = "Adult obesity"
  ) +
  theme_report()

After alignment, the broad aligned dataset has 14,922 comparable public health profiles. For modelling-related analysis, Total or Overall rows are excluded, leaving 14,334 subgroup aligned profiles. The all-profile aligned correlation is 0.458, while the subgroup modelling-scope correlation is 0.457. Both are positive, but they should be read as descriptive profile-level associations rather than causal effects. The subgroup aligned version is used for the within-profile check, lagged association check, regression adjustment ladder, classification model, risk ranking, intervention matrix, and threshold sensitivity analysis.

The alignment uses an inner join, so it keeps only profiles that have a record for both obesity and inactivity in the same year, location, and stratification group. The surveillance panel is mildly unbalanced (some locations or subgroups are not screened for every indicator in every year), so this forces the comparison onto a fully overlapping set of profiles. That keeps the correlation basis exactly consistent across the two indicators, at the cost of dropping profiles seen for only one of them. Removing aggregate Total or Overall rows for the modelling scope also avoids treating statewide totals as if they were independent demographic subgroup observations.

7.2 Within-Profile Robustness Check

The headline all-profile correlation still mixes two different things: stable differences between profiles (some locations and groups are simply higher on both indicators) and year-to-year changes within the same profile. To separate these, this check uses the subgroup aligned modelling dataset, removes the average value of each profile, and then re-correlates. It therefore focuses on whether obesity and inactivity still move together within the same location x stratification category x stratification group profile over time. This is a stricter check than the raw aligned correlation. Profiles with fewer than three years are dropped, because a one- or two-year profile has almost no within variation and would pile up near zero and distort the result.

# Within-profile check uses the subgroup aligned modelling dataset,
# so it is on the same basis as the regression and classification sections.
within_base <- modelling_aligned_health

within_corr_df <- within_base %>%
  group_by(locationdesc, stratificationcategory1, stratification1) %>%
  filter(n() >= 3) %>%
  mutate(
    obesity_within = obesity_value - mean(obesity_value, na.rm = TRUE),
    inactivity_within = inactivity_value - mean(inactivity_value, na.rm = TRUE)
  ) %>%
  ungroup()

raw_aligned_corr <- cor(
  aligned_health$obesity_value,
  aligned_health$inactivity_value,
  use = "complete.obs"
)

subgroup_aligned_corr <- cor(
  modelling_aligned_health$obesity_value,
  modelling_aligned_health$inactivity_value,
  use = "complete.obs"
)

within_profile_corr <- cor(
  within_corr_df$obesity_within,
  within_corr_df$inactivity_within,
  use = "complete.obs"
)

within_corr_compare <- tibble(
  correlation_type = c(
    "Raw aligned correlation (all profiles)",
    "Raw aligned correlation (subgroup modelling profiles)",
    "Within-profile demeaned correlation (subgroup profiles, >= 3 years)"
  ),
  profiles_used = c(
    sum(complete.cases(aligned_health[, c("obesity_value", "inactivity_value")])),
    sum(complete.cases(modelling_aligned_health[, c("obesity_value", "inactivity_value")])),
    nrow(within_corr_df)
  ),
  correlation = c(raw_aligned_corr, subgroup_aligned_corr, within_profile_corr)
)

within_corr_compare %>%
  kable(digits = 3, caption = "Raw Versus Within-Profile Demeaned Correlation")

Raw Versus Within-Profile Demeaned Correlation
correlation_type	profiles_used	correlation
Raw aligned correlation (all profiles)	14922	0.458
Raw aligned correlation (subgroup modelling profiles)	14334	0.457
Within-profile demeaned correlation (subgroup profiles, >= 3 years)	14259	0.073

readr::write_csv(within_corr_compare, "project_output_within_profile_correlation.csv")

The raw aligned correlation is 0.458 across all aligned profiles and 0.457 within the subgroup modelling scope. The within-profile demeaned correlation is 0.073. Because the within-profile figure uses demographic subgroup profiles only, it is on the same basis as the regression adjustment ladder and classification model later in the report. The within-profile correlation is much smaller, which suggests that much of the subgroup obesity-inactivity relationship comes from stable differences between locations and demographic profiles, not from year-to-year movement within the same profile. In plain terms, some profiles are high on both indicators because of their location and demographic structure; the relationship is weaker once we only look at changes within the same profile over time. One more caveat applies even to the within number: because both indicators drift upward over the decade within a profile, a positive within-profile correlation can partly reflect a shared time trend, so it is not removed here and the result is treated as descriptive only.

7.3 Exploratory Lagged Association Check

The aligned relationship above uses same-year subgroup profiles. The next question is whether earlier physical inactivity is related to later obesity in the same location and stratification group. This check uses the subgroup aligned modelling dataset and compares inactivity with obesity at lags of 0, 1, and 2 years. It is only a descriptive check, not a forecasting model and not a causal test.

lag_keys <- setdiff(alignment_keys, "yearstart")

lag_correlations <- map_dfr(0:2, function(k) {
  inactivity_part <- modelling_aligned_health %>%
    transmute(
      across(all_of(lag_keys)),
      match_year = yearstart + k,
      inactivity_value
    )

  obesity_part <- modelling_aligned_health %>%
    transmute(
      across(all_of(lag_keys)),
      match_year = yearstart,
      obesity_value
    )

  joined <- inner_join(
    inactivity_part,
    obesity_part,
    by = c(lag_keys, "match_year")
  )

  tibble(
    lag_years = k,
    matched_pairs = nrow(joined),
    correlation = cor(
      joined$inactivity_value,
      joined$obesity_value,
      use = "complete.obs"
    )
  )
})

lag_correlations %>%
  kable(digits = 3, caption = "Lagged Correlation Between Inactivity and Later Obesity")

Lagged Correlation Between Inactivity and Later Obesity
lag_years	matched_pairs	correlation
0	14334	0.457
1	12795	0.461
2	11473	0.478

readr::write_csv(lag_correlations, "project_output_lag_correlations.csv")

lag_correlations %>%
  ggplot(aes(x = factor(lag_years), y = correlation, group = 1)) +
  geom_col(fill = bar_fill_neutral, width = 0.62) +
  geom_text(
    aes(label = paste0("r = ", round(correlation, 3), "\n", "n = ", scales::comma(matched_pairs))),
    vjust = -0.35,
    colour = "#243447",
    size = 3.5,
    lineheight = 0.9
  ) +
  scale_y_continuous(limits = c(0, max(lag_correlations$correlation) + 0.12)) +
  labs(
    title = "Lagged Association Between Physical Inactivity and Later Obesity",
    subtitle = "Subgroup aligned profiles; lag 1 and lag 2 compare earlier inactivity with later obesity",
    x = "Lag in years",
    y = "Correlation (r)"
  ) +
  theme_report()

The lagged correlations are 0.457, 0.461, 0.478 for lag 0, lag 1, and lag 2. They stay positive and do not weaken much as the lag grows, although the number of matched pairs decreases at longer lags. This pattern may be linked to stable differences between locations and demographic subgroup profiles. Therefore, it should be read as a stable subgroup-level association, not as strong early-warning evidence. No out-of-sample forecasting is done here.

8 Regression Model

8.1 Modelling Objective and Design

The regression section addresses the third research question:

Can year, location, and demographic stratification variables explain variation in adult obesity prevalence?

The goal is not to predict obesity for one person. The target is the adult obesity percentage for an aggregated public health profile. In simple terms, the model checks why obesity values differ across years, locations, and demographic groups in the surveillance data.

This section has two parts. First, it fits a structural additive model with year, location, and demographic group. This model summarises the main patterns already seen in the EDA. Second, a later subsection uses an adjustment ladder to test whether physical inactivity still explains obesity after year, location, and demographic group are considered. The second part is more important for the obesity-inactivity question, because it answers something the EDA cannot: how much of the association survives once geography and demographics are controlled.

The regression target is:

Adult obesity data_value

The main predictors are:

yearstart + locationdesc + stratification1

stratificationcategory1 is not in the main formula because stratification1 already gives the specific demographic group. In this dataset the category and group fields are nested, so using both adds repeated dummy variables and makes the model harder to read. Variables that are directly derived from the reported estimate, such as the confidence limits and alternative value fields, are not used, because they would leak the indicator value itself. Sample size is different: it is not a leakage variable, but it is also not used as a predictor in this simple model. It records how precise each estimate is, and future work could use it for weighting (for example inverse-variance weighting), so small-sample estimates have less influence.

This regression uses demographic subgroup records only. The Total rows are overall state-year rates, that is, aggregates of the subgroup rows in the same state and year. They sit at a different level and are not independent of the subgroups. If both are used together, the same population may be represented more than once, which can make the regression look more stable than it really is. Total rows are kept for the EDA trend but removed here, which also matches the research question’s focus on demographic stratification.

This setup is simple and easy to interpret. It uses the same profile dimensions as the EDA: time, place, and demographic group. But subgroup rows are not fully independent: in the same state and year, the age, gender, income, education, and race or ethnicity rows are different views of the same survey sample. Because of this, the coefficients are used as descriptive, model-adjusted differences, not as formal statistical inference. The standard errors and p-values from lm() are not used for the main interpretation, because they would understate the uncertainty under this dependence.

Two models are fitted:

Linear regression, used as the main interpretable model.
Regression tree, used as a simple non-linear benchmark.

8.2 Train-Test Split and Model Fitting

regression_predictors <- c(
  "yearstart",
  "locationdesc",
  "stratification1"
)

regression_predictors <- regression_predictors[
  regression_predictors %in% names(obesity_data)
]

regression_df <- obesity_data %>%
  filter(
    !str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
    !str_to_lower(as.character(stratification1)) %in% c("total", "overall")
  ) %>%
  select(data_value, all_of(regression_predictors)) %>%
  drop_na(data_value) %>%
  mutate(across(where(is.character), as.factor))

set.seed(7004)
regression_train_index <- sample.int(
  nrow(regression_df),
  floor(0.8 * nrow(regression_df))
)

reg_train <- regression_df[regression_train_index, ]
reg_test <- regression_df[-regression_train_index, ]

drop_unseen_levels <- function(train, test) {
  keep <- rep(TRUE, nrow(test))

  for (column_name in names(test)) {
    if (is.factor(test[[column_name]])) {
      seen_levels <- unique(as.character(train[[column_name]]))
      keep <- keep & as.character(test[[column_name]]) %in% seen_levels
    }
  }

  list(
    test = test[keep, ],
    dropped = sum(!keep)
  )
}

reg_guard <- drop_unseen_levels(reg_train, reg_test)
reg_test <- reg_guard$test
reg_train <- droplevels(reg_train)

regression_design_summary <- tibble(
  target = "Adult obesity data_value",
  predictors = paste(regression_predictors, collapse = " + "),
  modelling_rows = nrow(regression_df),
  train_rows = nrow(reg_train),
  test_rows_after_guard = nrow(reg_test),
  test_rows_dropped_unseen_levels = reg_guard$dropped
)

reg_formula <- as.formula(
  paste("data_value ~", paste(regression_predictors, collapse = " + "))
)

reg_lm <- lm(reg_formula, data = reg_train)
reg_tree <- rpart(reg_formula, data = reg_train, method = "anova")

reg_pred_lm <- predict(reg_lm, newdata = reg_test)
reg_pred_tree <- predict(reg_tree, newdata = reg_test)

reg_baseline_pred <- rep(mean(reg_train$data_value, na.rm = TRUE), nrow(reg_test))
reg_baseline_rmse <- sqrt(mean((reg_test$data_value - reg_baseline_pred)^2))

regression_metric_row <- function(model_name, actual, predicted, n_train, baseline_rmse) {
  rmse_value <- sqrt(mean((actual - predicted)^2))

  tibble(
    model = model_name,
    target = "Adult obesity data_value",
    n_train = n_train,
    n_test = length(actual),
    rmse = rmse_value,
    mae = mean(abs(actual - predicted)),
    baseline_rmse = baseline_rmse,
    rmse_improvement_pct = (baseline_rmse - rmse_value) / baseline_rmse * 100,
    test_r2 = 1 - sum((actual - predicted)^2) /
      sum((actual - mean(actual))^2)
  )
}

regression_metrics <- bind_rows(
  regression_metric_row(
    "Linear regression",
    reg_test$data_value,
    reg_pred_lm,
    nrow(reg_train),
    reg_baseline_rmse
  ),
  regression_metric_row(
    "Regression tree",
    reg_test$data_value,
    reg_pred_tree,
    nrow(reg_train),
    reg_baseline_rmse
  )
)

regression_design_summary %>%
  kable(caption = "Regression Modelling Design Summary")

Regression Modelling Design Summary
target	predictors	modelling_rows	train_rows	test_rows_after_guard	test_rows_dropped_unseen_levels
Adult obesity data_value	yearstart + locationdesc + stratification1	14367	11493	2874	0

regression_metrics %>%
  kable(digits = 4, caption = "Regression Model Performance on Test Data")

Regression Model Performance on Test Data
model	target	n_train	n_test	rmse	mae	baseline_rmse	rmse_improvement_pct	test_r2
Linear regression	Adult obesity data_value	11493	2874	3.7328	2.5359	7.4214	49.7016	0.7469
Regression tree	Adult obesity data_value	11493	2874	4.4940	3.3592	7.4214	39.4460	0.6332

readr::write_csv(regression_design_summary, "project_output_regression_design_summary.csv")
readr::write_csv(regression_metrics, "project_output_regression_metrics.csv")

After the train-test split, the unseen-factor guard removed 0 test records, so the test set has no unseen location or stratification levels. Under this row-level random split, the linear regression explains 74.7% of the test-set variation in adult obesity, with an RMSE of 3.73 percentage points. This should be read as within-panel explanation, not forecasting skill. The same locations and demographic groups appear in both the training and test sets, so the model is mainly recovering already-observed profile means. The mean-only baseline is also weak, so beating it by 49.7% mainly shows that location and group structure is useful, not that the model can predict new locations or groups.

The regression tree gives a simple non-linear comparison. In this run it does not improve on the linear model. So the linear regression is kept as the main model, because it is simpler, clearer, and fits the observed profiles at least as well under the same split.

regression_metrics %>%
  select(model, rmse, mae, test_r2) %>%
  pivot_longer(
    cols = c(rmse, mae, test_r2),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      rmse = "RMSE",
      mae = "MAE",
      test_r2 = "Test R-squared"
    )
  ) %>%
  ggplot(aes(x = model, y = value, fill = model)) +
  geom_col(width = 0.62) +
  facet_wrap(~metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "Linear regression" = "#B4443F",
    "Regression tree" = "#496A81"
  )) +
  labs(
    title = "Regression Model Performance Comparison",
    subtitle = "Lower RMSE/MAE is better; higher test R-squared is better",
    x = NULL,
    y = NULL
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The comparison points the same way. Under the row-level split, the linear regression has lower error than the simple regression tree. This suggests that additive location and demographic differences capture much of the observed profile-level variation. Again, this is within-panel explanation, not future forecasting ability, and the linear model is preferred mainly because it is simpler and easier to explain in a public health report.

8.3 Observed Versus Predicted Values

regression_predictions <- reg_test %>%
  mutate(
    predicted_obesity = reg_pred_lm,
    residual = data_value - predicted_obesity,
    absolute_residual = abs(residual)
  )

regression_error_summary <- regression_predictions %>%
  summarise(
    mean_observed = mean(data_value, na.rm = TRUE),
    mean_predicted = mean(predicted_obesity, na.rm = TRUE),
    mean_residual = mean(residual, na.rm = TRUE),
    median_absolute_residual = median(absolute_residual, na.rm = TRUE),
    pct_within_3_points = mean(absolute_residual <= 3) * 100,
    pct_within_5_points = mean(absolute_residual <= 5) * 100
  )

regression_error_summary %>%
  kable(digits = 3, caption = "Linear Regression Prediction Error Summary")

Linear Regression Prediction Error Summary
mean_observed	mean_predicted	mean_residual	median_absolute_residual	pct_within_3_points	pct_within_5_points
30.58	30.554	0.026	1.743	72.199	87.752

readr::write_csv(regression_predictions, "project_output_regression_predictions.csv")
readr::write_csv(regression_error_summary, "project_output_regression_error_summary.csv")

set.seed(7004)

regression_predictions %>%
  slice_sample(n = min(5000, nrow(regression_predictions))) %>%
  ggplot(aes(x = predicted_obesity, y = data_value)) +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.6) +
  geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Observed Versus Predicted Adult Obesity",
    subtitle = "Linear regression predictions on the held-out test set",
    x = "Predicted adult obesity",
    y = "Observed adult obesity"
  ) +
  theme_report()

Most points lie close to the 45-degree line, so the model captures the main structure of obesity differences across public health profiles. The median absolute residual is 1.74 percentage points, and 87.8% of test records are predicted within five percentage points of the observed value.

9 Regression Diagnostics / Interpretation

9.1 Cross-Validation Stability Check

A single train-test split can be lucky or unlucky. To check that the result is not driven by one split, a simple 5-fold cross-validation is run for the linear regression. No hyperparameters are tuned; the goal is only to check whether performance is stable across different row-level splits.

rng_state <- .Random.seed

set.seed(2026)
cv_folds <- sample(rep(1:5, length.out = nrow(regression_df)))

cv_results <- map_dfr(1:5, function(fold_id) {
  cv_train <- regression_df[cv_folds != fold_id, ]
  cv_test <- regression_df[cv_folds == fold_id, ]

  cv_guard <- drop_unseen_levels(cv_train, cv_test)
  cv_test <- cv_guard$test
  cv_train <- droplevels(cv_train)

  cv_fit <- lm(reg_formula, data = cv_train)
  cv_pred <- predict(cv_fit, newdata = cv_test)

  tibble(
    fold = fold_id,
    n_test = nrow(cv_test),
    rows_dropped = cv_guard$dropped,
    rmse = sqrt(mean((cv_test$data_value - cv_pred)^2)),
    test_r2 = 1 - sum((cv_test$data_value - cv_pred)^2) /
      sum((cv_test$data_value - mean(cv_test$data_value))^2)
  )
})

cv_summary <- cv_results %>%
  summarise(
    mean_rmse = mean(rmse),
    sd_rmse = sd(rmse),
    mean_test_r2 = mean(test_r2),
    sd_test_r2 = sd(test_r2),
    total_rows_dropped = sum(rows_dropped)
  )

cv_results %>%
  kable(digits = 4, caption = "Five-Fold Cross-Validation Results")

Five-Fold Cross-Validation Results
fold	n_test	rmse	test_r2
1	2874	3.5271	0.7697
2	2874	3.6339	0.7591
3	2873	3.7346	0.7518
4	2873	3.5833	0.7552
5	2873	3.5892	0.7637

cv_summary %>%
  kable(digits = 4, caption = "Cross-Validation Stability Summary")

Cross-Validation Stability Summary
mean_rmse	sd_rmse	mean_test_r2	sd_test_r2	total_rows_dropped
3.6136	0.0776	0.7599	0.007	0

readr::write_csv(cv_results, "project_output_regression_cv.csv")
readr::write_csv(cv_summary, "project_output_regression_cv_summary.csv")

.Random.seed <- rng_state

Across the five folds, the average test R-squared is 0.76, with a standard deviation of 0.007. The low variation shows the result is not caused by one special split. But this check is still limited: every fold shares most of the same locations and demographic groups, so it only checks interpolation stability within the observed panel. It does not test generalisation to new locations, future years, or unseen demographic groups, which would need leave-one-state-out or forward-in-time validation (noted as future work).

9.2 Residual Diagnostics

The diagnostics check whether the linear model is a reasonable working description of the data. The main questions are simple: are the residuals centred near zero, do the errors show strong patterns, and do a few hard-to-fit profiles dominate the results?

residual_diagnostics <- regression_predictions %>%
  transmute(
    observed = data_value,
    fitted = predicted_obesity,
    residual = residual,
    absolute_residual = absolute_residual,
    standardised_residual = as.numeric(scale(residual))
  )

residual_summary <- residual_diagnostics %>%
  summarise(
    mean_residual = mean(residual, na.rm = TRUE),
    sd_residual = sd(residual, na.rm = TRUE),
    median_residual = median(residual, na.rm = TRUE),
    p95_absolute_residual = quantile(abs(residual), 0.95, na.rm = TRUE),
    max_absolute_residual = max(abs(residual), na.rm = TRUE)
  )

large_residual_profiles <- regression_predictions %>%
  arrange(desc(absolute_residual)) %>%
  slice_head(n = 10)

residual_summary %>%
  kable(digits = 3, caption = "Residual Summary for Linear Regression")

Residual Summary for Linear Regression
mean_residual	sd_residual	median_residual	p95_absolute_residual	max_absolute_residual
0.026	3.733	0.078	7.72	23.691

large_residual_profiles %>%
  kable(digits = 3, caption = "Largest Absolute Residuals in the Test Set")

Largest Absolute Residuals in the Test Set
data_value	yearstart	locationdesc	stratification1	predicted_obesity	residual	absolute_residual
16.7	2018	Ohio	American Indian/Alaska Native	40.391	-23.691	23.691
54.3	2014	South Dakota	2 or more races	32.298	22.002	22.002
56.2	2013	Hawaii	Hawaiian/Pacific Islander	34.471	21.729	21.729
12.6	2021	Virgin Islands	Non-Hispanic White	34.279	-21.679	21.679
17.5	2018	Georgia	American Indian/Alaska Native	38.500	-21.000	21.000
15.0	2011	Maine	Non-Hispanic Black	34.260	-19.260	19.260
19.3	2011	Arkansas	American Indian/Alaska Native	37.986	-18.686	18.686
53.6	2014	Arkansas	2 or more races	35.166	18.434	18.434
8.6	2011	Georgia	Other	26.639	-18.039	18.039
17.4	2015	Kentucky	2 or more races	35.025	-17.625	17.625

readr::write_csv(residual_diagnostics, "project_output_regression_residual_diagnostics.csv")
readr::write_csv(residual_summary, "project_output_regression_residual_summary.csv")
readr::write_csv(large_residual_profiles, "project_output_regression_largest_residuals.csv")

residual_diagnostics %>%
  ggplot(aes(x = fitted, y = residual)) +
  geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
  geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Residuals Versus Fitted Values",
    x = "Fitted adult obesity",
    y = "Residual"
  ) +
  theme_report(base_size = 11)

residual_diagnostics %>%
  ggplot(aes(x = residual)) +
  geom_histogram(bins = 40, fill = "#496A81", colour = "white") +
  geom_vline(xintercept = 0, colour = "#B4443F", linewidth = 1) +
  scale_x_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Residual Distribution",
    x = "Residual",
    y = "Profiles"
  ) +
  theme_report(base_size = 11)

qq_values <- qqnorm(residual_diagnostics$standardised_residual, plot.it = FALSE)
qq_data <- tibble(
  theoretical = qq_values$x,
  sample = qq_values$y
)

qq_data %>%
  ggplot(aes(x = theoretical, y = sample)) +
  geom_point(alpha = 0.35, colour = "#385A6B", size = 1.4) +
  geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
  labs(
    title = "Normal Q-Q Check",
    x = "Theoretical quantiles",
    y = "Sample quantiles"
  ) +
  theme_report(base_size = 11)

residual_diagnostics %>%
  mutate(sqrt_abs_standardised = sqrt(abs(standardised_residual))) %>%
  ggplot(aes(x = fitted, y = sqrt_abs_standardised)) +
  geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
  geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Scale-Location Check",
    x = "Fitted adult obesity",
    y = "Sqrt(abs(standardised residual))"
  ) +
  theme_report(base_size = 11)

The residuals are centred close to zero, so the model does not strongly overpredict or underpredict obesity overall. The histogram and Q-Q plot show most residuals near zero, but the tails are heavier than a perfect normal distribution. So most profiles are fitted reasonably well, while a few profiles are still hard to explain using only year, location, and stratification group. The scale-location plot does not show an extreme funnel shape, but some heterogeneity is still expected: the rows are aggregated survey estimates, and their precision varies with sample size, which is not modelled here (no weighting is applied). So these plots are used as practical checks, not as a formal test that the constant-variance assumption fully holds.

9.3 Factor-Effect Interpretation

The metrics show the model captures much of the observed profile-level structure, but it is more useful once the coefficients are read. Because locationdesc and stratification1 are categorical, the linear regression uses dummy variables. Each location or stratification coefficient is a difference from a reference level, holding year and the other profile variables constant.

lm_coef_tbl <- tibble(
  term = names(coef(reg_lm)),
  estimate = as.numeric(coef(reg_lm))
) %>%
  filter(!is.na(estimate))

year_effect <- lm_coef_tbl %>%
  filter(term == "yearstart") %>%
  pull(estimate)

ref_location <- levels(reg_train$locationdesc)[1]
ref_stratification <- levels(reg_train$stratification1)[1]

location_effects <- lm_coef_tbl %>%
  filter(str_detect(term, "^locationdesc")) %>%
  transmute(
    location = str_remove(term, "^locationdesc"),
    effect_vs_reference = estimate
  ) %>%
  arrange(desc(effect_vs_reference))

stratification_effects <- lm_coef_tbl %>%
  filter(str_detect(term, "^stratification1")) %>%
  transmute(
    stratification_group = str_remove(term, "^stratification1"),
    effect_vs_reference = estimate
  ) %>%
  arrange(desc(effect_vs_reference))

location_effect_extremes <- bind_rows(
  location_effects %>%
    slice_head(n = 8) %>%
    mutate(direction = "Higher than reference"),
  location_effects %>%
    slice_tail(n = 8) %>%
    mutate(direction = "Lower than reference")
) %>%
  mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))

stratification_effect_extremes <- bind_rows(
  stratification_effects %>%
    slice_head(n = 8) %>%
    mutate(direction = "Higher than reference"),
  stratification_effects %>%
    slice_tail(n = 8) %>%
    mutate(direction = "Lower than reference")
) %>%
  mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))

tibble(
  coefficient = "yearstart",
  interpretation = "Estimated yearly change in adult obesity, holding location and stratification group constant",
  estimate_percentage_points = year_effect
) %>%
  kable(digits = 3, caption = "Year Effect from Linear Regression")

Year Effect from Linear Regression
coefficient	interpretation	estimate_percentage_points
yearstart	Estimated yearly change in adult obesity, holding location and stratification group constant	0.6

location_effect_extremes %>%
  kable(digits = 3, caption = paste0(
    "Largest Location Effects Relative to Reference Location: ",
    ref_location
  ))

Largest Location Effects Relative to Reference Location: Alabama
location	effect_vs_reference	direction
West Virginia	2.047	Higher than reference
Mississippi	1.920	Higher than reference
Louisiana	0.340	Higher than reference
Arkansas	-0.074	Higher than reference
Oklahoma	-0.574	Higher than reference
Kentucky	-0.816	Higher than reference
Indiana	-1.464	Higher than reference
Iowa	-1.487	Higher than reference
Vermont	-7.976	Lower than reference
Utah	-8.066	Lower than reference
California	-8.284	Lower than reference
New York	-8.443	Lower than reference
Massachusetts	-9.096	Lower than reference
Hawaii	-9.665	Lower than reference
District of Columbia	-9.794	Lower than reference
Colorado	-11.933	Lower than reference

stratification_effect_extremes %>%
  kable(digits = 3, caption = paste0(
    "Largest Stratification Group Effects Relative to Reference Group: ",
    ref_stratification
  ))

Largest Stratification Group Effects Relative to Reference Group: $15,000 - $24,999
stratification_group	effect_vs_reference	direction
Hawaiian/Pacific Islander	7.801	Higher than reference
Non-Hispanic Black	4.236	Higher than reference
American Indian/Alaska Native	2.927	Higher than reference
45 - 54	2.201	Higher than reference
Less than $15,000	1.530	Higher than reference
55 - 64	1.047	Higher than reference
Less than high school	0.373	Higher than reference
35 - 44	0.121	Higher than reference
25 - 34	-4.880	Lower than reference
Non-Hispanic White	-5.133	Lower than reference
65 or older	-6.010	Lower than reference
$75,000 or greater	-6.376	Lower than reference
Data not reported	-7.972	Lower than reference
College graduate	-9.425	Lower than reference
18 - 24	-16.029	Lower than reference
Asian	-22.070	Lower than reference

readr::write_csv(lm_coef_tbl, "project_output_regression_coefficients.csv")
readr::write_csv(location_effects, "project_output_regression_location_effects.csv")
readr::write_csv(stratification_effects, "project_output_regression_stratification_effects.csv")

location_effect_extremes %>%
  mutate(location = reorder(location, effect_vs_reference)) %>%
  ggplot(aes(x = location, y = effect_vs_reference, fill = direction)) +
  geom_col(width = 0.72) +
  coord_flip() +
  facet_wrap(~direction, scales = "free_y") +
  scale_fill_manual(values = c(
    "Higher than reference" = "#B4443F",
    "Lower than reference" = "#4E7D63"
  )) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Largest Location Effects in the Linear Regression",
    subtitle = paste0("Effects are percentage-point differences relative to ", ref_location),
    x = NULL,
    y = "Effect versus reference"
  ) +
  theme_report(base_size = 11)

stratification_effect_extremes %>%
  mutate(stratification_group = reorder(stratification_group, effect_vs_reference)) %>%
  ggplot(aes(x = stratification_group, y = effect_vs_reference, fill = direction)) +
  geom_col(width = 0.72) +
  coord_flip() +
  facet_wrap(~direction, scales = "free_y") +
  scale_fill_manual(values = c(
    "Higher than reference" = "#B4443F",
    "Lower than reference" = "#4E7D63"
  )) +
  scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
  labs(
    title = "Largest Demographic Group Effects in the Linear Regression",
    subtitle = paste0("Effects are percentage-point differences relative to ", ref_stratification),
    x = NULL,
    y = "Effect versus reference"
  ) +
  theme_report(base_size = 11)

All coefficients below are descriptive, model-adjusted differences within this additive profile model. They are not formal inferential or causal estimates, and the standard errors are not interpreted (see the design note on non-independent subgroup rows).

In this additive model, the yearstart coefficient is 0.6. This suggests an average upward time pattern of about 0.6 percentage points per year, or roughly 6 percentage points over ten years, which matches the upward trend in the EDA. The model has no location-specific time terms, so this is an average pattern across profiles. It does not mean every location or every group rises at the same rate each year.

The location effects also match the location ranking. Locations with the largest positive coefficients include West Virginia, Mississippi, Louisiana, and those with the most negative include Colorado, District of Columbia, Hawaii. These differences stay visible after adjusting for year and demographic stratification group.

The stratification effects show demographic group is also linked with obesity variation. Groups with the largest positive coefficients include Hawaiian/Pacific Islander, Non-Hispanic Black, American Indian/Alaska Native, and those with the most negative include Asian, 18 - 24, College graduate. This connects with the demographic EDA: obesity is not evenly distributed across reported profiles. These subgroup coefficients compare reported stratification labels with the reference label. Because stratification1 mixes age, gender, income, education, and race or ethnicity into one label variable, they should be read as descriptive profile differences from the reference group, not as separate effects of age, income, education, or race.

9.4 Does Physical Inactivity Add Explanatory Power? An Adjustment Ladder

The model above mainly restates, in one equation, what the EDA already showed: obesity differs by year, place, and demographic group. By itself, this does not add much new information. A more useful question for this project is whether physical inactivity explains obesity beyond those structural differences. The EDA found a moderate subgroup aligned correlation (0.457) but a much smaller within-profile correlation (0.073), which already hints that much of the obesity-inactivity link is stable between-profile structure. The regression can make this clear by adding controls one block at a time and watching the inactivity coefficient.

This uses the same subgroup aligned modelling dataset created in the EDA section, so each row has both an obesity and an inactivity value for the same demographic subgroup profile and excludes Total or Overall rows. Four models are fitted on the full sample. This is not a train-test prediction task; the goal is to watch how the inactivity coefficient changes. The ladder keeps the same outcome and adds controls step by step: M1 inactivity only; M2 adds year; M3 adds location; M4 adds demographic group.

ladder_df <- modelling_aligned_health %>%
  select(obesity_value, inactivity_value, yearstart, locationdesc, stratification1) %>%
  drop_na() %>%
  mutate(across(where(is.character), as.factor))

ladder_models <- list(
  "M1: inactivity only"                    = lm(obesity_value ~ inactivity_value, data = ladder_df),
  "M2: + year"                             = lm(obesity_value ~ inactivity_value + yearstart, data = ladder_df),
  "M3: + location"                         = lm(obesity_value ~ inactivity_value + yearstart + locationdesc, data = ladder_df),
  "M4: + location + demographic group"     = lm(obesity_value ~ inactivity_value + yearstart + locationdesc + stratification1, data = ladder_df)
)

ladder_summary <- imap_dfr(ladder_models, function(fit, label) {
  tibble(
    model = label,
    inactivity_coefficient = unname(coef(fit)["inactivity_value"]),
    adj_r2 = summary(fit)$adj.r.squared
  )
})

ladder_summary <- ladder_summary %>%
  mutate(
    coef_retained_vs_m1 = inactivity_coefficient / inactivity_coefficient[model == "M1: inactivity only"]
  )

ladder_summary %>%
  kable(digits = 3, caption = "Inactivity Coefficient and Fit as Controls Are Added (Adjustment Ladder)")

Inactivity Coefficient and Fit as Controls Are Added (Adjustment Ladder)
model	inactivity_coefficient	adj_r2	coef_retained_vs_m1
M1: inactivity only	0.401	0.209	1.000
M2: + year	0.397	0.270	0.990
M3: + location	0.357	0.384	0.892
M4: + location + demographic group	0.139	0.769	0.347

readr::write_csv(ladder_summary, "project_output_inactivity_adjustment_ladder.csv")

ladder_summary %>%
  mutate(model = factor(model, levels = ladder_summary$model)) %>%
  ggplot(aes(x = model, y = inactivity_coefficient, group = 1)) +
  geom_col(fill = bar_fill_neutral, width = 0.62) +
  geom_text(aes(label = round(inactivity_coefficient, 3)), vjust = -0.35, colour = "#243447", size = 3.6) +
  geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
  labs(
    title = "Inactivity Coefficient Shrinks as Geography and Demographics Are Controlled",
    subtitle = "Each step adds a control block; a coefficient that collapses signals between-profile confounding",
    x = NULL,
    y = "Coefficient on inactivity (pts obesity per pt inactivity)"
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The inactivity coefficient falls from 0.401 in M1 to 0.139 in M4. After location and demographic group are added, it keeps only about 35% of its unadjusted size. Most of the drop happens when demographic group enters. This suggests that the raw association is strongly linked to demographic profile differences. For example, some income, education, and race or ethnicity profiles may be high on both indicators. This is the most useful result, and one the EDA cannot give: the coefficient pattern suggests that a large part of the apparent obesity-inactivity association is tied to between-profile structure, while a modest part survives adjustment.

These numbers also fit the earlier within-profile check on a sensible gradient. M4 still leaves some association because it only adjusts for average location and group differences. The within-profile demeaned correlation (0.073) is stricter, because it removes each full profile’s mean and keeps only year-to-year movement, and it is close to zero. The more completely profile identity is absorbed, the less remains – all pointing the same way: the obesity-inactivity link is mainly stable between-profile differences, not an independent within-profile relationship. The remaining M4 coefficient should not be over-read as a clean effect.

There are three cautions. First, this is an ecological regression: obesity and inactivity are both survey-estimated outcomes for aggregated groups, so the coefficient is an association between two outcomes, not a behavioural or causal effect, and it has no direction. Second, inactivity_value is also an estimate with measurement error, so the coefficient is pulled toward zero (regression dilution); the true link could be a bit stronger than M4 shows. Third, no precision weighting is used because the aligned data does not carry sample size; inverse-variance weighting is left as future work. So the ladder answers the variable-value question descriptively: most of the obesity-inactivity association is structural, and after geography and demographics are accounted for, physical inactivity adds only modest additional descriptive explanatory power for obesity in this aggregated surveillance panel.

One more point on the two regressions. The structural model at the start of this section is not the main evidence about the obesity-inactivity relationship; it just summarises how obesity itself varies by year, location, and demographic group. The adjustment ladder is the main regression evidence on variable value, and its message is that inactivity is mostly a marker of between-profile structure, not an independent explainer.

9.5 Interpretation of Model Results

The regression shows that adult obesity is structured by year, location, and demographic group. Its performance should be read as profile-level explanation within the observed surveillance panel. Because locationdesc and stratification1 are predictors, the model is mainly learning differences among already-observed locations and groups. This fits the project goal of profiling existing records, but it should not be described as strong future forecasting ability for new locations or new demographic categories.

The practical interpretation is:

Year matters: obesity values increase over the study period, consistent with the EDA trend.
Location matters: geographic differences remain large, consistent with the location ranking.
Demographic stratification matters: profile-level differences by age, income, education, and race or ethnicity help explain observed obesity variation.
Most of the inactivity link is structural: the coefficient pattern suggests that much of the obesity-inactivity association reflects stable between-profile differences. Only a modest and hard-to-interpret residual association remains.
Model performance is explanatory, not causal: the regression identifies structured differences in observed public health profiles, but it does not estimate causal effects.

Overall, the regression has two roles. The structural model confirms the EDA findings and puts them in one additive equation: obesity varies by time, place, and demographic group. The adjustment ladder then goes one step further, showing that most of the obesity-inactivity association is carried by that same structure, with only a modest residual after adjustment. Together they set up the classification stage, where inactivity is still useful as one profile feature for ranking and screening, not as a clean causal driver.

Future work on inference. This report treats the coefficients as descriptive and does not interpret p-values. A more formal analysis could use cluster-robust standard errors clustered by locationdesc, or a linear mixed-effects model. These methods would give more careful uncertainty estimates. However, they would still need caution because demographic rows can be different slices of the same survey sample.

10 Classification Model: High-Obesity Subgroup Profile Screening

10.1 Classification Objective

The classification section addresses the fourth research question:

Can high-obesity demographic subgroup public health profiles be screened using physical inactivity levels together with year, location, and demographic characteristics?

This is a different task from the regression model. The regression model explains the continuous adult obesity percentage across public health profiles, while the classification model converts the same profile-level information into a screening task: identifying which profiles are more likely to fall into the high-obesity group.

To keep the modelling level consistent with the revised EDA and regression sections, the classification analysis uses the same subgroup aligned modelling dataset created after the obesity-inactivity alignment. Total or Overall rows are excluded because they are aggregate summaries rather than independent subgroup observations. This avoids mixing statewide total records with demographic subgroup records in the same screening model.

The target is defined as:

high_obesity = 1 if obesity_value >= 75th percentile of TRAINING obesity values
high_obesity = 0 otherwise

The 75th percentile threshold is calculated from the training split only. This avoids test-set leakage and keeps the classification task honest. The threshold is not a medical standard; it is a relative rule for public health screening and prioritisation.

10.2 Train-Test Split and Classification Models

classification_health <- modelling_aligned_health

classification_predictors <- c(
  "inactivity_value",
  "yearstart",
  "locationdesc",
  "stratification1"
)

classification_predictors <- classification_predictors[
  classification_predictors %in% names(classification_health)
]

classification_df <- classification_health %>%
  select(obesity_value, all_of(classification_predictors)) %>%
  drop_na() %>%
  mutate(across(where(is.character), as.factor))

set.seed(7004)
classification_train_index <- sample.int(
  nrow(classification_df),
  floor(0.8 * nrow(classification_df))
)

cls_train <- classification_df[classification_train_index, ]
cls_test <- classification_df[-classification_train_index, ]

cls_guard <- drop_unseen_levels(cls_train, cls_test)
cls_test <- cls_guard$test
cls_train <- droplevels(cls_train)

obesity_threshold <- unname(quantile(cls_train$obesity_value, 0.75, na.rm = TRUE))

cls_train <- cls_train %>%
  mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))

cls_test <- cls_test %>%
  mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))

cls_formula <- as.formula(
  paste("high_obesity ~", paste(classification_predictors, collapse = " + "))
)

cls_glm <- glm(cls_formula, data = cls_train, family = binomial())

cls_tree <- rpart(
  as.formula(paste(
    "factor(high_obesity) ~",
    paste(classification_predictors, collapse = " + ")
  )),
  data = cls_train,
  method = "class"
)

cls_prob_glm <- predict(cls_glm, newdata = cls_test, type = "response")
cls_prob_tree <- predict(cls_tree, newdata = cls_test, type = "prob")[, "1"]

classification_design_summary <- tibble(
  target = "High-obesity subgroup public health profile",
  analysis_scope = "Demographic subgroup profiles only; Total/Overall rows excluded",
  threshold_source = "75th percentile of training-set obesity values",
  obesity_threshold = obesity_threshold,
  predictors = paste(classification_predictors, collapse = " + "),
  original_aligned_profiles = nrow(aligned_health),
  subgroup_profiles_used = nrow(classification_health),
  excluded_total_or_overall_profiles = nrow(aligned_health) - nrow(classification_health),
  modelling_rows = nrow(classification_df),
  train_rows = nrow(cls_train),
  test_rows_after_guard = nrow(cls_test),
  test_rows_dropped_unseen_levels = cls_guard$dropped,
  positive_rate_test = mean(cls_test$high_obesity == 1)
)

classification_design_summary %>%
  kable(digits = 4, caption = "Classification Modelling Design Summary")

Classification Modelling Design Summary
target	analysis_scope	threshold_source	obesity_threshold	predictors	original_aligned_profiles	subgroup_profiles_used	excluded_total_or_overall_profiles	modelling_rows	train_rows	test_rows_after_guard	test_rows_dropped_unseen_levels	positive_rate_test
High-obesity subgroup public health profile	Demographic subgroup profiles only; Total/Overall rows excluded	75th percentile of training-set obesity values	35.2	inactivity_value + yearstart + locationdesc + stratification1	14922	14334	588	14334	11467	2867	0	0.2483

readr::write_csv(classification_design_summary, "project_output_classification_design_summary.csv")

Logistic regression is used as the main classification model because it is interpretable and suitable for a binary screening task. A classification tree is included as a benchmark model. The purpose is not to diagnose individuals or predict future obesity status, but to rank observed subgroup profiles for public health attention.

10.3 Classification Performance Metrics

For public health prioritisation, the model should not only classify profiles at one threshold. It should also rank profiles so that the highest-risk subgroup profiles can be reviewed first. Therefore, the evaluation reports both threshold-based metrics and ranking-based metrics.

calc_auc_pr <- function(actual, prob, positive = 1) {
  y <- if_else(actual == positive, 1, 0)
  positive_n <- sum(y == 1)

  if (positive_n == 0) {
    return(NA_real_)
  }

  ordered_index <- order(prob, decreasing = TRUE)
  y_sorted <- y[ordered_index]

  tp <- cumsum(y_sorted == 1)
  fp <- cumsum(y_sorted == 0)

  recall <- tp / positive_n
  precision <- tp / pmax(tp + fp, 1)

  recall <- c(0, recall)
  precision <- c(1, precision)

  sum(diff(recall) * (precision[-1] + precision[-length(precision)]) / 2)
}

calc_auc_roc <- function(actual, prob, positive = 1) {
  y <- if_else(actual == positive, 1, 0)
  positive_n <- sum(y == 1)
  negative_n <- sum(y == 0)

  if (positive_n == 0 || negative_n == 0) {
    return(NA_real_)
  }

  rank_sum_positive <- sum(rank(prob)[y == 1])
  (rank_sum_positive - positive_n * (positive_n + 1) / 2) /
    (positive_n * negative_n)
}

top_fraction_capture <- function(actual, prob, positive = 1, fraction = 0.20) {
  ranking_df <- tibble(actual = actual, prob = prob) %>%
    arrange(desc(prob))

  n_top <- max(1, floor(nrow(ranking_df) * fraction))
  top_df <- ranking_df %>% slice_head(n = n_top)

  overall_rate <- mean(ranking_df$actual == positive)
  top_rate <- mean(top_df$actual == positive)

  tibble(
    top_fraction = fraction,
    overall_rate = overall_rate,
    top_rate = top_rate,
    lift = top_rate / overall_rate,
    capture = sum(top_df$actual == positive) / sum(ranking_df$actual == positive)
  )
}

classification_metric_row <- function(model_name, actual, prob, threshold = 0.5) {
  predicted <- if_else(prob >= threshold, 1, 0)

  tp <- sum(predicted == 1 & actual == 1)
  fp <- sum(predicted == 1 & actual == 0)
  tn <- sum(predicted == 0 & actual == 0)
  fn <- sum(predicted == 0 & actual == 1)

  precision <- tp / max(tp + fp, 1)
  recall <- tp / max(tp + fn, 1)
  specificity <- tn / max(tn + fp, 1)
  accuracy <- (tp + tn) / max(tp + tn + fp + fn, 1)
  f1 <- if ((precision + recall) > 0) {
    2 * precision * recall / (precision + recall)
  } else {
    0
  }

  top20 <- top_fraction_capture(actual, prob, fraction = 0.20)
  auc_pr <- calc_auc_pr(actual, prob)
  auc_roc <- calc_auc_roc(actual, prob)

  tibble(
    model = model_name,
    target = "High-obesity subgroup profile",
    obesity_threshold_q75 = obesity_threshold,
    n_train = nrow(cls_train),
    n_test = length(actual),
    positive_rate = mean(actual == 1),
    roc_auc = auc_roc,
    auc_pr = auc_pr,
    auc_pr_lift = auc_pr / mean(actual == 1),
    top20_lift = top20$lift,
    top20_capture = top20$capture,
    accuracy_threshold_0_5 = accuracy,
    precision_threshold_0_5 = precision,
    recall_threshold_0_5 = recall,
    specificity_threshold_0_5 = specificity,
    f1_threshold_0_5 = f1
  )
}

classification_metrics <- bind_rows(
  classification_metric_row("Logistic regression", cls_test$high_obesity, cls_prob_glm),
  classification_metric_row("Classification tree", cls_test$high_obesity, cls_prob_tree)
)

classification_metrics %>%
  kable(digits = 4, caption = "Classification Model Performance")

Classification Model Performance
model	target	obesity_threshold_q75	n_train	n_test	positive_rate	roc_auc	auc_pr	auc_pr_lift	top20_lift	top20_capture	accuracy_threshold_0_5	precision_threshold_0_5	recall_threshold_0_5	specificity_threshold_0_5	f1_threshold_0_5
Logistic regression	High-obesity subgroup profile	35.2	11467	2867	0.2483	0.9281	0.8152	3.2824	3.2818	0.6559	0.8804	0.7896	0.7065	0.9378	0.7457
Classification tree	High-obesity subgroup profile	35.2	11467	2867	0.2483	0.8478	0.6792	2.7349	3.0358	0.6067	0.8556	0.7435	0.6390	0.9271	0.6873

readr::write_csv(classification_metrics, "project_output_classification_metrics.csv")

classification_metrics %>%
  select(model, roc_auc, auc_pr, top20_capture, f1_threshold_0_5) %>%
  pivot_longer(
    cols = c(roc_auc, auc_pr, top20_capture, f1_threshold_0_5),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = recode(
      metric,
      roc_auc = "ROC-AUC",
      auc_pr = "AUC-PR",
      top20_capture = "Top-20% capture",
      f1_threshold_0_5 = "F1 at 0.5"
    )
  ) %>%
  ggplot(aes(x = model, y = value, fill = model)) +
  geom_col(width = 0.62) +
  facet_wrap(~metric, scales = "free_y") +
  scale_fill_manual(values = c(
    "Logistic regression" = "#B4443F",
    "Classification tree" = "#496A81"
  )) +
  labs(
    title = "Classification Model Performance Comparison",
    subtitle = "Ranking metrics are especially important for public health prioritisation",
    x = NULL,
    y = NULL
  ) +
  theme_report(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The logistic regression is carried forward as the main classification model because it provides strong ranking performance while remaining interpretable. Its AUC-PR is 0.815, which is 3.28 times the test-set positive rate. The top 20% highest-ranked subgroup profiles capture 65.6% of actual high-obesity subgroup profiles, which is useful when public health attention must be prioritised.

10.4 Confusion Matrix at 0.5 Reference Threshold

The confusion matrix uses 0.5 as a simple reference threshold for interpretation. This is not treated as the final public health decision rule, because prioritisation is better supported by ranking metrics such as AUC-PR, top-20% capture, and risk decile validation.

cls_pred_glm <- if_else(cls_prob_glm >= 0.5, 1, 0)

confusion_table <- table(
  Predicted = factor(
    cls_pred_glm,
    levels = c(0, 1),
    labels = c("Lower risk", "High risk")
  ),
  Actual = factor(
    cls_test$high_obesity,
    levels = c(0, 1),
    labels = c("Not high obesity", "High obesity")
  )
)

tn_cm <- confusion_table["Lower risk", "Not high obesity"]
fp_cm <- confusion_table["High risk", "Not high obesity"]
fn_cm <- confusion_table["Lower risk", "High obesity"]
tp_cm <- confusion_table["High risk", "High obesity"]

confusion_summary <- tibble(
  metric = c(
    "Sensitivity / Recall",
    "Specificity",
    "Positive Predictive Value / Precision",
    "Negative Predictive Value",
    "Accuracy"
  ),
  value = c(
    tp_cm / max(tp_cm + fn_cm, 1),
    tn_cm / max(tn_cm + fp_cm, 1),
    tp_cm / max(tp_cm + fp_cm, 1),
    tn_cm / max(tn_cm + fn_cm, 1),
    (tp_cm + tn_cm) / max(tp_cm + tn_cm + fp_cm + fn_cm, 1)
  )
)

confusion_table %>%
  kable(caption = "Confusion Matrix for Logistic Regression at 0.5 Threshold")

Confusion Matrix for Logistic Regression at 0.5 Threshold
	Not high obesity	High obesity
Lower risk	2021	209
High risk	134	503

confusion_summary %>%
  kable(digits = 4, caption = "Confusion Matrix Summary")

Confusion Matrix Summary
metric	value
Sensitivity / Recall	0.7065
Specificity	0.9378
Positive Predictive Value / Precision	0.7896
Negative Predictive Value	0.9063
Accuracy	0.8804

readr::write_csv(
  as.data.frame.matrix(confusion_table) %>% rownames_to_column("Predicted"),
  "project_output_confusion_matrix.csv"
)
readr::write_csv(confusion_summary, "project_output_confusion_summary.csv")

classification_predictions <- cls_test %>%
  mutate(
    high_obesity_score = cls_prob_glm,
    predicted_high_obesity = cls_pred_glm
  )

classification_predictions %>%
  ggplot(aes(x = high_obesity_score, fill = factor(high_obesity))) +
  geom_histogram(bins = 40, alpha = 0.65, position = "identity") +
  scale_fill_manual(values = c("0" = "#496A81", "1" = "#B4443F")) +
  labs(
    title = "Predicted High-Obesity Score Distribution",
    subtitle = "Higher scores indicate higher predicted probability of belonging to the high-obesity group",
    x = "Predicted high-obesity score",
    y = "Profiles",
    fill = "Actual high obesity"
  ) +
  theme_report()

readr::write_csv(classification_predictions, "project_output_classification_predictions.csv")

At the 0.5 reference threshold, the logistic regression achieves an accuracy of 0.88 and a recall of 0.706. For public health screening, recall and ranking performance are especially important because the goal is to avoid missing many high-obesity subgroup profiles when prioritising follow-up attention.

11 Risk Ranking Validation

11.1 Risk Decile Validation

The classification model is most useful if its risk scores rank subgroup profiles meaningfully. To validate the ranking, test profiles are divided into ten deciles based on predicted high-obesity score. Decile 10 contains the highest predicted-risk subgroup profiles.

risk_decile_summary <- classification_predictions %>%
  mutate(
    risk_decile = ntile(high_obesity_score, 10)
  ) %>%
  group_by(risk_decile) %>%
  summarise(
    profiles = n(),
    average_score = mean(high_obesity_score, na.rm = TRUE),
    observed_high_obesity_rate = mean(high_obesity == 1),
    average_obesity_value = mean(obesity_value, na.rm = TRUE),
    average_inactivity_value = mean(inactivity_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(risk_decile)

top_decile <- risk_decile_summary %>%
  filter(risk_decile == max(risk_decile))

bottom_decile <- risk_decile_summary %>%
  filter(risk_decile == min(risk_decile))

top_risk_profiles <- classification_predictions %>%
  arrange(desc(high_obesity_score)) %>%
  slice_head(n = 10) %>%
  transmute(
    Year = as.integer(yearstart),
    Location = locationdesc,
    Group = stratification1,
    Inactivity = round(inactivity_value, 1),
    Obesity = round(obesity_value, 1),
    Score = round(high_obesity_score, 3),
    ActualHighObesity = high_obesity
  )

risk_decile_summary %>%
  kable(digits = 4, caption = "Risk Decile Validation for Logistic Regression")

Risk Decile Validation for Logistic Regression
risk_decile	profiles	average_score	observed_high_obesity_rate	average_obesity_value	average_inactivity_value
1	287	0.0000	0.0070	17.7571	19.2976
2	287	0.0018	0.0035	24.3613	21.9387
3	287	0.0086	0.0035	26.9509	22.6714
4	287	0.0255	0.0418	29.1000	23.6540
5	287	0.0624	0.0592	30.4547	24.9732
6	287	0.1322	0.0976	31.6300	26.9049
7	287	0.2498	0.2195	33.0031	27.4808
8	286	0.4361	0.4266	34.7374	28.7986
9	286	0.6780	0.7308	36.9612	30.4559
10	286	0.9104	0.8986	40.4563	33.3958

top_risk_profiles %>%
  kable(caption = "Top 10 Highest-Risk Subgroup Profiles by Predicted Score")

Top 10 Highest-Risk Subgroup Profiles by Predicted Score
Year	Location	Group	Inactivity	Obesity	Score	ActualHighObesity
2018	Mississippi	Non-Hispanic Black	34.6	45.7	0.998	1
2017	Mississippi	Non-Hispanic Black	36.2	45.9	0.997	1
2019	Mississippi	Less than $15,000	54.3	42.8	0.997	1
2021	Kentucky	Non-Hispanic Black	37.4	46.8	0.996	1
2021	Mississippi	45 - 54	30.0	51.6	0.996	1
2019	Mississippi	45 - 54	40.0	50.1	0.996	1
2020	Kentucky	Non-Hispanic Black	39.3	45.4	0.996	1
2020	Mississippi	55 - 64	37.5	43.9	0.995	1
2019	West Virginia	45 - 54	32.5	50.2	0.994	1
2017	Louisiana	Non-Hispanic Black	36.9	42.2	0.993	1

readr::write_csv(risk_decile_summary, "project_output_high_obesity_risk_decile.csv")
readr::write_csv(top_risk_profiles, "project_output_top_risk_profiles.csv")

risk_decile_summary %>%
  ggplot(aes(x = factor(risk_decile), y = observed_high_obesity_rate)) +
  geom_col(fill = "#B4443F", width = 0.68) +
  geom_line(aes(group = 1), colour = "#243447", linewidth = 0.8) +
  geom_point(colour = "#243447", size = 2) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
  labs(
    title = "Observed High-Obesity Rate by Predicted Risk Decile",
    subtitle = "A useful screening model should show higher observed risk in higher score deciles",
    x = "Predicted risk decile",
    y = "Observed high-obesity rate"
  ) +
  theme_report()

The risk decile validation shows a clear ranking pattern. The lowest decile has an observed high-obesity rate of 0.7%, while the highest decile has an observed high-obesity rate of 89.9%. In decision terms, if a public health team works down the ranked list, it reaches genuinely high-risk subgroup profiles much earlier than random selection. This means the model is not only classifying profiles at a fixed threshold; it is also ordering subgroup profiles in a way that is meaningful for prioritisation within the observed surveillance panel.

12 Public Health Intervention Matrix

12.1 Translating Indicators into Intervention Segments

The classification model provides a screening score, but public health planning also needs interpretable action groups. The intervention matrix translates the two aligned indicators into four public health segments using the 75th percentile cut-offs for obesity and physical inactivity. It uses the same subgroup-level base as the classification model, so the intervention groups describe demographic subgroup profiles rather than a mixture of subgroup records and aggregate summaries.

The four segments are:

Segment	Profile Pattern	Suggested Public Health Focus
Priority Intervention	High obesity and high inactivity	Combined obesity prevention and physical activity promotion
Weight Management Focus	High obesity and lower inactivity	Nutrition, weight management, and obesity prevention support
Preventive Physical Activity Focus	Lower obesity and high inactivity	Early physical activity promotion and prevention
Standard Monitoring	Lower obesity and lower inactivity	Routine surveillance and monitoring

priority_base <- classification_health

obesity_q75 <- unname(quantile(priority_base$obesity_value, 0.75, na.rm = TRUE))
inactivity_q75 <- unname(quantile(priority_base$inactivity_value, 0.75, na.rm = TRUE))

action_matrix_data <- priority_base %>%
  mutate(
    obesity_level = if_else(obesity_value >= obesity_q75, "High obesity", "Lower obesity"),
    inactivity_level = if_else(inactivity_value >= inactivity_q75, "High inactivity", "Lower inactivity"),
    public_health_segment = case_when(
      obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
        "Priority Intervention",
      obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
        "Weight Management Focus",
      obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
        "Preventive Physical Activity Focus",
      TRUE ~ "Standard Monitoring"
    ),
    public_health_segment = factor(
      public_health_segment,
      levels = c(
        "Priority Intervention",
        "Weight Management Focus",
        "Preventive Physical Activity Focus",
        "Standard Monitoring"
      )
    )
  )

action_matrix_summary <- action_matrix_data %>%
  group_by(public_health_segment) %>%
  summarise(
    profiles = n(),
    share = profiles / nrow(action_matrix_data),
    average_obesity = mean(obesity_value, na.rm = TRUE),
    average_inactivity = mean(inactivity_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(public_health_segment)

action_matrix_summary %>%
  kable(digits = 4, caption = "Public Health Intervention Matrix Summary")

Public Health Intervention Matrix Summary
public_health_segment	profiles	share	average_obesity	average_inactivity
Priority Intervention	1625	0.1134	39.3788	37.8977
Weight Management Focus	1970	0.1374	38.9139	25.1503
Preventive Physical Activity Focus	1960	0.1367	30.4933	36.5277
Standard Monitoring	8779	0.6125	26.9292	21.5793

readr::write_csv(action_matrix_data, "project_output_public_health_action_matrix_data.csv")
readr::write_csv(action_matrix_summary, "project_output_public_health_action_matrix_summary.csv")

action_matrix_data %>%
  ggplot(aes(x = inactivity_value, y = obesity_value, colour = public_health_segment)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_vline(xintercept = inactivity_q75, linetype = "dashed", colour = "#5D6975") +
  geom_hline(yintercept = obesity_q75, linetype = "dashed", colour = "#5D6975") +
  scale_colour_manual(values = c(
    "Priority Intervention" = "#B4443F",
    "Weight Management Focus" = "#D18F3F",
    "Preventive Physical Activity Focus" = "#2F6F8F",
    "Standard Monitoring" = "#7A8793"
  )) +
  scale_x_continuous(labels = scales::label_number(suffix = "%")) +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  labs(
    title = "Public Health Intervention Matrix",
    subtitle = "Segments are based on 75th percentile cut-offs for subgroup aligned obesity and inactivity values",
    x = "No leisure-time physical activity",
    y = "Adult obesity",
    colour = "Segment"
  ) +
  theme_report()

The Priority Intervention segment contains subgroup profiles where both obesity and inactivity are relatively high. This segment represents 11.3% of subgroup aligned profiles, with an average obesity value of 39.38% and an average inactivity value of 37.9%. These profiles are the most natural starting point for combined obesity prevention and physical activity promotion, but the segment should be read as decision support rather than an automatic policy list.

13 Threshold Sensitivity and Robust Core

13.1 Sensitivity to Q70, Q75, and Q80 Cut-Offs

The 75th percentile cut-off is a design choice, not a clinical threshold. To check whether the intervention priority result depends too heavily on one cut-off, the matrix is recalculated using 70th, 75th, and 80th percentile thresholds on the same subgroup-level aligned data.

priority_flag_at <- function(df, q) {
  obesity_cutoff <- unname(quantile(df$obesity_value, q, na.rm = TRUE))
  inactivity_cutoff <- unname(quantile(df$inactivity_value, q, na.rm = TRUE))

  df$obesity_value >= obesity_cutoff &
    df$inactivity_value >= inactivity_cutoff
}

priority_base <- classification_health
sensitivity_quantiles <- c(0.70, 0.75, 0.80)

sensitivity_summary <- map_dfr(sensitivity_quantiles, function(q) {
  priority_flag <- priority_flag_at(priority_base, q)

  tibble(
    threshold = paste0("Q", q * 100),
    obesity_cutoff = unname(quantile(priority_base$obesity_value, q, na.rm = TRUE)),
    inactivity_cutoff = unname(quantile(priority_base$inactivity_value, q, na.rm = TRUE)),
    priority_profiles = sum(priority_flag),
    priority_share = mean(priority_flag),
    average_obesity_priority = mean(priority_base$obesity_value[priority_flag], na.rm = TRUE),
    average_inactivity_priority = mean(priority_base$inactivity_value[priority_flag], na.rm = TRUE)
  )
})

robust_priority_data <- priority_base %>%
  mutate(
    priority_q70 = priority_flag_at(priority_base, 0.70),
    priority_q75 = priority_flag_at(priority_base, 0.75),
    priority_q80 = priority_flag_at(priority_base, 0.80),
    robust_priority = priority_q70 & priority_q75 & priority_q80
  )

robust_priority_profiles <- robust_priority_data %>%
  filter(robust_priority) %>%
  arrange(desc(obesity_value + inactivity_value))

robust_core_top10 <- robust_priority_profiles %>%
  slice_head(n = 10) %>%
  transmute(
    Year = as.integer(yearstart),
    Location = locationdesc,
    Group = stratification1,
    Obesity = round(obesity_value, 1),
    Inactivity = round(inactivity_value, 1),
    CombinedBurden = round(obesity_value + inactivity_value, 1)
  )

robust_priority_summary <- tibble(
  n_robust_priority_profiles = nrow(robust_priority_profiles),
  robust_priority_share = nrow(robust_priority_profiles) / nrow(priority_base),
  average_obesity = mean(robust_priority_profiles$obesity_value, na.rm = TRUE),
  average_inactivity = mean(robust_priority_profiles$inactivity_value, na.rm = TRUE)
)

priority_q70_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q70"]
priority_q75_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q75"]
priority_q80_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q80"]
robust_priority_n <- nrow(robust_priority_profiles)
robust_priority_share <- robust_priority_n / nrow(priority_base)

sensitivity_summary %>%
  kable(digits = 4, caption = "Threshold Sensitivity for Priority Intervention Profiles")

Threshold Sensitivity for Priority Intervention Profiles
threshold	obesity_cutoff	inactivity_cutoff	priority_profiles	priority_share	average_obesity_priority	average_inactivity_priority
Q70	34.2	29.9	2178	0.1519	38.6547	36.8449
Q75	35.2	31.3	1625	0.1134	39.3788	37.8977
Q80	36.2	32.8	1137	0.0793	40.1659	38.9839

robust_priority_summary %>%
  kable(digits = 4, caption = "Robust Priority Core Summary")

Robust Priority Core Summary
n_robust_priority_profiles	robust_priority_share	average_obesity	average_inactivity
1137	0.0793	40.1659	38.9839

robust_priority_profiles %>%
  slice_head(n = 20) %>%
  kable(digits = 3, caption = "Top 20 Robust Priority Profiles")

Top 20 Robust Priority Profiles
yearstart	locationdesc	stratificationcategory1	stratification1	obesity_value	inactivity_value	priority_q70	priority_q75	priority_q80	robust_priority
2019	North Dakota	Race/Ethnicity	American Indian/Alaska Native	57.8	49.5	TRUE	TRUE	TRUE	TRUE
2017	Puerto Rico	Education	Less than high school	39.0	66.8	TRUE	TRUE	TRUE	TRUE
2015	Ohio	Race/Ethnicity	American Indian/Alaska Native	51.6	51.4	TRUE	TRUE	TRUE	TRUE
2017	Puerto Rico	Age (years)	45 - 54	41.8	59.6	TRUE	TRUE	TRUE	TRUE
2021	Puerto Rico	Education	Less than high school	39.0	61.4	TRUE	TRUE	TRUE	TRUE
2013	Tennessee	Education	Less than high school	41.3	58.9	TRUE	TRUE	TRUE	TRUE
2017	Puerto Rico	Age (years)	55 - 64	40.8	58.8	TRUE	TRUE	TRUE	TRUE
2019	Mississippi	Income	Less than $15,000	42.8	54.3	TRUE	TRUE	TRUE	TRUE
2016	New Jersey	Education	Less than high school	41.5	54.5	TRUE	TRUE	TRUE	TRUE
2017	Puerto Rico	Income	Less than $15,000	37.5	57.9	TRUE	TRUE	TRUE	TRUE
2019	Mississippi	Education	Less than high school	43.1	52.3	TRUE	TRUE	TRUE	TRUE
2017	Arkansas	Race/Ethnicity	American Indian/Alaska Native	50.3	44.8	TRUE	TRUE	TRUE	TRUE
2019	Kentucky	Education	Less than high school	41.2	53.9	TRUE	TRUE	TRUE	TRUE
2018	Kentucky	Income	Less than $15,000	39.8	55.2	TRUE	TRUE	TRUE	TRUE
2019	Puerto Rico	Age (years)	55 - 64	38.8	56.2	TRUE	TRUE	TRUE	TRUE
2018	Arkansas	Income	Less than $15,000	50.5	44.4	TRUE	TRUE	TRUE	TRUE
2012	Arkansas	Race/Ethnicity	American Indian/Alaska Native	46.0	48.8	TRUE	TRUE	TRUE	TRUE
2021	Puerto Rico	Income	Less than $15,000	39.4	55.4	TRUE	TRUE	TRUE	TRUE
2018	Mississippi	Income	Less than $15,000	46.1	48.4	TRUE	TRUE	TRUE	TRUE
2020	Puerto Rico	Age (years)	55 - 64	41.8	52.6	TRUE	TRUE	TRUE	TRUE

robust_core_top10 %>%
  kable(caption = "Top 10 Robust-Core Priority Profiles by Combined Burden")

Top 10 Robust-Core Priority Profiles by Combined Burden
Year	Location	Group	Obesity	Inactivity	CombinedBurden
2019	North Dakota	American Indian/Alaska Native	57.8	49.5	107.3
2017	Puerto Rico	Less than high school	39.0	66.8	105.8
2015	Ohio	American Indian/Alaska Native	51.6	51.4	103.0
2017	Puerto Rico	45 - 54	41.8	59.6	101.4
2021	Puerto Rico	Less than high school	39.0	61.4	100.4
2013	Tennessee	Less than high school	41.3	58.9	100.2
2017	Puerto Rico	55 - 64	40.8	58.8	99.6
2019	Mississippi	Less than $15,000	42.8	54.3	97.1
2016	New Jersey	Less than high school	41.5	54.5	96.0
2017	Puerto Rico	Less than $15,000	37.5	57.9	95.4

readr::write_csv(sensitivity_summary, "project_output_threshold_sensitivity.csv")
readr::write_csv(robust_priority_data, "project_output_robust_priority_data.csv")
readr::write_csv(robust_priority_profiles, "project_output_robust_priority_profiles.csv")
readr::write_csv(robust_core_top10, "project_output_robust_core_top10_profiles.csv")
readr::write_csv(robust_priority_summary, "project_output_robust_priority_summary.csv")

sensitivity_summary %>%
  ggplot(aes(x = threshold, y = priority_profiles, fill = threshold)) +
  geom_col(width = 0.62) +
  geom_text(
    aes(label = priority_profiles),
    vjust = -0.35,
    colour = "#243447",
    size = 3.8
  ) +
  scale_fill_manual(values = c(
    "Q70" = "#D18F3F",
    "Q75" = "#B4443F",
    "Q80" = "#7D2E2A"
  )) +
  labs(
    title = "Priority Intervention Profiles Under Alternative Thresholds",
    subtitle = "Higher thresholds define a smaller but stricter intervention core",
    x = "Quantile threshold",
    y = "Priority profiles"
  ) +
  theme_report()

The number of priority profiles changes from 2178 at Q70 to 1625 at Q75 and 1137 at Q80. Because both obesity and inactivity cut-offs rise together under the same quantile rule, the three priority sets are nested: every Q80 priority profile is also a priority profile at Q75 and Q70. The robust core contains 1137 subgroup profiles, or 7.93% of subgroup aligned profiles. These profiles remain high-priority under all three reasonable cut-off choices, so they form a defensible planning starting point rather than a final automatic policy decision.

14 Dashboard / App Design

14.1 Purpose of the Dashboard

A Shiny dashboard, app.R, accompanies this report. The dashboard is designed as an interactive public health profile explorer. Its purpose is to help non-technical users explore the main findings of the project without reading every table or code block in the full report.

The dashboard does not provide individual-level medical diagnosis, personal obesity prediction, or causal recommendations. It works with the same aggregated surveillance logic used throughout the report: each record represents a public health profile defined by year, location, indicator, and demographic stratification group.

The dashboard supports three practical user questions:

How do adult obesity and physical inactivity values change across years?
Which locations and demographic groups show higher average values?
Which aligned subgroup profiles fall into higher-priority intervention segments?

14.2 App Structure

The app is organised into three pages. Each page corresponds to a major analytical stage in the report.

Dashboard Page	Main Purpose	Related Report Section
Overview and Trend	Explore yearly trends by indicator, year range, location, and stratification category	Exploratory Data Analysis
Location and Demographic Explorer	Identify locations and demographic groups with higher average obesity or inactivity values	Location and Demographic Comparison
Priority Matrix Explorer	Classify aligned subgroup profiles into intervention segments under Q70, Q75, or Q80 thresholds	Public Health Intervention Matrix and Threshold Sensitivity

This structure makes the dashboard a practical extension of the analysis. The first page explains the when, the second explains the where and who, and the third translates the results into public health action groups.

14.3 Page 1: Overview and Trend

The Overview and Trend page allows users to select an indicator, year range, location, and stratification category. It then displays a yearly trend plot and a small summary table containing the number of records, number of years, number of locations, average value, minimum value, and maximum value.

This page is useful for quickly checking whether a selected indicator is rising, falling, or fluctuating across time. For example, users can compare the adult obesity trend with the no leisure-time physical activity trend, or focus on a specific location or stratification category.

14.4 Page 2: Location and Demographic Explorer

The Location and Demographic Explorer page helps users identify where higher values are concentrated. Users select an indicator and year option, and the app displays:

The top 10 locations by average indicator value.
The top 10 stratification groups by average indicator value.

This page supports the location and demographic comparison logic from the EDA. It helps users move beyond national averages and see which public health profiles deserve closer attention.

14.5 Page 3: Priority Matrix Explorer

The Priority Matrix Explorer page is the most decision-oriented part of the app. It uses aligned subgroup obesity-inactivity profiles and classifies them into four intervention segments:

Segment	Interpretation
Priority Intervention	High obesity and high physical inactivity
Weight Management Focus	High obesity and lower physical inactivity
Preventive Physical Activity Focus	Lower obesity and high physical inactivity
Standard Monitoring	Lower obesity and lower physical inactivity

Users can select Q70, Q75, or Q80 as the high-risk quantile threshold. The app then shows the number of subgroup profiles in each segment, segment averages, and a filtered table of the highest combined-burden subgroup profiles. This directly connects the dashboard to the threshold sensitivity analysis in the report.

14.6 Design Rationale

The app design follows the same principles as the analysis:

Indicator-specific interpretation: obesity and physical inactivity values are not mixed unless the records are aligned.
Profile-level framing: results are shown as subgroup-level public health profiles, not individual risks.
Interpretable controls: users filter by year, location, indicator, and stratification category rather than by technical model parameters.
Decision support: the priority matrix converts analytical results into clear public health segments.
Threshold transparency: Q70, Q75, and Q80 options show how intervention lists expand or contract under different cut-off choices.

14.7 How to Run the App

To run the dashboard:

Place Nutrition__Physical_Activity__and_Obesity.csv in the same folder as app.R.
Open app.R in RStudio.
Click Run App.

The dashboard requires the following R packages:

shiny, dplyr, tidyr, ggplot2, readr, stringr

14.8 Dashboard Limitations

The dashboard is intended for exploration and communication, not for clinical or causal decision-making. Several limitations should be kept in mind:

It uses aggregated public health surveillance data, not individual-level records.
Segment labels are based on relative quantile thresholds, not medical thresholds.
The priority matrix highlights subgroup profiles for public health attention but does not prove that physical inactivity causes obesity.
The dashboard separates broad trend exploration from subgroup-level priority analysis; users should not compare total-level and subgroup-level outputs as if they had the same unit of analysis.
Filtered dashboard views may contain fewer records, so small filtered groups should be interpreted carefully.
The app is designed as a compact project dashboard rather than a full production surveillance system.

Overall, the dashboard strengthens the project by turning the statistical analysis into an interactive tool. It allows users to explore trends, compare locations and demographic groups, and inspect intervention priority profiles in a way that is more accessible than static tables alone.

15 Discussion / Limitations / Conclusion

15.1 Discussion

This project shows how a large public health surveillance dataset can be transformed into an interpretable workflow for obesity and physical inactivity profiling. The analysis begins with careful question selection, because data_value has different meanings under different public health questions. By focusing on adult obesity (Q036) and no leisure-time physical activity (Q047), the project avoids mixing incompatible indicators and keeps the interpretation consistent.

The exploratory analysis shows that adult obesity increased over the study period. In the all-record trend, adult obesity rose by 6.3 percentage points from 2011 to 2021. The total-stratification trend confirms that the increase remains visible after reducing demographic composition effects. This supports the conclusion that the rising obesity pattern is not only a result of how stratified records are pooled.

Location and demographic comparison further show that obesity is unevenly distributed. The highest average obesity locations include Mississippi, West Virginia, Louisiana, while lower-obesity locations include Colorado, District of Columbia, Hawaii. Demographic profiles also differ strongly, especially across race or ethnicity, age, education, and income groups. These findings justify the modelling strategy: year, location, and demographic stratification are not arbitrary predictors, but variables that the EDA already shows to be important.

The aligned obesity-inactivity analysis is a central design choice. Because obesity and physical inactivity are reported in separate rows, the two indicators are compared only after matching by year, location, stratification category, and stratification group. The report then separates the broad all-profile alignment from the subgroup aligned modelling scope. Within the subgroup modelling scope, the raw correlation between adult obesity and no leisure-time physical activity is 0.457, indicating a positive descriptive association. The within-profile demeaned correlation (0.073) shows how much of this survives once stable between-profile differences are removed, and the subgroup-level lagged association check stays positive across lag 0, lag 1, and lag 2. Taken together, these are stable subgroup profile-level associations rather than validated predictive or causal relationships, so they motivate examining inactivity alongside obesity while still requiring a cautious non-causal interpretation.

The same pattern reappears in the regression adjustment ladder, which forms a clean gradient with the EDA: the inactivity coefficient falls from about 0.4 when unadjusted to about 0.14 once location and demographic group are controlled (roughly 35% retained), and the stricter within-profile correlation (0.073) sits closer to zero still. The more completely profile identity is absorbed, the less of the association survives, so the EDA and the regression tell one consistent story: the obesity-inactivity link is carried mostly by stable differences between locations and demographic subgroup profiles, with only a modest residual remaining after full adjustment.

The regression model confirms that adult obesity values are strongly structured by public health profile characteristics. The linear regression explains 74.7% of test-set variation in adult obesity values and performs better than the regression tree benchmark. The coefficient interpretation strengthens the result: the yearstart coefficient is approximately 0.6, meaning that obesity increases by about 0.6 percentage points per year after controlling for location and stratification group. This connects the model directly back to the EDA trend.

The classification model provides a practical subgroup-level screening layer. After excluding Total or Overall rows to match the regression scope, the logistic regression achieves a ROC-AUC of 0.928 and an AUC-PR of 0.815. More importantly for prioritisation, the top 20% highest-ranked subgroup profiles capture 65.6% of true high-obesity profiles. The risk-decile validation shows that observed high-obesity rates rise sharply from the lowest to the highest predicted-risk deciles, confirming that the model ranks subgroup profiles in a meaningful order within the observed surveillance panel.

Finally, the public health intervention matrix translates statistical results into action categories. The Priority Intervention segment contains subgroup profiles with both high obesity and high inactivity, representing 11.3% of subgroup aligned profiles. Threshold sensitivity analysis shows that the priority list contracts as the threshold moves from Q70 to Q80, and the robust core contains 1137 subgroup profiles that remain high-priority under all three cut-off choices. This robust core is a defensible starting point for intervention planning, not an automatic policy decision.

15.2 Limitations

Several limitations should be considered when interpreting the results.

The dataset is aggregated, not individual-level.
Each row represents a public health profile such as a year, location, indicator, and stratification group. The results cannot be used to predict whether a specific person is obese.
The analysis is associational, not causal.
The aligned analysis shows that obesity and physical inactivity tend to be higher in the same public health profiles, but this does not prove that physical inactivity causes obesity. Other factors such as diet, income, local environment, healthcare access, and policy context may also contribute.
Model performance reflects interpolation within the observed surveillance panel.
Because location and stratification group are predictors, the regression and classification models mainly learn patterns among already-observed locations and demographic groups. The results should not be treated as strong evidence of performance for entirely new locations, future policy contexts, or unseen demographic categories.
High-risk thresholds are relative.
The project uses quantile thresholds such as Q75 to define high-obesity and high-inactivity profiles. These thresholds are useful for prioritisation, but they are not medical, clinical, or policy-mandated thresholds.
Total-level and subgroup-level records have different meanings.
Total or Overall rows summarise a location-year, while subgroup rows describe demographic slices. The report keeps these levels separate, but users should not interpret subgroup models as population-weighted national estimates.
Some strata have fewer records than others.
Certain demographic groups and locations have smaller coverage or fewer years of data. This can make some subgroup averages and extreme rankings less stable.
The models are intentionally simple.
The regression and classification models use interpretable predictors and additive structures. They do not fully model complex interactions between location, income, education, age, race or ethnicity, and physical inactivity.
The intervention matrix supports prioritisation, not final policy decisions.
The matrix identifies profiles that may deserve attention, but real interventions should also consider local context, feasibility, resources, community needs, and external validation.
The dashboard is exploratory.
The Shiny app helps users inspect trends, rankings, and priority segments, but it is not a production surveillance system or clinical decision tool.

15.3 Conclusion

This project analysed adult obesity and no leisure-time physical activity using the Nutrition, Physical Activity, and Obesity public health surveillance dataset. The dataset satisfies the course dimension requirement and supports a complete data science workflow: data cleaning, exploratory analysis, aligned relationship analysis, regression modelling, classification modelling, intervention prioritisation, threshold sensitivity analysis, and dashboard design.

The main findings are clear. Adult obesity increased from 2011 to 2021, and the increase remains visible in total-stratification records. Obesity also differs substantially across locations and demographic groups, confirming that public health profiles are not evenly distributed. After fair alignment, obesity and physical inactivity show a positive subgroup-level association with a correlation of 0.457 within the modelling scope.

The regression model explains 74.7% of test-set obesity variation, showing that year, location, and demographic group account for a large share of observed obesity differences. The classification model adds subgroup-level screening value: the logistic regression captures 65.6% of true high-obesity subgroup profiles within the top 20% highest-ranked subgroup profiles.

The final public health intervention matrix turns these results into practical segments. The Priority Intervention group combines high obesity with high physical inactivity and represents the strongest candidate group for combined obesity prevention and physical activity promotion. The robust core of 1137 subgroup profiles remains high-priority under Q70, Q75, and Q80 threshold choices, making it a defensible starting point for planning.

Overall, the project demonstrates how programming and data science can convert large-scale public health surveillance data into interpretable insights, screening tools, intervention priorities, and an interactive dashboard. All conclusions should be understood as group-level public health insights rather than individual medical predictions or causal claims.

Adult Obesity and Physical Inactivity Trends: Data-Driven Public Health Prioritisation

WQD7004 Programming for Data Science - Group Project

Group Members: CHEN MIAOJUN, HUANG SHENGQI, WANG HEXIAO, WANG PEIYUAN

2026-06-16