| Member | Student ID | Main Assigned Section |
|---|---|---|
| Chen Miaojun | 25072895 | Dataset Description; Data Structure and Quality; Data Cleaning and Question Selection; Dashboard / App Design |
| Wang Peiyuan | 25089856 | Exploratory Data Analysis; Location and Demographic Comparison; Aligned Relationship and Robustness Checks |
| Huang Shengqi | 24080731 | Regression Model; Regression Diagnostics / Interpretation; Adjustment Ladder; statistical limitation review |
| Wang Hexiao | 25089268 | Classification Model: High-Obesity Subgroup Profile Screening; Risk Ranking Validation; Public Health Intervention Matrix; Threshold Sensitivity and Robust Core |
| All members | N/A | Discussion / Limitations / Conclusion; final review; presentation preparation |
Obesity and physical inactivity are important public health concerns because they are closely related to long-term population health, healthcare burden, and preventive health planning. This project analyses the Nutrition, Physical Activity, and Obesity public health surveillance dataset in the United States. The dataset covers 2011 to 2021 and contains 88,629 records with 33 variables, giving a total dataset dimension of more than 2.9 million cells.
The analysis focuses on two selected indicators: adult obesity and no
leisure-time physical activity. Since the dataset is aggregated, each
row represents a public health profile rather than an individual person.
The project therefore studies obesity and physical inactivity patterns
across years, locations, and demographic groups instead of predicting
individual medical outcomes. A distinction is made between broad
descriptive views, which may use all selected surveillance records, and
modelling-related subgroup analysis, which excludes Total
or Overall aggregate rows to keep the unit of analysis
consistent.
The overall aim is to build an interpretable public health decision-support workflow. The report first prepares clean indicator-specific datasets, then supports exploratory analysis, aligned obesity-inactivity comparison, regression modelling, classification modelling, and public health intervention priority analysis. The workflow deliberately separates total-stratification summaries from demographic subgroup profiles so that exploratory patterns, regression explanation, classification screening, and intervention priorities are interpreted at the correct level. All findings should be interpreted as group-level public health insights rather than individual medical predictions or causal effects.
The main findings show that adult obesity increased by about six percentage points over the study period. Obesity and physical inactivity have a positive subgroup-level aligned correlation of about 0.46, but the within-profile demeaned correlation is only about 0.07, suggesting that much of the relationship is driven by stable differences between locations and demographic profiles. The regression model explains about three-quarters of test-set obesity variation, while the classification model captures about two-thirds of true high-obesity subgroup profiles within the top 20% highest-ranked profiles.
Obesity and physical inactivity are major public health concerns because they are closely associated with chronic disease burden, healthcare cost, and preventive health planning. However, obesity is not evenly distributed across the population. Its prevalence can vary across locations, years, age groups, education levels, income levels, gender groups, and racial or ethnic groups.
A single national average may hide important differences between communities. For public health planning, it is more useful to identify which combinations of year, location, and demographic group have higher obesity prevalence. These combinations can be understood as public health profiles.
Physical inactivity is also an important surveillance indicator. A lack of leisure-time physical activity may appear together with higher obesity prevalence in some public health profiles. However, this project treats the relationship as an observed association. It does not claim that physical inactivity directly causes obesity.
This project focuses on three connected tasks.
First, it describes adult obesity and physical inactivity trends across years, locations, and demographic groups. Second, it examines whether public health profile characteristics can explain differences in adult obesity prevalence. Third, it screens high-obesity demographic subgroup profiles and translates obesity-inactivity patterns into interpretable public health priority groups.
The focus is therefore not only on building models, but also on producing results that can be understood and used for public health monitoring and prioritisation.
The objective of this project is to transform large-scale public health surveillance data into interpretable insights and practical public health priorities.
The project is guided by the following research questions:
Therefore, all findings should be interpreted as group-level public health insights rather than individual medical predictions.
This project is framed as a public health profile analysis rather than an individual diagnosis or personal risk prediction task.
data_value has different meanings under
different public health questions, so the analysis must select specific
questionid values before modelling.Total or Overall rows excluded.The dataset used in this project is the Nutrition, Physical Activity, and Obesity dataset. The accessible CSV file was obtained from Kaggle, while the original public health context comes from the CDC Division of Nutrition, Physical Activity, and Obesity surveillance programme. The underlying data source is the Behavioral Risk Factor Surveillance System, a large state-based health survey system in the United States.
Nutrition__Physical_Activity__and_Obesity.csvThe dataset dimension is calculated after loading the raw CSV file.
csv_file <- "Nutrition__Physical_Activity__and_Obesity.csv"
if (!file.exists(csv_file)) {
csv_files <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
csv_files <- csv_files[!str_detect(csv_files, "^project_output_")]
if (length(csv_files) == 1) {
csv_file <- csv_files[1]
message("Using detected source CSV file: ", csv_file)
} else {
stop(
"Please place Nutrition__Physical_Activity__and_Obesity.csv in the same folder as this Rmd file."
)
}
}
raw_data <- readr::read_csv(csv_file, show_col_types = FALSE)clean_colnames <- function(x) {
x %>%
str_to_lower() %>%
str_replace_all("[^a-z0-9]+", "_") %>%
str_replace_all("^_|_$", "")
}
original_names <- names(raw_data)
names(raw_data) <- clean_colnames(names(raw_data))
column_mapping <- tibble(
original_name = original_names,
cleaned_name = names(raw_data)
)
n_rows <- nrow(raw_data)
n_cols <- ncol(raw_data)
n_dim <- n_rows * n_cols
dataset_dimension <- tibble(
rows = n_rows,
columns = n_cols,
dimension_cells = n_dim
)
dataset_dimension %>%
kable(caption = "Raw Dataset Dimension")| rows | columns | dimension_cells |
|---|---|---|
| 88629 | 33 | 2924757 |
The dataset contains 88,629 rows x 33 columns = 2,924,757 cells. This satisfies the course dataset-dimension requirement because the total dimension is greater than 100,000.
The dataset contains multiple public health questions, so many columns are metadata, identifiers, confidence limits, sample sizes, and stratification labels. The main variables used in this project are listed below.
| Variable | Description |
|---|---|
yearstart |
Starting year of the record |
yearend |
Ending year of the record |
locationdesc |
U.S. state, territory, or national location description |
class |
Broad public health category |
topic |
Specific topic under each class |
question |
Public health indicator question |
data_value |
Main numeric value for the selected indicator |
low_confidence_limit |
Lower confidence limit for the estimate |
high_confidence_limit |
Upper confidence limit for the estimate |
sample_size |
Sample size used for the estimate |
stratificationcategory1 |
Type of demographic stratification |
stratification1 |
Specific group within the stratification category |
questionid |
Stable identifier for the public health question |
A key property of this dataset is that data_value is
question-dependent. This means that the same numeric column can
represent different measurements depending on the value of
questionid.
For example:
data_value represents adult
obesity prevalence.data_value
represents a physical activity measurement.data_value represents a nutrition-related measurement.Therefore, this project does not pool all data_value
records together. Instead, it selects specific public health questions
before analysis. This avoids mixing values that have different
meanings.
A second interpretation issue is the difference between
Total or Overall rows and demographic subgroup
rows. Total rows are useful for broad state-level trends
because they summarise the overall adult population for a location and
year. However, regression, classification, risk ranking, and
intervention-priority analysis are based on demographic subgroup
profiles. Therefore, Total or Overall rows are
excluded from modelling-related sections to avoid mixing aggregate
summaries with subgroup observations.
This project focuses on two public health indicators.
| Indicator | Question ID | Question text |
|---|---|---|
| Adult obesity | Q036 |
Percent of adults aged 18 years and older who have obesity |
| No leisure-time physical activity | Q047 |
Percent of adults who engage in no leisure-time physical activity |
These two indicators are selected because they support the main project objective: analysing obesity patterns and examining how they relate to physical inactivity at the public health profile level.
Before any modelling, the structure and quality of the raw data are profiled. This includes variable classes, missing values, distinct-value counts, duplicate rows, and complete-row counts.
structure_summary <- tibble(
variable = names(raw_data),
class = map_chr(raw_data, ~ class(.x)[1]),
missing_n = map_int(raw_data, ~ sum(is.na(.x))),
missing_rate = map_dbl(raw_data, ~ mean(is.na(.x))),
unique_n = map_int(raw_data, ~ n_distinct(.x, na.rm = TRUE)),
example_values = map_chr(
raw_data,
~ paste(head(unique(na.omit(as.character(.x))), 5), collapse = " | ")
)
)
quality_summary <- tibble(
rows = nrow(raw_data),
columns = ncol(raw_data),
dimension_cells = nrow(raw_data) * ncol(raw_data),
duplicate_rows = sum(duplicated(raw_data)),
duplicate_rate = mean(duplicated(raw_data)),
complete_rows = sum(complete.cases(raw_data)),
complete_row_rate = mean(complete.cases(raw_data))
)
quality_summary %>%
kable(digits = 4, caption = "Row-Level Data Quality Summary")| rows | columns | dimension_cells | duplicate_rows | duplicate_rate | complete_rows | complete_row_rate |
|---|---|---|---|---|---|---|
| 88629 | 33 | 2924757 | 0 | 0 | 0 | 0 |
readr::write_csv(structure_summary, "project_output_variable_structure.csv")
readr::write_csv(quality_summary, "project_output_quality_summary.csv")The raw dataset contains 88,629 rows and 33 columns. The total dataset dimension is 2,924,757 cells. The dataset contains 0 exact duplicate rows.
key_variables <- c(
"yearstart",
"yearend",
"locationdesc",
"class",
"topic",
"question",
"data_value",
"low_confidence_limit",
"high_confidence_limit",
"sample_size",
"stratificationcategory1",
"stratification1",
"questionid"
)
key_variable_summary <- structure_summary %>%
filter(variable %in% key_variables)
key_variable_summary %>%
kable(digits = 4, caption = "Key Variable Structure Summary")| variable | class | missing_n | missing_rate | unique_n | example_values |
|---|---|---|---|---|---|
| yearstart | numeric | 0 | 0.0000 | 11 | 2020 | 2014 | 2013 | 2015 | 2012 |
| yearend | numeric | 0 | 0.0000 | 11 | 2020 | 2014 | 2013 | 2015 | 2012 |
| locationdesc | character | 0 | 0.0000 | 55 | National | Guam | Wyoming | District of Columbia | Puerto Rico |
| class | character | 0 | 0.0000 | 3 | Physical Activity | Obesity / Weight Status | Fruits and Vegetables |
| topic | character | 0 | 0.0000 | 3 | Physical Activity - Behavior | Obesity / Weight Status | Fruits and Vegetables - Behavior |
| question | character | 0 | 0.0000 | 9 | Percent of adults who engage in no leisure-time physical activity | Percent of adults aged 18 years and older who have obesity | Percent of adults aged 18 years and older who have an overweight classification | Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) | Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week |
| data_value | numeric | 8778 | 0.0990 | 695 | 30.6 | 29.3 | 28.8 | 32.7 | 26.6 |
| low_confidence_limit | numeric | 8778 | 0.0990 | 666 | 29.4 | 25.7 | 28.1 | 31.9 | 25.6 |
| high_confidence_limit | numeric | 8778 | 0.0990 | 760 | 31.8 | 33.3 | 29.5 | 33.5 | 27.6 |
| sample_size | numeric | 8778 | 0.0990 | 9580 | 31255 | 842 | 62562 | 60069 | 30904 |
| questionid | character | 0 | 0.0000 | 9 | Q047 | Q036 | Q037 | Q045 | Q044 |
| stratificationcategory1 | character | 9 | 0.0001 | 6 | Race/Ethnicity | Education | Income | Age (years) | Gender |
| stratification1 | character | 9 | 0.0001 | 28 | Hispanic | High school graduate | $50,000 - $74,999 | Data not reported | Less than $15,000 |
high_missing_summary <- structure_summary %>%
arrange(desc(missing_rate), desc(unique_n)) %>%
slice_head(n = 10)
high_missing_summary %>%
kable(digits = 4, caption = "Variables with Highest Missing Rates")| variable | class | missing_n | missing_rate | unique_n | example_values |
|---|---|---|---|---|---|
| data_value_unit | logical | 88629 | 1.0000 | 0 | |
| total | character | 85464 | 0.9643 | 1 | Total |
| gender | character | 82299 | 0.9286 | 2 | Female | Male |
| data_value_footnote_symbol | character | 79851 | 0.9010 | 1 | ~ |
| data_value_footnote | character | 79851 | 0.9010 | 1 | Data not available because sample size is insufficient. |
| education | character | 75969 | 0.8572 | 4 | High school graduate | Less than high school | Some college or technical school | College graduate |
| age_years | character | 69639 | 0.7857 | 6 | 25 - 34 | 55 - 64 | 18 - 24 | 45 - 54 | 35 - 44 |
| income | character | 66474 | 0.7500 | 7 | $50,000 - $74,999 | Data not reported | Less than $15,000 | $25,000 - $34,999 | $15,000 - $24,999 |
| race_ethnicity | character | 63309 | 0.7143 | 8 | Hispanic | American Indian/Alaska Native | Asian | Non-Hispanic White | Other |
| sample_size | numeric | 8778 | 0.0990 | 9580 | 31255 | 842 | 62562 | 60069 | 30904 |
The number of complete rows is 0. This does not mean that the dataset is unusable. Much of the missingness is structural.
For example, demographic columns such as age_years,
education, gender, income,
race_ethnicity, and total are mutually
exclusive stratification fields. A record normally belongs to one
stratification category, so the other demographic columns are expected
to be empty. The same information is carried more consistently by
stratificationcategory1 and
stratification1.
The dataset also contains missing values in variables such as
data_value, low_confidence_limit,
high_confidence_limit, and sample_size. These
missing values are often linked to unavailable estimates, such as cases
where the sample size is insufficient. Therefore, the modelling dataset
is prepared by selecting the relevant indicators and removing rows where
the selected indicator value is missing.
Because the data are aggregated and question-specific, cleaning here mainly means selecting and validating comparable records rather than repairing individual-level fields.
The cleaning workflow includes:
questionid.data_value for the selected
indicators.The duplicate check found 0 exact duplicate rows, so no de-duplication was required.
has_questionid <- "questionid" %in% names(raw_data)
has_question <- "question" %in% names(raw_data)
if (!has_question) {
stop("The dataset must contain a question column after column-name cleaning.")
}
question_table <- raw_data %>%
distinct(
questionid = if (has_questionid) questionid else NA_character_,
question = question
) %>%
arrange(questionid, question)
question_table %>%
kable(caption = "Available Public Health Questions")| questionid | question |
|---|---|
| Q018 | Percent of adults who report consuming fruit less than one time daily |
| Q019 | Percent of adults who report consuming vegetables less than one time daily |
| Q036 | Percent of adults aged 18 years and older who have obesity |
| Q037 | Percent of adults aged 18 years and older who have an overweight classification |
| Q043 | Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) |
| Q044 | Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week |
| Q045 | Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) |
| Q046 | Percent of adults who engage in muscle-strengthening activities on 2 or more days a week |
| Q047 | Percent of adults who engage in no leisure-time physical activity |
The table above confirms that the dataset contains multiple public
health questions. Since data_value has a different meaning
under different questions, the analysis selects only the two questions
directly related to this project.
if (has_questionid) {
selected_questions <- question_table %>%
filter(questionid %in% c("Q036", "Q047"))
} else {
selected_questions <- question_table %>%
filter(
str_detect(str_to_lower(question), "obesity") |
str_detect(
str_to_lower(question),
"no leisure-time physical activity|no leisure time physical activity"
)
)
}
selected_questions %>%
kable(caption = "Selected Public Health Questions")| questionid | question |
|---|---|
| Q036 | Percent of adults aged 18 years and older who have obesity |
| Q047 | Percent of adults who engage in no leisure-time physical activity |
The selected questions are:
Q036: Percent of adults aged 18 years and older who
have obesity.Q047: Percent of adults who engage in no leisure-time
physical activity.The preferred selection method is questionid, because it
is more stable than text matching. Text matching is used only as a
fallback if the dataset does not contain a questionid
column.
if (has_questionid) {
obesity_filter <- raw_data$questionid == "Q036"
inactivity_filter <- raw_data$questionid == "Q047"
} else {
obesity_filter <- str_detect(
str_to_lower(raw_data$question),
"obesity"
)
inactivity_filter <- str_detect(
str_to_lower(raw_data$question),
"no leisure-time physical activity|no leisure time physical activity"
)
}
indicator_match_summary <- tibble(
indicator = c(
"Adult obesity",
"No leisure-time physical activity"
),
matched_records_before_missing_value_removal = c(
sum(obesity_filter, na.rm = TRUE),
sum(inactivity_filter, na.rm = TRUE)
)
)
indicator_match_summary %>%
kable(caption = "Records Matched to Each Selected Indicator")| indicator | matched_records_before_missing_value_removal |
|---|---|
| Adult obesity | 16577 |
| No leisure-time physical activity | 16549 |
required_cols <- c(
"yearstart",
"locationdesc",
"question",
"data_value",
"stratificationcategory1",
"stratification1"
)
missing_required <- setdiff(required_cols, names(raw_data))
if (length(missing_required) > 0) {
stop(
paste(
"Missing required columns:",
paste(missing_required, collapse = ", ")
)
)
}
obesity_data <- raw_data %>%
filter(obesity_filter, !is.na(data_value)) %>%
mutate(
indicator = "Adult obesity",
stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
stratification1 = coalesce(as.character(stratification1), "Unknown")
)
inactivity_data <- raw_data %>%
filter(inactivity_filter, !is.na(data_value)) %>%
mutate(
indicator = "No leisure-time physical activity",
stratificationcategory1 = coalesce(as.character(stratificationcategory1), "Unknown"),
stratification1 = coalesce(as.character(stratification1), "Unknown")
)obesity_summary <- obesity_data %>%
summarise(
rows = n(),
years = n_distinct(yearstart),
min_year = min(yearstart, na.rm = TRUE),
max_year = max(yearstart, na.rm = TRUE),
locations = n_distinct(locationdesc),
mean_value = mean(data_value, na.rm = TRUE),
sd_value = sd(data_value, na.rm = TRUE),
min_value = min(data_value, na.rm = TRUE),
max_value = max(data_value, na.rm = TRUE)
)
inactivity_summary <- inactivity_data %>%
summarise(
rows = n(),
years = n_distinct(yearstart),
min_year = min(yearstart, na.rm = TRUE),
max_year = max(yearstart, na.rm = TRUE),
locations = n_distinct(locationdesc),
mean_value = mean(data_value, na.rm = TRUE),
sd_value = sd(data_value, na.rm = TRUE),
min_value = min(data_value, na.rm = TRUE),
max_value = max(data_value, na.rm = TRUE)
)
indicator_summary <- bind_rows(
obesity_summary %>% mutate(indicator = "Adult obesity"),
inactivity_summary %>% mutate(indicator = "No leisure-time physical activity")
) %>%
select(indicator, everything())
indicator_summary %>%
kable(digits = 4, caption = "Cleaned Indicator Dataset Summary")| indicator | rows | years | min_year | max_year | locations | mean_value | sd_value | min_value | max_value |
|---|---|---|---|---|---|---|---|---|---|
| Adult obesity | 14956 | 11 | 2011 | 2021 | 55 | 30.4590 | 7.2741 | 0.9 | 60.4 |
| No leisure-time physical activity | 14984 | 11 | 2011 | 2021 | 55 | 25.9161 | 8.3074 | 2.5 | 66.8 |
After removing records with missing data_value, the
adult obesity dataset contains 14,956 records, and the no leisure-time
physical activity dataset contains 14,984 records.
Both selected indicators are percentages, so valid values should fall between 0 and 100.
value_range_check <- tibble(
indicator = c(
"Adult obesity",
"No leisure-time physical activity"
),
questionid = c("Q036", "Q047"),
min_value = c(
min(obesity_data$data_value, na.rm = TRUE),
min(inactivity_data$data_value, na.rm = TRUE)
),
max_value = c(
max(obesity_data$data_value, na.rm = TRUE),
max(inactivity_data$data_value, na.rm = TRUE)
),
out_of_range_n = c(
sum(obesity_data$data_value < 0 | obesity_data$data_value > 100, na.rm = TRUE),
sum(inactivity_data$data_value < 0 | inactivity_data$data_value > 100, na.rm = TRUE)
)
)
value_range_check %>%
kable(digits = 2, caption = "Indicator Value Range Validation")| indicator | questionid | min_value | max_value | out_of_range_n |
|---|---|---|---|---|
| Adult obesity | Q036 | 0.9 | 60.4 | 0 |
| No leisure-time physical activity | Q047 | 2.5 | 66.8 | 0 |
The range validation confirms that all retained adult obesity and physical inactivity values are within the valid percentage range from 0 to 100.
cleaning_summary <- tibble(
stage = c(
"Raw data",
"Cleaned adult obesity dataset",
"Cleaned no leisure-time physical activity dataset"
),
questionid = c(
NA_character_,
"Q036",
"Q047"
),
rows = c(
nrow(raw_data),
nrow(obesity_data),
nrow(inactivity_data)
),
columns = c(
ncol(raw_data),
ncol(obesity_data),
ncol(inactivity_data)
)
)
cleaning_summary %>%
kable(caption = "Before and After Cleaning Summary")| stage | questionid | rows | columns |
|---|---|---|---|
| Raw data | NA | 88629 | 33 |
| Cleaned adult obesity dataset | Q036 | 14956 | 34 |
| Cleaned no leisure-time physical activity dataset | Q047 | 14984 | 34 |
readr::write_csv(cleaning_summary, "project_output_cleaning_summary.csv")
readr::write_csv(obesity_data, "project_output_cleaned_obesity_dataset.csv")
readr::write_csv(inactivity_data, "project_output_cleaned_inactivity_dataset.csv")The cleaned datasets provide the foundation for the next stages of the project, including exploratory trend analysis, demographic comparison, aligned obesity-inactivity analysis, regression modelling, classification modelling, and public health intervention priority analysis.
The EDA has three goals. First, it shows how adult obesity and physical inactivity changed from 2011 to 2021. Second, it compares obesity across locations and demographic groups. Third, it builds an aligned obesity-inactivity dataset. This makes sure the two indicators are compared within the same public health profile. This part is descriptive: it does not test causality and does not predict future obesity. It helps us understand the main patterns before the regression and classification models.
The EDA uses two related levels of analysis. Broad descriptive views
may use all selected records or Total records to show
general surveillance patterns. Modelling-related EDA uses demographic
subgroup profiles only, with Total or Overall
rows removed. This keeps the EDA, regression, classification, risk
ranking, intervention matrix, and threshold sensitivity analysis on a
consistent subgroup-level basis.
Three rules guide this EDA:
data_value is interpreted only within the selected
indicator.The figures in this section use a consistent visual style so that trends, rankings, and relationships can be compared more easily.
The first trend plot uses all selected records for the two indicators. It gives a broad descriptive overview of the surveillance data before separating total-level and subgroup-level analysis. The next subsection then uses total-stratification records to check whether the main trend remains visible when repeated demographic subgroup records are removed.
This pooled view is useful as a starting point, but it should not be
over-read. The data include both Total records and
demographic subgroup records, so the yearly average can be affected by
the mix of records in each year. For that reason, the
total-stratification robustness check is used as a more comparable
state-level follow-up, while subgroup-only aligned data are used later
for modelling-related analysis.
combined_core <- bind_rows(
obesity_data %>%
mutate(indicator = "Adult obesity"),
inactivity_data %>%
mutate(indicator = "No leisure-time physical activity")
)
yearly_trend <- combined_core %>%
group_by(indicator, yearstart) %>%
summarise(
records = n(),
average_value = mean(data_value, na.rm = TRUE),
median_value = median(data_value, na.rm = TRUE),
sd_value = sd(data_value, na.rm = TRUE),
.groups = "drop"
)
trend_change <- yearly_trend %>%
arrange(indicator, yearstart) %>%
group_by(indicator) %>%
summarise(
first_year = first(yearstart),
first_average = first(average_value),
last_year = last(yearstart),
last_average = last(average_value),
absolute_change = last_average - first_average,
relative_change_pct = absolute_change / first_average * 100,
.groups = "drop"
)
yearly_trend %>%
kable(digits = 3, caption = "Yearly Average of Selected Public Health Indicators")| indicator | yearstart | records | average_value | median_value | sd_value |
|---|---|---|---|---|---|
| Adult obesity | 2011 | 1332 | 27.736 | 28.10 | 6.756 |
| Adult obesity | 2012 | 1332 | 27.997 | 28.40 | 6.734 |
| Adult obesity | 2013 | 1344 | 28.816 | 29.35 | 6.782 |
| Adult obesity | 2014 | 1369 | 29.336 | 29.90 | 6.998 |
| Adult obesity | 2015 | 1361 | 29.533 | 30.00 | 6.963 |
| Adult obesity | 2016 | 1389 | 30.086 | 30.70 | 7.036 |
| Adult obesity | 2017 | 1372 | 30.956 | 31.60 | 7.047 |
| Adult obesity | 2018 | 1369 | 31.633 | 32.20 | 7.004 |
| Adult obesity | 2019 | 1343 | 32.356 | 32.90 | 7.178 |
| Adult obesity | 2020 | 1368 | 32.413 | 32.85 | 7.269 |
| Adult obesity | 2021 | 1377 | 34.033 | 34.60 | 7.417 |
| No leisure-time physical activity | 2011 | 1332 | 26.492 | 26.20 | 7.416 |
| No leisure-time physical activity | 2012 | 1336 | 23.898 | 22.95 | 7.956 |
| No leisure-time physical activity | 2013 | 1318 | 26.855 | 26.30 | 7.588 |
| No leisure-time physical activity | 2014 | 1376 | 24.236 | 23.50 | 8.035 |
| No leisure-time physical activity | 2015 | 1358 | 27.054 | 26.50 | 8.025 |
| No leisure-time physical activity | 2016 | 1399 | 24.895 | 24.10 | 8.614 |
| No leisure-time physical activity | 2017 | 1372 | 27.972 | 27.60 | 8.101 |
| No leisure-time physical activity | 2018 | 1377 | 25.632 | 24.60 | 8.720 |
| No leisure-time physical activity | 2019 | 1346 | 27.769 | 26.85 | 8.427 |
| No leisure-time physical activity | 2020 | 1382 | 24.933 | 23.85 | 8.531 |
| No leisure-time physical activity | 2021 | 1388 | 25.427 | 24.20 | 8.644 |
| indicator | first_year | first_average | last_year | last_average | absolute_change | relative_change_pct |
|---|---|---|---|---|---|---|
| Adult obesity | 2011 | 27.736 | 2021 | 34.033 | 6.296 | 22.70 |
| No leisure-time physical activity | 2011 | 26.492 | 2021 | 25.427 | -1.065 | -4.02 |
readr::write_csv(yearly_trend, "project_output_yearly_trend.csv")
readr::write_csv(trend_change, "project_output_yearly_trend_change.csv")yearly_trend %>%
ggplot(aes(x = yearstart, y = average_value, colour = indicator, group = indicator)) +
geom_line(linewidth = 1.1) +
geom_point(size = 2.5) +
scale_colour_manual(values = indicator_colours) +
scale_x_continuous(breaks = sort(unique(yearly_trend$yearstart))) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Overall Trends in Adult Obesity and Physical Inactivity",
subtitle = "Broad descriptive view across all selected non-missing public health records",
x = "Year",
y = "Average value"
) +
theme_report()The two indicators show different shapes. Adult obesity shows an overall upward pattern across the decade. It rises from 27.74% in 2011 to 34.03% in 2021, an increase of 6.3 percentage points. Physical inactivity shows a recurring jagged pattern rather than a clear upward or downward trend, which is why it is interpreted more cautiously than the obesity trend.
The overall trend above mixes many demographic subgroup records for
the same location and year. As a robustness check, this subsection keeps
only the Total records. These are the overall adult rate
for each location, with no demographic breakdown, so they avoid repeated
demographic weighting and provide a more comparable year-location
summary.
find_total_records <- function(df) {
total_flag <- rep(FALSE, nrow(df))
if ("stratificationcategory1" %in% names(df)) {
total_flag <- total_flag |
str_to_lower(as.character(df$stratificationcategory1)) %in% c("total", "overall")
}
if ("stratification1" %in% names(df)) {
total_flag <- total_flag |
str_to_lower(as.character(df$stratification1)) %in% c("total", "overall")
}
if ("total" %in% names(df)) {
total_flag <- total_flag | !is.na(df$total)
}
total_flag
}
obesity_total <- obesity_data %>%
filter(find_total_records(.))
inactivity_total <- inactivity_data %>%
filter(find_total_records(.))
total_check <- tibble(
indicator = c("Adult obesity", "No leisure-time physical activity"),
records = c(nrow(obesity_total), nrow(inactivity_total)),
years = c(n_distinct(obesity_total$yearstart), n_distinct(inactivity_total$yearstart)),
locations = c(n_distinct(obesity_total$locationdesc), n_distinct(inactivity_total$locationdesc))
)
total_indicator_trend <- bind_rows(
obesity_total %>%
transmute(indicator = "Adult obesity", yearstart, locationdesc, value = data_value),
inactivity_total %>%
transmute(indicator = "No leisure-time physical activity", yearstart, locationdesc, value = data_value)
) %>%
group_by(indicator, yearstart) %>%
summarise(
records = n(),
average_value = mean(value, na.rm = TRUE),
median_value = median(value, na.rm = TRUE),
sd_value = sd(value, na.rm = TRUE),
.groups = "drop"
)
total_trend_change <- total_indicator_trend %>%
arrange(indicator, yearstart) %>%
group_by(indicator) %>%
summarise(
first_year = first(yearstart),
first_average = first(average_value),
last_year = last(yearstart),
last_average = last(average_value),
absolute_change = last_average - first_average,
relative_change_pct = absolute_change / first_average * 100,
.groups = "drop"
)
trend_change_comparison <- bind_rows(
trend_change %>%
mutate(analysis_basis = "All selected records"),
total_trend_change %>%
mutate(analysis_basis = "Total records only")
) %>%
select(
analysis_basis,
indicator,
first_year,
first_average,
last_year,
last_average,
absolute_change
)
total_check %>%
kable(caption = "Coverage of Total-Stratification Records")| indicator | records | years | locations |
|---|---|---|---|
| Adult obesity | 589 | 11 | 55 |
| No leisure-time physical activity | 588 | 11 | 55 |
trend_change_comparison %>%
kable(digits = 3, caption = "First-to-Last-Year Change Under Two Analysis Bases")| analysis_basis | indicator | first_year | first_average | last_year | last_average | absolute_change |
|---|---|---|---|---|---|---|
| All selected records | Adult obesity | 2011 | 27.736 | 2021 | 34.033 | 6.296 |
| All selected records | No leisure-time physical activity | 2011 | 26.492 | 2021 | 25.427 | -1.065 |
| Total records only | Adult obesity | 2011 | 27.588 | 2021 | 33.537 | 5.949 |
| Total records only | No leisure-time physical activity | 2011 | 25.737 | 2021 | 24.187 | -1.550 |
readr::write_csv(total_check, "project_output_total_stratification_check.csv")
readr::write_csv(total_indicator_trend, "project_output_total_indicator_trend.csv")
readr::write_csv(total_trend_change, "project_output_total_trend_change.csv")
readr::write_csv(trend_change_comparison, "project_output_trend_change_comparison.csv")trend_change_comparison %>%
mutate(
analysis_basis = factor(
analysis_basis,
levels = c("All selected records", "Total records only")
)
) %>%
ggplot(aes(x = indicator, y = absolute_change, fill = analysis_basis)) +
geom_col(width = 0.64, position = position_dodge(width = 0.72)) +
geom_text(
aes(label = round(absolute_change, 2)),
position = position_dodge(width = 0.72),
vjust = if_else(trend_change_comparison$absolute_change >= 0, -0.35, 1.25),
colour = "#243447",
size = 3.6
) +
geom_hline(yintercept = 0, colour = "#5D6975", linewidth = 0.5) +
scale_fill_manual(values = c(
"All selected records" = "#A7B1B7",
"Total records only" = "#496A81"
)) +
scale_y_continuous(
labels = scales::label_number(suffix = " pts"),
expand = expansion(mult = c(0.16, 0.16))
) +
labs(
title = "Trend Change Robustness Check",
subtitle = "First-to-last-year change remains similar after using Total-only records",
x = NULL,
y = "Change from first to last year"
) +
theme_report()The total records cover the indicators well for state-level analysis. Adult obesity total records span 11 years and 55 locations. On this Total-only basis, adult obesity still rises from 27.59% to 33.54%, while physical inactivity shows no clear upward or downward trend and oscillates around a broadly similar level. The robustness plot does not repeat the first trend chart; instead, it compares the first-to-last-year change under the all-record and Total-only bases. The obesity increase remains similar under both bases, so the upward obesity pattern is not simply an artefact of repeated demographic subgroup records.
Two limits apply to this trend. First, it is an unweighted average of available location-level total records. Each location has the same weight, so a large state and a small state are treated the same. It should be read as a state-level surveillance trend, not a population-weighted national prevalence estimate. Second, 2020 and 2021 overlap with the COVID-19 pandemic period, so changes in those years should be read carefully: they may reflect both real public health change and data-collection context.
Trends describe the when; location and demographic comparisons describe the where and who. This section first compares locations on a total-stratification basis, then uses demographic stratification records to identify group-level differences in obesity prevalence.
The location ranking uses Total obesity records only, so
each location is compared at the same level and locations with more
subgroup records do not get extra weight. To reduce instability from
very short series, the ranking keeps only locations with at least eight
years of total-stratification obesity records.
min_years_for_location <- 8
location_obesity <- obesity_total %>%
filter(locationdesc != "National") %>%
group_by(locationdesc) %>%
summarise(
records = n(),
years = n_distinct(yearstart),
average_obesity = mean(data_value, na.rm = TRUE),
median_obesity = median(data_value, na.rm = TRUE),
min_obesity = min(data_value, na.rm = TRUE),
max_obesity = max(data_value, na.rm = TRUE),
.groups = "drop"
) %>%
filter(years >= min_years_for_location) %>%
arrange(desc(average_obesity))
top_obesity_locations <- location_obesity %>%
slice_max(average_obesity, n = 10, with_ties = FALSE)
low_obesity_locations <- location_obesity %>%
slice_min(average_obesity, n = 10, with_ties = FALSE)
location_obesity_extremes <- bind_rows(
top_obesity_locations %>% mutate(group = "Highest average obesity"),
low_obesity_locations %>% mutate(group = "Lowest average obesity")
)
top_obesity_locations %>%
kable(digits = 3, caption = "Top 10 Locations by Average Adult Obesity")| locationdesc | records | years | average_obesity | median_obesity | min_obesity | max_obesity |
|---|---|---|---|---|---|---|
| Mississippi | 11 | 11 | 37.218 | 37.3 | 34.6 | 40.8 |
| West Virginia | 11 | 11 | 37.027 | 37.7 | 32.4 | 40.6 |
| Louisiana | 11 | 11 | 35.764 | 35.9 | 33.1 | 38.6 |
| Arkansas | 11 | 11 | 35.518 | 35.7 | 30.9 | 38.7 |
| Alabama | 11 | 11 | 35.427 | 35.7 | 32.0 | 39.9 |
| Kentucky | 11 | 11 | 34.509 | 34.3 | 30.4 | 40.3 |
| Oklahoma | 11 | 11 | 34.491 | 33.9 | 31.1 | 39.4 |
| Tennessee | 11 | 11 | 33.464 | 33.8 | 29.2 | 36.5 |
| Indiana | 11 | 11 | 33.327 | 32.7 | 30.8 | 36.8 |
| South Carolina | 11 | 11 | 33.300 | 32.3 | 30.8 | 36.2 |
low_obesity_locations %>%
kable(digits = 3, caption = "Lowest 10 Locations by Average Adult Obesity")| locationdesc | records | years | average_obesity | median_obesity | min_obesity | max_obesity |
|---|---|---|---|---|---|---|
| Colorado | 11 | 11 | 22.273 | 22.3 | 20.2 | 25.1 |
| District of Columbia | 11 | 11 | 23.218 | 23.0 | 21.7 | 24.7 |
| Hawaii | 11 | 11 | 23.545 | 23.8 | 21.8 | 25.0 |
| Massachusetts | 11 | 11 | 24.455 | 24.3 | 22.7 | 27.4 |
| California | 11 | 11 | 25.618 | 25.0 | 23.8 | 30.3 |
| New York | 11 | 11 | 26.073 | 25.7 | 23.6 | 29.1 |
| Vermont | 11 | 11 | 26.164 | 26.3 | 23.7 | 29.0 |
| New Jersey | 10 | 10 | 26.340 | 26.6 | 23.7 | 28.2 |
| Montana | 11 | 11 | 26.345 | 25.5 | 23.6 | 31.8 |
| Utah | 11 | 11 | 26.382 | 25.4 | 24.1 | 30.9 |
readr::write_csv(location_obesity, "project_output_location_obesity_ranking.csv")
readr::write_csv(location_obesity_extremes, "project_output_location_obesity_extremes.csv")location_obesity_extremes %>%
mutate(
location_label = reorder(locationdesc, average_obesity),
group = factor(group, levels = c("Highest average obesity", "Lowest average obesity"))
) %>%
ggplot(aes(x = location_label, y = average_obesity, fill = group)) +
geom_col(width = 0.72) +
geom_text(
aes(label = round(average_obesity, 1)),
hjust = -0.12,
size = 3.2,
colour = "#243447"
) +
coord_flip() +
facet_wrap(~group, scales = "free") +
scale_fill_manual(values = c(
"Highest average obesity" = bar_fill_high,
"Lowest average obesity" = bar_fill_low
)) +
scale_y_continuous(
labels = scales::label_number(suffix = "%"),
expand = expansion(mult = c(0, 0.14))
) +
labs(
title = "Location Differences in Adult Obesity",
subtitle = paste0("Total-stratification records; locations with at least ", min_years_for_location, " years of data"),
x = NULL,
y = "Average adult obesity"
) +
theme_report()The ranking shows clear differences between locations, so obesity is
not evenly spread across places. The highest average obesity locations
are led by Mississippi, West Virginia, and Louisiana, which form a clear
concentration in Southern and Appalachian areas. The lowest are led by
Colorado, District of Columbia, and Hawaii. This is still descriptive:
it shows where higher obesity values appear, not why. The gap between
the highest and lowest groups is large, which is why
locationdesc is later included in the regression model.
The demographic comparison uses subgroup records only.
Total records are removed so that overall rates are not
mixed with subgroup rates. The aim is to see which age, education,
income, gender, and race or ethnicity groups have higher average obesity
values. These results describe public health profiles, not individual
biological risk.
obesity_stratification <- obesity_data %>%
filter(
!is.na(stratificationcategory1),
!is.na(stratification1),
!str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
!str_to_lower(as.character(stratification1)) %in% c("total", "overall")
) %>%
group_by(stratificationcategory1, stratification1) %>%
summarise(
records = n(),
years = n_distinct(yearstart),
locations = n_distinct(locationdesc),
average_obesity = mean(data_value, na.rm = TRUE),
median_obesity = median(data_value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(stratificationcategory1, desc(average_obesity))
stratification_gap <- obesity_stratification %>%
group_by(stratificationcategory1) %>%
arrange(desc(average_obesity), .by_group = TRUE) %>%
summarise(
highest_group = first(stratification1),
highest_average = first(average_obesity),
lowest_group = last(stratification1),
lowest_average = last(average_obesity),
within_category_gap = highest_average - lowest_average,
groups_compared = n(),
.groups = "drop"
) %>%
arrange(desc(within_category_gap))
obesity_stratification %>%
kable(digits = 3, caption = "Average Adult Obesity by Demographic Stratification Group")| stratificationcategory1 | stratification1 | records | years | locations | average_obesity | median_obesity |
|---|---|---|---|---|---|---|
| Age (years) | 45 - 54 | 589 | 11 | 55 | 35.913 | 35.50 |
| Age (years) | 55 - 64 | 589 | 11 | 55 | 34.831 | 34.90 |
| Age (years) | 35 - 44 | 589 | 11 | 55 | 33.902 | 33.60 |
| Age (years) | 25 - 34 | 589 | 11 | 55 | 28.883 | 28.80 |
| Age (years) | 65 or older | 589 | 11 | 55 | 27.877 | 28.00 |
| Age (years) | 18 - 24 | 589 | 11 | 55 | 17.751 | 17.30 |
| Education | Less than high school | 589 | 11 | 55 | 34.207 | 34.30 |
| Education | High school graduate | 589 | 11 | 55 | 32.424 | 32.40 |
| Education | Some college or technical school | 589 | 11 | 55 | 31.849 | 31.80 |
| Education | College graduate | 589 | 11 | 55 | 24.414 | 24.30 |
| Gender | Male | 589 | 11 | 55 | 30.218 | 30.30 |
| Gender | Female | 589 | 11 | 55 | 30.173 | 29.80 |
| Income | Less than $15,000 | 589 | 11 | 55 | 35.324 | 35.40 |
| Income | $15,000 - $24,999 | 589 | 11 | 55 | 33.840 | 33.90 |
| Income | $25,000 - $34,999 | 589 | 11 | 55 | 32.313 | 32.10 |
| Income | $35,000 - $49,999 | 589 | 11 | 55 | 31.980 | 32.00 |
| Income | $50,000 - $74,999 | 588 | 11 | 55 | 31.338 | 31.15 |
| Income | $75,000 or greater | 589 | 11 | 55 | 27.418 | 27.00 |
| Income | Data not reported | 589 | 11 | 55 | 25.834 | 25.70 |
| Race/Ethnicity | Hawaiian/Pacific Islander | 41 | 11 | 8 | 39.715 | 41.00 |
| Race/Ethnicity | Non-Hispanic Black | 473 | 11 | 47 | 38.303 | 38.70 |
| Race/Ethnicity | American Indian/Alaska Native | 388 | 11 | 46 | 36.869 | 37.50 |
| Race/Ethnicity | Hispanic | 573 | 11 | 55 | 32.056 | 32.10 |
| Race/Ethnicity | 2 or more races | 540 | 11 | 53 | 32.047 | 31.85 |
| Race/Ethnicity | Non-Hispanic White | 580 | 11 | 54 | 28.566 | 29.10 |
| Race/Ethnicity | Other | 196 | 11 | 42 | 27.281 | 26.75 |
| Race/Ethnicity | Asian | 386 | 11 | 40 | 10.967 | 10.20 |
| stratificationcategory1 | highest_group | highest_average | lowest_group | lowest_average | within_category_gap | groups_compared |
|---|---|---|---|---|---|---|
| Race/Ethnicity | Hawaiian/Pacific Islander | 39.715 | Asian | 10.967 | 28.748 | 8 |
| Age (years) | 45 - 54 | 35.913 | 18 - 24 | 17.751 | 18.162 | 6 |
| Education | Less than high school | 34.207 | College graduate | 24.414 | 9.794 | 4 |
| Income | Less than $15,000 | 35.324 | Data not reported | 25.834 | 9.489 | 7 |
| Gender | Male | 30.218 | Female | 30.173 | 0.045 | 2 |
readr::write_csv(obesity_stratification, "project_output_obesity_stratification_profile.csv")
readr::write_csv(stratification_gap, "project_output_obesity_stratification_gap.csv")obesity_stratification %>%
mutate(
stratification_label = reorder(stratification1, average_obesity)
) %>%
ggplot(aes(x = stratification_label, y = average_obesity)) +
geom_col(fill = bar_fill_neutral, width = 0.72) +
coord_flip() +
facet_wrap(~stratificationcategory1, scales = "free") +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Adult Obesity by Demographic Stratification Group",
subtitle = "Average values across available years and locations",
x = NULL,
y = "Average adult obesity"
) +
theme_report(base_size = 11)The chart shows that obesity differs across reported demographic
groups, not only across places. The largest within-category gap appears
in Race/Ethnicity, where Hawaiian/Pacific Islander has the highest
average obesity value and Asian has the lowest. The results suggest that
obesity is higher in some middle-aged, lower education, and lower income
profiles. Race or ethnicity shows a wide spread, while gender
differences are small. This is why stratification1 is later
included in the regression model and also used as profile-level
information in the classification screening model.
This does not mean any group causes obesity. It only shows that obesity values differ across reported profiles. Two limits apply. These averages are not population-weighted demographic prevalence; they are averages across available subgroup records. They are also averaged across years and locations, so a group whose records come mostly from higher-obesity locations can have its average pulled up by that coverage mix rather than by the demographic factor itself. In addition, some race or ethnicity groups have fewer available records and locations than others, so extreme subgroup averages should be read with coverage caution. Overall, these results should be read as profile-level patterns.
Obesity and physical inactivity come from separate rows, so their relationship should not be measured by comparing all obesity records with all inactivity records. A fair comparison matches the two indicators by the same year, location, stratification category, and stratification group. This gives one row per public health profile, with both values attached. After this alignment, the report keeps two versions: a broad aligned dataset for descriptive comparison and a subgroup-only aligned dataset for regression, classification, risk ranking, intervention matrix, and threshold sensitivity analysis.
alignment_keys <- c("yearstart", "locationdesc")
if ("stratificationcategory1" %in% names(obesity_data) &&
"stratificationcategory1" %in% names(inactivity_data)) {
alignment_keys <- c(alignment_keys, "stratificationcategory1")
}
if ("stratification1" %in% names(obesity_data) &&
"stratification1" %in% names(inactivity_data)) {
alignment_keys <- c(alignment_keys, "stratification1")
}
obesity_for_join <- obesity_data %>%
select(all_of(alignment_keys), obesity_value = data_value) %>%
group_by(across(all_of(alignment_keys))) %>%
summarise(obesity_value = mean(obesity_value, na.rm = TRUE), .groups = "drop")
inactivity_for_join <- inactivity_data %>%
select(all_of(alignment_keys), inactivity_value = data_value) %>%
group_by(across(all_of(alignment_keys))) %>%
summarise(inactivity_value = mean(inactivity_value, na.rm = TRUE), .groups = "drop")
aligned_health <- obesity_for_join %>%
inner_join(inactivity_for_join, by = alignment_keys)
modelling_aligned_health <- aligned_health %>%
filter(!find_total_records(.))
aligned_scope_summary <- tibble(
analysis_scope = c(
"All aligned profiles",
"Subgroup aligned profiles used for modelling"
),
profiles = c(nrow(aligned_health), nrow(modelling_aligned_health)),
total_or_overall_profiles_excluded = c(
0,
nrow(aligned_health) - nrow(modelling_aligned_health)
)
)
aligned_summary <- aligned_health %>%
summarise(
aligned_profiles = n(),
obesity_mean = mean(obesity_value, na.rm = TRUE),
inactivity_mean = mean(inactivity_value, na.rm = TRUE),
obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
)
modelling_aligned_summary <- modelling_aligned_health %>%
summarise(
aligned_profiles = n(),
obesity_mean = mean(obesity_value, na.rm = TRUE),
inactivity_mean = mean(inactivity_value, na.rm = TRUE),
obesity_q75 = quantile(obesity_value, 0.75, na.rm = TRUE),
inactivity_q75 = quantile(inactivity_value, 0.75, na.rm = TRUE),
correlation = cor(obesity_value, inactivity_value, use = "complete.obs")
)
aligned_summary_compare <- bind_rows(
aligned_summary %>% mutate(analysis_scope = "All aligned profiles"),
modelling_aligned_summary %>% mutate(analysis_scope = "Subgroup aligned modelling profiles")
) %>%
select(analysis_scope, everything())
aligned_quadrant_preview <- modelling_aligned_health %>%
mutate(
obesity_level = if_else(
obesity_value >= modelling_aligned_summary$obesity_q75,
"High obesity",
"Lower obesity"
),
inactivity_level = if_else(
inactivity_value >= modelling_aligned_summary$inactivity_q75,
"High inactivity",
"Lower inactivity"
),
profile_group = case_when(
obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
"High obesity + high inactivity",
obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
"High obesity only",
obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
"High inactivity only",
TRUE ~ "Lower obesity + lower inactivity"
)
) %>%
count(profile_group, name = "profiles") %>%
mutate(share = profiles / sum(profiles)) %>%
arrange(desc(profiles))
aligned_scope_summary %>%
kable(digits = 3, caption = "Aligned Dataset Scope for Descriptive and Modelling Analysis")| analysis_scope | profiles | total_or_overall_profiles_excluded |
|---|---|---|
| All aligned profiles | 14922 | 0 |
| Subgroup aligned profiles used for modelling | 14334 | 588 |
aligned_summary_compare %>%
kable(digits = 3, caption = "Aligned Obesity-Inactivity Relationship Summary by Analysis Scope")| analysis_scope | aligned_profiles | obesity_mean | inactivity_mean | obesity_q75 | inactivity_q75 | correlation |
|---|---|---|---|---|---|---|
| All aligned profiles | 14922 | 30.465 | 25.924 | 35.1 | 31.1 | 0.458 |
| Subgroup aligned modelling profiles | 14334 | 30.475 | 25.964 | 35.2 | 31.3 | 0.457 |
aligned_quadrant_preview %>%
kable(digits = 3, caption = "Exploratory Subgroup Profile Groups Based on 75th Percentile Cut-Offs")| profile_group | profiles | share |
|---|---|---|
| Lower obesity + lower inactivity | 8779 | 0.612 |
| High obesity only | 1970 | 0.137 |
| High inactivity only | 1960 | 0.137 |
| High obesity + high inactivity | 1625 | 0.113 |
readr::write_csv(aligned_health, "project_output_aligned_obesity_inactivity_all_profiles.csv")
readr::write_csv(modelling_aligned_health, "project_output_aligned_obesity_inactivity_subgroup_profiles.csv")
readr::write_csv(aligned_scope_summary, "project_output_aligned_scope_summary.csv")
readr::write_csv(aligned_summary_compare, "project_output_aligned_relationship_summary.csv")
readr::write_csv(aligned_quadrant_preview, "project_output_aligned_quadrant_preview.csv")set.seed(7004)
modelling_aligned_health %>%
slice_sample(n = min(5000, nrow(modelling_aligned_health))) %>%
ggplot(aes(x = inactivity_value, y = obesity_value)) +
geom_point(alpha = 0.28, colour = "#385A6B", size = 1.6) +
geom_smooth(method = "lm", se = TRUE, colour = "#B4443F", fill = "#E6B8B4") +
geom_vline(
xintercept = modelling_aligned_summary$inactivity_q75,
linetype = "dashed",
colour = "#5D6975"
) +
geom_hline(
yintercept = modelling_aligned_summary$obesity_q75,
linetype = "dashed",
colour = "#5D6975"
) +
scale_x_continuous(labels = scales::label_number(suffix = "%")) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Aligned Relationship Between Physical Inactivity and Adult Obesity",
subtitle = "Subgroup aligned profiles; dashed lines mark 75th percentile cut-offs",
x = "No leisure-time physical activity",
y = "Adult obesity"
) +
theme_report()After alignment, the broad aligned dataset has 14,922 comparable
public health profiles. For modelling-related analysis,
Total or Overall rows are excluded, leaving
14,334 subgroup aligned profiles. The all-profile aligned correlation is
0.458, while the subgroup modelling-scope correlation is 0.457. Both are
positive, but they should be read as descriptive profile-level
associations rather than causal effects. The subgroup aligned version is
used for the within-profile check, lagged association check, regression
adjustment ladder, classification model, risk ranking, intervention
matrix, and threshold sensitivity analysis.
The alignment uses an inner join, so it keeps only profiles that have
a record for both obesity and inactivity in the same year,
location, and stratification group. The surveillance panel is mildly
unbalanced (some locations or subgroups are not screened for every
indicator in every year), so this forces the comparison onto a fully
overlapping set of profiles. That keeps the correlation basis exactly
consistent across the two indicators, at the cost of dropping profiles
seen for only one of them. Removing aggregate Total or
Overall rows for the modelling scope also avoids treating
statewide totals as if they were independent demographic subgroup
observations.
The headline all-profile correlation still mixes two different
things: stable differences between profiles (some locations and
groups are simply higher on both indicators) and year-to-year changes
within the same profile. To separate these, this check uses the
subgroup aligned modelling dataset, removes the average value of each
profile, and then re-correlates. It therefore focuses on whether obesity
and inactivity still move together within the same
location x stratification category x stratification group
profile over time. This is a stricter check than the raw aligned
correlation. Profiles with fewer than three years are dropped, because a
one- or two-year profile has almost no within variation and would pile
up near zero and distort the result.
# Within-profile check uses the subgroup aligned modelling dataset,
# so it is on the same basis as the regression and classification sections.
within_base <- modelling_aligned_health
within_corr_df <- within_base %>%
group_by(locationdesc, stratificationcategory1, stratification1) %>%
filter(n() >= 3) %>%
mutate(
obesity_within = obesity_value - mean(obesity_value, na.rm = TRUE),
inactivity_within = inactivity_value - mean(inactivity_value, na.rm = TRUE)
) %>%
ungroup()
raw_aligned_corr <- cor(
aligned_health$obesity_value,
aligned_health$inactivity_value,
use = "complete.obs"
)
subgroup_aligned_corr <- cor(
modelling_aligned_health$obesity_value,
modelling_aligned_health$inactivity_value,
use = "complete.obs"
)
within_profile_corr <- cor(
within_corr_df$obesity_within,
within_corr_df$inactivity_within,
use = "complete.obs"
)
within_corr_compare <- tibble(
correlation_type = c(
"Raw aligned correlation (all profiles)",
"Raw aligned correlation (subgroup modelling profiles)",
"Within-profile demeaned correlation (subgroup profiles, >= 3 years)"
),
profiles_used = c(
sum(complete.cases(aligned_health[, c("obesity_value", "inactivity_value")])),
sum(complete.cases(modelling_aligned_health[, c("obesity_value", "inactivity_value")])),
nrow(within_corr_df)
),
correlation = c(raw_aligned_corr, subgroup_aligned_corr, within_profile_corr)
)
within_corr_compare %>%
kable(digits = 3, caption = "Raw Versus Within-Profile Demeaned Correlation")| correlation_type | profiles_used | correlation |
|---|---|---|
| Raw aligned correlation (all profiles) | 14922 | 0.458 |
| Raw aligned correlation (subgroup modelling profiles) | 14334 | 0.457 |
| Within-profile demeaned correlation (subgroup profiles, >= 3 years) | 14259 | 0.073 |
The raw aligned correlation is 0.458 across all aligned profiles and 0.457 within the subgroup modelling scope. The within-profile demeaned correlation is 0.073. Because the within-profile figure uses demographic subgroup profiles only, it is on the same basis as the regression adjustment ladder and classification model later in the report. The within-profile correlation is much smaller, which suggests that much of the subgroup obesity-inactivity relationship comes from stable differences between locations and demographic profiles, not from year-to-year movement within the same profile. In plain terms, some profiles are high on both indicators because of their location and demographic structure; the relationship is weaker once we only look at changes within the same profile over time. One more caveat applies even to the within number: because both indicators drift upward over the decade within a profile, a positive within-profile correlation can partly reflect a shared time trend, so it is not removed here and the result is treated as descriptive only.
The aligned relationship above uses same-year subgroup profiles. The next question is whether earlier physical inactivity is related to later obesity in the same location and stratification group. This check uses the subgroup aligned modelling dataset and compares inactivity with obesity at lags of 0, 1, and 2 years. It is only a descriptive check, not a forecasting model and not a causal test.
lag_keys <- setdiff(alignment_keys, "yearstart")
lag_correlations <- map_dfr(0:2, function(k) {
inactivity_part <- modelling_aligned_health %>%
transmute(
across(all_of(lag_keys)),
match_year = yearstart + k,
inactivity_value
)
obesity_part <- modelling_aligned_health %>%
transmute(
across(all_of(lag_keys)),
match_year = yearstart,
obesity_value
)
joined <- inner_join(
inactivity_part,
obesity_part,
by = c(lag_keys, "match_year")
)
tibble(
lag_years = k,
matched_pairs = nrow(joined),
correlation = cor(
joined$inactivity_value,
joined$obesity_value,
use = "complete.obs"
)
)
})
lag_correlations %>%
kable(digits = 3, caption = "Lagged Correlation Between Inactivity and Later Obesity")| lag_years | matched_pairs | correlation |
|---|---|---|
| 0 | 14334 | 0.457 |
| 1 | 12795 | 0.461 |
| 2 | 11473 | 0.478 |
lag_correlations %>%
ggplot(aes(x = factor(lag_years), y = correlation, group = 1)) +
geom_col(fill = bar_fill_neutral, width = 0.62) +
geom_text(
aes(label = paste0("r = ", round(correlation, 3), "\n", "n = ", scales::comma(matched_pairs))),
vjust = -0.35,
colour = "#243447",
size = 3.5,
lineheight = 0.9
) +
scale_y_continuous(limits = c(0, max(lag_correlations$correlation) + 0.12)) +
labs(
title = "Lagged Association Between Physical Inactivity and Later Obesity",
subtitle = "Subgroup aligned profiles; lag 1 and lag 2 compare earlier inactivity with later obesity",
x = "Lag in years",
y = "Correlation (r)"
) +
theme_report()The lagged correlations are 0.457, 0.461, 0.478 for lag 0, lag 1, and lag 2. They stay positive and do not weaken much as the lag grows, although the number of matched pairs decreases at longer lags. This pattern may be linked to stable differences between locations and demographic subgroup profiles. Therefore, it should be read as a stable subgroup-level association, not as strong early-warning evidence. No out-of-sample forecasting is done here.
The regression section addresses the third research question:
Can year, location, and demographic stratification variables explain variation in adult obesity prevalence?
The goal is not to predict obesity for one person. The target is the adult obesity percentage for an aggregated public health profile. In simple terms, the model checks why obesity values differ across years, locations, and demographic groups in the surveillance data.
This section has two parts. First, it fits a structural additive model with year, location, and demographic group. This model summarises the main patterns already seen in the EDA. Second, a later subsection uses an adjustment ladder to test whether physical inactivity still explains obesity after year, location, and demographic group are considered. The second part is more important for the obesity-inactivity question, because it answers something the EDA cannot: how much of the association survives once geography and demographics are controlled.
The regression target is:
Adult obesity data_value
The main predictors are:
yearstart + locationdesc + stratification1
stratificationcategory1 is not in the main formula
because stratification1 already gives the specific
demographic group. In this dataset the category and group fields are
nested, so using both adds repeated dummy variables and makes the model
harder to read. Variables that are directly derived from the reported
estimate, such as the confidence limits and alternative value fields,
are not used, because they would leak the indicator value itself. Sample
size is different: it is not a leakage variable, but it is also
not used as a predictor in this simple model. It records how precise
each estimate is, and future work could use it for weighting (for
example inverse-variance weighting), so small-sample estimates have less
influence.
This regression uses demographic subgroup records
only. The Total rows are overall state-year rates,
that is, aggregates of the subgroup rows in the same state and year.
They sit at a different level and are not independent of the subgroups.
If both are used together, the same population may be represented more
than once, which can make the regression look more stable than it really
is. Total rows are kept for the EDA trend but removed here,
which also matches the research question’s focus on demographic
stratification.
This setup is simple and easy to interpret. It uses the same profile
dimensions as the EDA: time, place, and demographic group. But subgroup
rows are not fully independent: in the same state and year, the age,
gender, income, education, and race or ethnicity rows are different
views of the same survey sample. Because of this, the
coefficients are used as descriptive, model-adjusted differences, not as
formal statistical inference. The standard errors and p-values from
lm() are not used for the main interpretation, because they
would understate the uncertainty under this dependence.
Two models are fitted:
regression_predictors <- c(
"yearstart",
"locationdesc",
"stratification1"
)
regression_predictors <- regression_predictors[
regression_predictors %in% names(obesity_data)
]
regression_df <- obesity_data %>%
filter(
!str_to_lower(as.character(stratificationcategory1)) %in% c("total", "overall"),
!str_to_lower(as.character(stratification1)) %in% c("total", "overall")
) %>%
select(data_value, all_of(regression_predictors)) %>%
drop_na(data_value) %>%
mutate(across(where(is.character), as.factor))
set.seed(7004)
regression_train_index <- sample.int(
nrow(regression_df),
floor(0.8 * nrow(regression_df))
)
reg_train <- regression_df[regression_train_index, ]
reg_test <- regression_df[-regression_train_index, ]
drop_unseen_levels <- function(train, test) {
keep <- rep(TRUE, nrow(test))
for (column_name in names(test)) {
if (is.factor(test[[column_name]])) {
seen_levels <- unique(as.character(train[[column_name]]))
keep <- keep & as.character(test[[column_name]]) %in% seen_levels
}
}
list(
test = test[keep, ],
dropped = sum(!keep)
)
}
reg_guard <- drop_unseen_levels(reg_train, reg_test)
reg_test <- reg_guard$test
reg_train <- droplevels(reg_train)
regression_design_summary <- tibble(
target = "Adult obesity data_value",
predictors = paste(regression_predictors, collapse = " + "),
modelling_rows = nrow(regression_df),
train_rows = nrow(reg_train),
test_rows_after_guard = nrow(reg_test),
test_rows_dropped_unseen_levels = reg_guard$dropped
)
reg_formula <- as.formula(
paste("data_value ~", paste(regression_predictors, collapse = " + "))
)
reg_lm <- lm(reg_formula, data = reg_train)
reg_tree <- rpart(reg_formula, data = reg_train, method = "anova")
reg_pred_lm <- predict(reg_lm, newdata = reg_test)
reg_pred_tree <- predict(reg_tree, newdata = reg_test)
reg_baseline_pred <- rep(mean(reg_train$data_value, na.rm = TRUE), nrow(reg_test))
reg_baseline_rmse <- sqrt(mean((reg_test$data_value - reg_baseline_pred)^2))
regression_metric_row <- function(model_name, actual, predicted, n_train, baseline_rmse) {
rmse_value <- sqrt(mean((actual - predicted)^2))
tibble(
model = model_name,
target = "Adult obesity data_value",
n_train = n_train,
n_test = length(actual),
rmse = rmse_value,
mae = mean(abs(actual - predicted)),
baseline_rmse = baseline_rmse,
rmse_improvement_pct = (baseline_rmse - rmse_value) / baseline_rmse * 100,
test_r2 = 1 - sum((actual - predicted)^2) /
sum((actual - mean(actual))^2)
)
}
regression_metrics <- bind_rows(
regression_metric_row(
"Linear regression",
reg_test$data_value,
reg_pred_lm,
nrow(reg_train),
reg_baseline_rmse
),
regression_metric_row(
"Regression tree",
reg_test$data_value,
reg_pred_tree,
nrow(reg_train),
reg_baseline_rmse
)
)
regression_design_summary %>%
kable(caption = "Regression Modelling Design Summary")| target | predictors | modelling_rows | train_rows | test_rows_after_guard | test_rows_dropped_unseen_levels |
|---|---|---|---|---|---|
| Adult obesity data_value | yearstart + locationdesc + stratification1 | 14367 | 11493 | 2874 | 0 |
| model | target | n_train | n_test | rmse | mae | baseline_rmse | rmse_improvement_pct | test_r2 |
|---|---|---|---|---|---|---|---|---|
| Linear regression | Adult obesity data_value | 11493 | 2874 | 3.7328 | 2.5359 | 7.4214 | 49.7016 | 0.7469 |
| Regression tree | Adult obesity data_value | 11493 | 2874 | 4.4940 | 3.3592 | 7.4214 | 39.4460 | 0.6332 |
readr::write_csv(regression_design_summary, "project_output_regression_design_summary.csv")
readr::write_csv(regression_metrics, "project_output_regression_metrics.csv")After the train-test split, the unseen-factor guard removed 0 test records, so the test set has no unseen location or stratification levels. Under this row-level random split, the linear regression explains 74.7% of the test-set variation in adult obesity, with an RMSE of 3.73 percentage points. This should be read as within-panel explanation, not forecasting skill. The same locations and demographic groups appear in both the training and test sets, so the model is mainly recovering already-observed profile means. The mean-only baseline is also weak, so beating it by 49.7% mainly shows that location and group structure is useful, not that the model can predict new locations or groups.
The regression tree gives a simple non-linear comparison. In this run it does not improve on the linear model. So the linear regression is kept as the main model, because it is simpler, clearer, and fits the observed profiles at least as well under the same split.
regression_metrics %>%
select(model, rmse, mae, test_r2) %>%
pivot_longer(
cols = c(rmse, mae, test_r2),
names_to = "metric",
values_to = "value"
) %>%
mutate(
metric = recode(
metric,
rmse = "RMSE",
mae = "MAE",
test_r2 = "Test R-squared"
)
) %>%
ggplot(aes(x = model, y = value, fill = model)) +
geom_col(width = 0.62) +
facet_wrap(~metric, scales = "free_y") +
scale_fill_manual(values = c(
"Linear regression" = "#B4443F",
"Regression tree" = "#496A81"
)) +
labs(
title = "Regression Model Performance Comparison",
subtitle = "Lower RMSE/MAE is better; higher test R-squared is better",
x = NULL,
y = NULL
) +
theme_report(base_size = 11) +
theme(axis.text.x = element_text(angle = 15, hjust = 1))The comparison points the same way. Under the row-level split, the linear regression has lower error than the simple regression tree. This suggests that additive location and demographic differences capture much of the observed profile-level variation. Again, this is within-panel explanation, not future forecasting ability, and the linear model is preferred mainly because it is simpler and easier to explain in a public health report.
regression_predictions <- reg_test %>%
mutate(
predicted_obesity = reg_pred_lm,
residual = data_value - predicted_obesity,
absolute_residual = abs(residual)
)
regression_error_summary <- regression_predictions %>%
summarise(
mean_observed = mean(data_value, na.rm = TRUE),
mean_predicted = mean(predicted_obesity, na.rm = TRUE),
mean_residual = mean(residual, na.rm = TRUE),
median_absolute_residual = median(absolute_residual, na.rm = TRUE),
pct_within_3_points = mean(absolute_residual <= 3) * 100,
pct_within_5_points = mean(absolute_residual <= 5) * 100
)
regression_error_summary %>%
kable(digits = 3, caption = "Linear Regression Prediction Error Summary")| mean_observed | mean_predicted | mean_residual | median_absolute_residual | pct_within_3_points | pct_within_5_points |
|---|---|---|---|---|---|
| 30.58 | 30.554 | 0.026 | 1.743 | 72.199 | 87.752 |
readr::write_csv(regression_predictions, "project_output_regression_predictions.csv")
readr::write_csv(regression_error_summary, "project_output_regression_error_summary.csv")set.seed(7004)
regression_predictions %>%
slice_sample(n = min(5000, nrow(regression_predictions))) %>%
ggplot(aes(x = predicted_obesity, y = data_value)) +
geom_point(alpha = 0.3, colour = "#385A6B", size = 1.6) +
geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
scale_x_continuous(labels = scales::label_number(suffix = "%")) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Observed Versus Predicted Adult Obesity",
subtitle = "Linear regression predictions on the held-out test set",
x = "Predicted adult obesity",
y = "Observed adult obesity"
) +
theme_report()Most points lie close to the 45-degree line, so the model captures the main structure of obesity differences across public health profiles. The median absolute residual is 1.74 percentage points, and 87.8% of test records are predicted within five percentage points of the observed value.
A single train-test split can be lucky or unlucky. To check that the result is not driven by one split, a simple 5-fold cross-validation is run for the linear regression. No hyperparameters are tuned; the goal is only to check whether performance is stable across different row-level splits.
rng_state <- .Random.seed
set.seed(2026)
cv_folds <- sample(rep(1:5, length.out = nrow(regression_df)))
cv_results <- map_dfr(1:5, function(fold_id) {
cv_train <- regression_df[cv_folds != fold_id, ]
cv_test <- regression_df[cv_folds == fold_id, ]
cv_guard <- drop_unseen_levels(cv_train, cv_test)
cv_test <- cv_guard$test
cv_train <- droplevels(cv_train)
cv_fit <- lm(reg_formula, data = cv_train)
cv_pred <- predict(cv_fit, newdata = cv_test)
tibble(
fold = fold_id,
n_test = nrow(cv_test),
rows_dropped = cv_guard$dropped,
rmse = sqrt(mean((cv_test$data_value - cv_pred)^2)),
test_r2 = 1 - sum((cv_test$data_value - cv_pred)^2) /
sum((cv_test$data_value - mean(cv_test$data_value))^2)
)
})
cv_summary <- cv_results %>%
summarise(
mean_rmse = mean(rmse),
sd_rmse = sd(rmse),
mean_test_r2 = mean(test_r2),
sd_test_r2 = sd(test_r2),
total_rows_dropped = sum(rows_dropped)
)
cv_results %>%
kable(digits = 4, caption = "Five-Fold Cross-Validation Results")| fold | n_test | rows_dropped | rmse | test_r2 |
|---|---|---|---|---|
| 1 | 2874 | 0 | 3.5271 | 0.7697 |
| 2 | 2874 | 0 | 3.6339 | 0.7591 |
| 3 | 2873 | 0 | 3.7346 | 0.7518 |
| 4 | 2873 | 0 | 3.5833 | 0.7552 |
| 5 | 2873 | 0 | 3.5892 | 0.7637 |
| mean_rmse | sd_rmse | mean_test_r2 | sd_test_r2 | total_rows_dropped |
|---|---|---|---|---|
| 3.6136 | 0.0776 | 0.7599 | 0.007 | 0 |
readr::write_csv(cv_results, "project_output_regression_cv.csv")
readr::write_csv(cv_summary, "project_output_regression_cv_summary.csv")
.Random.seed <- rng_stateAcross the five folds, the average test R-squared is 0.76, with a standard deviation of 0.007. The low variation shows the result is not caused by one special split. But this check is still limited: every fold shares most of the same locations and demographic groups, so it only checks interpolation stability within the observed panel. It does not test generalisation to new locations, future years, or unseen demographic groups, which would need leave-one-state-out or forward-in-time validation (noted as future work).
The diagnostics check whether the linear model is a reasonable working description of the data. The main questions are simple: are the residuals centred near zero, do the errors show strong patterns, and do a few hard-to-fit profiles dominate the results?
residual_diagnostics <- regression_predictions %>%
transmute(
observed = data_value,
fitted = predicted_obesity,
residual = residual,
absolute_residual = absolute_residual,
standardised_residual = as.numeric(scale(residual))
)
residual_summary <- residual_diagnostics %>%
summarise(
mean_residual = mean(residual, na.rm = TRUE),
sd_residual = sd(residual, na.rm = TRUE),
median_residual = median(residual, na.rm = TRUE),
p95_absolute_residual = quantile(abs(residual), 0.95, na.rm = TRUE),
max_absolute_residual = max(abs(residual), na.rm = TRUE)
)
large_residual_profiles <- regression_predictions %>%
arrange(desc(absolute_residual)) %>%
slice_head(n = 10)
residual_summary %>%
kable(digits = 3, caption = "Residual Summary for Linear Regression")| mean_residual | sd_residual | median_residual | p95_absolute_residual | max_absolute_residual |
|---|---|---|---|---|
| 0.026 | 3.733 | 0.078 | 7.72 | 23.691 |
large_residual_profiles %>%
kable(digits = 3, caption = "Largest Absolute Residuals in the Test Set")| data_value | yearstart | locationdesc | stratification1 | predicted_obesity | residual | absolute_residual |
|---|---|---|---|---|---|---|
| 16.7 | 2018 | Ohio | American Indian/Alaska Native | 40.391 | -23.691 | 23.691 |
| 54.3 | 2014 | South Dakota | 2 or more races | 32.298 | 22.002 | 22.002 |
| 56.2 | 2013 | Hawaii | Hawaiian/Pacific Islander | 34.471 | 21.729 | 21.729 |
| 12.6 | 2021 | Virgin Islands | Non-Hispanic White | 34.279 | -21.679 | 21.679 |
| 17.5 | 2018 | Georgia | American Indian/Alaska Native | 38.500 | -21.000 | 21.000 |
| 15.0 | 2011 | Maine | Non-Hispanic Black | 34.260 | -19.260 | 19.260 |
| 19.3 | 2011 | Arkansas | American Indian/Alaska Native | 37.986 | -18.686 | 18.686 |
| 53.6 | 2014 | Arkansas | 2 or more races | 35.166 | 18.434 | 18.434 |
| 8.6 | 2011 | Georgia | Other | 26.639 | -18.039 | 18.039 |
| 17.4 | 2015 | Kentucky | 2 or more races | 35.025 | -17.625 | 17.625 |
readr::write_csv(residual_diagnostics, "project_output_regression_residual_diagnostics.csv")
readr::write_csv(residual_summary, "project_output_regression_residual_summary.csv")
readr::write_csv(large_residual_profiles, "project_output_regression_largest_residuals.csv")residual_diagnostics %>%
ggplot(aes(x = fitted, y = residual)) +
geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
scale_x_continuous(labels = scales::label_number(suffix = "%")) +
scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
labs(
title = "Residuals Versus Fitted Values",
x = "Fitted adult obesity",
y = "Residual"
) +
theme_report(base_size = 11)residual_diagnostics %>%
ggplot(aes(x = residual)) +
geom_histogram(bins = 40, fill = "#496A81", colour = "white") +
geom_vline(xintercept = 0, colour = "#B4443F", linewidth = 1) +
scale_x_continuous(labels = scales::label_number(suffix = " pts")) +
labs(
title = "Residual Distribution",
x = "Residual",
y = "Profiles"
) +
theme_report(base_size = 11)qq_values <- qqnorm(residual_diagnostics$standardised_residual, plot.it = FALSE)
qq_data <- tibble(
theoretical = qq_values$x,
sample = qq_values$y
)
qq_data %>%
ggplot(aes(x = theoretical, y = sample)) +
geom_point(alpha = 0.35, colour = "#385A6B", size = 1.4) +
geom_abline(intercept = 0, slope = 1, colour = "#B4443F", linewidth = 1) +
labs(
title = "Normal Q-Q Check",
x = "Theoretical quantiles",
y = "Sample quantiles"
) +
theme_report(base_size = 11)residual_diagnostics %>%
mutate(sqrt_abs_standardised = sqrt(abs(standardised_residual))) %>%
ggplot(aes(x = fitted, y = sqrt_abs_standardised)) +
geom_point(alpha = 0.3, colour = "#385A6B", size = 1.4) +
geom_smooth(se = FALSE, colour = "#B4443F", linewidth = 0.9) +
scale_x_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Scale-Location Check",
x = "Fitted adult obesity",
y = "Sqrt(abs(standardised residual))"
) +
theme_report(base_size = 11)The residuals are centred close to zero, so the model does not strongly overpredict or underpredict obesity overall. The histogram and Q-Q plot show most residuals near zero, but the tails are heavier than a perfect normal distribution. So most profiles are fitted reasonably well, while a few profiles are still hard to explain using only year, location, and stratification group. The scale-location plot does not show an extreme funnel shape, but some heterogeneity is still expected: the rows are aggregated survey estimates, and their precision varies with sample size, which is not modelled here (no weighting is applied). So these plots are used as practical checks, not as a formal test that the constant-variance assumption fully holds.
The metrics show the model captures much of the observed
profile-level structure, but it is more useful once the coefficients are
read. Because locationdesc and stratification1
are categorical, the linear regression uses dummy variables. Each
location or stratification coefficient is a difference from a reference
level, holding year and the other profile variables constant.
lm_coef_tbl <- tibble(
term = names(coef(reg_lm)),
estimate = as.numeric(coef(reg_lm))
) %>%
filter(!is.na(estimate))
year_effect <- lm_coef_tbl %>%
filter(term == "yearstart") %>%
pull(estimate)
ref_location <- levels(reg_train$locationdesc)[1]
ref_stratification <- levels(reg_train$stratification1)[1]
location_effects <- lm_coef_tbl %>%
filter(str_detect(term, "^locationdesc")) %>%
transmute(
location = str_remove(term, "^locationdesc"),
effect_vs_reference = estimate
) %>%
arrange(desc(effect_vs_reference))
stratification_effects <- lm_coef_tbl %>%
filter(str_detect(term, "^stratification1")) %>%
transmute(
stratification_group = str_remove(term, "^stratification1"),
effect_vs_reference = estimate
) %>%
arrange(desc(effect_vs_reference))
location_effect_extremes <- bind_rows(
location_effects %>%
slice_head(n = 8) %>%
mutate(direction = "Higher than reference"),
location_effects %>%
slice_tail(n = 8) %>%
mutate(direction = "Lower than reference")
) %>%
mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))
stratification_effect_extremes <- bind_rows(
stratification_effects %>%
slice_head(n = 8) %>%
mutate(direction = "Higher than reference"),
stratification_effects %>%
slice_tail(n = 8) %>%
mutate(direction = "Lower than reference")
) %>%
mutate(direction = factor(direction, levels = c("Higher than reference", "Lower than reference")))
tibble(
coefficient = "yearstart",
interpretation = "Estimated yearly change in adult obesity, holding location and stratification group constant",
estimate_percentage_points = year_effect
) %>%
kable(digits = 3, caption = "Year Effect from Linear Regression")| coefficient | interpretation | estimate_percentage_points |
|---|---|---|
| yearstart | Estimated yearly change in adult obesity, holding location and stratification group constant | 0.6 |
location_effect_extremes %>%
kable(digits = 3, caption = paste0(
"Largest Location Effects Relative to Reference Location: ",
ref_location
))| location | effect_vs_reference | direction |
|---|---|---|
| West Virginia | 2.047 | Higher than reference |
| Mississippi | 1.920 | Higher than reference |
| Louisiana | 0.340 | Higher than reference |
| Arkansas | -0.074 | Higher than reference |
| Oklahoma | -0.574 | Higher than reference |
| Kentucky | -0.816 | Higher than reference |
| Indiana | -1.464 | Higher than reference |
| Iowa | -1.487 | Higher than reference |
| Vermont | -7.976 | Lower than reference |
| Utah | -8.066 | Lower than reference |
| California | -8.284 | Lower than reference |
| New York | -8.443 | Lower than reference |
| Massachusetts | -9.096 | Lower than reference |
| Hawaii | -9.665 | Lower than reference |
| District of Columbia | -9.794 | Lower than reference |
| Colorado | -11.933 | Lower than reference |
stratification_effect_extremes %>%
kable(digits = 3, caption = paste0(
"Largest Stratification Group Effects Relative to Reference Group: ",
ref_stratification
))| stratification_group | effect_vs_reference | direction |
|---|---|---|
| Hawaiian/Pacific Islander | 7.801 | Higher than reference |
| Non-Hispanic Black | 4.236 | Higher than reference |
| American Indian/Alaska Native | 2.927 | Higher than reference |
| 45 - 54 | 2.201 | Higher than reference |
| Less than $15,000 | 1.530 | Higher than reference |
| 55 - 64 | 1.047 | Higher than reference |
| Less than high school | 0.373 | Higher than reference |
| 35 - 44 | 0.121 | Higher than reference |
| 25 - 34 | -4.880 | Lower than reference |
| Non-Hispanic White | -5.133 | Lower than reference |
| 65 or older | -6.010 | Lower than reference |
| $75,000 or greater | -6.376 | Lower than reference |
| Data not reported | -7.972 | Lower than reference |
| College graduate | -9.425 | Lower than reference |
| 18 - 24 | -16.029 | Lower than reference |
| Asian | -22.070 | Lower than reference |
readr::write_csv(lm_coef_tbl, "project_output_regression_coefficients.csv")
readr::write_csv(location_effects, "project_output_regression_location_effects.csv")
readr::write_csv(stratification_effects, "project_output_regression_stratification_effects.csv")location_effect_extremes %>%
mutate(location = reorder(location, effect_vs_reference)) %>%
ggplot(aes(x = location, y = effect_vs_reference, fill = direction)) +
geom_col(width = 0.72) +
coord_flip() +
facet_wrap(~direction, scales = "free_y") +
scale_fill_manual(values = c(
"Higher than reference" = "#B4443F",
"Lower than reference" = "#4E7D63"
)) +
scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
labs(
title = "Largest Location Effects in the Linear Regression",
subtitle = paste0("Effects are percentage-point differences relative to ", ref_location),
x = NULL,
y = "Effect versus reference"
) +
theme_report(base_size = 11)stratification_effect_extremes %>%
mutate(stratification_group = reorder(stratification_group, effect_vs_reference)) %>%
ggplot(aes(x = stratification_group, y = effect_vs_reference, fill = direction)) +
geom_col(width = 0.72) +
coord_flip() +
facet_wrap(~direction, scales = "free_y") +
scale_fill_manual(values = c(
"Higher than reference" = "#B4443F",
"Lower than reference" = "#4E7D63"
)) +
scale_y_continuous(labels = scales::label_number(suffix = " pts")) +
labs(
title = "Largest Demographic Group Effects in the Linear Regression",
subtitle = paste0("Effects are percentage-point differences relative to ", ref_stratification),
x = NULL,
y = "Effect versus reference"
) +
theme_report(base_size = 11)All coefficients below are descriptive, model-adjusted differences within this additive profile model. They are not formal inferential or causal estimates, and the standard errors are not interpreted (see the design note on non-independent subgroup rows).
In this additive model, the yearstart coefficient is
0.6. This suggests an average upward time pattern of about 0.6
percentage points per year, or roughly 6 percentage points over ten
years, which matches the upward trend in the EDA. The model has no
location-specific time terms, so this is an average pattern across
profiles. It does not mean every location or every group rises at the
same rate each year.
The location effects also match the location ranking. Locations with the largest positive coefficients include West Virginia, Mississippi, Louisiana, and those with the most negative include Colorado, District of Columbia, Hawaii. These differences stay visible after adjusting for year and demographic stratification group.
The stratification effects show demographic group is also linked with
obesity variation. Groups with the largest positive coefficients include
Hawaiian/Pacific Islander, Non-Hispanic Black, American Indian/Alaska
Native, and those with the most negative include Asian, 18 - 24, College
graduate. This connects with the demographic EDA: obesity is not evenly
distributed across reported profiles. These subgroup coefficients
compare reported stratification labels with the reference label. Because
stratification1 mixes age, gender, income, education, and
race or ethnicity into one label variable, they should be read as
descriptive profile differences from the reference group, not as
separate effects of age, income, education, or race.
The model above mainly restates, in one equation, what the EDA already showed: obesity differs by year, place, and demographic group. By itself, this does not add much new information. A more useful question for this project is whether physical inactivity explains obesity beyond those structural differences. The EDA found a moderate subgroup aligned correlation (0.457) but a much smaller within-profile correlation (0.073), which already hints that much of the obesity-inactivity link is stable between-profile structure. The regression can make this clear by adding controls one block at a time and watching the inactivity coefficient.
This uses the same subgroup aligned modelling dataset created in the
EDA section, so each row has both an obesity and an inactivity value for
the same demographic subgroup profile and excludes Total or
Overall rows. Four models are fitted on the full sample.
This is not a train-test prediction task; the goal is to watch how the
inactivity coefficient changes. The ladder keeps the same outcome and
adds controls step by step: M1 inactivity only; M2 adds year; M3 adds
location; M4 adds demographic group.
ladder_df <- modelling_aligned_health %>%
select(obesity_value, inactivity_value, yearstart, locationdesc, stratification1) %>%
drop_na() %>%
mutate(across(where(is.character), as.factor))
ladder_models <- list(
"M1: inactivity only" = lm(obesity_value ~ inactivity_value, data = ladder_df),
"M2: + year" = lm(obesity_value ~ inactivity_value + yearstart, data = ladder_df),
"M3: + location" = lm(obesity_value ~ inactivity_value + yearstart + locationdesc, data = ladder_df),
"M4: + location + demographic group" = lm(obesity_value ~ inactivity_value + yearstart + locationdesc + stratification1, data = ladder_df)
)
ladder_summary <- imap_dfr(ladder_models, function(fit, label) {
tibble(
model = label,
inactivity_coefficient = unname(coef(fit)["inactivity_value"]),
adj_r2 = summary(fit)$adj.r.squared
)
})
ladder_summary <- ladder_summary %>%
mutate(
coef_retained_vs_m1 = inactivity_coefficient / inactivity_coefficient[model == "M1: inactivity only"]
)
ladder_summary %>%
kable(digits = 3, caption = "Inactivity Coefficient and Fit as Controls Are Added (Adjustment Ladder)")| model | inactivity_coefficient | adj_r2 | coef_retained_vs_m1 |
|---|---|---|---|
| M1: inactivity only | 0.401 | 0.209 | 1.000 |
| M2: + year | 0.397 | 0.270 | 0.990 |
| M3: + location | 0.357 | 0.384 | 0.892 |
| M4: + location + demographic group | 0.139 | 0.769 | 0.347 |
ladder_summary %>%
mutate(model = factor(model, levels = ladder_summary$model)) %>%
ggplot(aes(x = model, y = inactivity_coefficient, group = 1)) +
geom_col(fill = bar_fill_neutral, width = 0.62) +
geom_text(aes(label = round(inactivity_coefficient, 3)), vjust = -0.35, colour = "#243447", size = 3.6) +
geom_hline(yintercept = 0, colour = "#5D6975", linetype = "dashed") +
labs(
title = "Inactivity Coefficient Shrinks as Geography and Demographics Are Controlled",
subtitle = "Each step adds a control block; a coefficient that collapses signals between-profile confounding",
x = NULL,
y = "Coefficient on inactivity (pts obesity per pt inactivity)"
) +
theme_report(base_size = 11) +
theme(axis.text.x = element_text(angle = 15, hjust = 1))The inactivity coefficient falls from 0.401 in M1 to 0.139 in M4. After location and demographic group are added, it keeps only about 35% of its unadjusted size. Most of the drop happens when demographic group enters. This suggests that the raw association is strongly linked to demographic profile differences. For example, some income, education, and race or ethnicity profiles may be high on both indicators. This is the most useful result, and one the EDA cannot give: the coefficient pattern suggests that a large part of the apparent obesity-inactivity association is tied to between-profile structure, while a modest part survives adjustment.
These numbers also fit the earlier within-profile check on a sensible gradient. M4 still leaves some association because it only adjusts for average location and group differences. The within-profile demeaned correlation (0.073) is stricter, because it removes each full profile’s mean and keeps only year-to-year movement, and it is close to zero. The more completely profile identity is absorbed, the less remains – all pointing the same way: the obesity-inactivity link is mainly stable between-profile differences, not an independent within-profile relationship. The remaining M4 coefficient should not be over-read as a clean effect.
There are three cautions. First, this is an ecological
regression: obesity and inactivity are both survey-estimated outcomes
for aggregated groups, so the coefficient is an association between two
outcomes, not a behavioural or causal effect, and it has no direction.
Second, inactivity_value is also an estimate with
measurement error, so the coefficient is pulled toward zero (regression
dilution); the true link could be a bit stronger than M4 shows. Third,
no precision weighting is used because the aligned data does not carry
sample size; inverse-variance weighting is left as future work. So the
ladder answers the variable-value question descriptively: most of the
obesity-inactivity association is structural, and after geography and
demographics are accounted for, physical inactivity adds only modest
additional descriptive explanatory power for obesity in this aggregated
surveillance panel.
One more point on the two regressions. The structural model at the start of this section is not the main evidence about the obesity-inactivity relationship; it just summarises how obesity itself varies by year, location, and demographic group. The adjustment ladder is the main regression evidence on variable value, and its message is that inactivity is mostly a marker of between-profile structure, not an independent explainer.
The regression shows that adult obesity is structured by year,
location, and demographic group. Its performance should be read as
profile-level explanation within the observed surveillance panel.
Because locationdesc and stratification1 are
predictors, the model is mainly learning differences among
already-observed locations and groups. This fits the project goal of
profiling existing records, but it should not be described as strong
future forecasting ability for new locations or new demographic
categories.
The practical interpretation is:
Overall, the regression has two roles. The structural model confirms the EDA findings and puts them in one additive equation: obesity varies by time, place, and demographic group. The adjustment ladder then goes one step further, showing that most of the obesity-inactivity association is carried by that same structure, with only a modest residual after adjustment. Together they set up the classification stage, where inactivity is still useful as one profile feature for ranking and screening, not as a clean causal driver.
Future work on inference. This report treats the
coefficients as descriptive and does not interpret p-values. A more
formal analysis could use cluster-robust standard errors clustered by
locationdesc, or a linear mixed-effects model. These
methods would give more careful uncertainty estimates. However, they
would still need caution because demographic rows can be different
slices of the same survey sample.
The classification section addresses the fourth research question:
Can high-obesity demographic subgroup public health profiles be screened using physical inactivity levels together with year, location, and demographic characteristics?
This is a different task from the regression model. The regression model explains the continuous adult obesity percentage across public health profiles, while the classification model converts the same profile-level information into a screening task: identifying which profiles are more likely to fall into the high-obesity group.
To keep the modelling level consistent with the revised EDA and
regression sections, the classification analysis uses the same
subgroup aligned modelling dataset created after the
obesity-inactivity alignment. Total or Overall
rows are excluded because they are aggregate summaries rather than
independent subgroup observations. This avoids mixing statewide total
records with demographic subgroup records in the same screening
model.
The target is defined as:
high_obesity = 1 if obesity_value >= 75th percentile of TRAINING obesity values
high_obesity = 0 otherwise
The 75th percentile threshold is calculated from the training split only. This avoids test-set leakage and keeps the classification task honest. The threshold is not a medical standard; it is a relative rule for public health screening and prioritisation.
classification_health <- modelling_aligned_health
classification_predictors <- c(
"inactivity_value",
"yearstart",
"locationdesc",
"stratification1"
)
classification_predictors <- classification_predictors[
classification_predictors %in% names(classification_health)
]
classification_df <- classification_health %>%
select(obesity_value, all_of(classification_predictors)) %>%
drop_na() %>%
mutate(across(where(is.character), as.factor))
set.seed(7004)
classification_train_index <- sample.int(
nrow(classification_df),
floor(0.8 * nrow(classification_df))
)
cls_train <- classification_df[classification_train_index, ]
cls_test <- classification_df[-classification_train_index, ]
cls_guard <- drop_unseen_levels(cls_train, cls_test)
cls_test <- cls_guard$test
cls_train <- droplevels(cls_train)
obesity_threshold <- unname(quantile(cls_train$obesity_value, 0.75, na.rm = TRUE))
cls_train <- cls_train %>%
mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))
cls_test <- cls_test %>%
mutate(high_obesity = if_else(obesity_value >= obesity_threshold, 1, 0))
cls_formula <- as.formula(
paste("high_obesity ~", paste(classification_predictors, collapse = " + "))
)
cls_glm <- glm(cls_formula, data = cls_train, family = binomial())
cls_tree <- rpart(
as.formula(paste(
"factor(high_obesity) ~",
paste(classification_predictors, collapse = " + ")
)),
data = cls_train,
method = "class"
)
cls_prob_glm <- predict(cls_glm, newdata = cls_test, type = "response")
cls_prob_tree <- predict(cls_tree, newdata = cls_test, type = "prob")[, "1"]
classification_design_summary <- tibble(
target = "High-obesity subgroup public health profile",
analysis_scope = "Demographic subgroup profiles only; Total/Overall rows excluded",
threshold_source = "75th percentile of training-set obesity values",
obesity_threshold = obesity_threshold,
predictors = paste(classification_predictors, collapse = " + "),
original_aligned_profiles = nrow(aligned_health),
subgroup_profiles_used = nrow(classification_health),
excluded_total_or_overall_profiles = nrow(aligned_health) - nrow(classification_health),
modelling_rows = nrow(classification_df),
train_rows = nrow(cls_train),
test_rows_after_guard = nrow(cls_test),
test_rows_dropped_unseen_levels = cls_guard$dropped,
positive_rate_test = mean(cls_test$high_obesity == 1)
)
classification_design_summary %>%
kable(digits = 4, caption = "Classification Modelling Design Summary")| target | analysis_scope | threshold_source | obesity_threshold | predictors | original_aligned_profiles | subgroup_profiles_used | excluded_total_or_overall_profiles | modelling_rows | train_rows | test_rows_after_guard | test_rows_dropped_unseen_levels | positive_rate_test |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| High-obesity subgroup public health profile | Demographic subgroup profiles only; Total/Overall rows excluded | 75th percentile of training-set obesity values | 35.2 | inactivity_value + yearstart + locationdesc + stratification1 | 14922 | 14334 | 588 | 14334 | 11467 | 2867 | 0 | 0.2483 |
Logistic regression is used as the main classification model because it is interpretable and suitable for a binary screening task. A classification tree is included as a benchmark model. The purpose is not to diagnose individuals or predict future obesity status, but to rank observed subgroup profiles for public health attention.
For public health prioritisation, the model should not only classify profiles at one threshold. It should also rank profiles so that the highest-risk subgroup profiles can be reviewed first. Therefore, the evaluation reports both threshold-based metrics and ranking-based metrics.
calc_auc_pr <- function(actual, prob, positive = 1) {
y <- if_else(actual == positive, 1, 0)
positive_n <- sum(y == 1)
if (positive_n == 0) {
return(NA_real_)
}
ordered_index <- order(prob, decreasing = TRUE)
y_sorted <- y[ordered_index]
tp <- cumsum(y_sorted == 1)
fp <- cumsum(y_sorted == 0)
recall <- tp / positive_n
precision <- tp / pmax(tp + fp, 1)
recall <- c(0, recall)
precision <- c(1, precision)
sum(diff(recall) * (precision[-1] + precision[-length(precision)]) / 2)
}
calc_auc_roc <- function(actual, prob, positive = 1) {
y <- if_else(actual == positive, 1, 0)
positive_n <- sum(y == 1)
negative_n <- sum(y == 0)
if (positive_n == 0 || negative_n == 0) {
return(NA_real_)
}
rank_sum_positive <- sum(rank(prob)[y == 1])
(rank_sum_positive - positive_n * (positive_n + 1) / 2) /
(positive_n * negative_n)
}
top_fraction_capture <- function(actual, prob, positive = 1, fraction = 0.20) {
ranking_df <- tibble(actual = actual, prob = prob) %>%
arrange(desc(prob))
n_top <- max(1, floor(nrow(ranking_df) * fraction))
top_df <- ranking_df %>% slice_head(n = n_top)
overall_rate <- mean(ranking_df$actual == positive)
top_rate <- mean(top_df$actual == positive)
tibble(
top_fraction = fraction,
overall_rate = overall_rate,
top_rate = top_rate,
lift = top_rate / overall_rate,
capture = sum(top_df$actual == positive) / sum(ranking_df$actual == positive)
)
}
classification_metric_row <- function(model_name, actual, prob, threshold = 0.5) {
predicted <- if_else(prob >= threshold, 1, 0)
tp <- sum(predicted == 1 & actual == 1)
fp <- sum(predicted == 1 & actual == 0)
tn <- sum(predicted == 0 & actual == 0)
fn <- sum(predicted == 0 & actual == 1)
precision <- tp / max(tp + fp, 1)
recall <- tp / max(tp + fn, 1)
specificity <- tn / max(tn + fp, 1)
accuracy <- (tp + tn) / max(tp + tn + fp + fn, 1)
f1 <- if ((precision + recall) > 0) {
2 * precision * recall / (precision + recall)
} else {
0
}
top20 <- top_fraction_capture(actual, prob, fraction = 0.20)
auc_pr <- calc_auc_pr(actual, prob)
auc_roc <- calc_auc_roc(actual, prob)
tibble(
model = model_name,
target = "High-obesity subgroup profile",
obesity_threshold_q75 = obesity_threshold,
n_train = nrow(cls_train),
n_test = length(actual),
positive_rate = mean(actual == 1),
roc_auc = auc_roc,
auc_pr = auc_pr,
auc_pr_lift = auc_pr / mean(actual == 1),
top20_lift = top20$lift,
top20_capture = top20$capture,
accuracy_threshold_0_5 = accuracy,
precision_threshold_0_5 = precision,
recall_threshold_0_5 = recall,
specificity_threshold_0_5 = specificity,
f1_threshold_0_5 = f1
)
}
classification_metrics <- bind_rows(
classification_metric_row("Logistic regression", cls_test$high_obesity, cls_prob_glm),
classification_metric_row("Classification tree", cls_test$high_obesity, cls_prob_tree)
)
classification_metrics %>%
kable(digits = 4, caption = "Classification Model Performance")| model | target | obesity_threshold_q75 | n_train | n_test | positive_rate | roc_auc | auc_pr | auc_pr_lift | top20_lift | top20_capture | accuracy_threshold_0_5 | precision_threshold_0_5 | recall_threshold_0_5 | specificity_threshold_0_5 | f1_threshold_0_5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Logistic regression | High-obesity subgroup profile | 35.2 | 11467 | 2867 | 0.2483 | 0.9281 | 0.8152 | 3.2824 | 3.2818 | 0.6559 | 0.8804 | 0.7896 | 0.7065 | 0.9378 | 0.7457 |
| Classification tree | High-obesity subgroup profile | 35.2 | 11467 | 2867 | 0.2483 | 0.8478 | 0.6792 | 2.7349 | 3.0358 | 0.6067 | 0.8556 | 0.7435 | 0.6390 | 0.9271 | 0.6873 |
classification_metrics %>%
select(model, roc_auc, auc_pr, top20_capture, f1_threshold_0_5) %>%
pivot_longer(
cols = c(roc_auc, auc_pr, top20_capture, f1_threshold_0_5),
names_to = "metric",
values_to = "value"
) %>%
mutate(
metric = recode(
metric,
roc_auc = "ROC-AUC",
auc_pr = "AUC-PR",
top20_capture = "Top-20% capture",
f1_threshold_0_5 = "F1 at 0.5"
)
) %>%
ggplot(aes(x = model, y = value, fill = model)) +
geom_col(width = 0.62) +
facet_wrap(~metric, scales = "free_y") +
scale_fill_manual(values = c(
"Logistic regression" = "#B4443F",
"Classification tree" = "#496A81"
)) +
labs(
title = "Classification Model Performance Comparison",
subtitle = "Ranking metrics are especially important for public health prioritisation",
x = NULL,
y = NULL
) +
theme_report(base_size = 11) +
theme(axis.text.x = element_text(angle = 15, hjust = 1))The logistic regression is carried forward as the main classification model because it provides strong ranking performance while remaining interpretable. Its AUC-PR is 0.815, which is 3.28 times the test-set positive rate. The top 20% highest-ranked subgroup profiles capture 65.6% of actual high-obesity subgroup profiles, which is useful when public health attention must be prioritised.
The confusion matrix uses 0.5 as a simple reference threshold for interpretation. This is not treated as the final public health decision rule, because prioritisation is better supported by ranking metrics such as AUC-PR, top-20% capture, and risk decile validation.
cls_pred_glm <- if_else(cls_prob_glm >= 0.5, 1, 0)
confusion_table <- table(
Predicted = factor(
cls_pred_glm,
levels = c(0, 1),
labels = c("Lower risk", "High risk")
),
Actual = factor(
cls_test$high_obesity,
levels = c(0, 1),
labels = c("Not high obesity", "High obesity")
)
)
tn_cm <- confusion_table["Lower risk", "Not high obesity"]
fp_cm <- confusion_table["High risk", "Not high obesity"]
fn_cm <- confusion_table["Lower risk", "High obesity"]
tp_cm <- confusion_table["High risk", "High obesity"]
confusion_summary <- tibble(
metric = c(
"Sensitivity / Recall",
"Specificity",
"Positive Predictive Value / Precision",
"Negative Predictive Value",
"Accuracy"
),
value = c(
tp_cm / max(tp_cm + fn_cm, 1),
tn_cm / max(tn_cm + fp_cm, 1),
tp_cm / max(tp_cm + fp_cm, 1),
tn_cm / max(tn_cm + fn_cm, 1),
(tp_cm + tn_cm) / max(tp_cm + tn_cm + fp_cm + fn_cm, 1)
)
)
confusion_table %>%
kable(caption = "Confusion Matrix for Logistic Regression at 0.5 Threshold")| Not high obesity | High obesity | |
|---|---|---|
| Lower risk | 2021 | 209 |
| High risk | 134 | 503 |
| metric | value |
|---|---|
| Sensitivity / Recall | 0.7065 |
| Specificity | 0.9378 |
| Positive Predictive Value / Precision | 0.7896 |
| Negative Predictive Value | 0.9063 |
| Accuracy | 0.8804 |
readr::write_csv(
as.data.frame.matrix(confusion_table) %>% rownames_to_column("Predicted"),
"project_output_confusion_matrix.csv"
)
readr::write_csv(confusion_summary, "project_output_confusion_summary.csv")classification_predictions <- cls_test %>%
mutate(
high_obesity_score = cls_prob_glm,
predicted_high_obesity = cls_pred_glm
)
classification_predictions %>%
ggplot(aes(x = high_obesity_score, fill = factor(high_obesity))) +
geom_histogram(bins = 40, alpha = 0.65, position = "identity") +
scale_fill_manual(values = c("0" = "#496A81", "1" = "#B4443F")) +
labs(
title = "Predicted High-Obesity Score Distribution",
subtitle = "Higher scores indicate higher predicted probability of belonging to the high-obesity group",
x = "Predicted high-obesity score",
y = "Profiles",
fill = "Actual high obesity"
) +
theme_report()At the 0.5 reference threshold, the logistic regression achieves an accuracy of 0.88 and a recall of 0.706. For public health screening, recall and ranking performance are especially important because the goal is to avoid missing many high-obesity subgroup profiles when prioritising follow-up attention.
The classification model is most useful if its risk scores rank subgroup profiles meaningfully. To validate the ranking, test profiles are divided into ten deciles based on predicted high-obesity score. Decile 10 contains the highest predicted-risk subgroup profiles.
risk_decile_summary <- classification_predictions %>%
mutate(
risk_decile = ntile(high_obesity_score, 10)
) %>%
group_by(risk_decile) %>%
summarise(
profiles = n(),
average_score = mean(high_obesity_score, na.rm = TRUE),
observed_high_obesity_rate = mean(high_obesity == 1),
average_obesity_value = mean(obesity_value, na.rm = TRUE),
average_inactivity_value = mean(inactivity_value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(risk_decile)
top_decile <- risk_decile_summary %>%
filter(risk_decile == max(risk_decile))
bottom_decile <- risk_decile_summary %>%
filter(risk_decile == min(risk_decile))
top_risk_profiles <- classification_predictions %>%
arrange(desc(high_obesity_score)) %>%
slice_head(n = 10) %>%
transmute(
Year = as.integer(yearstart),
Location = locationdesc,
Group = stratification1,
Inactivity = round(inactivity_value, 1),
Obesity = round(obesity_value, 1),
Score = round(high_obesity_score, 3),
ActualHighObesity = high_obesity
)
risk_decile_summary %>%
kable(digits = 4, caption = "Risk Decile Validation for Logistic Regression")| risk_decile | profiles | average_score | observed_high_obesity_rate | average_obesity_value | average_inactivity_value |
|---|---|---|---|---|---|
| 1 | 287 | 0.0000 | 0.0070 | 17.7571 | 19.2976 |
| 2 | 287 | 0.0018 | 0.0035 | 24.3613 | 21.9387 |
| 3 | 287 | 0.0086 | 0.0035 | 26.9509 | 22.6714 |
| 4 | 287 | 0.0255 | 0.0418 | 29.1000 | 23.6540 |
| 5 | 287 | 0.0624 | 0.0592 | 30.4547 | 24.9732 |
| 6 | 287 | 0.1322 | 0.0976 | 31.6300 | 26.9049 |
| 7 | 287 | 0.2498 | 0.2195 | 33.0031 | 27.4808 |
| 8 | 286 | 0.4361 | 0.4266 | 34.7374 | 28.7986 |
| 9 | 286 | 0.6780 | 0.7308 | 36.9612 | 30.4559 |
| 10 | 286 | 0.9104 | 0.8986 | 40.4563 | 33.3958 |
| Year | Location | Group | Inactivity | Obesity | Score | ActualHighObesity |
|---|---|---|---|---|---|---|
| 2018 | Mississippi | Non-Hispanic Black | 34.6 | 45.7 | 0.998 | 1 |
| 2017 | Mississippi | Non-Hispanic Black | 36.2 | 45.9 | 0.997 | 1 |
| 2019 | Mississippi | Less than $15,000 | 54.3 | 42.8 | 0.997 | 1 |
| 2021 | Kentucky | Non-Hispanic Black | 37.4 | 46.8 | 0.996 | 1 |
| 2021 | Mississippi | 45 - 54 | 30.0 | 51.6 | 0.996 | 1 |
| 2019 | Mississippi | 45 - 54 | 40.0 | 50.1 | 0.996 | 1 |
| 2020 | Kentucky | Non-Hispanic Black | 39.3 | 45.4 | 0.996 | 1 |
| 2020 | Mississippi | 55 - 64 | 37.5 | 43.9 | 0.995 | 1 |
| 2019 | West Virginia | 45 - 54 | 32.5 | 50.2 | 0.994 | 1 |
| 2017 | Louisiana | Non-Hispanic Black | 36.9 | 42.2 | 0.993 | 1 |
readr::write_csv(risk_decile_summary, "project_output_high_obesity_risk_decile.csv")
readr::write_csv(top_risk_profiles, "project_output_top_risk_profiles.csv")risk_decile_summary %>%
ggplot(aes(x = factor(risk_decile), y = observed_high_obesity_rate)) +
geom_col(fill = "#B4443F", width = 0.68) +
geom_line(aes(group = 1), colour = "#243447", linewidth = 0.8) +
geom_point(colour = "#243447", size = 2) +
scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
labs(
title = "Observed High-Obesity Rate by Predicted Risk Decile",
subtitle = "A useful screening model should show higher observed risk in higher score deciles",
x = "Predicted risk decile",
y = "Observed high-obesity rate"
) +
theme_report()The risk decile validation shows a clear ranking pattern. The lowest decile has an observed high-obesity rate of 0.7%, while the highest decile has an observed high-obesity rate of 89.9%. In decision terms, if a public health team works down the ranked list, it reaches genuinely high-risk subgroup profiles much earlier than random selection. This means the model is not only classifying profiles at a fixed threshold; it is also ordering subgroup profiles in a way that is meaningful for prioritisation within the observed surveillance panel.
The classification model provides a screening score, but public health planning also needs interpretable action groups. The intervention matrix translates the two aligned indicators into four public health segments using the 75th percentile cut-offs for obesity and physical inactivity. It uses the same subgroup-level base as the classification model, so the intervention groups describe demographic subgroup profiles rather than a mixture of subgroup records and aggregate summaries.
The four segments are:
| Segment | Profile Pattern | Suggested Public Health Focus |
|---|---|---|
| Priority Intervention | High obesity and high inactivity | Combined obesity prevention and physical activity promotion |
| Weight Management Focus | High obesity and lower inactivity | Nutrition, weight management, and obesity prevention support |
| Preventive Physical Activity Focus | Lower obesity and high inactivity | Early physical activity promotion and prevention |
| Standard Monitoring | Lower obesity and lower inactivity | Routine surveillance and monitoring |
priority_base <- classification_health
obesity_q75 <- unname(quantile(priority_base$obesity_value, 0.75, na.rm = TRUE))
inactivity_q75 <- unname(quantile(priority_base$inactivity_value, 0.75, na.rm = TRUE))
action_matrix_data <- priority_base %>%
mutate(
obesity_level = if_else(obesity_value >= obesity_q75, "High obesity", "Lower obesity"),
inactivity_level = if_else(inactivity_value >= inactivity_q75, "High inactivity", "Lower inactivity"),
public_health_segment = case_when(
obesity_level == "High obesity" & inactivity_level == "High inactivity" ~
"Priority Intervention",
obesity_level == "High obesity" & inactivity_level == "Lower inactivity" ~
"Weight Management Focus",
obesity_level == "Lower obesity" & inactivity_level == "High inactivity" ~
"Preventive Physical Activity Focus",
TRUE ~ "Standard Monitoring"
),
public_health_segment = factor(
public_health_segment,
levels = c(
"Priority Intervention",
"Weight Management Focus",
"Preventive Physical Activity Focus",
"Standard Monitoring"
)
)
)
action_matrix_summary <- action_matrix_data %>%
group_by(public_health_segment) %>%
summarise(
profiles = n(),
share = profiles / nrow(action_matrix_data),
average_obesity = mean(obesity_value, na.rm = TRUE),
average_inactivity = mean(inactivity_value, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(public_health_segment)
action_matrix_summary %>%
kable(digits = 4, caption = "Public Health Intervention Matrix Summary")| public_health_segment | profiles | share | average_obesity | average_inactivity |
|---|---|---|---|---|
| Priority Intervention | 1625 | 0.1134 | 39.3788 | 37.8977 |
| Weight Management Focus | 1970 | 0.1374 | 38.9139 | 25.1503 |
| Preventive Physical Activity Focus | 1960 | 0.1367 | 30.4933 | 36.5277 |
| Standard Monitoring | 8779 | 0.6125 | 26.9292 | 21.5793 |
readr::write_csv(action_matrix_data, "project_output_public_health_action_matrix_data.csv")
readr::write_csv(action_matrix_summary, "project_output_public_health_action_matrix_summary.csv")action_matrix_data %>%
ggplot(aes(x = inactivity_value, y = obesity_value, colour = public_health_segment)) +
geom_point(alpha = 0.35, size = 1.5) +
geom_vline(xintercept = inactivity_q75, linetype = "dashed", colour = "#5D6975") +
geom_hline(yintercept = obesity_q75, linetype = "dashed", colour = "#5D6975") +
scale_colour_manual(values = c(
"Priority Intervention" = "#B4443F",
"Weight Management Focus" = "#D18F3F",
"Preventive Physical Activity Focus" = "#2F6F8F",
"Standard Monitoring" = "#7A8793"
)) +
scale_x_continuous(labels = scales::label_number(suffix = "%")) +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
labs(
title = "Public Health Intervention Matrix",
subtitle = "Segments are based on 75th percentile cut-offs for subgroup aligned obesity and inactivity values",
x = "No leisure-time physical activity",
y = "Adult obesity",
colour = "Segment"
) +
theme_report()The Priority Intervention segment contains subgroup
profiles where both obesity and inactivity are relatively high. This
segment represents 11.3% of subgroup aligned profiles, with an average
obesity value of 39.38% and an average inactivity value of 37.9%. These
profiles are the most natural starting point for combined obesity
prevention and physical activity promotion, but the segment should be
read as decision support rather than an automatic policy list.
The 75th percentile cut-off is a design choice, not a clinical threshold. To check whether the intervention priority result depends too heavily on one cut-off, the matrix is recalculated using 70th, 75th, and 80th percentile thresholds on the same subgroup-level aligned data.
priority_flag_at <- function(df, q) {
obesity_cutoff <- unname(quantile(df$obesity_value, q, na.rm = TRUE))
inactivity_cutoff <- unname(quantile(df$inactivity_value, q, na.rm = TRUE))
df$obesity_value >= obesity_cutoff &
df$inactivity_value >= inactivity_cutoff
}
priority_base <- classification_health
sensitivity_quantiles <- c(0.70, 0.75, 0.80)
sensitivity_summary <- map_dfr(sensitivity_quantiles, function(q) {
priority_flag <- priority_flag_at(priority_base, q)
tibble(
threshold = paste0("Q", q * 100),
obesity_cutoff = unname(quantile(priority_base$obesity_value, q, na.rm = TRUE)),
inactivity_cutoff = unname(quantile(priority_base$inactivity_value, q, na.rm = TRUE)),
priority_profiles = sum(priority_flag),
priority_share = mean(priority_flag),
average_obesity_priority = mean(priority_base$obesity_value[priority_flag], na.rm = TRUE),
average_inactivity_priority = mean(priority_base$inactivity_value[priority_flag], na.rm = TRUE)
)
})
robust_priority_data <- priority_base %>%
mutate(
priority_q70 = priority_flag_at(priority_base, 0.70),
priority_q75 = priority_flag_at(priority_base, 0.75),
priority_q80 = priority_flag_at(priority_base, 0.80),
robust_priority = priority_q70 & priority_q75 & priority_q80
)
robust_priority_profiles <- robust_priority_data %>%
filter(robust_priority) %>%
arrange(desc(obesity_value + inactivity_value))
robust_core_top10 <- robust_priority_profiles %>%
slice_head(n = 10) %>%
transmute(
Year = as.integer(yearstart),
Location = locationdesc,
Group = stratification1,
Obesity = round(obesity_value, 1),
Inactivity = round(inactivity_value, 1),
CombinedBurden = round(obesity_value + inactivity_value, 1)
)
robust_priority_summary <- tibble(
n_robust_priority_profiles = nrow(robust_priority_profiles),
robust_priority_share = nrow(robust_priority_profiles) / nrow(priority_base),
average_obesity = mean(robust_priority_profiles$obesity_value, na.rm = TRUE),
average_inactivity = mean(robust_priority_profiles$inactivity_value, na.rm = TRUE)
)
priority_q70_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q70"]
priority_q75_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q75"]
priority_q80_n <- sensitivity_summary$priority_profiles[sensitivity_summary$threshold == "Q80"]
robust_priority_n <- nrow(robust_priority_profiles)
robust_priority_share <- robust_priority_n / nrow(priority_base)
sensitivity_summary %>%
kable(digits = 4, caption = "Threshold Sensitivity for Priority Intervention Profiles")| threshold | obesity_cutoff | inactivity_cutoff | priority_profiles | priority_share | average_obesity_priority | average_inactivity_priority |
|---|---|---|---|---|---|---|
| Q70 | 34.2 | 29.9 | 2178 | 0.1519 | 38.6547 | 36.8449 |
| Q75 | 35.2 | 31.3 | 1625 | 0.1134 | 39.3788 | 37.8977 |
| Q80 | 36.2 | 32.8 | 1137 | 0.0793 | 40.1659 | 38.9839 |
| n_robust_priority_profiles | robust_priority_share | average_obesity | average_inactivity |
|---|---|---|---|
| 1137 | 0.0793 | 40.1659 | 38.9839 |
robust_priority_profiles %>%
slice_head(n = 20) %>%
kable(digits = 3, caption = "Top 20 Robust Priority Profiles")| yearstart | locationdesc | stratificationcategory1 | stratification1 | obesity_value | inactivity_value | priority_q70 | priority_q75 | priority_q80 | robust_priority |
|---|---|---|---|---|---|---|---|---|---|
| 2019 | North Dakota | Race/Ethnicity | American Indian/Alaska Native | 57.8 | 49.5 | TRUE | TRUE | TRUE | TRUE |
| 2017 | Puerto Rico | Education | Less than high school | 39.0 | 66.8 | TRUE | TRUE | TRUE | TRUE |
| 2015 | Ohio | Race/Ethnicity | American Indian/Alaska Native | 51.6 | 51.4 | TRUE | TRUE | TRUE | TRUE |
| 2017 | Puerto Rico | Age (years) | 45 - 54 | 41.8 | 59.6 | TRUE | TRUE | TRUE | TRUE |
| 2021 | Puerto Rico | Education | Less than high school | 39.0 | 61.4 | TRUE | TRUE | TRUE | TRUE |
| 2013 | Tennessee | Education | Less than high school | 41.3 | 58.9 | TRUE | TRUE | TRUE | TRUE |
| 2017 | Puerto Rico | Age (years) | 55 - 64 | 40.8 | 58.8 | TRUE | TRUE | TRUE | TRUE |
| 2019 | Mississippi | Income | Less than $15,000 | 42.8 | 54.3 | TRUE | TRUE | TRUE | TRUE |
| 2016 | New Jersey | Education | Less than high school | 41.5 | 54.5 | TRUE | TRUE | TRUE | TRUE |
| 2017 | Puerto Rico | Income | Less than $15,000 | 37.5 | 57.9 | TRUE | TRUE | TRUE | TRUE |
| 2019 | Mississippi | Education | Less than high school | 43.1 | 52.3 | TRUE | TRUE | TRUE | TRUE |
| 2017 | Arkansas | Race/Ethnicity | American Indian/Alaska Native | 50.3 | 44.8 | TRUE | TRUE | TRUE | TRUE |
| 2019 | Kentucky | Education | Less than high school | 41.2 | 53.9 | TRUE | TRUE | TRUE | TRUE |
| 2018 | Kentucky | Income | Less than $15,000 | 39.8 | 55.2 | TRUE | TRUE | TRUE | TRUE |
| 2019 | Puerto Rico | Age (years) | 55 - 64 | 38.8 | 56.2 | TRUE | TRUE | TRUE | TRUE |
| 2018 | Arkansas | Income | Less than $15,000 | 50.5 | 44.4 | TRUE | TRUE | TRUE | TRUE |
| 2012 | Arkansas | Race/Ethnicity | American Indian/Alaska Native | 46.0 | 48.8 | TRUE | TRUE | TRUE | TRUE |
| 2021 | Puerto Rico | Income | Less than $15,000 | 39.4 | 55.4 | TRUE | TRUE | TRUE | TRUE |
| 2018 | Mississippi | Income | Less than $15,000 | 46.1 | 48.4 | TRUE | TRUE | TRUE | TRUE |
| 2020 | Puerto Rico | Age (years) | 55 - 64 | 41.8 | 52.6 | TRUE | TRUE | TRUE | TRUE |
| Year | Location | Group | Obesity | Inactivity | CombinedBurden |
|---|---|---|---|---|---|
| 2019 | North Dakota | American Indian/Alaska Native | 57.8 | 49.5 | 107.3 |
| 2017 | Puerto Rico | Less than high school | 39.0 | 66.8 | 105.8 |
| 2015 | Ohio | American Indian/Alaska Native | 51.6 | 51.4 | 103.0 |
| 2017 | Puerto Rico | 45 - 54 | 41.8 | 59.6 | 101.4 |
| 2021 | Puerto Rico | Less than high school | 39.0 | 61.4 | 100.4 |
| 2013 | Tennessee | Less than high school | 41.3 | 58.9 | 100.2 |
| 2017 | Puerto Rico | 55 - 64 | 40.8 | 58.8 | 99.6 |
| 2019 | Mississippi | Less than $15,000 | 42.8 | 54.3 | 97.1 |
| 2016 | New Jersey | Less than high school | 41.5 | 54.5 | 96.0 |
| 2017 | Puerto Rico | Less than $15,000 | 37.5 | 57.9 | 95.4 |
readr::write_csv(sensitivity_summary, "project_output_threshold_sensitivity.csv")
readr::write_csv(robust_priority_data, "project_output_robust_priority_data.csv")
readr::write_csv(robust_priority_profiles, "project_output_robust_priority_profiles.csv")
readr::write_csv(robust_core_top10, "project_output_robust_core_top10_profiles.csv")
readr::write_csv(robust_priority_summary, "project_output_robust_priority_summary.csv")sensitivity_summary %>%
ggplot(aes(x = threshold, y = priority_profiles, fill = threshold)) +
geom_col(width = 0.62) +
geom_text(
aes(label = priority_profiles),
vjust = -0.35,
colour = "#243447",
size = 3.8
) +
scale_fill_manual(values = c(
"Q70" = "#D18F3F",
"Q75" = "#B4443F",
"Q80" = "#7D2E2A"
)) +
labs(
title = "Priority Intervention Profiles Under Alternative Thresholds",
subtitle = "Higher thresholds define a smaller but stricter intervention core",
x = "Quantile threshold",
y = "Priority profiles"
) +
theme_report()The number of priority profiles changes from 2178 at Q70 to 1625 at Q75 and 1137 at Q80. Because both obesity and inactivity cut-offs rise together under the same quantile rule, the three priority sets are nested: every Q80 priority profile is also a priority profile at Q75 and Q70. The robust core contains 1137 subgroup profiles, or 7.93% of subgroup aligned profiles. These profiles remain high-priority under all three reasonable cut-off choices, so they form a defensible planning starting point rather than a final automatic policy decision.
A Shiny dashboard, app.R, accompanies this report. The
dashboard is designed as an interactive public health profile
explorer. Its purpose is to help non-technical users explore
the main findings of the project without reading every table or code
block in the full report.
The dashboard does not provide individual-level medical diagnosis, personal obesity prediction, or causal recommendations. It works with the same aggregated surveillance logic used throughout the report: each record represents a public health profile defined by year, location, indicator, and demographic stratification group.
The dashboard supports three practical user questions:
The app is organised into three pages. Each page corresponds to a major analytical stage in the report.
| Dashboard Page | Main Purpose | Related Report Section |
|---|---|---|
| Overview and Trend | Explore yearly trends by indicator, year range, location, and stratification category | Exploratory Data Analysis |
| Location and Demographic Explorer | Identify locations and demographic groups with higher average obesity or inactivity values | Location and Demographic Comparison |
| Priority Matrix Explorer | Classify aligned subgroup profiles into intervention segments under Q70, Q75, or Q80 thresholds | Public Health Intervention Matrix and Threshold Sensitivity |
This structure makes the dashboard a practical extension of the analysis. The first page explains the when, the second explains the where and who, and the third translates the results into public health action groups.
The Overview and Trend page allows users to select an
indicator, year range, location, and stratification category. It then
displays a yearly trend plot and a small summary table containing the
number of records, number of years, number of locations, average value,
minimum value, and maximum value.
This page is useful for quickly checking whether a selected indicator is rising, falling, or fluctuating across time. For example, users can compare the adult obesity trend with the no leisure-time physical activity trend, or focus on a specific location or stratification category.
The Location and Demographic Explorer page helps users
identify where higher values are concentrated. Users select an indicator
and year option, and the app displays:
This page supports the location and demographic comparison logic from the EDA. It helps users move beyond national averages and see which public health profiles deserve closer attention.
The Priority Matrix Explorer page is the most
decision-oriented part of the app. It uses aligned subgroup
obesity-inactivity profiles and classifies them into four intervention
segments:
| Segment | Interpretation |
|---|---|
| Priority Intervention | High obesity and high physical inactivity |
| Weight Management Focus | High obesity and lower physical inactivity |
| Preventive Physical Activity Focus | Lower obesity and high physical inactivity |
| Standard Monitoring | Lower obesity and lower physical inactivity |
Users can select Q70, Q75, or Q80 as the high-risk quantile threshold. The app then shows the number of subgroup profiles in each segment, segment averages, and a filtered table of the highest combined-burden subgroup profiles. This directly connects the dashboard to the threshold sensitivity analysis in the report.
The app design follows the same principles as the analysis:
To run the dashboard:
Nutrition__Physical_Activity__and_Obesity.csv in
the same folder as app.R.app.R in RStudio.The dashboard requires the following R packages:
shiny, dplyr, tidyr, ggplot2, readr, stringr
The dashboard is intended for exploration and communication, not for clinical or causal decision-making. Several limitations should be kept in mind:
Overall, the dashboard strengthens the project by turning the statistical analysis into an interactive tool. It allows users to explore trends, compare locations and demographic groups, and inspect intervention priority profiles in a way that is more accessible than static tables alone.
This project shows how a large public health surveillance dataset can
be transformed into an interpretable workflow for obesity and physical
inactivity profiling. The analysis begins with careful question
selection, because data_value has different meanings under
different public health questions. By focusing on adult obesity
(Q036) and no leisure-time physical activity
(Q047), the project avoids mixing incompatible indicators
and keeps the interpretation consistent.
The exploratory analysis shows that adult obesity increased over the study period. In the all-record trend, adult obesity rose by 6.3 percentage points from 2011 to 2021. The total-stratification trend confirms that the increase remains visible after reducing demographic composition effects. This supports the conclusion that the rising obesity pattern is not only a result of how stratified records are pooled.
Location and demographic comparison further show that obesity is unevenly distributed. The highest average obesity locations include Mississippi, West Virginia, Louisiana, while lower-obesity locations include Colorado, District of Columbia, Hawaii. Demographic profiles also differ strongly, especially across race or ethnicity, age, education, and income groups. These findings justify the modelling strategy: year, location, and demographic stratification are not arbitrary predictors, but variables that the EDA already shows to be important.
The aligned obesity-inactivity analysis is a central design choice. Because obesity and physical inactivity are reported in separate rows, the two indicators are compared only after matching by year, location, stratification category, and stratification group. The report then separates the broad all-profile alignment from the subgroup aligned modelling scope. Within the subgroup modelling scope, the raw correlation between adult obesity and no leisure-time physical activity is 0.457, indicating a positive descriptive association. The within-profile demeaned correlation (0.073) shows how much of this survives once stable between-profile differences are removed, and the subgroup-level lagged association check stays positive across lag 0, lag 1, and lag 2. Taken together, these are stable subgroup profile-level associations rather than validated predictive or causal relationships, so they motivate examining inactivity alongside obesity while still requiring a cautious non-causal interpretation.
The same pattern reappears in the regression adjustment ladder, which forms a clean gradient with the EDA: the inactivity coefficient falls from about 0.4 when unadjusted to about 0.14 once location and demographic group are controlled (roughly 35% retained), and the stricter within-profile correlation (0.073) sits closer to zero still. The more completely profile identity is absorbed, the less of the association survives, so the EDA and the regression tell one consistent story: the obesity-inactivity link is carried mostly by stable differences between locations and demographic subgroup profiles, with only a modest residual remaining after full adjustment.
The regression model confirms that adult obesity values are strongly
structured by public health profile characteristics. The linear
regression explains 74.7% of test-set variation in adult obesity values
and performs better than the regression tree benchmark. The coefficient
interpretation strengthens the result: the yearstart
coefficient is approximately 0.6, meaning that obesity increases by
about 0.6 percentage points per year after controlling for location and
stratification group. This connects the model directly back to the EDA
trend.
The classification model provides a practical subgroup-level
screening layer. After excluding Total or
Overall rows to match the regression scope, the logistic
regression achieves a ROC-AUC of 0.928 and an AUC-PR of 0.815. More
importantly for prioritisation, the top 20% highest-ranked subgroup
profiles capture 65.6% of true high-obesity profiles. The risk-decile
validation shows that observed high-obesity rates rise sharply from the
lowest to the highest predicted-risk deciles, confirming that the model
ranks subgroup profiles in a meaningful order within the observed
surveillance panel.
Finally, the public health intervention matrix translates statistical
results into action categories. The Priority Intervention
segment contains subgroup profiles with both high obesity and high
inactivity, representing 11.3% of subgroup aligned profiles. Threshold
sensitivity analysis shows that the priority list contracts as the
threshold moves from Q70 to Q80, and the robust core contains 1137
subgroup profiles that remain high-priority under all three cut-off
choices. This robust core is a defensible starting point for
intervention planning, not an automatic policy decision.
Several limitations should be considered when interpreting the results.
The dataset is aggregated, not
individual-level.
Each row represents a public health profile such as a year, location,
indicator, and stratification group. The results cannot be used to
predict whether a specific person is obese.
The analysis is associational, not causal.
The aligned analysis shows that obesity and physical inactivity tend to
be higher in the same public health profiles, but this does not prove
that physical inactivity causes obesity. Other factors such as diet,
income, local environment, healthcare access, and policy context may
also contribute.
Model performance reflects interpolation within the
observed surveillance panel.
Because location and stratification group are predictors, the regression
and classification models mainly learn patterns among already-observed
locations and demographic groups. The results should not be treated as
strong evidence of performance for entirely new locations, future policy
contexts, or unseen demographic categories.
High-risk thresholds are relative.
The project uses quantile thresholds such as Q75 to define high-obesity
and high-inactivity profiles. These thresholds are useful for
prioritisation, but they are not medical, clinical, or policy-mandated
thresholds.
Total-level and subgroup-level records have different
meanings.
Total or Overall rows summarise a location-year, while subgroup rows
describe demographic slices. The report keeps these levels separate, but
users should not interpret subgroup models as population-weighted
national estimates.
Some strata have fewer records than
others.
Certain demographic groups and locations have smaller coverage or fewer
years of data. This can make some subgroup averages and extreme rankings
less stable.
The models are intentionally simple.
The regression and classification models use interpretable predictors
and additive structures. They do not fully model complex interactions
between location, income, education, age, race or ethnicity, and
physical inactivity.
The intervention matrix supports prioritisation, not
final policy decisions.
The matrix identifies profiles that may deserve attention, but real
interventions should also consider local context, feasibility,
resources, community needs, and external validation.
The dashboard is exploratory.
The Shiny app helps users inspect trends, rankings, and priority
segments, but it is not a production surveillance system or clinical
decision tool.
This project analysed adult obesity and no leisure-time physical activity using the Nutrition, Physical Activity, and Obesity public health surveillance dataset. The dataset satisfies the course dimension requirement and supports a complete data science workflow: data cleaning, exploratory analysis, aligned relationship analysis, regression modelling, classification modelling, intervention prioritisation, threshold sensitivity analysis, and dashboard design.
The main findings are clear. Adult obesity increased from 2011 to 2021, and the increase remains visible in total-stratification records. Obesity also differs substantially across locations and demographic groups, confirming that public health profiles are not evenly distributed. After fair alignment, obesity and physical inactivity show a positive subgroup-level association with a correlation of 0.457 within the modelling scope.
The regression model explains 74.7% of test-set obesity variation, showing that year, location, and demographic group account for a large share of observed obesity differences. The classification model adds subgroup-level screening value: the logistic regression captures 65.6% of true high-obesity subgroup profiles within the top 20% highest-ranked subgroup profiles.
The final public health intervention matrix turns these results into
practical segments. The Priority Intervention group
combines high obesity with high physical inactivity and represents the
strongest candidate group for combined obesity prevention and physical
activity promotion. The robust core of 1137 subgroup profiles remains
high-priority under Q70, Q75, and Q80 threshold choices, making it a
defensible starting point for planning.
Overall, the project demonstrates how programming and data science can convert large-scale public health surveillance data into interpretable insights, screening tools, intervention priorities, and an interactive dashboard. All conclusions should be understood as group-level public health insights rather than individual medical predictions or causal claims.