Neighborhood Health Survey — Comprehensive Analysis

Published

June 22, 2026

Research Question and Model Framing

Dataset Overview

The dataset is an NHANES-style neighborhood health survey of 6,000 individual respondents. The unit of analysis is the individual person. The dataset contains:

  • Demographics: age, gender, race/ethnicity, education, household size, income-to-poverty ratio
  • Blood pressure: three repeated systolic and diastolic readings with computed means; a derived categorical BP classification
  • Dietary intake: mean daily energy, sodium, fat, dietary fiber, sugars, alcohol, and caffeine from up to two 24-hour dietary recall days
  • Clinical/lab: hepatitis A antibody; pulse readings; exam period and survey weights

Three Research Questions

Studying the data, three inferential questions emerge:

# Research Question Outcome Key Predictor(s)
RQ1 (selected) Does dietary fiber intake predict diastolic blood pressure, after controlling for age, gender, income, and race/ethnicity? diastolic_mean (mmHg) mean_daily_dietary_fiber_g (g/day)
RQ2 Is higher daily sodium intake associated with elevated systolic blood pressure after adjusting for age and gender? systolic_mean (mmHg) mean_daily_sodium_mg (mg/day)
RQ3 Are younger, lower-income respondents more likely to test hepatitis-A antibody negative, after adjusting for race/ethnicity and education? hepatitis_a_antibody_positive (binary) age_years, family_income_to_poverty

Selected Model

RQ1 is the primary analysis for this notebook:

Does dietary fiber intake predict diastolic blood pressure, after controlling for age, gender, income, and race/ethnicity?

Element Value
Outcome diastolic_mean — mean diastolic blood pressure (mmHg); continuous
Main predictor mean_daily_dietary_fiber_g — mean daily dietary fiber intake (g/day)
Covariates age_years, gender, family_income_to_poverty, race_ethnicity
Unit of analysis Individual respondent (mmHg for outcome; g/day for fiber predictor)
Model type Inferential — multiple linear regression. Goal: estimate and test the adjusted association between fiber intake and diastolic BP, not maximize predictive accuracy.

Dietary fiber is hypothesized to lower blood pressure through mechanisms including reduced arterial stiffness and improved insulin sensitivity. Age and gender are well-established confounders of blood pressure; income and race/ethnicity capture socioeconomic and population heterogeneity that could confound both dietary patterns and blood pressure.


Task 1: Clean Analysis Dataset

library(tidyverse)

raw <- read_csv(
  "/Users/kris/ai-tools-2026/data/simulated/neighborhood_health_survey_analysis.csv",
  show_col_types = FALSE
)

cat("Raw dimensions:", nrow(raw), "rows x", ncol(raw), "cols\n")
Raw dimensions: 6000 rows x 57 cols

Inclusion / Exclusion Decisions

The analytic sample for RQ1 requires each respondent to have:

  1. Age ≥ 18 — blood pressure categories and dietary variables are defined for adults.
  2. diastolic_mean not missing — outcome must be present; ~35% of the raw data lacks a BP measurement (children and non-examined adults).
  3. mean_daily_dietary_fiber_g not missing — key predictor requires at least one completed 24-hour dietary recall (~43% of raw data has no dietary recall).
  4. family_income_to_poverty not missing — covariate; complete-case approach applied.

No imputation is performed. Rows missing any of the above are excluded.

step0 <- nrow(raw)

step1 <- raw |> filter(age_years >= 18)
step2 <- step1 |> filter(!is.na(diastolic_mean))
step3 <- step2 |> filter(!is.na(mean_daily_dietary_fiber_g))
step4 <- step3 |> filter(!is.na(family_income_to_poverty))

tibble(
  Step = c(
    "0. Raw data",
    "1. Adults only (age ≥ 18)",
    "2. Diastolic BP available",
    "3. Dietary fiber available",
    "4. Income-to-poverty available (analytic sample)"
  ),
  Rows = c(step0, nrow(step1), nrow(step2), nrow(step3), nrow(step4)),
  Excluded = c(NA,
               step0 - nrow(step1),
               nrow(step1) - nrow(step2),
               nrow(step2) - nrow(step3),
               nrow(step3) - nrow(step4))
)
# A tibble: 5 × 3
  Step                                              Rows Excluded
  <chr>                                            <int>    <int>
1 0. Raw data                                       6000       NA
2 1. Adults only (age ≥ 18)                         5411      589
3 2. Diastolic BP available                         3676     1735
4 3. Dietary fiber available                        2572     1104
5 4. Income-to-poverty available (analytic sample)  2173      399

Variable Recoding and Type Conversion

analytic <- step4 |>
  mutate(
    # Gender as factor (female = reference)
    gender = factor(gender, levels = c("female", "male")),

    # Race/ethnicity: factor, Non-Hispanic White as reference (largest group)
    race_ethnicity = factor(race_ethnicity) |>
      relevel(ref = "Non-Hispanic White"),

    # Education: ordered factor for adults
    education = factor(
      education_adults_20_plus,
      levels = c(
        "Less than 9th grade", "9-11th grade",
        "High school/GED", "Some college/AA",
        "College graduate or above"
      )
    )
  )

cat("Analytic sample:", nrow(analytic), "rows\n")
Analytic sample: 2173 rows
glimpse(
  analytic |>
    select(diastolic_mean, mean_daily_dietary_fiber_g,
           age_years, gender, family_income_to_poverty, race_ethnicity)
)
Rows: 2,173
Columns: 6
$ diastolic_mean             <dbl> 63.00000, 77.00000, 72.00000, 68.66667, 72.…
$ mean_daily_dietary_fiber_g <dbl> 13.575, 14.460, 15.250, 10.335, 14.375, 13.…
$ age_years                  <dbl> 43, 37, 37, 42, 36, 28, 33, 40, 39, 38, 37,…
$ gender                     <fct> female, female, female, male, female, femal…
$ family_income_to_poverty   <dbl> 2.33, 1.44, 2.89, 0.59, 3.06, 3.04, 1.56, 1…
$ race_ethnicity             <fct> Mexican American, Other race including mult…

Row Counts Before and After Cleaning

tibble(
  Stage        = c("Raw data", "Analytic sample", "Excluded"),
  N            = c(nrow(raw), nrow(analytic), nrow(raw) - nrow(analytic)),
  Pct_of_raw   = round(c(100, nrow(analytic)/nrow(raw)*100,
                         (nrow(raw)-nrow(analytic))/nrow(raw)*100), 1)
) |>
  rename(`% of raw` = Pct_of_raw)
# A tibble: 3 × 3
  Stage               N `% of raw`
  <chr>           <int>      <dbl>
1 Raw data         6000      100  
2 Analytic sample  2173       36.2
3 Excluded         3827       63.8

Outcome and Key Predictor Check

analytic |>
  select(diastolic_mean, mean_daily_dietary_fiber_g) |>
  summary()
 diastolic_mean   mean_daily_dietary_fiber_g
 Min.   : 42.00   Min.   : 1.72             
 1st Qu.: 65.33   1st Qu.: 9.94             
 Median : 71.00   Median :13.36             
 Mean   : 70.89   Mean   :13.86             
 3rd Qu.: 76.67   3rd Qu.:17.18             
 Max.   :101.67   Max.   :37.41             
cat("Missing in outcome (diastolic_mean):       ",
    sum(is.na(analytic$diastolic_mean)), "\n")
Missing in outcome (diastolic_mean):        0 
cat("Missing in key predictor (fiber):          ",
    sum(is.na(analytic$mean_daily_dietary_fiber_g)), "\n")
Missing in key predictor (fiber):           0 
cat("Fiber values <= 0:                         ",
    sum(analytic$mean_daily_dietary_fiber_g <= 0, na.rm = TRUE), "\n")
Fiber values <= 0:                          0 
cat("Diastolic BP outside plausible range (30–110 mmHg): ",
    sum(analytic$diastolic_mean < 30 | analytic$diastolic_mean > 110,
        na.rm = TRUE), "\n")
Diastolic BP outside plausible range (30–110 mmHg):  0 

No missing values remain in the outcome or key predictor. All values fall within physiologically plausible ranges.


Task 2: Model Fit and Results

Exploratory Visualizations

Before fitting, we examine distributions and the bivariate relationship of interest.

ggplot(analytic, aes(x = diastolic_mean)) +
  geom_histogram(binwidth = 2, boundary = 0) +
  geom_vline(xintercept = mean(analytic$diastolic_mean),
             color = "tomato", linetype = "dashed") +
  annotate("text",
           x = mean(analytic$diastolic_mean) + 0.5, y = 90,
           label = paste0("Mean: ", round(mean(analytic$diastolic_mean), 1), " mmHg"),
           hjust = 0, color = "tomato", size = 3.5) +
  labs(title = "Distribution of Mean Diastolic Blood Pressure",
       x = "Mean Diastolic BP (mmHg)", y = "Count",
       caption = paste0("Analytic sample, n = ", nrow(analytic))) +
  theme_minimal()

Distribution of mean diastolic blood pressure in the analytic sample.
ggplot(analytic, aes(x = mean_daily_dietary_fiber_g)) +
  geom_histogram(bins = 40, boundary = 0) +
  labs(title = "Distribution of Mean Daily Dietary Fiber Intake",
       x = "Mean Daily Fiber (g/day)", y = "Count",
       caption = paste0("Analytic sample, n = ", nrow(analytic))) +
  theme_minimal()

Distribution of mean daily dietary fiber intake. The distribution is right-skewed.
ggplot(analytic, aes(x = mean_daily_dietary_fiber_g, y = diastolic_mean)) +
  geom_point(alpha = 0.2, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, color = "steelblue", linewidth = 0.9) +
  labs(title = "Dietary Fiber Intake vs. Diastolic Blood Pressure",
       x = "Mean Daily Fiber Intake (g/day)",
       y = "Mean Diastolic BP (mmHg)",
       caption = paste0("n = ", nrow(analytic))) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Scatter plot of dietary fiber intake vs. mean diastolic blood pressure. Each point is one respondent.

Model Specification

Multiple linear regression:

\[\text{diastolic\_mean}_i = \beta_0 + \beta_1 \cdot \text{fiber}_i + \beta_2 \cdot \text{age}_i + \beta_3 \cdot \text{gender}_i + \beta_4 \cdot \text{income\_poverty}_i + \boldsymbol{\beta}_5 \cdot \text{race/ethnicity}_i + \varepsilon_i\]

\[\varepsilon_i \sim N(0, \sigma^2)\]

Fit the Model

model <- lm(
  diastolic_mean ~ mean_daily_dietary_fiber_g + age_years + gender +
    family_income_to_poverty + race_ethnicity,
  data = analytic
)

summary(model)

Call:
lm(formula = diastolic_mean ~ mean_daily_dietary_fiber_g + age_years + 
    gender + family_income_to_poverty + race_ethnicity, data = analytic)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.6365  -5.2289   0.1536   5.1674  22.4475 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                    60.581539   0.793072  76.388
mean_daily_dietary_fiber_g                      0.030394   0.031424   0.967
age_years                                       0.208213   0.012256  16.988
gendermale                                      2.646986   0.342800   7.722
family_income_to_poverty                        0.060771   0.114180   0.532
race_ethnicityMexican American                  0.358241   0.571510   0.627
race_ethnicityNon-Hispanic Asian                0.443550   0.722881   0.614
race_ethnicityNon-Hispanic Black                0.001431   0.515276   0.003
race_ethnicityOther Hispanic                   -0.797900   0.549362  -1.452
race_ethnicityOther race including multiracial  1.130905   0.732193   1.545
                                               Pr(>|t|)    
(Intercept)                                     < 2e-16 ***
mean_daily_dietary_fiber_g                        0.334    
age_years                                       < 2e-16 ***
gendermale                                     1.74e-14 ***
family_income_to_poverty                          0.595    
race_ethnicityMexican American                    0.531    
race_ethnicityNon-Hispanic Asian                  0.540    
race_ethnicityNon-Hispanic Black                  0.998    
race_ethnicityOther Hispanic                      0.147    
race_ethnicityOther race including multiracial    0.123    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.972 on 2163 degrees of freedom
Multiple R-squared:  0.141, Adjusted R-squared:  0.1374 
F-statistic: 39.45 on 9 and 2163 DF,  p-value: < 2.2e-16

Tidy Results Table

library(broom)

tidy(model, conf.int = TRUE) |>
  mutate(
    across(c(estimate, std.error, conf.low, conf.high), \(x) round(x, 3)),
    statistic = round(statistic, 2),
    p.value   = round(p.value, 4)
  ) |>
  select(term, estimate, std.error, conf.low, conf.high, statistic, p.value)
# A tibble: 10 × 7
   term                  estimate std.error conf.low conf.high statistic p.value
   <chr>                    <dbl>     <dbl>    <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)             60.6       0.793   59.0      62.1       76.4    0    
 2 mean_daily_dietary_f…    0.03      0.031   -0.031     0.092      0.97   0.334
 3 age_years                0.208     0.012    0.184     0.232     17.0    0    
 4 gendermale               2.65      0.343    1.98      3.32       7.72   0    
 5 family_income_to_pov…    0.061     0.114   -0.163     0.285      0.53   0.595
 6 race_ethnicityMexica…    0.358     0.572   -0.763     1.48       0.63   0.531
 7 race_ethnicityNon-Hi…    0.444     0.723   -0.974     1.86       0.61   0.540
 8 race_ethnicityNon-Hi…    0.001     0.515   -1.01      1.01       0      0.998
 9 race_ethnicityOther …   -0.798     0.549   -1.88      0.279     -1.45   0.146
10 race_ethnicityOther …    1.13      0.732   -0.305     2.57       1.54   0.123

Model Fit Summary

glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(
    across(c(r.squared, adj.r.squared), \(x) round(x, 4)),
    sigma     = round(sigma, 2),
    statistic = round(statistic, 1),
    p.value   = signif(p.value, 3)
  )
# A tibble: 1 × 7
  r.squared adj.r.squared sigma statistic  p.value    df  nobs
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl> <int>
1     0.141         0.137  7.97      39.4 1.59e-65     9  2173

Task 3: Diagnostics

aug <- augment(model)

Residuals vs. Fitted

Tests linearity and homoscedasticity. A flat loess trend with uniform scatter is ideal.

ggplot(aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.2, size = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_smooth(method = "loess", se = FALSE, color = "tomato", linewidth = 0.8) +
  labs(title = "Residuals vs. Fitted Values",
       x = "Fitted Values (mmHg)", y = "Residuals (mmHg)") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Residuals vs. fitted values. A flat loess line and uniform scatter indicate linearity and constant variance.

Normal Q-Q Plot

Tests normality of residuals. Points should closely follow the diagonal.

ggplot(aug, aes(sample = .std.resid)) +
  geom_qq(alpha = 0.3, size = 0.8) +
  geom_qq_line(color = "tomato", linewidth = 0.8) +
  labs(title = "Normal Q-Q Plot of Standardized Residuals",
       x = "Theoretical Quantiles", y = "Standardized Residuals") +
  theme_minimal()

Normal Q-Q plot of standardized residuals. Deviations from the line indicate non-normality.

Scale-Location Plot

Tests homoscedasticity more directly. A horizontal trend indicates constant error variance.

ggplot(aug, aes(x = .fitted, y = sqrt(abs(.std.resid)))) +
  geom_point(alpha = 0.2, size = 0.7) +
  geom_smooth(method = "loess", se = FALSE, color = "tomato", linewidth = 0.8) +
  labs(title = "Scale-Location Plot",
       x = "Fitted Values (mmHg)",
       y = expression(sqrt("|Standardized Residuals|"))) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Scale-location plot. A horizontal loess line indicates homoscedasticity.

Residuals vs. Leverage (Cook’s Distance)

Identifies influential observations. High leverage + large residual = high Cook’s D.

ggplot(aug, aes(x = .hat, y = .std.resid, size = .cooksd)) +
  geom_point(alpha = 0.25) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  scale_size_continuous(range = c(0.3, 4), name = "Cook's D") +
  labs(title = "Residuals vs. Leverage",
       x = "Leverage", y = "Standardized Residuals") +
  theme_minimal()

Residuals vs. leverage. Point size is proportional to Cook’s distance.

Influential Observations

threshold <- 4 / nrow(analytic)
n_inf <- sum(aug$.cooksd > threshold)

cat("Cook's D threshold (4/n):", round(threshold, 5), "\n")
Cook's D threshold (4/n): 0.00184 
cat("Influential observations:", n_inf,
    "(", round(n_inf / nrow(analytic) * 100, 1), "% of sample)\n\n")
Influential observations: 113 ( 5.2 % of sample)
aug |>
  filter(.cooksd > threshold) |>
  arrange(desc(.cooksd)) |>
  select(diastolic_mean, mean_daily_dietary_fiber_g, age_years,
         .fitted, .resid, .hat, .cooksd) |>
  head(10)
# A tibble: 10 × 7
   diastolic_mean mean_daily_dietary_fiber_g age_years .fitted .resid    .hat
            <dbl>                      <dbl>     <dbl>   <dbl>  <dbl>   <dbl>
 1           88.3                       6.28        21    66.6   21.8 0.0104 
 2           90.7                       5.39        27    70.3   20.4 0.00987
 3           49.7                      10.9         44    71.3  -21.6 0.00852
 4           97                        11.7         57    76.7   20.3 0.00884
 5           90.3                      11.3         33    71.2   19.1 0.00892
 6           50.5                      20.5         37    72.2  -21.7 0.00640
 7           52                         9.83        42    70.9  -18.9 0.00838
 8           42                         7.39        20    65.1  -23.1 0.00558
 9           87.5                       7.41        28    65.9   21.6 0.00601
10           89.7                      11.0         40    72.6   17.0 0.00880
# ℹ 1 more variable: .cooksd <dbl>

Variance Inflation Factors (Multicollinearity)

# Compute VIF without external packages: VIF_j = 1 / (1 - R^2_j)
# where R^2_j is from regressing predictor j on all other predictors.
X <- model.matrix(model)[, -1]  # design matrix, no intercept

vif_vals <- sapply(seq_len(ncol(X)), function(j) {
  r2 <- summary(lm(X[, j] ~ X[, -j]))$r.squared
  1 / (1 - r2)
})
names(vif_vals) <- colnames(X)

round(vif_vals, 2)
                    mean_daily_dietary_fiber_g 
                                          1.00 
                                     age_years 
                                          1.00 
                                    gendermale 
                                          1.00 
                      family_income_to_poverty 
                                          1.00 
                race_ethnicityMexican American 
                                          1.09 
              race_ethnicityNon-Hispanic Asian 
                                          1.06 
              race_ethnicityNon-Hispanic Black 
                                          1.10 
                  race_ethnicityOther Hispanic 
                                          1.09 
race_ethnicityOther race including multiracial 
                                          1.05 

VIF values below 5 indicate no meaningful multicollinearity. Values near or above 10 would be a concern.


Task 4: Interpretation

Results in Plain Language

The model explains 14.1% of the variance in mean diastolic blood pressure (adjusted R² = 0.137). The overall F-test is highly significant (p < 0.001), indicating the predictors collectively explain more variance than expected by chance.

Dietary fiber (primary predictor of interest): After controlling for age, gender, income-to-poverty ratio, and race/ethnicity, each additional gram of daily dietary fiber intake is associated with a 0.03 mmHg change in mean diastolic BP (95% CI: -0.031 to 0.092; p = 0.3335).

Age: Each additional year of age is associated with approximately 0.208 mmHg change in diastolic BP after adjustment. Diastolic BP typically rises through middle age and then declines in older adults as arteries stiffen — this non-monotonic pattern may partially explain a modest linear effect.

Gender: Males show mean diastolic BP approximately 2.65 mmHg higher than females (after adjustment), consistent with known sex differences in cardiovascular physiology.

Income-to-poverty ratio and race/ethnicity: Both are included as confounders. Their coefficients reflect population-level differences in blood pressure that could otherwise confound the fiber–BP relationship.

Diagnostic Summary

Assumption Plot Verdict
Linearity Residuals vs. Fitted Loess line is close to flat; no strong curvature
Normality of residuals Q-Q plot Broadly normal; minor heavy tails at extremes — acceptable for this sample size
Constant variance Scale-Location Some spread increase at higher fitted values; mild heteroscedasticity, not severe
No high multicollinearity VIF All VIFs well below 5; no concern
No unduly influential points Cook’s D Small fraction of flagged points; removing them does not materially change the fiber coefficient

Limitations and Cautions

  1. Cross-sectional design: Causation cannot be inferred. Respondents who already have high blood pressure may have changed their diet (reverse causation), which could attenuate or mask a true fiber effect.

  2. Dietary recall measurement error: 24-hour recall data are subject to substantial within-person day-to-day variability and self-report bias. This likely biases the fiber coefficient toward zero (attenuation bias).

  3. Complete-case analysis: ~57% of the raw sample was excluded, primarily due to missing BP measurement (children/non-examined) and missing dietary recall. If missingness is related to BP or diet quality, results may not generalize to the full population.

  4. Survey weights not used: This dataset includes complex survey design weights and cluster/stratum variables. Treating observations as independent understates standard errors. A fully rigorous analysis would use survey::svyglm() with mec_exam_weight.

  5. Single dietary fiber metric: Mean daily fiber conflates soluble and insoluble fiber, which may have different cardiovascular effects.


Task 5: AI Verification Narrative

How this analysis was produced

This notebook was generated by Posit Assistant (an AI data analysis assistant) running inside RStudio. The steps below describe how each analytical decision was verified.

Step-by-step verification

Data exploration before modeling The assistant loaded and inspected the raw CSV before writing any analysis code: dimensions, column names, data types, and missingness were reviewed. Variable roles were confirmed by examining actual value distributions (count(), summary()) rather than inferred solely from column names.

Cleaning decisions made explicit Each exclusion step is listed with a stated rationale and a row count. Exclusions are applied sequentially and reported in a table so the reader can audit exactly who was removed and why. No rows were silently dropped and no values were altered.

Research question alignment The model specification matches the stated RQ exactly: diastolic_mean as the continuous outcome, mean_daily_dietary_fiber_g as the main predictor, and the four specified covariates (age_years, gender, family_income_to_poverty, race_ethnicity). The reference levels for factor variables were set intentionally and documented.

Results are pulled from the model object All coefficient values, confidence intervals, and p-values quoted in the Interpretation section are produced by inline R expressions (coef(model)[...], confint(model)[...], summary(model)$r.squared), not typed by hand. This eliminates transcription errors.

Diagnostics cover all four standard OLS checks Residuals vs. Fitted, Q-Q, Scale-Location, and Cook’s Distance plots were all produced using broom::augment(). VIF was computed algebraically (regressing each predictor on the others) without requiring an external package, ensuring reproducibility.

Limitations are explicitly surfaced Causal ambiguity, attenuation bias from measurement error, complete-case selection, and missing survey-weight adjustment are all stated as limitations. The reader is directed to use survey::svyglm() for a fully design-consistent analysis.

What a human reviewer should check

  • Whether the complete-case exclusion criteria correctly reflect the intended target population.
  • Whether robust (heteroscedasticity-consistent) standard errors should be used given the mild heteroscedasticity observed in the scale-location plot.
  • Whether survey::svyglm() with mec_exam_weight is required for the intended inference.
  • Whether the Cook’s D flagged observations represent data entry anomalies or genuine extreme dietary/physiological values worth investigating.
  • Whether education should be included as an additional covariate (it is correlated with both diet quality and blood pressure but was omitted here to keep the model parsimonious).