1 What we are building and why

We are building an individual-level cumulative environmental stress score (and a parallel restoration score) from element-level survey responses across seven property types. This is a measurement + aggregation problem: we are compressing many observed indicators into a smaller set of scores that behave like coherent exposure indices. The output is intended to be used as a predictor of cognitive well-being outcomes (PSS, affect items, and/or an overall experience composite), and to support demographic prediction (expected burden given a demographic profile).

Two principles govern the approach:

  1. Two-stage aggregation prevents structural weighting artifacts. If one property type has more element items than another, summing raw elements across properties silently weights that property higher. We therefore aggregate within property first, then aggregate across properties after standardizing.

  2. Composite scores must be validated, not assumed. We use (a) convergent validity checks against each property’s single-item overall rating and (b) construct validity checks against PSS/affect outcomes. We also use reliability / dimensionality checks to confirm that it is statistically reasonable to collapse elements into a property score.


2 Columns used (based on our prior datasets)

This how-to assumes the same column families used previously:

If any names differ, we update them once in prop_dict and the outcome-mapping block.


3 Step-by-step (statistical logic + tests + code)

3.1 1) We define a property dictionary and fail loudly if required columns are missing.

Statistical rationale. Composite construction is extremely sensitive to mis-specified column groupings. If we accidentally omit a stressor family or include the wrong items, the resulting aggregate is not “slightly off”—it is a different measurement instrument. We therefore encode a transparent mapping and explicitly check column presence.

library(readxl)
library(dplyr)
library(tidyr)
library(stringr)

df <- read_excel("cleaned data.xlsx")

prop_dict <- tibble::tribble(
  ~property,     ~stress_col,               ~restor_col,               ~overall_col,
  "home",        "home_aspect_stress",      "home_aspect_restore",     "home_overall_num",
  "office",      "office_aspect_stress",    "office_aspect_restore",   "office_overall_num",
  "park",        "park_aspect_stress",      "park_aspect_restore",     "park_overall_num",
  "education",   "education_space_stress",  NA_character_,             "education_overall_num",
  "sport",       "sport_aspect_stress",     "sport_aspect_restore",    "sport_overall_num",
  "hotel",       "hotel_aspect_stress",     "hotel_aspect_restore",    "hotel_overall_num",
  "healthcare",  "healthcare_space_stress", NA_character_,             "healthcare_overall_num"
)

id_col <- "response_id"
places_col <- "number_location_daily"

required_cols <- c(id_col, places_col, prop_dict$stress_col, prop_dict$overall_col)
required_cols <- required_cols[!is.na(required_cols)]
missing_cols <- setdiff(required_cols, names(df))
stopifnot(length(missing_cols) == 0)

3.2 2) We one-hot encode check-all columns so elements become analyzable indicators.

Statistical rationale. A check-all cell containing multiple selections is not a single variable; it is a sparse set of binary indicators. If we do not separate it, we (a) undercount exposure, (b) cannot test which elements drive overall ratings, and (c) cannot compute reliable composites.

Tests supported. Element prevalence; regression of overall ratings on elements; tetrachoric correlations; EFA/CFA.

one_hot_checkall <- function(data, id_col, checkall_col, prefix) {
  if (is.na(checkall_col) || !(checkall_col %in% names(data))) return(NULL)

  data %>%
    select(all_of(id_col), all_of(checkall_col)) %>%
    filter(!is.na(.data[[checkall_col]]), .data[[checkall_col]] != "") %>%
    mutate(.option = .data[[checkall_col]]) %>%
    separate_rows(.option, sep = ",") %>%
    mutate(
      .option = str_squish(.option),
      .option = str_replace_all(.option, "[^A-Za-z0-9]+", "_"),
      .option = str_replace_all(.option, "_+", "_"),
      .option = str_replace_all(.option, "^_|_$", "")
    ) %>%
    distinct(.data[[id_col]], .option) %>%
    mutate(value = 1L) %>%
    pivot_wider(
      names_from = .option,
      values_from = value,
      values_fill = 0L,
      names_prefix = paste0(prefix, "__")
    )
}

3.3 3) We test whether within-property aggregation is statistically reasonable.

Statistical rationale. Summing elements assumes the indicators can be collapsed into a single dimension (or at least a defensible burden index). We confirm coherence using reliability and dimensionality checks.

Concrete options. - Binary elements: tetrachoric correlations; EFA on tetrachoric matrix; KR-20 (or alpha as a proxy). - Likert elements: Cronbach’s alpha; omega; EFA/CFA.

If the first factor is dominant and reliability is acceptable, collapsing to a single property score is justified. If clearly multi-dimensional, we either keep subscales or justify a general factor only with evidence.

# Example for PARK stress after one-hot encoding:
# library(psych)
# park_vars <- grep("^park_stress__", names(df), value = TRUE)
# X <- df[, park_vars]
# tetra <- psych::tetrachoric(X)$rho
# fa1 <- psych::fa(tetra, nfactors = 1, fm = "ml")
# print(fa1)
# psych::alpha(X)

3.4 4) We compute property-level stress/restoration scores as composites.

Statistical rationale. For check-all items, the composite is naturally a burden count (a formative index): endorsing more stressors implies greater burden. This supports using rowSums().

Assumption. Equal contribution per stressor unless weighting is justified and stable.

stress_onehots <- list()
restor_onehots <- list()

for (i in seq_len(nrow(prop_dict))) {
  p <- prop_dict$property[i]
  stress_onehots[[p]] <- one_hot_checkall(df, id_col, prop_dict$stress_col[i], paste0(p, "_stress"))
  restor_onehots[[p]] <- one_hot_checkall(df, id_col, prop_dict$restor_col[i], paste0(p, "_restor"))
}

for (p in names(stress_onehots)) if (!is.null(stress_onehots[[p]])) df <- df %>% left_join(stress_onehots[[p]], by = id_col)
for (p in names(restor_onehots)) if (!is.null(restor_onehots[[p]])) df <- df %>% left_join(restor_onehots[[p]], by = id_col)

for (i in seq_len(nrow(prop_dict))) {
  p <- prop_dict$property[i]
  st_vars <- grep(paste0("^", p, "_stress__"), names(df), value = TRUE)
  rs_vars <- grep(paste0("^", p, "_restor__"), names(df), value = TRUE)
  if (length(st_vars) > 0) df[[paste0(p, "_stress_score")]] <- rowSums(df[, st_vars], na.rm = TRUE)
  if (length(rs_vars) > 0) df[[paste0(p, "_restor_score")]] <- rowSums(df[, rs_vars], na.rm = TRUE)
}

3.5 5) We test convergent validity against single-item overall property ratings.

Statistical rationale. If the composite captures the major stressors/restoratives, it should align with overall appraisal. Because overall items are often ordinal, Spearman is a safe default; OLS regression provides an interpretable effect size and incremental validity.

Tests. Spearman correlations; regressions overall ~ stress_score + restor_score; nested-model comparisons.

validation <- prop_dict %>%
  mutate(stress_score_col = paste0(property, "_stress_score"),
         restor_score_col = paste0(property, "_restor_score")) %>%
  rowwise() %>%
  mutate(
    spearman_stress = ifelse(
      overall_col %in% names(df) && stress_score_col %in% names(df),
      cor(df[[stress_score_col]], df[[overall_col]], use = "complete.obs", method = "spearman"),
      NA_real_
    ),
    spearman_restor = ifelse(
      overall_col %in% names(df) && restor_score_col %in% names(df),
      cor(df[[restor_score_col]], df[[overall_col]], use = "complete.obs", method = "spearman"),
      NA_real_
    )
  ) %>% ungroup()

validation

3.6 6) If we weight elements, we use regularized methods and test stability.

Statistical rationale. OLS coefficients can be unstable with correlated stressors. Ridge regression (alpha=0) is preferred for stable weights; lasso can be used for sparse selection but requires careful validation.

Tests. CV RMSE; coefficient stability; out-of-sample correlation with overall anchors.

# library(glmnet)
# park_overall <- "park_overall_num"
# park_vars <- grep("^park_stress__", names(df), value = TRUE)
# X <- model.matrix(~ . - 1, data = df[, park_vars, drop = FALSE])
# y <- df[[park_overall]]
# fit_ridge <- cv.glmnet(X, y, alpha = 0)
# w <- as.numeric(coef(fit_ridge, s = "lambda.min"))[-1]
# df$park_stress_weighted <- as.numeric(X %*% w)

3.7 7) We standardize property scores before summing across properties.

Statistical rationale. Standardization prevents scale and variance differences from turning into implicit weights. This is standard practice in index construction when combining subscales.

stress_score_cols <- grep("_stress_score$", names(df), value = TRUE)
restor_score_cols <- grep("_restor_score$", names(df), value = TRUE)

for (c in stress_score_cols) df[[paste0(c, "_z")]] <- as.numeric(scale(df[[c]]))
for (c in restor_score_cols) df[[paste0(c, "_z")]] <- as.numeric(scale(df[[c]]))

3.8 8) We compute cumulative stress and restoration.

Statistical rationale. Cumulative indices are sums of standardized property scores, making the aggregate multi-context by construction. We compute net balance as stress − restoration but retain both components for modeling.

stress_z_cols <- grep("_stress_score_z$", names(df), value = TRUE)
restor_z_cols <- grep("_restor_score_z$", names(df), value = TRUE)

df <- df %>%
  mutate(
    cumulative_stress = rowSums(across(all_of(stress_z_cols)), na.rm = TRUE),
    cumulative_restor = rowSums(across(all_of(restor_z_cols)), na.rm = TRUE),
    cumulative_net    = cumulative_stress - cumulative_restor
  )

3.9 9) We test construct validity against PSS and affect outcomes.

Statistical rationale. We expect stress burden to associate with higher PSS and negative affect, and restoration to associate with lower PSS and higher positive affect. We also test incremental validity beyond demographics using nested-model ANOVA.

pss_items <- c("pss4_1","pss4_2","pss4_3","pss4_4")
df <- df %>% mutate(across(all_of(pss_items), ~ as.numeric(.) - 1))
df <- df %>% mutate(pss4_2_r = 4 - pss4_2, pss4_3_r = 4 - pss4_3)
df <- df %>% mutate(PSS_total = pss4_1 + pss4_2_r + pss4_3_r + pss4_4)

fit_demo <- lm(PSS_total ~ age + gender + income_us + city + mobility, data = df)
fit_full <- lm(PSS_total ~ age + gender + income_us + city + mobility + cumulative_stress + cumulative_restor, data = df)
anova(fit_demo, fit_full)
summary(fit_full)

3.10 10) We incorporate exposure opportunity (places/day) using moderation.

Statistical rationale. Without time-in-place, moderation is the cleanest test: it asks whether burden translates to outcomes differently at higher exposure breadth.

fit_dose <- lm(PSS_total ~ cumulative_stress * number_location_daily + cumulative_restor, data = df)
summary(fit_dose)

3.11 11) We build the “similar people” prediction model.

Statistical rationale. This estimates expected cumulative stress given demographics, enabling an interface (input demographics → predicted burden). We validate with train/test or cross-validation for predictive performance.

age_levels <- c("18–24","25–34","35–44","45–54","55–64","65+")
df <- df %>% mutate(age_cohort = cut(age, breaks = c(17,24,34,44,54,64,150), labels = age_levels, right = TRUE))

fit_pred <- lm(cumulative_stress ~ age_cohort + city + gender + income_us + mobility, data = df)
summary(fit_pred)

new_person <- data.frame(
  age_cohort = factor("35–44", levels = age_levels),
  city = factor("New York City", levels = levels(df$city)),
  gender = factor("Man", levels = levels(df$gender)),
  income_us = factor("75k–100k", levels = levels(df$income_us)),
  mobility = factor("Commuter", levels = levels(df$mobility))
)

predict(fit_pred, newdata = new_person, interval = "confidence")

4 Sensitivity analyses (minimum set)

We treat these as required because aggregation choices can change results.

# Drop-one-property sensitivity scaffold
# stress_z_cols <- grep("_stress_score_z$", names(df), value = TRUE)
# for (drop in stress_z_cols) {
#   tmp <- setdiff(stress_z_cols, drop)
#   df[[paste0("cum_stress_minus_", drop)]] <- rowSums(df[, tmp], na.rm = TRUE)
# }

5 Assumptions (documented)