We are building an individual-level cumulative environmental stress score (and a parallel restoration score) from element-level survey responses across seven property types. This is a measurement + aggregation problem: we are compressing many observed indicators into a smaller set of scores that behave like coherent exposure indices. The output is intended to be used as a predictor of cognitive well-being outcomes (PSS, affect items, and/or an overall experience composite), and to support demographic prediction (expected burden given a demographic profile).
Two principles govern the approach:
Two-stage aggregation prevents structural weighting artifacts. If one property type has more element items than another, summing raw elements across properties silently weights that property higher. We therefore aggregate within property first, then aggregate across properties after standardizing.
Composite scores must be validated, not assumed. We use (a) convergent validity checks against each property’s single-item overall rating and (b) construct validity checks against PSS/affect outcomes. We also use reliability / dimensionality checks to confirm that it is statistically reasonable to collapse elements into a property score.
This how-to assumes the same column families used previously:
response_idage, city,
gender, income_us (or equivalent),
mobilitynumber_location_dailyhome_overall_num,
office_overall_num, park_overall_num,
education_overall_num, sport_overall_num,
hotel_overall_num, healthcare_overall_numhome_aspect_stress,
office_aspect_stress, park_aspect_stress,
education_space_stress, sport_aspect_stress,
hotel_aspect_stress,
healthcare_space_stresshome_aspect_restore,
office_aspect_restore, park_aspect_restore,
sport_aspect_restore,
hotel_aspect_restorepss4_1–pss4_4) and
short affect items (e.g., mood_tense,
mood_sad, mood_happy,
mood_calm)If any names differ, we update them once in prop_dict
and the outcome-mapping block.
Statistical rationale. Composite construction is extremely sensitive to mis-specified column groupings. If we accidentally omit a stressor family or include the wrong items, the resulting aggregate is not “slightly off”—it is a different measurement instrument. We therefore encode a transparent mapping and explicitly check column presence.
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
df <- read_excel("cleaned data.xlsx")
prop_dict <- tibble::tribble(
~property, ~stress_col, ~restor_col, ~overall_col,
"home", "home_aspect_stress", "home_aspect_restore", "home_overall_num",
"office", "office_aspect_stress", "office_aspect_restore", "office_overall_num",
"park", "park_aspect_stress", "park_aspect_restore", "park_overall_num",
"education", "education_space_stress", NA_character_, "education_overall_num",
"sport", "sport_aspect_stress", "sport_aspect_restore", "sport_overall_num",
"hotel", "hotel_aspect_stress", "hotel_aspect_restore", "hotel_overall_num",
"healthcare", "healthcare_space_stress", NA_character_, "healthcare_overall_num"
)
id_col <- "response_id"
places_col <- "number_location_daily"
required_cols <- c(id_col, places_col, prop_dict$stress_col, prop_dict$overall_col)
required_cols <- required_cols[!is.na(required_cols)]
missing_cols <- setdiff(required_cols, names(df))
stopifnot(length(missing_cols) == 0)
Statistical rationale. A check-all cell containing multiple selections is not a single variable; it is a sparse set of binary indicators. If we do not separate it, we (a) undercount exposure, (b) cannot test which elements drive overall ratings, and (c) cannot compute reliable composites.
Tests supported. Element prevalence; regression of overall ratings on elements; tetrachoric correlations; EFA/CFA.
one_hot_checkall <- function(data, id_col, checkall_col, prefix) {
if (is.na(checkall_col) || !(checkall_col %in% names(data))) return(NULL)
data %>%
select(all_of(id_col), all_of(checkall_col)) %>%
filter(!is.na(.data[[checkall_col]]), .data[[checkall_col]] != "") %>%
mutate(.option = .data[[checkall_col]]) %>%
separate_rows(.option, sep = ",") %>%
mutate(
.option = str_squish(.option),
.option = str_replace_all(.option, "[^A-Za-z0-9]+", "_"),
.option = str_replace_all(.option, "_+", "_"),
.option = str_replace_all(.option, "^_|_$", "")
) %>%
distinct(.data[[id_col]], .option) %>%
mutate(value = 1L) %>%
pivot_wider(
names_from = .option,
values_from = value,
values_fill = 0L,
names_prefix = paste0(prefix, "__")
)
}
Statistical rationale. Summing elements assumes the indicators can be collapsed into a single dimension (or at least a defensible burden index). We confirm coherence using reliability and dimensionality checks.
Concrete options. - Binary elements: tetrachoric correlations; EFA on tetrachoric matrix; KR-20 (or alpha as a proxy). - Likert elements: Cronbach’s alpha; omega; EFA/CFA.
If the first factor is dominant and reliability is acceptable, collapsing to a single property score is justified. If clearly multi-dimensional, we either keep subscales or justify a general factor only with evidence.
# Example for PARK stress after one-hot encoding:
# library(psych)
# park_vars <- grep("^park_stress__", names(df), value = TRUE)
# X <- df[, park_vars]
# tetra <- psych::tetrachoric(X)$rho
# fa1 <- psych::fa(tetra, nfactors = 1, fm = "ml")
# print(fa1)
# psych::alpha(X)
Statistical rationale. For check-all items, the
composite is naturally a burden count (a formative
index): endorsing more stressors implies greater burden. This supports
using rowSums().
Assumption. Equal contribution per stressor unless weighting is justified and stable.
stress_onehots <- list()
restor_onehots <- list()
for (i in seq_len(nrow(prop_dict))) {
p <- prop_dict$property[i]
stress_onehots[[p]] <- one_hot_checkall(df, id_col, prop_dict$stress_col[i], paste0(p, "_stress"))
restor_onehots[[p]] <- one_hot_checkall(df, id_col, prop_dict$restor_col[i], paste0(p, "_restor"))
}
for (p in names(stress_onehots)) if (!is.null(stress_onehots[[p]])) df <- df %>% left_join(stress_onehots[[p]], by = id_col)
for (p in names(restor_onehots)) if (!is.null(restor_onehots[[p]])) df <- df %>% left_join(restor_onehots[[p]], by = id_col)
for (i in seq_len(nrow(prop_dict))) {
p <- prop_dict$property[i]
st_vars <- grep(paste0("^", p, "_stress__"), names(df), value = TRUE)
rs_vars <- grep(paste0("^", p, "_restor__"), names(df), value = TRUE)
if (length(st_vars) > 0) df[[paste0(p, "_stress_score")]] <- rowSums(df[, st_vars], na.rm = TRUE)
if (length(rs_vars) > 0) df[[paste0(p, "_restor_score")]] <- rowSums(df[, rs_vars], na.rm = TRUE)
}
Statistical rationale. If the composite captures the major stressors/restoratives, it should align with overall appraisal. Because overall items are often ordinal, Spearman is a safe default; OLS regression provides an interpretable effect size and incremental validity.
Tests. Spearman correlations; regressions
overall ~ stress_score + restor_score; nested-model
comparisons.
validation <- prop_dict %>%
mutate(stress_score_col = paste0(property, "_stress_score"),
restor_score_col = paste0(property, "_restor_score")) %>%
rowwise() %>%
mutate(
spearman_stress = ifelse(
overall_col %in% names(df) && stress_score_col %in% names(df),
cor(df[[stress_score_col]], df[[overall_col]], use = "complete.obs", method = "spearman"),
NA_real_
),
spearman_restor = ifelse(
overall_col %in% names(df) && restor_score_col %in% names(df),
cor(df[[restor_score_col]], df[[overall_col]], use = "complete.obs", method = "spearman"),
NA_real_
)
) %>% ungroup()
validation
Statistical rationale. OLS coefficients can be unstable with correlated stressors. Ridge regression (alpha=0) is preferred for stable weights; lasso can be used for sparse selection but requires careful validation.
Tests. CV RMSE; coefficient stability; out-of-sample correlation with overall anchors.
# library(glmnet)
# park_overall <- "park_overall_num"
# park_vars <- grep("^park_stress__", names(df), value = TRUE)
# X <- model.matrix(~ . - 1, data = df[, park_vars, drop = FALSE])
# y <- df[[park_overall]]
# fit_ridge <- cv.glmnet(X, y, alpha = 0)
# w <- as.numeric(coef(fit_ridge, s = "lambda.min"))[-1]
# df$park_stress_weighted <- as.numeric(X %*% w)
Statistical rationale. Standardization prevents scale and variance differences from turning into implicit weights. This is standard practice in index construction when combining subscales.
stress_score_cols <- grep("_stress_score$", names(df), value = TRUE)
restor_score_cols <- grep("_restor_score$", names(df), value = TRUE)
for (c in stress_score_cols) df[[paste0(c, "_z")]] <- as.numeric(scale(df[[c]]))
for (c in restor_score_cols) df[[paste0(c, "_z")]] <- as.numeric(scale(df[[c]]))
Statistical rationale. Cumulative indices are sums
of standardized property scores, making the aggregate multi-context by
construction. We compute net balance as
stress − restoration but retain both components for
modeling.
stress_z_cols <- grep("_stress_score_z$", names(df), value = TRUE)
restor_z_cols <- grep("_restor_score_z$", names(df), value = TRUE)
df <- df %>%
mutate(
cumulative_stress = rowSums(across(all_of(stress_z_cols)), na.rm = TRUE),
cumulative_restor = rowSums(across(all_of(restor_z_cols)), na.rm = TRUE),
cumulative_net = cumulative_stress - cumulative_restor
)
Statistical rationale. We expect stress burden to associate with higher PSS and negative affect, and restoration to associate with lower PSS and higher positive affect. We also test incremental validity beyond demographics using nested-model ANOVA.
pss_items <- c("pss4_1","pss4_2","pss4_3","pss4_4")
df <- df %>% mutate(across(all_of(pss_items), ~ as.numeric(.) - 1))
df <- df %>% mutate(pss4_2_r = 4 - pss4_2, pss4_3_r = 4 - pss4_3)
df <- df %>% mutate(PSS_total = pss4_1 + pss4_2_r + pss4_3_r + pss4_4)
fit_demo <- lm(PSS_total ~ age + gender + income_us + city + mobility, data = df)
fit_full <- lm(PSS_total ~ age + gender + income_us + city + mobility + cumulative_stress + cumulative_restor, data = df)
anova(fit_demo, fit_full)
summary(fit_full)
Statistical rationale. Without time-in-place, moderation is the cleanest test: it asks whether burden translates to outcomes differently at higher exposure breadth.
fit_dose <- lm(PSS_total ~ cumulative_stress * number_location_daily + cumulative_restor, data = df)
summary(fit_dose)
Statistical rationale. This estimates expected cumulative stress given demographics, enabling an interface (input demographics → predicted burden). We validate with train/test or cross-validation for predictive performance.
age_levels <- c("18–24","25–34","35–44","45–54","55–64","65+")
df <- df %>% mutate(age_cohort = cut(age, breaks = c(17,24,34,44,54,64,150), labels = age_levels, right = TRUE))
fit_pred <- lm(cumulative_stress ~ age_cohort + city + gender + income_us + mobility, data = df)
summary(fit_pred)
new_person <- data.frame(
age_cohort = factor("35–44", levels = age_levels),
city = factor("New York City", levels = levels(df$city)),
gender = factor("Man", levels = levels(df$gender)),
income_us = factor("75k–100k", levels = levels(df$income_us)),
mobility = factor("Commuter", levels = levels(df$mobility))
)
predict(fit_pred, newdata = new_person, interval = "confidence")
We treat these as required because aggregation choices can change results.
# Drop-one-property sensitivity scaffold
# stress_z_cols <- grep("_stress_score_z$", names(df), value = TRUE)
# for (drop in stress_z_cols) {
# tmp <- setdiff(stress_z_cols, drop)
# df[[paste0("cum_stress_minus_", drop)]] <- rowSums(df[, tmp], na.rm = TRUE)
# }