Master analysis Rmd for the wellbeing paper preregistration AND the
dissertation Chapter 2 cuts. It reads the cleaned Qualtrics export
(Data/wellbeing.csv), applies exclusions, computes scale
scores, runs the prereg’s H1–H5 plus order effects, and then runs the
dissertation-specific Interest #1 (PIL vs. EPS-BTS purpose comparison)
and Interest #2 (L-BAM for TRWOC transfer).
Pipeline-to-prereg map:
| Section | Prereg item | Notes |
|---|---|---|
| Setup → Reliability | Indices, transformations | Reverse-keys per Ryff/BFI scoring keys |
| Embeddings | “Convergent validity” embedding spec | Layer 23, mean-pooled, both text + word_type aggregations |
| H1 train | H1 confirmatory CV | Penalty grid 10^seq(-6, 6), FDR-corrected, lambda
boundary check |
| H1 disattenuated | H1 secondary | r / sqrt(α_observed) |
| H1 prospective | H1 sequential evaluation | Frozen model on N=100; benchmark r ≥ .45 lower bound |
| H1 pooled panel | Setup for H2–H5 | CV preds (N=400) + prospective preds (N=100) |
| H2 nomological | H2 | Two-tailed Pearson, FDR within predicted dim |
| H3 external | H3 i–iv (one-tailed, FDR), exploratory (two-tailed) | Religious commitment = Q202 ≥ 3 |
| H4 face validity | H4 | textProjection + LDA per construct + rater export packet |
| H5 discriminant | H5 + Brendan’s mixed-effects layer | 6×6 corr matrix (observed + disattenuated); lmer on z-scored sq err with emmeans for 30 contrasts; per-construct paired t-tests |
| Order effects | “Other planned analyses” | t-tests on scale means by condition |
| Interest #1 | Dissertation Ch2 | Purpose-language comparison |
| Interest #2 | Dissertation Ch2 | L-BAM build |
The launch began 2026-04-27. All earlier responses
(Nov 2025 – Apr 2026 dev/pilot) are excluded by a pre-launch date filter
in Section 4. On the current launch batch (N=40), the prereg analyses
with eval = FALSE are blocked on:
v1_pre_republish: missing condition, ethnicity
(Q190), Q199 employment, Q200 religious affiliation, Q201 relationship,
Q202 religion importance. Until the survey is republished and new
responses come in, H3iv (religious commitment), H3 exploratory
(relationship/employment), H2/H3 ethnicity reporting, and order effects
all cannot run.wave column on clean (1 = N=400 train, 2 =
N=100 held-out). Add that upstream once you have wave 2 collected.eval = FALSE on the
embedding chunk; flip when ready (10–30+ min per prompt).library(tidyverse)
library(here)
library(janitor)
library(psych)
library(lavaan)
library(broom)
library(broom.mixed)
library(scales)
library(knitr)
library(gt)
library(lme4)
library(lmerTest)
library(emmeans)
results_dir <- here::here("Data", "results")
dir.create(results_dir, showWarnings = FALSE, recursive = TRUE)
theme_set(theme_minimal(base_size = 12))
brand_blue <- "#013d5aff"
brand_red <- "#cd5d5bff"
update_geom_defaults("point", list(colour = brand_blue))
update_geom_defaults("line", list(colour = brand_blue))
update_geom_defaults("bar", list(fill = brand_blue))
update_geom_defaults("col", list(fill = brand_blue))
update_geom_defaults("density",list(colour = brand_blue, fill = brand_blue, alpha = 0.3))
Qualtrics CSV exports include two metadata header rows above the actual data. We strip them, retype, and keep the original column names.
strip_qualtrics_header <- function(path) {
readr::read_csv(path, show_col_types = FALSE, na = c("", "NA")) |>
slice(-(1:2))
}
dat <- strip_qualtrics_header(here::here("Data", "wellbeing.csv")) |>
mutate(
across(c(`Duration (in seconds)`, Progress, Finished), as.numeric),
duration_min = `Duration (in seconds)` / 60,
StartDate = as.POSIXct(StartDate, tz = "America/Los_Angeles"),
EndDate = as.POSIXct(EndDate, tz = "America/Los_Angeles"),
RecordedDate = as.POSIXct(RecordedDate, tz = "America/Los_Angeles")
)
dat_labeled <- strip_qualtrics_header(here::here("Data", "wellbeing_labeled.csv"))
build_value_labels <- function(numeric_df, labeled_df) {
shared <- intersect(names(numeric_df), names(labeled_df))
out <- list()
for (q in shared) {
pairs <- tibble(num = as.character(numeric_df[[q]]),
lab = as.character(labeled_df[[q]])) |>
filter(!is.na(num), !is.na(lab), num != lab) |>
distinct()
if (nrow(pairs) > 0) out[[q]] <- setNames(pairs$lab, pairs$num)
}
out
}
value_labels <- build_value_labels(dat, dat_labeled)
cat("Numeric rows:", nrow(dat), " Labeled rows:", nrow(dat_labeled), "\n")
## Numeric rows: 63 Labeled rows: 63
cat("Date range:", format(min(dat$RecordedDate, na.rm = TRUE)), "to",
format(max(dat$RecordedDate, na.rm = TRUE)), "\n")
## Date range: 2025-11-17 16:30:46 to 2026-04-27 22:01:32
cat("Questions with value labels resolved:", length(value_labels), "\n")
## Questions with value labels resolved: 102
This is the canonical mapping from Qualtrics question codes to scales/subscales for this study.
Reverse-key status (verified against
docs/survey_codebook.md, 2026-04-28):
reverse = TRUE. Qualtrics built the scale with “Strongly
Agree” as choice 1 and “Strongly Disagree” as choice 7 — opposite of
convention. The launch-batch correlation pattern (SWLS/HILS correlate
negatively with Ryff PWB and positively with PHQ-9/GAD-7 in raw data)
confirms.reverse = FALSE.scale_map <- tribble(
~scale, ~subscale, ~item, ~likert_min, ~likert_max, ~reverse,
"EPS", "meaningfulness", "QID181", 1, 5, FALSE,
"EPS", "meaningfulness", "Q189", 1, 5, FALSE,
"EPS", "meaningfulness", "Q188", 1, 5, FALSE,
"EPS", "commitment", "Q187", 1, 5, FALSE,
"EPS", "commitment", "Q186", 1, 5, FALSE,
"EPS", "commitment", "Q185", 1, 5, FALSE,
"EPS", "beyond_self", "Q184", 1, 5, FALSE,
"EPS", "beyond_self", "Q183", 1, 5, FALSE,
"EPS", "beyond_self", "Q182", 1, 5, FALSE,
"SWLS", "swls", "Q61", 1, 7, TRUE,
"SWLS", "swls", "Q63", 1, 7, TRUE,
"SWLS", "swls", "Q64", 1, 7, TRUE,
"HILS", "hils", "Q94", 1, 7, TRUE,
"HILS", "hils", "Q95", 1, 7, TRUE,
"HILS", "hils", "Q96", 1, 7, TRUE,
"Ryff", "autonomy", "Q161", 1, 6, FALSE,
"Ryff", "autonomy", "Q178", 1, 6, FALSE,
"Ryff", "autonomy", "Q177", 1, 6, FALSE,
"Ryff", "env_mastery", "Q176", 1, 6, FALSE,
"Ryff", "env_mastery", "Q175", 1, 6, TRUE,
"Ryff", "env_mastery", "Q174", 1, 6, FALSE,
"Ryff", "personal_growth", "Q173", 1, 6, FALSE,
"Ryff", "personal_growth", "Q172", 1, 6, TRUE,
"Ryff", "personal_growth", "Q171", 1, 6, FALSE,
"Ryff", "pos_relations", "Q170", 1, 6, FALSE,
"Ryff", "pos_relations", "Q169", 1, 6, FALSE,
"Ryff", "pos_relations", "Q168", 1, 6, TRUE,
"Ryff", "self_acceptance", "Q167", 1, 6, FALSE,
"Ryff", "self_acceptance", "Q166", 1, 6, FALSE,
"Ryff", "self_acceptance", "Q165", 1, 6, FALSE,
"Ryff", "purpose_in_life", "Q164", 1, 6, FALSE,
"Ryff", "purpose_in_life", "Q163", 1, 6, FALSE,
"Ryff", "purpose_in_life", "Q179", 1, 6, FALSE,
"PANAS", "positive_affect", "interested", 1, 5, FALSE,
"PANAS", "positive_affect", "Q88", 1, 5, FALSE,
"PANAS", "positive_affect", "Q87", 1, 5, FALSE,
"PANAS", "positive_affect", "Q86", 1, 5, FALSE,
"PANAS", "positive_affect", "Q85", 1, 5, FALSE,
"PANAS", "positive_affect", "Q84", 1, 5, FALSE,
"PANAS", "positive_affect", "Q83", 1, 5, FALSE,
"PANAS", "positive_affect", "Q82", 1, 5, FALSE,
"PANAS", "positive_affect", "Q81", 1, 5, FALSE,
"PANAS", "positive_affect", "Q80", 1, 5, FALSE,
"PANAS", "negative_affect", "Q79", 1, 5, FALSE,
"PANAS", "negative_affect", "Q78", 1, 5, FALSE,
"PANAS", "negative_affect", "Q77", 1, 5, FALSE,
"PANAS", "negative_affect", "Q76", 1, 5, FALSE,
"PANAS", "negative_affect", "Q75", 1, 5, FALSE,
"PANAS", "negative_affect", "Q74", 1, 5, FALSE,
"PANAS", "negative_affect", "Q73", 1, 5, FALSE,
"PANAS", "negative_affect", "Q72", 1, 5, FALSE,
"PANAS", "negative_affect", "Q71", 1, 5, FALSE,
"PANAS", "negative_affect", "Q70", 1, 5, FALSE,
"Agency", "agency", "Q89", 1, 8, FALSE,
"Agency", "agency", "Q92", 1, 8, FALSE,
"Agency", "agency", "Q91", 1, 8, FALSE,
"Agency", "agency", "Q90", 1, 8, FALSE,
"PHQ9", "phq9", "Q101", 1, 4, FALSE,
"PHQ9", "phq9", "Q110", 1, 4, FALSE,
"PHQ9", "phq9", "Q114", 1, 4, FALSE,
"PHQ9", "phq9", "Q115", 1, 4, FALSE,
"PHQ9", "phq9", "Q113", 1, 4, FALSE,
"PHQ9", "phq9", "Q112", 1, 4, FALSE,
"PHQ9", "phq9", "Q111", 1, 4, FALSE,
"PHQ9", "phq9", "Q117", 1, 4, FALSE,
"PHQ9", "phq9", "Q116", 1, 4, FALSE,
"BFI10", "bfi_extraversion", "Q127", 1, 5, TRUE,
"BFI10", "bfi_agreeableness", "Q128", 1, 5, FALSE,
"BFI10", "bfi_conscientious", "Q137", 1, 5, TRUE,
"BFI10", "bfi_neuroticism", "Q136", 1, 5, TRUE,
"BFI10", "bfi_openness", "Q135", 1, 5, TRUE,
"BFI10", "bfi_extraversion", "Q134", 1, 5, FALSE,
"BFI10", "bfi_agreeableness", "Q133", 1, 5, TRUE,
"BFI10", "bfi_conscientious", "Q132", 1, 5, FALSE,
"BFI10", "bfi_neuroticism", "Q131", 1, 5, FALSE,
"BFI10", "bfi_openness", "Q130", 1, 5, FALSE,
"GAD7", "gad7", "Q139", 1, 4, FALSE,
"GAD7", "gad7", "Q145", 1, 4, FALSE,
"GAD7", "gad7", "Q144", 1, 4, FALSE,
"GAD7", "gad7", "Q143", 1, 4, FALSE,
"GAD7", "gad7", "Q142", 1, 4, FALSE,
"GAD7", "gad7", "Q141", 1, 4, FALSE,
"GAD7", "gad7", "Q140", 1, 4, FALSE,
"Responsibility","responsibility", "Q192", 1, 4, FALSE,
"Responsibility","responsibility", "Q193", 1, 4, FALSE,
"Responsibility","responsibility", "Q194", 1, 4, FALSE
)
text_prompts <- tribble(
~prompt_var, ~purpose,
"purpose_goal", "Goal aspiration (EPS-BTS prompt)",
"purpose_goal_reason", "Why this goal (purpose motivation)",
"purpose_goal_action", "What you're doing (commitment)",
"selfacceptance_text", "Ryff Self-Acceptance bidirectional",
"confidence_text", "Confidence/Autonomy unidirectional",
"environmental_text", "Ryff Environmental Mastery bidirectional",
"posrel_text", "Ryff Positive Relations bidirectional",
"autonomy_text", "Ryff Autonomy bidirectional",
"purposeinlife_text", "Ryff Purpose in Life bidirectional",
"personalgrowth_text", "Ryff Personal Growth bidirectional",
"college_goals_text", "TRWOC-mirrored: college goals",
"transformative_text", "TRWOC-mirrored: transformation"
)
scale_map |>
count(scale, subscale) |>
gt() |>
tab_header(title = "Scale items in the survey")
| Scale items in the survey | ||
| scale | subscale | n |
|---|---|---|
| Agency | agency | 4 |
| BFI10 | bfi_agreeableness | 2 |
| BFI10 | bfi_conscientious | 2 |
| BFI10 | bfi_extraversion | 2 |
| BFI10 | bfi_neuroticism | 2 |
| BFI10 | bfi_openness | 2 |
| EPS | beyond_self | 3 |
| EPS | commitment | 3 |
| EPS | meaningfulness | 3 |
| GAD7 | gad7 | 7 |
| HILS | hils | 3 |
| PANAS | negative_affect | 10 |
| PANAS | positive_affect | 10 |
| PHQ9 | phq9 | 9 |
| Responsibility | responsibility | 3 |
| Ryff | autonomy | 3 |
| Ryff | env_mastery | 3 |
| Ryff | personal_growth | 3 |
| Ryff | pos_relations | 3 |
| Ryff | purpose_in_life | 3 |
| Ryff | self_acceptance | 3 |
| SWLS | swls | 3 |
Build per-respondent flags. We flag rather than auto-drop so the exclusion rules are visible.
attention_check_var <- "Q125"
attention_correct <- 5
launch_date <- as.Date("2026-04-27")
dat_text_chars <- dat |>
rowwise() |>
mutate(
text_total_chars = sum(nchar(c_across(any_of(text_prompts$prompt_var))), na.rm = TRUE),
text_n_nonempty = sum(nchar(c_across(any_of(text_prompts$prompt_var))) > 0, na.rm = TRUE)
) |>
ungroup() |>
select(ResponseId, text_total_chars, text_n_nonempty)
junk_pattern <- "^(asdf|sdf|sadf|asdfd|qwer|test|none?|na)\\.?$"
dat_flagged <- dat |>
left_join(dat_text_chars, by = "ResponseId") |>
mutate(
survey_version = case_when(
!is.na(condition) | !is.na(Q202) | !is.na(Q199) | !is.na(Q200) | !is.na(Q201) ~ "v2_post_republish",
TRUE ~ "v1_pre_republish"
),
flag_pre_launch = as.Date(RecordedDate) < launch_date,
flag_preview = DistributionChannel == "preview",
flag_unfinished = is.na(Finished) | Finished != 1 | Progress < 100,
flag_too_fast = duration_min < 5,
flag_too_slow = duration_min > 60,
flag_attention = !is.na(.data[[attention_check_var]]) &
as.numeric(.data[[attention_check_var]]) != attention_correct,
flag_junk_text = rowSums(across(
any_of(text_prompts$prompt_var),
~ str_detect(tolower(replace_na(.x, "")), junk_pattern)
)) >= 3,
flag_short_text = text_total_chars < 200,
flag_any_exclude = flag_pre_launch | flag_preview | flag_unfinished |
flag_too_fast | flag_junk_text
)
dat_flagged |>
count(survey_version) |>
gt() |>
tab_header(title = "Survey version split (detected from field presence)",
subtitle = "v1 = collected before Qualtrics republish (missing condition + new demos); v2 = post-republish")
| Survey version split (detected from field presence) | |
| v1 = collected before Qualtrics republish (missing condition + new demos); v2 = post-republish | |
| survey_version | n |
|---|---|
| v1_pre_republish | 61 |
| v2_post_republish | 2 |
dat_flagged |>
summarise(
n_total = n(),
n_pre_launch = sum(flag_pre_launch),
n_preview = sum(flag_preview),
n_unfinished = sum(flag_unfinished),
n_too_fast = sum(flag_too_fast),
n_too_slow = sum(flag_too_slow),
n_attention_fail = sum(flag_attention, na.rm = TRUE),
n_junk_text = sum(flag_junk_text),
n_short_text = sum(flag_short_text),
n_excluded = sum(flag_any_exclude),
n_kept = sum(!flag_any_exclude)
) |>
pivot_longer(everything(), names_to = "criterion", values_to = "n") |>
gt() |>
tab_header(title = "Quality flag summary")
| Quality flag summary | |
| criterion | n |
|---|---|
| n_total | 63 |
| n_pre_launch | 23 |
| n_preview | 5 |
| n_unfinished | 3 |
| n_too_fast | 5 |
| n_too_slow | 4 |
| n_attention_fail | 7 |
| n_junk_text | 2 |
| n_short_text | 5 |
| n_excluded | 23 |
| n_kept | 40 |
completion_time_tbl <- bind_rows(
dat_flagged |> mutate(sample = "All responses"),
dat_flagged |> filter(!flag_any_exclude) |> mutate(sample = "Analytic sample"),
dat_flagged |> filter(flag_any_exclude) |> mutate(sample = "Excluded")
) |>
group_by(sample) |>
summarise(
n = n(),
median_min = median(duration_min, na.rm = TRUE),
max_min = max(duration_min, na.rm = TRUE),
.groups = "drop"
) |>
mutate(sample = factor(sample, levels = c("All responses", "Analytic sample", "Excluded"))) |>
arrange(sample)
completion_time_tbl |>
gt() |>
tab_header(title = "Completion time summary",
subtitle = "Minutes, from Qualtrics duration") |>
fmt_number(columns = c(median_min, max_min), decimals = 1)
| Completion time summary | |||
| Minutes, from Qualtrics duration | |||
| sample | n | median_min | max_min |
|---|---|---|---|
| All responses | 63 | 23.6 | 77.0 |
| Analytic sample | 40 | 22.1 | 77.0 |
| Excluded | 23 | 25.0 | 68.5 |
dat_flagged |>
filter(flag_any_exclude | flag_attention | flag_short_text | flag_too_slow) |>
select(ResponseId, DistributionChannel, Progress, duration_min,
text_total_chars, interested,
starts_with("flag_")) |>
gt() |>
tab_header(title = "Responses tripping any quality flag",
subtitle = "Inspect before applying exclusions")
| Responses tripping any quality flag | ||||||||||||||
| Inspect before applying exclusions | ||||||||||||||
| ResponseId | DistributionChannel | Progress | duration_min | text_total_chars | interested | flag_pre_launch | flag_preview | flag_unfinished | flag_too_fast | flag_too_slow | flag_attention | flag_junk_text | flag_short_text | flag_any_exclude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R_3Bu1hK1zYqDsM86 | preview | 100 | 0.350 | 0 | NA | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE |
| R_1CE0QON8Nm8OWUy | anonymous | 100 | 15.200 | 429 | 4 | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE |
| R_7TKTAsrbYYfbBoL | anonymous | 100 | 7.550 | 249 | 3 | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE |
| R_8LeN190VWzrOSVH | anonymous | 100 | 24.467 | 354 | 3 | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE |
| R_214npk8n9eAFLQW | anonymous | 100 | 37.233 | 822 | 4 | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE |
| R_1YxhmhI9ebtfKen | anonymous | 100 | 25.033 | 513 | 2 | TRUE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE |
| R_5HxEq14dgBns2FH | preview | 100 | 1.467 | 16 | NA | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE |
| R_34d4np5PTGcD2Zj | anonymous | 100 | 13.900 | 1096 | NA | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_9r6Or7W8GYnYreN | anonymous | 100 | 21.200 | 1936 | 4 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_2FXVfJYbE5r4mUV | anonymous | 100 | 25.883 | 2785 | 2 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_6KuBlG28QSUOAHu | anonymous | 100 | 25.583 | 2949 | 3 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_6ISePjAjPLwhylk | anonymous | 100 | 27.633 | 2152 | 3 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_5WHLTgLvBtEjv8t | anonymous | 100 | 40.983 | 2491 | 3 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_5Cd78srxQDU4SlN | anonymous | 100 | 32.867 | 4673 | 4 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_7fxtcq5udPpmGOL | anonymous | 100 | 68.467 | 2741 | 2 | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE |
| R_6HiaYG3kQ6n0KzZ | anonymous | 100 | 67.050 | 3661 | 4 | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE |
| R_1jNueOqRkJAMTot | anonymous | 100 | 34.267 | 3408 | 4 | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_57h5OopJnvxY5gd | anonymous | 23 | 8.333 | 316 | NA | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_5sBVbjYpQJoC01z | anonymous | 24 | 34.333 | 2731 | NA | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_12RZjVmrsd0zOOV | anonymous | 27 | 30.883 | 2096 | NA | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE |
| R_6LY158qtkL39bPx | preview | 100 | 0.383 | 0 | NA | TRUE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | TRUE |
| R_60ZosKHqL4uamZV | preview | 100 | 4.833 | 45 | 3 | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE |
| R_6e4WeAXFgzGL1Zl | preview | 100 | 4.200 | 33 | 2 | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE | TRUE | TRUE | TRUE |
| R_6IWrMKVp7EH3QEp | anonymous | 100 | 77.000 | 2397 | 2 | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE |
| R_5CwiWoKtvBh3iaS | anonymous | 100 | 73.467 | 4828 | 5 | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE |
The exclusion rule we apply is conservative: drop preview, unfinished, too-fast, and obvious junk text. Attention-check failures and short-text are kept and flagged for sensitivity analyses.
clean <- dat_flagged |>
filter(!flag_any_exclude) |>
mutate(across(all_of(scale_map$item), as.numeric))
cat("Cleaned N:", nrow(clean), "\n")
## Cleaned N: 40
cat("By survey version:\n")
## By survey version:
print(table(clean$survey_version))
##
## v1_pre_republish
## 40
readr::write_csv(clean, here::here("Data", "wellbeing_clean.csv"))
flow <- tibble(
step = c("Total responses",
sprintf("− Pre-launch (before %s)", format(launch_date)),
"− Preview / admin",
"− Unfinished or <100% progress",
"− Duration < 5 minutes",
"− Junk text (≥3 prompts gibberish)",
"= Analytic sample"),
n = c(
nrow(dat),
-sum(dat_flagged$flag_pre_launch),
-sum(dat_flagged$flag_preview & !dat_flagged$flag_pre_launch),
-sum(dat_flagged$flag_unfinished & !dat_flagged$flag_pre_launch &
!dat_flagged$flag_preview),
-sum(dat_flagged$flag_too_fast & !dat_flagged$flag_pre_launch &
!dat_flagged$flag_preview & !dat_flagged$flag_unfinished),
-sum(dat_flagged$flag_junk_text & !dat_flagged$flag_pre_launch &
!dat_flagged$flag_preview & !dat_flagged$flag_unfinished &
!dat_flagged$flag_too_fast),
nrow(clean)
)
)
flow |> gt() |> tab_header(title = "Sample-flow accounting")
| Sample-flow accounting | |
| step | n |
|---|---|
| Total responses | 63 |
| − Pre-launch (before 2026-04-27) | -23 |
| − Preview / admin | 0 |
| − Unfinished or <100% progress | 0 |
| − Duration < 5 minutes | 0 |
| − Junk text (≥3 prompts gibberish) | 0 |
| = Analytic sample | 40 |
demog_summary <- function(varname, var, lookup) {
if (varname == "Q146") {
x <- as.numeric(var)
return(tibble(level = "Age (mean ± sd)",
n = sprintf("%.1f ± %.1f", mean(x, na.rm=TRUE), sd(x, na.rm=TRUE))))
}
tibble(level = as.character(var)) |>
count(level) |>
mutate(level = case_when(
is.na(level) ~ "Missing",
!is.null(lookup) & level %in% names(lookup) ~ lookup[level],
TRUE ~ level
),
n = as.character(n))
}
demog_tbl <- bind_rows(
demog_summary("Q146", clean$Q146, NULL) |> mutate(var = "Age"),
demog_summary("Q147", clean$Q147, value_labels$Q147) |> mutate(var = "Gender"),
demog_summary("Q148", clean$Q148, value_labels$Q148) |> mutate(var = "Financial situation"),
demog_summary("Q190", clean$Q190, value_labels$Q190) |> mutate(var = "Ethnicity"),
demog_summary("highest_education", clean$highest_education, value_labels$highest_education) |>
mutate(var = "Education"),
demog_summary("in_college", clean$in_college, value_labels$in_college) |>
mutate(var = "Currently a student")
) |>
select(var, level, n)
demog_tbl |>
gt(groupname_col = "var") |>
tab_header(title = "Demographic composition (analytic sample)",
subtitle = "Value labels pulled from wellbeing_labeled.csv")
| Demographic composition (analytic sample) | |
| Value labels pulled from wellbeing_labeled.csv | |
| level | n |
|---|---|
| Age | |
| Age (mean ± sd) | 31.9 ± 11.0 |
| Gender | |
| Male | 17 |
| Female | 22 |
| Prefer not to say | 1 |
| Financial situation | |
| 5 - Living comfortably | 7 |
| 4 - Doing well financially | 5 |
| 3 - Doing okay financially | 4 |
| 4 | 8 |
| 2 - Difficult to get by | 12 |
| 1 - Finding it very difficult to get by | 3 |
| Prefer not to say | 1 |
| Ethnicity | |
| Missing | 40 |
| Education | |
| High school degree or GED | 5 |
| Some college but no degree | 10 |
| Associate's degree | 5 |
| Certificate or technical degree | 1 |
| Bachelor's degree | 13 |
| Some graduate or professional school | 1 |
| Completed Master's, Doctorate, or Professional degree (post-Bachelor's) | 5 |
| Currently a student | |
| No | 21 |
| Yes | 19 |
score_subscale <- function(data, items_df) {
m <- score_items(data, items_df)
out <- rowMeans(m, na.rm = TRUE)
out[is.nan(out)] <- NA_real_
out
}
score_items <- function(data, items_df) {
items <- items_df$item
rev <- items_df$reverse
mins <- items_df$likert_min
maxs <- items_df$likert_max
m <- as.matrix(data[, items, drop = FALSE])
for (j in seq_along(items)) {
if (isTRUE(rev[j])) m[, j] <- (mins[j] + maxs[j]) - m[, j]
}
m
}
subscales <- scale_map |>
group_by(scale, subscale) |>
group_split() |>
(\(x) set_names(x, map_chr(x, ~ paste(.x$scale[1], .x$subscale[1], sep = "_"))))()
scores <- map_dfc(subscales, ~ score_subscale(clean, .x)) |>
mutate(ResponseId = clean$ResponseId, .before = 1)
scored <- clean |>
select(ResponseId, all_of(text_prompts$prompt_var),
Q146, Q147, Q148, Q190, highest_education, in_college,
condition, interested, duration_min, text_total_chars) |>
left_join(scores, by = "ResponseId")
readr::write_csv(scored, here::here("Data", "wellbeing_scored.csv"))
scored |> select(starts_with(c("EPS_","Ryff_","SWLS_","HILS_","PANAS_","Agency_","PHQ9_","BFI10_","GAD7_"))) |>
psych::describe() |>
as_tibble(rownames = "scale") |>
select(scale, n, mean, sd, min, max, skew, kurtosis) |>
gt() |>
tab_header(title = "Scale-score descriptives") |>
fmt_number(columns = c(mean, sd, min, max, skew, kurtosis), decimals = 2)
| Scale-score descriptives | |||||||
| scale | n | mean | sd | min | max | skew | kurtosis |
|---|---|---|---|---|---|---|---|
| EPS_beyond_self | 40 | 4.12 | 1.17 | 1.00 | 5.00 | −1.38 | 0.60 |
| EPS_commitment | 40 | 4.24 | 0.71 | 1.67 | 5.00 | −1.84 | 4.24 |
| EPS_meaningfulness | 40 | 4.41 | 0.70 | 1.67 | 5.00 | −1.85 | 4.15 |
| Ryff_autonomy | 40 | 4.38 | 0.75 | 2.67 | 6.00 | 0.29 | −0.39 |
| Ryff_env_mastery | 40 | 4.04 | 0.90 | 2.67 | 6.00 | 0.22 | −0.97 |
| Ryff_personal_growth | 40 | 5.11 | 0.74 | 3.33 | 6.00 | −0.81 | 0.20 |
| Ryff_pos_relations | 40 | 4.42 | 0.86 | 2.00 | 6.00 | −0.58 | 0.24 |
| Ryff_purpose_in_life | 40 | 4.53 | 0.92 | 2.33 | 6.00 | −0.43 | −0.62 |
| Ryff_self_acceptance | 40 | 4.32 | 1.11 | 1.33 | 6.00 | −0.76 | 0.09 |
| SWLS_swls | 40 | 4.47 | 1.44 | 1.67 | 6.67 | −0.40 | −1.11 |
| HILS_hils | 40 | 4.44 | 1.43 | 2.00 | 7.00 | −0.21 | −1.21 |
| PANAS_negative_affect | 40 | 2.03 | 0.89 | 1.00 | 4.70 | 0.95 | 0.45 |
| PANAS_positive_affect | 40 | 3.20 | 0.81 | 1.50 | 4.80 | −0.25 | −0.69 |
| Agency_agency | 40 | 5.70 | 1.16 | 3.00 | 7.50 | −0.25 | −0.63 |
| PHQ9_phq9 | 40 | 1.86 | 0.71 | 1.00 | 4.00 | 1.13 | 1.01 |
| BFI10_bfi_agreeableness | 40 | 3.46 | 1.04 | 1.00 | 5.00 | −0.50 | −0.68 |
| BFI10_bfi_conscientious | 40 | 3.80 | 0.99 | 2.00 | 5.00 | −0.33 | −1.28 |
| BFI10_bfi_extraversion | 40 | 2.71 | 1.07 | 1.00 | 4.50 | 0.21 | −1.26 |
| BFI10_bfi_neuroticism | 40 | 3.23 | 1.24 | 1.00 | 5.00 | −0.36 | −1.11 |
| BFI10_bfi_openness | 40 | 3.74 | 0.99 | 1.00 | 5.00 | −0.58 | −0.19 |
| GAD7_gad7 | 40 | 2.04 | 0.84 | 1.00 | 4.00 | 0.57 | −0.84 |
pwb_dims <- c("self_acceptance","autonomy","env_mastery",
"pos_relations","purpose_in_life","personal_growth")
Reliability and distributional summaries per subscale. With pilot N these estimates are noisy — treat as sanity checks, not final values. Omega is McDonald’s omega total from a one-factor solution. Cronbach’s alpha is retained as the primary reliability coefficient for disattenuated correlations to match the preregistered formula; omega total is reported descriptively and can be used as a sensitivity check.
mean_or_na <- function(x) {
x <- x[is.finite(x)]
if (length(x) == 0) NA_real_ else mean(x)
}
omega_total <- function(m) {
if (ncol(m) < 2 || nrow(m) < 5) return(NA_real_)
res <- tryCatch(
suppressWarnings(suppressMessages(psych::omega(m, nfactors = 1, plot = FALSE))),
error = function(e) NULL
)
if (is.null(res) || is.null(res$omega.tot)) NA_real_ else unname(res$omega.tot)
}
scale_metrics <- function(data, items_df) {
m <- score_items(data, items_df)
score <- rowMeans(m, na.rm = TRUE)
score[is.nan(score)] <- NA_real_
ok <- complete.cases(m)
m_ok <- m[ok, , drop = FALSE]
alpha_res <- tryCatch(
suppressWarnings(psych::alpha(m_ok, warnings = FALSE, check.keys = FALSE)),
error = function(e) NULL
)
item_cors <- if (ncol(m_ok) >= 2 && nrow(m_ok) >= 5) {
r <- suppressWarnings(cor(m_ok, use = "pairwise.complete.obs"))
r[lower.tri(r)]
} else {
NA_real_
}
corrected_item_total <- if (ncol(m_ok) >= 2 && nrow(m_ok) >= 5) {
map_dbl(seq_len(ncol(m_ok)), function(j) {
item <- m_ok[, j]
total_without_item <- rowMeans(m_ok[, -j, drop = FALSE])
if (sd(item) == 0 || sd(total_without_item) == 0) {
NA_real_
} else {
cor(item, total_without_item)
}
})
} else {
NA_real_
}
tibble(
n_items = nrow(items_df),
n_complete = sum(ok),
alpha = if (is.null(alpha_res)) NA_real_ else unname(alpha_res$total$raw_alpha),
omega_total = omega_total(m_ok),
mean_inter_item_r = mean_or_na(item_cors),
mean_corrected_item_total_r = mean_or_na(corrected_item_total),
mean = mean(score, na.rm = TRUE),
sd = sd(score, na.rm = TRUE),
skew = psych::skew(score, na.rm = TRUE),
kurtosis = psych::kurtosi(score, na.rm = TRUE)
)
}
scale_descriptives_tbl <- scale_map |>
group_by(scale, subscale) |>
group_modify(~ scale_metrics(clean, .x)) |>
ungroup()
alpha_tbl <- scale_descriptives_tbl
readr::write_csv(scale_descriptives_tbl,
here::here("Data", "results", "scale_reliability_descriptives.csv"))
reliability_gt <- scale_descriptives_tbl |>
gt(groupname_col = "scale") |>
cols_label(
subscale = "Subscale",
n_items = "k items",
n_complete = "n",
alpha = "α",
omega_total = "ω",
mean_inter_item_r = "mean r (inter-item)",
mean_corrected_item_total_r = "mean r (item-total)",
mean = "M",
sd = "SD",
skew = "Skew",
kurtosis = "Kurt"
) |>
tab_spanner(label = "Reliability",
columns = c(alpha, omega_total,
mean_inter_item_r, mean_corrected_item_total_r)) |>
tab_spanner(label = "Distribution",
columns = c(mean, sd, skew, kurtosis)) |>
tab_header(
title = "Scale reliability and descriptives (pilot)",
subtitle = "α is the primary reliability for disattenuation; ω (McDonald's total) is sensitivity. Inter-item r and corrected item-total r are unweighted means across items in the subscale. M/SD on the subscale-mean score."
) |>
fmt_number(
columns = c(alpha, omega_total, mean_inter_item_r, mean_corrected_item_total_r,
mean, sd, skew, kurtosis),
decimals = 2
) |>
sub_missing(missing_text = "—") |>
data_color(
columns = alpha,
fn = function(x) {
dplyr::case_when(
is.na(x) ~ "#f0f0f0",
x >= 0.70 ~ "#cfe8d6",
x >= 0.50 ~ "#fff3cd",
TRUE ~ "#f4cdcd"
)
}
)
reliability_gt
| Scale reliability and descriptives (pilot) | ||||||||||
| α is the primary reliability for disattenuation; ω (McDonald's total) is sensitivity. Inter-item r and corrected item-total r are unweighted means across items in the subscale. M/SD on the subscale-mean score. | ||||||||||
| Subscale | k items | n |
Reliability
|
Distribution
|
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| α | ω | mean r (inter-item) | mean r (item-total) | M | SD | Skew | Kurt | |||
| Agency | ||||||||||
| agency | 4 | 40 | 0.79 | 0.80 | 0.49 | 0.61 | 5.70 | 1.16 | −0.25 | −0.63 |
| BFI10 | ||||||||||
| bfi_agreeableness | 2 | 40 | 0.47 | 0.47 | 0.31 | 0.31 | 3.46 | 1.04 | −0.50 | −0.68 |
| bfi_conscientious | 2 | 40 | 0.44 | 0.52 | 0.35 | 0.35 | 3.80 | 0.99 | −0.33 | −1.28 |
| bfi_extraversion | 2 | 40 | 0.64 | 0.65 | 0.48 | 0.48 | 2.71 | 1.07 | 0.21 | −1.26 |
| bfi_neuroticism | 2 | 40 | 0.85 | 0.86 | 0.75 | 0.75 | 3.23 | 1.24 | −0.36 | −1.11 |
| bfi_openness | 2 | 39 | 0.46 | 0.46 | 0.30 | 0.30 | 3.74 | 0.99 | −0.58 | −0.19 |
| EPS | ||||||||||
| beyond_self | 3 | 40 | 0.95 | 0.95 | 0.85 | 0.89 | 4.12 | 1.17 | −1.38 | 0.60 |
| commitment | 3 | 40 | 0.66 | 0.71 | 0.39 | 0.49 | 4.24 | 0.71 | −1.84 | 4.24 |
| meaningfulness | 3 | 40 | 0.88 | 0.88 | 0.72 | 0.77 | 4.41 | 0.70 | −1.85 | 4.15 |
| GAD7 | ||||||||||
| gad7 | 7 | 38 | 0.92 | 0.92 | 0.63 | 0.76 | 2.04 | 0.84 | 0.57 | −0.84 |
| HILS | ||||||||||
| hils | 3 | 40 | 0.90 | 0.91 | 0.76 | 0.81 | 4.44 | 1.43 | −0.21 | −1.21 |
| PANAS | ||||||||||
| negative_affect | 10 | 40 | 0.95 | 0.95 | 0.66 | 0.79 | 2.03 | 0.89 | 0.95 | 0.45 |
| positive_affect | 10 | 39 | 0.91 | 0.91 | 0.51 | 0.68 | 3.20 | 0.81 | −0.25 | −0.69 |
| PHQ9 | ||||||||||
| phq9 | 9 | 39 | 0.91 | 0.91 | 0.53 | 0.69 | 1.86 | 0.71 | 1.13 | 1.01 |
| Responsibility | ||||||||||
| responsibility | 3 | 0 | — | — | — | — | — | — | — | — |
| Ryff | ||||||||||
| autonomy | 3 | 40 | 0.39 | 0.41 | 0.18 | 0.24 | 4.38 | 0.75 | 0.29 | −0.39 |
| env_mastery | 3 | 40 | 0.63 | 0.67 | 0.36 | 0.45 | 4.04 | 0.90 | 0.22 | −0.97 |
| personal_growth | 3 | 39 | 0.53 | 0.56 | 0.29 | 0.35 | 5.11 | 0.74 | −0.81 | 0.20 |
| pos_relations | 3 | 40 | 0.54 | 0.66 | 0.31 | 0.37 | 4.42 | 0.86 | −0.58 | 0.24 |
| purpose_in_life | 3 | 39 | 0.74 | 0.78 | 0.51 | 0.58 | 4.53 | 0.92 | −0.43 | −0.62 |
| self_acceptance | 3 | 40 | 0.83 | 0.86 | 0.64 | 0.71 | 4.32 | 1.11 | −0.76 | 0.09 |
| SWLS | ||||||||||
| swls | 3 | 40 | 0.83 | 0.84 | 0.62 | 0.69 | 4.47 | 1.44 | −0.40 | −1.11 |
needed_pkgs <- setdiff(c("webshot2", "chromote"), rownames(installed.packages()))
if (length(needed_pkgs)) install.packages(needed_pkgs)
gt::gtsave(reliability_gt,
here::here("Data", "results", "scale_reliability.html"))
tryCatch(
gt::gtsave(reliability_gt,
here::here("Data", "results", "scale_reliability.png"),
vwidth = 1400, vheight = 1000, expand = 5),
error = function(e) {
message("PNG save failed (",
conditionMessage(e),
"). HTML version is still available at Data/results/scale_reliability.html — ",
"open in a browser and File → Print → Save as PDF for a slide-ready view.")
}
)
ryff_model <- "
SA =~ Q161 + Q178 + Q177
AU =~ Q176 + Q175 + Q174
EM =~ Q173 + Q172 + Q171
PR =~ Q170 + Q169 + Q168
PIL =~ Q167 + Q166 + Q165
PG =~ Q164 + Q163 + Q179
"
fit_ryff <- cfa(ryff_model, data = clean, missing = "fiml", estimator = "MLR")
summary(fit_ryff, fit.measures = TRUE, standardized = TRUE)
eps_model <- "
Mean =~ QID181 + Q189 + Q188
Commit =~ Q187 + Q186 + Q185
BTS =~ Q184 + Q183 + Q182
"
fit_eps <- cfa(eps_model, data = clean, missing = "fiml", estimator = "MLR")
summary(fit_eps, fit.measures = TRUE, standardized = TRUE)
text_long <- clean |>
select(ResponseId, all_of(text_prompts$prompt_var)) |>
pivot_longer(-ResponseId, names_to = "prompt_var", values_to = "text") |>
left_join(text_prompts, by = "prompt_var") |>
mutate(
text = replace_na(text, ""),
chars = nchar(text),
words = str_count(text, "\\S+")
)
text_long |>
group_by(prompt_var, purpose) |>
summarise(
n_responded = sum(chars > 0),
median_chars = median(chars[chars > 0]),
median_words = median(words[chars > 0]),
p10_words = quantile(words[chars > 0], 0.10),
p90_words = quantile(words[chars > 0], 0.90),
.groups = "drop"
) |>
arrange(desc(median_chars)) |>
gt() |>
tab_header(title = "Open-ended response lengths") |>
fmt_number(columns = c(median_chars, median_words, p10_words, p90_words), decimals = 0)
| Open-ended response lengths | ||||||
| prompt_var | purpose | n_responded | median_chars | median_words | p10_words | p90_words |
|---|---|---|---|---|---|---|
| posrel_text | Ryff Positive Relations bidirectional | 39 | 379 | 70 | 33 | 119 |
| selfacceptance_text | Ryff Self-Acceptance bidirectional | 40 | 312 | 61 | 29 | 97 |
| environmental_text | Ryff Environmental Mastery bidirectional | 40 | 297 | 59 | 23 | 103 |
| confidence_text | Confidence/Autonomy unidirectional | 40 | 292 | 57 | 25 | 94 |
| purposeinlife_text | Ryff Purpose in Life bidirectional | 40 | 273 | 52 | 25 | 104 |
| autonomy_text | Ryff Autonomy bidirectional | 40 | 257 | 50 | 23 | 98 |
| personalgrowth_text | Ryff Personal Growth bidirectional | 40 | 233 | 45 | 17 | 77 |
| transformative_text | TRWOC-mirrored: transformation | 21 | 175 | 33 | 11 | 65 |
| college_goals_text | TRWOC-mirrored: college goals | 19 | 134 | 28 | 13 | 72 |
| purpose_goal_action | What you're doing (commitment) | 40 | 95 | 18 | 5 | 36 |
| purpose_goal_reason | Why this goal (purpose motivation) | 40 | 87 | 16 | 7 | 41 |
| purpose_goal | Goal aspiration (EPS-BTS prompt) | 40 | 69 | 14 | 4 | 36 |
text_long |>
filter(chars > 0) |>
ggplot(aes(words)) +
geom_histogram(bins = 30) +
facet_wrap(~ prompt_var, scales = "free_y") +
labs(title = "Word count distribution per prompt",
x = "Words", y = "Respondents")
Per the prereg: layer 23 of mixedbread-large (24-layer model),
mean-pooled across tokens. We retain both text-level embeddings (for
ridge prediction) and word-type embeddings (for
textProjection in H4 face validity). Heavy compute
(10–30+ min per prompt on CPU; longer the first time the model
downloads).
library(text)
embed_one <- function(prompt_var) {
message("Embedding: ", prompt_var)
if (!prompt_var %in% names(clean)) stop("Missing prompt column: ", prompt_var)
texts <- as.character(clean[[prompt_var]])
missing_text <- is.na(texts) | !nzchar(trimws(texts))
if (any(missing_text)) {
message(" replacing ", sum(missing_text), " missing/blank rows with sentinel text")
texts[missing_text] <- "missing response"
}
emb <- textEmbed(
texts = tibble(texts = texts),
model = "mixedbread-ai/mxbai-embed-large-v1",
layers = 23,
aggregation_from_tokens_to_texts = "mean",
aggregation_from_tokens_to_word_types = "mean",
keep_token_embeddings = FALSE,
batch_size = 16
)
list(prompt_var = prompt_var, embeddings = emb)
}
embeddings_dir <- here::here("Data", "embeddings")
dir.create(embeddings_dir, showWarnings = FALSE, recursive = TRUE)
for (pv in text_prompts$prompt_var) {
out_path <- file.path(embeddings_dir, paste0(pv, ".rds"))
if (file.exists(out_path)) next
res <- embed_one(pv)
saveRDS(res$embeddings, out_path)
}
list.files(embeddings_dir)
embeddings_dir <- here::here("Data", "embeddings")
embeddings <- map(set_names(text_prompts$prompt_var),
~ readRDS(file.path(embeddings_dir, paste0(.x, ".rds"))))
embedding_matrix <- function(emb) {
emb$texts$texts |>
select(-matches("^id(_|$)"), -matches("^id_"))
}
word_type_matrix <- function(emb) {
wt <- emb$word_types
if (is.list(wt) && "texts" %in% names(wt)) wt$texts else wt
}
text_embed <- map(embeddings, embedding_matrix)
word_type_embed <- map(embeddings, word_type_matrix)
prompt_available <- clean |>
transmute(across(all_of(text_prompts$prompt_var), ~ !is.na(.x) & nzchar(trimws(.x))))
complete_prompt_score_idx <- function(prompt_var, outcome) {
which(
prompt_available[[prompt_var]] &
!is.na(scored[[outcome]]) &
complete.cases(text_embed[[prompt_var]])
)
}
model_id_vars <- function(final_model) {
blueprint <- final_model$pre$mold$blueprint
id_ptype <- tryCatch(
blueprint$extra_role_ptypes[["id variable"]],
error = function(e) NULL
)
id_vars <- if (is.null(id_ptype)) character() else names(id_ptype)
recipe_cols <- tryCatch(names(blueprint$recipe$ptype), error = function(e) character())
predictor_cols <- tryCatch(names(blueprint$ptypes$predictors), error = function(e) character())
id_vars <- union(id_vars, intersect(setdiff(recipe_cols, c(predictor_cols, "y")), "id_nr"))
id_vars
}
embedding_new_data <- function(final_model, emb_var, idx) {
predictor_cols <- tryCatch(
names(final_model$pre$mold$blueprint$ptypes$predictors),
error = function(e) names(text_embed[[emb_var]])
)
new_x <- as_tibble(text_embed[[emb_var]][idx, predictor_cols, drop = FALSE])
for (id_var in model_id_vars(final_model)) {
if (!id_var %in% names(new_x)) new_x[[id_var]] <- seq_along(idx)
}
new_x
}
map(text_embed, dim)
Per prereg: ridge regression via textTrainRegression on
the N=400 training sample with the package nested-CV defaults (10 outer
folds × inner 75/25 validation split, RMSE-minimizing with
lowest_penalty tie-breaking, outer + inner stratified by
outcome quartiles), with the penalty grid widened from the
package default 10^seq(-3, 3) to
10^seq(-6, 6) (13 candidates). The
final_model returned by the package is the parsnip workflow
we’ll freeze for Prereg 2.
The six PWB constructs are tested with two-tailed Pearson correlations, FDR-corrected (Benjamini-Hochberg) across the six tests. We also report disattenuated correlations using observed-scale reliabilities from Section 9.
library(text)
penalty_grid <- 10^seq(-6, 6)
train_idx <- if ("wave" %in% names(clean)) which(clean$wave == 1) else seq_len(nrow(clean))
h1_specs <- tribble(
~emb_var, ~outcome,
"selfacceptance_text", "Ryff_self_acceptance",
"autonomy_text", "Ryff_autonomy",
"environmental_text", "Ryff_env_mastery",
"posrel_text", "Ryff_pos_relations",
"purposeinlife_text", "Ryff_purpose_in_life",
"personalgrowth_text", "Ryff_personal_growth"
)
h1_train_one <- function(emb_var, outcome) {
idx <- train_idx
y <- scored[[outcome]][idx]
ok <- prompt_available[[emb_var]][idx] & !is.na(y) & complete.cases(text_embed[[emb_var]][idx, ])
idx <- idx[ok]
x <- text_embed[[emb_var]][idx, , drop = FALSE]
y <- scored[[outcome]][idx]
if (length(y) < 10) stop("Too few complete rows for ", emb_var, " -> ", outcome)
fit <- textTrainRegression(
x = x,
y = y,
penalty = penalty_grid,
multi_cores = TRUE,
seed = 2020
)
list(emb_var = emb_var, outcome = outcome, model_idx = idx, fit = fit,
lambda_chosen = fit$final_model$fit$fit$spec$args$penalty)
}
h1_models <- pmap(h1_specs, ~ h1_train_one(..1, ..2))
models_dir <- here::here("Data", "models", "h1_train")
dir.create(models_dir, showWarnings = FALSE, recursive = TRUE)
walk(h1_models, ~ saveRDS(.x, file.path(models_dir,
paste0(.x$outcome, ".rds"))))
h1_cv_pred_tbl <- function(m) {
pred_tbl <- m$fit$predictions
if (!is.data.frame(pred_tbl) || !all(c("predictions", "y") %in% names(pred_tbl))) {
stop("Could not find numeric CV predictions for ", m$outcome)
}
id_nr <- if ("id_nr" %in% names(pred_tbl)) pred_tbl$id_nr else seq_len(nrow(pred_tbl))
pred_tbl |>
transmute(
id_nr = coalesce(id_nr, row_number()),
predictions = as.numeric(.data$predictions),
y = as.numeric(.data$y)
)
}
h1_lambda <- function(m) {
lambda <- m$lambda_chosen
if (is.numeric(lambda) && length(lambda) == 1) return(lambda)
lambda <- m$fit$final_model$fit$fit$spec$args$penalty
if (is.numeric(lambda) && length(lambda) == 1) return(lambda)
lambda_line <- str_subset(m$fit$model_description, "^penalty in final model =")
if (length(lambda_line) == 1) {
return(readr::parse_number(lambda_line))
}
NA_real_
}
h1_cv <- map_dfr(h1_models, function(m) {
pred_tbl <- h1_cv_pred_tbl(m)
ok <- complete.cases(pred_tbl$predictions, pred_tbl$y)
ct <- cor.test(pred_tbl$predictions[ok], pred_tbl$y[ok], alternative = "two.sided")
tibble(outcome = m$outcome, n = sum(ok),
cv_r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2],
p_raw = ct$p.value, lambda = h1_lambda(m))
}) |>
mutate(p_fdr = p.adjust(p_raw, method = "BH"))
h1_cv |>
gt() |>
tab_header(title = "H1 (cross-validated training sample, N=400)",
subtitle = "Two-tailed Pearson, FDR-corrected across 6 PWB constructs") |>
fmt_number(columns = c(cv_r, lcl, ucl, p_raw, p_fdr, lambda), decimals = 3)
boundary_check <- h1_cv |>
mutate(at_boundary = lambda <= min(penalty_grid) | lambda >= max(penalty_grid))
boundary_check |>
filter(at_boundary) |>
gt() |>
tab_header(title = "Lambda boundary check",
subtitle = "If any rows appear, widen the penalty grid before locking the prereg amendment")
readr::write_csv(h1_cv, here::here("Data", "results", "h1_cv.csv"))
h1_cv_predictions <- map_dfc(h1_models, function(m) {
preds <- rep(NA_real_, nrow(clean))
pred_tbl <- h1_cv_pred_tbl(m)
idx <- if (!is.null(m$model_idx)) m$model_idx else train_idx
preds[idx[pred_tbl$id_nr]] <- pred_tbl$predictions
setNames(tibble(preds), paste0("pred_", m$outcome))
}) |>
mutate(ResponseId = clean$ResponseId, .before = 1)
readr::write_csv(h1_cv_predictions, here::here("Data", "results", "predictions_cv_train.csv"))
Disattenuation per construct uses observed-scale reliability only,
treating the predicted language score as the true-score proxy. The
primary preprint value uses Cronbach’s alpha to match the preregistered
formula:
r_disattenuated_alpha = r_observed / sqrt(alpha_outcome).
McDonald’s omega total is reported as a sensitivity check:
r_disattenuated_omega = r_observed / sqrt(omega_total_outcome).
alpha_lookup <- alpha_tbl |>
mutate(scale_subscale = paste(scale, subscale, sep = "_")) |>
select(scale_subscale, alpha, omega_total)
h1_disattenuated <- h1_cv |>
left_join(alpha_lookup, by = c("outcome" = "scale_subscale")) |>
mutate(
reliability_primary = "Cronbach alpha",
r_disattenuated_alpha = cv_r / sqrt(pmax(alpha, .Machine$double.eps)),
r_disattenuated_omega = cv_r / sqrt(pmax(omega_total, .Machine$double.eps))
)
h1_disattenuated |>
select(outcome, n, cv_r, alpha, omega_total,
r_disattenuated_alpha, r_disattenuated_omega, reliability_primary) |>
gt() |>
tab_header(title = "H1 — observed and disattenuated correlations",
subtitle = "Primary disattenuated estimate uses Cronbach's alpha; omega total is sensitivity.") |>
fmt_number(columns = c(cv_r, alpha, omega_total,
r_disattenuated_alpha, r_disattenuated_omega), decimals = 3)
readr::write_csv(h1_disattenuated, here::here("Data", "results", "h1_disattenuated.csv"))
Frozen models applied to the held-out sample. Benchmark: r ≥ .45 lower bound of 95% CI.
stopifnot("Need a 'wave' column on `clean` (1 = train, 2 = held-out)" = "wave" %in% names(clean))
test_idx <- which(clean$wave == 2)
h1_prospective <- map_dfr(h1_models, function(m) {
idx <- test_idx[prompt_available[[m$emb_var]][test_idx]]
new_x <- embedding_new_data(m$fit$final_model, m$emb_var, idx)
preds <- predict(m$fit$final_model, new_data = new_x)$.pred
truth <- scored[[m$outcome]][idx]
ok <- complete.cases(preds, truth)
ct <- cor.test(preds[ok], truth[ok], alternative = "two.sided")
tibble(outcome = m$outcome, n_test = sum(ok),
prospective_r = unname(ct$estimate),
lcl = ct$conf.int[1], ucl = ct$conf.int[2],
p_raw = ct$p.value, lambda = h1_lambda(m))
}) |>
mutate(p_fdr = p.adjust(p_raw, method = "BH"),
meets_benchmark = lcl >= 0.45)
h1_prospective |>
gt() |>
tab_header(title = "H1 (prospective held-out sample, N=100)",
subtitle = "Benchmark: r ≥ .45 lower bound of 95% CI") |>
fmt_number(columns = c(prospective_r, lcl, ucl, p_raw, p_fdr, lambda), decimals = 3)
h1_prospective_predictions <- map_dfc(h1_models, function(m) {
preds <- rep(NA_real_, nrow(clean))
idx <- test_idx[prompt_available[[m$emb_var]][test_idx]]
new_x <- embedding_new_data(m$fit$final_model, m$emb_var, idx)
preds[idx] <- predict(m$fit$final_model, new_data = new_x)$.pred
setNames(tibble(preds), paste0("pred_", m$outcome))
}) |>
mutate(ResponseId = clean$ResponseId, .before = 1)
readr::write_csv(h1_prospective, here::here("Data", "results", "h1_prospective.csv"))
readr::write_csv(h1_prospective_predictions, here::here("Data", "results", "predictions_prospective.csv"))
Per prereg’s “Other planned analyses”: compare the unidirectional
confidence_text prompt (which mirrors Ryff’s autonomy item
wording) with the dimensional autonomy_text prompt for
predicting Ryff Autonomy. The confirmatory H1 result above uses
autonomy_text (the registered predictor); this block
reports confidence_text as a parallel sensitivity model
with identical settings (penalty grid 10^seq(-6, 6),
package nested-CV defaults, layer 23 mixedbread embeddings). Saved
separately so it doesn’t pollute the prereg-aligned
pred_Ryff_autonomy panel used for H2/H3/H5.
library(text)
idx <- complete_prompt_score_idx("confidence_text", "Ryff_autonomy")
h1_autonomy_alt <- textTrainRegression(
x = text_embed[["confidence_text"]][idx, ],
y = scored$Ryff_autonomy[idx],
penalty = penalty_grid,
multi_cores = TRUE,
seed = 2020
)
saveRDS(h1_autonomy_alt,
here::here("Data", "models", "h1_train", "Ryff_autonomy__confidence_text.rds"))
alt_lambda <- h1_autonomy_alt$final_model$fit$actions$model$spec$args$penalty
alt_truth <- scored$Ryff_autonomy[idx]
alt_preds <- h1_autonomy_alt$results_predictions$predictions
alt_ok <- complete.cases(alt_preds, alt_truth)
alt_ct <- cor.test(alt_preds[alt_ok], alt_truth[alt_ok], alternative = "two.sided")
primary_row <- h1_cv |> filter(outcome == "Ryff_autonomy")
autonomy_comparison <- tibble(
predictor = c("autonomy_text (primary, prereg-aligned)", "confidence_text (sensitivity)"),
n = c(primary_row$n, sum(alt_ok)),
cv_r = c(primary_row$cv_r, unname(alt_ct$estimate)),
lcl = c(primary_row$lcl, alt_ct$conf.int[1]),
ucl = c(primary_row$ucl, alt_ct$conf.int[2]),
p_raw = c(primary_row$p_raw, alt_ct$p.value),
lambda = c(primary_row$lambda, alt_lambda),
at_lower_bound = c(primary_row$lambda, alt_lambda) <= min(penalty_grid),
at_upper_bound = c(primary_row$lambda, alt_lambda) >= max(penalty_grid)
)
autonomy_comparison |>
gt() |>
tab_header(title = "Autonomy prompt comparison (sensitivity per prereg)",
subtitle = "Both predict Ryff Autonomy; primary is locked, sensitivity is exploratory") |>
fmt_number(columns = c(cv_r, lcl, ucl, p_raw, lambda), decimals = 3)
readr::write_csv(autonomy_comparison,
here::here("Data", "results", "h1_autonomy_prompt_comparison.csv"))
Per prereg: combine CV training predictions (N=400) with held-out prospective predictions (N=100) into a single panel of out-of-sample predicted scores per participant per construct. This panel is the predictor input for H2–H5.
h1_pooled <- bind_rows(
h1_cv_predictions |> filter(!is.na(pred_Ryff_purpose_in_life)),
h1_prospective_predictions |> filter(!is.na(pred_Ryff_purpose_in_life))
) |>
distinct(ResponseId, .keep_all = TRUE)
readr::write_csv(h1_pooled, here::here("Data", "results", "predictions_pooled.csv"))
cat("Pooled prediction panel rows:", nrow(h1_pooled), "\n")
Pearson correlations between each language-based PWB score and: (a) the five non-matched PWB observed subscales, (b) SWLS, HILS, PANAS PA (positive expected), (c) PANAS NA, PHQ-9, GAD-7 (negative expected). Two-tailed per prereg, FDR-corrected within each predicted construct.
h2_panel <- scored |>
select(ResponseId, starts_with("Ryff_"), SWLS_swls, HILS_hils,
PANAS_positive_affect, PANAS_negative_affect,
PHQ9_phq9, GAD7_gad7, starts_with("BFI10_"), Agency_agency) |>
inner_join(h1_pooled, by = "ResponseId")
h2_other_observed <- c("SWLS_swls","HILS_hils",
"PANAS_positive_affect","PANAS_negative_affect",
"PHQ9_phq9","GAD7_gad7")
h2_pairs <- crossing(pred_dim = pwb_dims, observed = c(paste0("Ryff_", pwb_dims), h2_other_observed)) |>
filter(observed != paste0("Ryff_", pred_dim))
h2_results <- h2_pairs |>
pmap_dfr(function(pred_dim, observed) {
pred_col <- paste0("pred_Ryff_", pred_dim)
df <- h2_panel |> select(p = all_of(pred_col), o = all_of(observed)) |> drop_na()
if (nrow(df) < 10) return(NULL)
ct <- cor.test(df$p, df$o, alternative = "two.sided")
tibble(pred_dim = pred_dim, observed = observed,
n = nrow(df), r = unname(ct$estimate),
lcl = ct$conf.int[1], ucl = ct$conf.int[2], p_raw = ct$p.value)
}) |>
group_by(pred_dim) |>
mutate(p_fdr = p.adjust(p_raw, method = "BH")) |>
ungroup()
h2_results |>
gt(groupname_col = "pred_dim") |>
tab_header(title = "H2 — nomological validity (N=500, two-tailed, FDR within predicted dim)") |>
fmt_number(columns = c(r, lcl, ucl, p_raw, p_fdr), decimals = 3)
readr::write_csv(h2_results, here::here("Data", "results", "h2_results.csv"))
h2_results |>
ggplot(aes(r, fct_rev(pred_dim), color = sign(r) > 0)) +
geom_point(size = 2) +
geom_errorbarh(aes(xmin = lcl, xmax = ucl), height = 0) +
geom_vline(xintercept = 0, linetype = 2, color = "grey60") +
facet_wrap(~ observed, scales = "free_x") +
scale_color_manual(values = c(`TRUE` = brand_blue, `FALSE` = brand_red), guide = "none") +
labs(title = "H2 — predicted PWB ↔ other constructs",
x = "Pearson r (95% CI)", y = NULL)
Confirmatory: H3i age × {EM, AU}, H3ii gender × pos relations, H3iii education × {PG, PIL}, H3iv religious commitment × PIL. Religious commitment = Q202 ≥ 3 (“Somewhat” or higher). Education tested ordinal + sensitivity (continuous and binary at bachelor’s). All confirmatory tests one-tailed, FDR-corrected within H3 confirmatory family. Exploratory: SES (Q148), relationship status, employment.
h3_panel <- h2_panel |>
left_join(clean |> select(ResponseId, Q146, Q147, Q148, highest_education,
Q199, Q201, Q202),
by = "ResponseId") |>
mutate(
age_num = as.numeric(Q146),
gender_binary = case_when(Q147 == "1" ~ "Male", Q147 == "2" ~ "Female", TRUE ~ NA_character_),
ses_continuous = as.numeric(Q148),
edu_ordinal = as.numeric(highest_education),
edu_continuous = edu_ordinal,
edu_binary_BAplus = if_else(edu_ordinal >= 6, "BA+", "below_BA"),
relig_commit_binary = case_when(
as.numeric(Q202) >= 3 ~ "Religious",
as.numeric(Q202) %in% 1:2 ~ "Not religious",
TRUE ~ NA_character_
),
relationship_partnered = case_when(
as.numeric(Q201) %in% 3:6 ~ "Partnered",
as.numeric(Q201) %in% c(1, 2, 7, 8, 9) ~ "Not partnered",
TRUE ~ NA_character_
),
employed_binary = case_when(
as.numeric(Q199) %in% 1:2 ~ "Employed",
as.numeric(Q199) %in% 3:4 ~ "Not employed",
TRUE ~ NA_character_
)
)
confirmatory_one_tailed <- function(predicted_col, x, alternative, kind = c("cor","ttest")) {
d <- tibble(p = h3_panel[[predicted_col]], x = x) |> drop_na()
if (nrow(d) < 10) return(NULL)
if (kind == "cor") {
ct <- cor.test(d$p, d$x, alternative = alternative)
tibble(stat = "Pearson r", estimate = unname(ct$estimate),
lcl = ct$conf.int[1], ucl = ct$conf.int[2], p_raw = ct$p.value, n = nrow(d))
} else {
d$x <- factor(d$x)
tt <- t.test(p ~ x, data = d, alternative = alternative, var.equal = FALSE)
means <- d |> group_by(x) |> summarise(m = mean(p), .groups = "drop")
pooled_sd <- sqrt(((sum(d$x == levels(d$x)[1])-1)*var(d$p[d$x == levels(d$x)[1]]) +
(sum(d$x == levels(d$x)[2])-1)*var(d$p[d$x == levels(d$x)[2]])) /
(nrow(d) - 2))
cohens_d <- (means$m[1] - means$m[2]) / pooled_sd
tibble(stat = "Welch t", estimate = unname(tt$statistic),
lcl = NA_real_, ucl = NA_real_, p_raw = tt$p.value, n = nrow(d),
cohens_d = cohens_d, mean_diff = means$m[1] - means$m[2])
}
}
h3_confirmatory <- bind_rows(
confirmatory_one_tailed("pred_Ryff_env_mastery", h3_panel$age_num, "greater", "cor") |> mutate(test = "H3i: age × EM"),
confirmatory_one_tailed("pred_Ryff_autonomy", h3_panel$age_num, "greater", "cor") |> mutate(test = "H3i: age × AU"),
confirmatory_one_tailed("pred_Ryff_pos_relations", h3_panel$gender_binary, "greater", "ttest") |> mutate(test = "H3ii: gender × PR (women > men)"),
confirmatory_one_tailed("pred_Ryff_personal_growth", h3_panel$edu_ordinal, "greater", "cor") |> mutate(test = "H3iii: edu × PG (ordinal)"),
confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$edu_ordinal, "greater", "cor") |> mutate(test = "H3iii: edu × PIL (ordinal)"),
confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$relig_commit_binary, "greater", "ttest") |> mutate(test = "H3iv: religious commitment × PIL")
) |>
mutate(p_fdr = p.adjust(p_raw, method = "BH")) |>
select(test, n, stat, estimate, cohens_d, mean_diff, lcl, ucl, p_raw, p_fdr)
h3_confirmatory |>
gt() |>
tab_header(title = "H3 confirmatory (one-tailed, FDR-corrected within H3 family)") |>
fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw, p_fdr), decimals = 3)
h3_edu_sensitivity <- bind_rows(
confirmatory_one_tailed("pred_Ryff_personal_growth", h3_panel$edu_continuous, "greater", "cor") |> mutate(test = "edu × PG (continuous)"),
confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$edu_continuous, "greater", "cor") |> mutate(test = "edu × PIL (continuous)"),
confirmatory_one_tailed("pred_Ryff_personal_growth", h3_panel$edu_binary_BAplus, "greater", "ttest") |> mutate(test = "edu × PG (binary BA+)"),
confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$edu_binary_BAplus, "greater", "ttest") |> mutate(test = "edu × PIL (binary BA+)")
)
h3_edu_sensitivity |>
gt() |>
tab_header(title = "H3 education sensitivity analyses") |>
fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw), decimals = 3)
h3_exploratory_specs <- expand_grid(
predicted_col = paste0("pred_Ryff_", pwb_dims),
pair = list(
list(label = "SES (Q148)", x = "ses_continuous", kind = "cor", alt = "two.sided"),
list(label = "Partnered", x = "relationship_partnered", kind = "ttest", alt = "two.sided"),
list(label = "Employed", x = "employed_binary", kind = "ttest", alt = "two.sided")
)
) |> unnest_wider(pair)
h3_exploratory <- pmap_dfr(h3_exploratory_specs, function(predicted_col, label, x, kind, alt) {
res <- confirmatory_one_tailed(predicted_col, h3_panel[[x]], alt, kind)
if (is.null(res)) return(NULL)
res |> mutate(predicted = predicted_col, covariate = label)
})
h3_exploratory |>
gt(groupname_col = "predicted") |>
tab_header(title = "H3 exploratory (two-tailed, no FDR)") |>
fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw), decimals = 3)
readr::write_csv(h3_confirmatory, here::here("Data", "results", "h3_confirmatory.csv"))
readr::write_csv(h3_edu_sensitivity, here::here("Data", "results", "h3_edu_sensitivity.csv"))
readr::write_csv(h3_exploratory, here::here("Data", "results", "h3_exploratory.csv"))
For each PWB dimension: (a) supervised projection of word-type embeddings onto the direction vector that separates low vs. high observed scale scores, (b) LDA topics from the dimension’s open-ended responses. Outputs are exported as tables + plots for two independent raters (blind to hypotheses, per appendix) to qualitatively compare against Ryff’s theoretical definitions.
library(text)
library(topics)
h4_specs <- h1_specs
h4_projection <- map(seq_len(nrow(h4_specs)), function(i) {
pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
idx <- complete_prompt_score_idx(pv, oc)
proj <- textProjection(
words = clean[[pv]][idx],
word_embeddings = text_embed[[pv]][idx, , drop = FALSE],
word_types_embeddings = word_type_embed[[pv]],
x = scored[[oc]][idx],
pca = NULL,
aggregation = "mean",
split = "quartile",
word_weight_power = 1,
min_freq_words_test = 2
)
list(prompt = pv, outcome = oc, projection = proj)
})
proj_dir <- here::here("Data", "results", "h4_projections")
dir.create(proj_dir, showWarnings = FALSE, recursive = TRUE)
walk(h4_projection, ~ saveRDS(.x, file.path(proj_dir, paste0(.x$outcome, ".rds"))))
h4_top_words <- map_dfr(h4_projection, function(p) {
df <- p$projection$word_data
bind_rows(
df |> arrange(desc(dot.x)) |> head(15) |> mutate(pole = "high"),
df |> arrange(dot.x) |> head(15) |> mutate(pole = "low")
) |>
select(words, dot.x, p_values_dot.x, cohens_d.x, n, n_g1.x, n_g2.x, pole) |>
mutate(outcome = p$outcome, .before = 1)
})
h4_top_words |>
gt(groupname_col = c("outcome","pole")) |>
tab_header(title = "H4 — top projected words/phrases per dimension (high vs. low observed score)") |>
fmt_number(columns = c(dot.x, p_values_dot.x, cohens_d.x), decimals = 3)
set.seed(2020)
h4_rater_packet <- map_dfr(seq_len(nrow(h4_specs)), function(i) {
oc <- h4_specs$outcome[i]; pv <- h4_specs$emb_var[i]
scored_long <- tibble(text = clean[[pv]], score = scored[[oc]]) |>
drop_na() |> mutate(quartile = ntile(score, 4))
q1 <- scored_long |> filter(quartile == 1)
q4 <- scored_long |> filter(quartile == 4)
bind_rows(
q1 |> slice_sample(n = min(5, nrow(q1))) |> mutate(pole = "low quartile"),
q4 |> slice_sample(n = min(5, nrow(q4))) |> mutate(pole = "high quartile")
) |> mutate(outcome = oc, .before = 1)
})
readr::write_csv(h4_rater_packet, here::here("Data", "results", "h4_rater_packet.csv"))
readr::write_csv(h4_top_words, here::here("Data", "results", "h4_top_words.csv"))
Topic-count selection — fit LDA at several k per construct, pick by mean per-topic coherence. At pilot N=40 these estimates are unstable; treat as an ordering, not a precise pick. Re-run at full N to lock k per construct.
library(topics)
candidate_k <- c(4, 6, 8, 10, 12, 15)
build_dtm_for_construct <- function(prompt_var) {
texts <- clean[[prompt_var]][prompt_available[[prompt_var]]]
topicsDtm(
data = tibble(text = texts),
ngram_window = c(1, 1),
removal_mode = "frequency",
removal_rate_least = 2,
shuffle = FALSE,
seed = 2020
)
}
h4_dtm <- map(set_names(h4_specs$emb_var), build_dtm_for_construct)
h4_k_diagnostic <- map_dfr(seq_len(nrow(h4_specs)), function(i) {
pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
dtm <- h4_dtm[[pv]]
map_dfr(candidate_k, function(k) {
mod <- topicsModel(dtm = dtm, num_topics = k, num_iterations = 1500, seed = 2020)
tibble(
prompt = pv,
outcome = oc,
k = k,
mean_coherence = mean(mod$coherence, na.rm = TRUE),
median_coherence = median(mod$coherence, na.rm = TRUE),
min_coherence = min(mod$coherence, na.rm = TRUE),
prevalence_sd = sd(mod$prevalence, na.rm = TRUE)
)
})
})
h4_k_diagnostic |>
ggplot(aes(k, mean_coherence)) +
geom_line(color = brand_blue) +
geom_point(color = brand_blue, size = 2) +
facet_wrap(~ outcome, scales = "free_y") +
labs(title = "H4 — LDA topic-count selection",
subtitle = "Mean per-topic coherence across candidate k. Pick the elbow / peak per construct.",
x = "Number of topics (k)", y = "Mean coherence") +
theme_minimal(base_size = 11)
h4_k_chosen <- h4_k_diagnostic |>
group_by(outcome) |>
slice_max(mean_coherence, n = 1, with_ties = FALSE) |>
ungroup() |>
select(outcome, prompt, k_chosen = k, mean_coherence)
h4_k_chosen |>
gt() |>
tab_header(title = "Chosen k per construct (peak coherence)",
subtitle = "Override manually below if visual inspection prefers a different k.") |>
fmt_number(columns = mean_coherence, decimals = 3)
readr::write_csv(h4_k_diagnostic, here::here("Data", "results", "h4_topic_k_diagnostic.csv"))
readr::write_csv(h4_k_chosen, here::here("Data", "results", "h4_topic_k_chosen.csv"))
Final LDA fit at the chosen k per construct, plus
topicsTest() correlating each topic’s prevalence with the
matched observed scale score. The test result powers the inline plot
below — top topics most strongly associated with high vs. low scale
scores.
library(topics)
h4_k_lookup <- setNames(h4_k_chosen$k_chosen, h4_k_chosen$outcome)
h4_lda <- map(seq_len(nrow(h4_specs)), function(i) {
pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
ok <- prompt_available[[pv]]
k <- h4_k_lookup[[oc]]
texts <- clean[[pv]][ok]
dtm <- h4_dtm[[pv]]
mod <- topicsModel(dtm = dtm, num_topics = k, num_iterations = 1500, seed = 2020)
data_for_test <- tibble(text = texts)
data_for_test[[oc]] <- scored[[oc]][ok]
test <- tryCatch(
topicsTest(
data = data_for_test,
model = mod,
x_variable = oc,
test_method = "default",
p_adjust_method = "fdr",
seed = 2020
),
error = function(e) {
message("topicsTest failed for ", oc, ": ", conditionMessage(e))
NULL
}
)
list(prompt = pv, outcome = oc, k = k, dtm = dtm, model = mod, test = test,
topics_pkg_version = packageVersion("topics"))
})
lda_dir <- here::here("Data", "topics", "h4")
dir.create(lda_dir, showWarnings = FALSE, recursive = TRUE)
walk(h4_lda, ~ saveRDS(.x, file.path(lda_dir,
paste0(.x$outcome, "_k", .x$k, ".rds"))))
Inline visuals — for each PWB construct, render the supervised word-projection plot (low ↔︎ high observed score) and the topic-prevalence × scale association plot. Read across them to label what each topic / each projection pole is “about”, then record those labels for the rater packet.
library(text)
library(topics)
print_any_plot <- function(obj) {
if (inherits(obj, c("ggplot", "gg", "patchwork"))) {
print(obj)
} else if (is.list(obj)) {
if ("final_plot" %in% names(obj)) {
print_any_plot(obj$final_plot)
} else {
for (el in obj) print_any_plot(el)
}
}
invisible(NULL)
}
extract_test_table <- function(test_obj) {
if (is.null(test_obj)) return(NULL)
if (is.data.frame(test_obj)) return(tibble::as_tibble(test_obj))
if (is.list(test_obj)) {
for (nm in c("test_results", "results", "test", "topic_results")) {
if (!is.null(test_obj[[nm]]) && is.data.frame(test_obj[[nm]])) {
return(tibble::as_tibble(test_obj[[nm]]))
}
}
df_el <- purrr::detect(test_obj, is.data.frame)
if (!is.null(df_el)) return(tibble::as_tibble(df_el))
}
NULL
}
for (i in seq_len(nrow(h4_specs))) {
oc <- h4_specs$outcome[i]
pv <- h4_specs$emb_var[i]
k_used <- h4_lda[[i]]$k
cat("\n\n### ", oc, " · prompt: `", pv, "` · k = ", k_used, "\n\n", sep = "")
cat("\n**Word-level projection — low ↔ high observed `", oc, "`**\n\n", sep = "")
proj <- h4_projection[[i]]$projection
pp <- tryCatch(
textProjectionPlot(
word_data = proj,
k_n_words_to_test = FALSE,
min_freq_words_test = 2,
plot_n_word_extreme = 10,
plot_n_word_frequency = 5,
plot_n_words_middle = 5,
plot_n_words_p = 5,
title_top = paste0("Projection: ", oc, " (high pole right, low pole left)")
),
error = function(e) {
cat("\n*textProjectionPlot failed:* ", conditionMessage(e), "\n\n", sep = "")
NULL
}
)
print_any_plot(pp)
cat("\n\n**Top 10 topics most associated with `", oc, "` (linear regression on topic prevalence, FDR-corrected)**\n\n", sep = "")
td <- extract_test_table(h4_lda[[i]]$test)
if (is.null(td)) {
cat("\n*Could not extract topicsTest result table for ", oc, ".*\n\n", sep = "")
} else {
estimate_col <- intersect(c("estimate","beta","coef","Estimate","b","z"), names(td))[1]
p_col <- intersect(c("p.adjusted","p.adj","p_value_adj","p","p_value"),names(td))[1]
if (is.na(estimate_col)) {
cat("\n*Effect-size column not found in topicsTest output. Columns: ",
paste(names(td), collapse = ", "), "*\n\n", sep = "")
} else {
td_top <- td |>
mutate(.estimate = .data[[estimate_col]]) |>
arrange(desc(abs(.estimate))) |>
head(10) |>
select(-.estimate)
print(
gt(td_top) |>
tab_header(title = paste0("Top topics — ", oc),
subtitle = paste0("Ranked by |", estimate_col, "|; ",
if (!is.na(p_col)) paste0("p column: ", p_col) else "p column not detected")) |>
fmt_number(columns = where(is.numeric), decimals = 3)
)
}
}
cat("\n\n**Topic plot — prevalence × scale association**\n\n")
tp <- NULL
for (test_arg in list(h4_lda[[i]]$test, td, NULL)) {
tp <- tryCatch(
topicsPlot(
model = h4_lda[[i]]$model,
test = test_arg,
p_alpha = 0.05,
p_adjust_method = "fdr",
ngrams_max = 20,
plot_n_most_prevalent_topics = 10,
overview_plot = FALSE
),
error = function(e) NULL
)
if (!is.null(tp)) break
}
if (is.null(tp)) {
cat("\n*topicsPlot failed for all test-arg variants; see the top-10 table above.*\n\n")
}
print_any_plot(tp)
cat("\n\n---\n\n")
}
After viewing the projections and topics, fill in human-readable labels for each pole and each topic so the rater packet has interpretable anchors. The blank scaffold below writes to disk; expand columns if you keep more than 6 topics.
h4_labels <- tribble(
~outcome, ~high_pole_label, ~low_pole_label,
~topic_1, ~topic_2, ~topic_3, ~topic_4, ~topic_5, ~topic_6,
"Ryff_self_acceptance", "", "", "", "", "", "", "", "",
"Ryff_autonomy", "", "", "", "", "", "", "", "",
"Ryff_env_mastery", "", "", "", "", "", "", "", "",
"Ryff_pos_relations", "", "", "", "", "", "", "", "",
"Ryff_purpose_in_life", "", "", "", "", "", "", "", "",
"Ryff_personal_growth", "", "", "", "", "", "", "", ""
)
readr::write_csv(h4_labels, here::here("Data", "results", "h4_labels.csv"))
Per prereg: 6×6 predicted-by-observed correlation matrix with both observed and disattenuated correlations, descriptive pairwise comparisons of matched vs. each non-matched subscale.
Layered on top: a participant-level mixed-effects model on z-scored
squared prediction errors. Random participant intercept; fixed effect of
is_target (1 if matched). emmeans extracts the
30 individual target-vs-off-target pairwise contrasts.
if (!exists("pwb_dims")) {
pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
"pos_relations", "purpose_in_life", "personal_growth")
}
h5_obs <- paste0("Ryff_", pwb_dims)
h5_pred <- paste0("pred_Ryff_", pwb_dims)
h5_panel <- h2_panel |> select(ResponseId, all_of(h5_obs), all_of(h5_pred))
h5_corr <- expand_grid(predicted = h5_pred, observed = h5_obs) |>
pmap_dfr(function(predicted, observed) {
d <- h5_panel |> select(p = all_of(predicted), o = all_of(observed)) |> drop_na()
if (nrow(d) < 10) return(NULL)
ct <- cor.test(d$p, d$o)
tibble(predicted = predicted, observed = observed, n = nrow(d),
r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2])
}) |>
mutate(
pred_dim = str_remove(predicted, "pred_Ryff_"),
obs_dim = str_remove(observed, "Ryff_"),
is_target = pred_dim == obs_dim
)
h5_corr_matrix <- h5_corr |>
select(pred_dim, obs_dim, r) |>
pivot_wider(names_from = obs_dim, values_from = r)
h5_corr_matrix |>
gt() |>
tab_header(title = "H5 — predicted × observed PWB correlation matrix",
subtitle = "Diagonal = matched (target); off-diagonal = non-matched")
h5_corr_disattenuated <- h5_corr |>
left_join(alpha_lookup |> rename(observed = scale_subscale,
alpha_o = alpha,
omega_total_o = omega_total),
by = "observed") |>
mutate(
reliability_primary = "Cronbach alpha",
r_disattenuated_alpha = r / sqrt(pmax(alpha_o, .Machine$double.eps)),
r_disattenuated_omega = r / sqrt(pmax(omega_total_o, .Machine$double.eps))
)
h5_corr_disattenuated |>
select(pred_dim, obs_dim, n, r, alpha_o, omega_total_o,
r_disattenuated_alpha, r_disattenuated_omega,
reliability_primary, is_target) |>
gt(groupname_col = "pred_dim") |>
tab_header(title = "H5 — observed and disattenuated correlations",
subtitle = "Primary disattenuated estimate uses Cronbach's alpha; omega total is sensitivity.") |>
fmt_number(columns = c(r, alpha_o, omega_total_o,
r_disattenuated_alpha, r_disattenuated_omega), decimals = 3)
readr::write_csv(h5_corr_disattenuated, here::here("Data", "results", "h5_corr_disattenuated.csv"))
if (!exists("pwb_dims")) {
pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
"pos_relations", "purpose_in_life", "personal_growth")
}
h5_obs <- paste0("Ryff_", pwb_dims)
h5_pred <- paste0("pred_Ryff_", pwb_dims)
if (!exists("h1_pooled")) {
pooled_path <- here::here("Data", "results", "predictions_pooled.csv")
cv_path <- here::here("Data", "results", "predictions_cv_train.csv")
prospective_path <- here::here("Data", "results", "predictions_prospective.csv")
if (file.exists(pooled_path)) {
h1_pooled <- readr::read_csv(pooled_path, show_col_types = FALSE)
} else {
prediction_pieces <- list()
if (exists("h1_cv_predictions")) {
prediction_pieces <- append(prediction_pieces, list(h1_cv_predictions))
} else if (file.exists(cv_path)) {
prediction_pieces <- append(
prediction_pieces,
list(readr::read_csv(cv_path, show_col_types = FALSE))
)
}
if (exists("h1_prospective_predictions")) {
prediction_pieces <- append(prediction_pieces, list(h1_prospective_predictions))
} else if (file.exists(prospective_path)) {
prediction_pieces <- append(
prediction_pieces,
list(readr::read_csv(prospective_path, show_col_types = FALSE))
)
}
if (length(prediction_pieces) == 0) {
stop("No H1 prediction panel found. Run H1 prediction chunks or create Data/results/predictions_cv_train.csv.")
}
h1_pooled <- bind_rows(prediction_pieces) |>
distinct(ResponseId, .keep_all = TRUE)
}
if ("pred_Ryff_purpose_in_life" %in% names(h1_pooled)) {
h1_pooled <- h1_pooled |> filter(!is.na(pred_Ryff_purpose_in_life))
}
}
if (!exists("h2_panel")) {
h2_panel <- scored |>
select(ResponseId, starts_with("Ryff_"), SWLS_swls, HILS_hils,
PANAS_positive_affect, PANAS_negative_affect,
PHQ9_phq9, GAD7_gad7, starts_with("BFI10_"), Agency_agency) |>
inner_join(h1_pooled, by = "ResponseId")
}
if (!exists("h5_panel")) {
h5_panel <- h2_panel |> select(ResponseId, all_of(h5_obs), all_of(h5_pred))
}
h5_long <- h5_panel |>
pivot_longer(all_of(h5_pred), names_to = "predicted", values_to = "pred_score") |>
mutate(pred_dim = str_remove(predicted, "pred_Ryff_")) |>
select(-predicted) |>
pivot_longer(all_of(h5_obs), names_to = "observed", values_to = "obs_score") |>
mutate(obs_dim = str_remove(observed, "Ryff_")) |>
select(-observed) |>
group_by(pred_dim) |>
mutate(z_pred = as.numeric(scale(pred_score))) |>
group_by(obs_dim) |>
mutate(z_obs = as.numeric(scale(obs_score))) |>
ungroup() |>
mutate(
sq_err = (z_pred - z_obs)^2,
is_target = factor(if_else(pred_dim == obs_dim, "target", "off_target"),
levels = c("off_target", "target")),
pred_dim = factor(pred_dim, levels = pwb_dims),
obs_dim = factor(obs_dim, levels = pwb_dims)
) |>
drop_na(sq_err)
h5_lmer_overall <- lmer(
sq_err ~ is_target + (1 | ResponseId) + (1 | pred_dim),
data = h5_long
)
summary(h5_lmer_overall)
h5_lmer_pairwise <- lmer(
sq_err ~ pred_dim * obs_dim + (1 | ResponseId),
data = h5_long
)
h5_emm_pairwise <- emmeans(h5_lmer_pairwise, ~ obs_dim | pred_dim) |>
contrast(method = "trt.vs.ctrl",
ref = match(pwb_dims, levels(h5_long$obs_dim))) |>
as_tibble()
h5_emm_pairwise |>
filter(str_detect(contrast, "ref")) |>
gt() |>
tab_header(title = "H5 — 30 pairwise contrasts (target as reference; predicted dim grouped)") |>
fmt_number(columns = any_of(c("estimate", "SE", "df", "t.ratio", "z.ratio", "p.value")),
decimals = 3)
readr::write_csv(h5_emm_pairwise, here::here("Data", "results", "h5_pairwise_contrasts.csv"))
h5_per_construct_paired <- h5_long |>
group_by(ResponseId, pred_dim, is_target) |>
summarise(mean_sq_err = mean(sq_err), .groups = "drop") |>
pivot_wider(names_from = is_target, values_from = mean_sq_err) |>
drop_na(target, off_target)
h5_per_construct_t <- h5_per_construct_paired |>
group_by(pred_dim) |>
summarise(
n = n(),
mean_target = mean(target),
mean_off = mean(off_target),
diff = mean(off_target - target),
t_stat = t.test(off_target, target, paired = TRUE, alternative = "greater")$statistic,
p_one_sided = t.test(off_target, target, paired = TRUE, alternative = "greater")$p.value
) |>
mutate(p_fdr = p.adjust(p_one_sided, method = "BH"))
h5_per_construct_t |>
gt() |>
tab_header(title = "H5 — paired t-tests per construct (target vs. mean of 5 off-target errors)") |>
fmt_number(columns = c(mean_target, mean_off, diff, t_stat, p_one_sided, p_fdr), decimals = 3)
readr::write_csv(h5_per_construct_t, here::here("Data", "results", "h5_per_construct_paired.csv"))
Per prereg “Other planned analyses.” Compares scale means between text_first vs. scales_first conditions via t-tests.
condition_panel <- scored |>
left_join(clean |> select(ResponseId, condition), by = "ResponseId") |>
mutate(
text_first = case_when(
condition == "scales_last" ~ "text_first",
condition == "text_last" ~ "scales_first",
TRUE ~ NA_character_
)
) |>
filter(!is.na(text_first))
scale_cols <- c(paste0("Ryff_", pwb_dims), "SWLS_swls", "HILS_hils",
"PANAS_positive_affect", "PANAS_negative_affect",
"PHQ9_phq9", "GAD7_gad7", paste0("EPS_", c("meaningfulness","commitment","beyond_self")))
order_effects <- map_dfr(scale_cols, function(s) {
d <- condition_panel |> select(y = all_of(s), text_first) |> drop_na()
if (length(unique(d$text_first)) < 2 || nrow(d) < 10) return(NULL)
tt <- t.test(y ~ text_first, data = d, var.equal = FALSE)
means <- d |> group_by(text_first) |> summarise(m = mean(y), s = sd(y), n = n(), .groups = "drop")
tibble(scale = s, n_text_first = means$n[means$text_first == "text_first"],
n_scales_first = means$n[means$text_first == "scales_first"],
mean_text_first = means$m[means$text_first == "text_first"],
mean_scales_first = means$m[means$text_first == "scales_first"],
diff = unname(diff(means$m)),
t_stat = unname(tt$statistic), p_raw = tt$p.value)
}) |>
mutate(p_fdr = p.adjust(p_raw, method = "BH"))
order_effects |>
gt() |>
tab_header(title = "Order effects: scale means by counterbalancing condition",
subtitle = "Welch's t-tests, FDR-corrected across scales") |>
fmt_number(columns = c(mean_text_first, mean_scales_first, diff, t_stat, p_raw, p_fdr), decimals = 3)
readr::write_csv(order_effects, here::here("Data", "results", "order_effects.csv"))
The dissertation’s central question for this chapter: do the language profiles of sense of purpose (Ryff PIL) and beyond-the-self purpose (Engaged Purpose Scale BTS subscale) differ, and do those differences predict different downstream wellbeing outcomes?
Four sub-analyses:
textProjection() — which words/phrases load on each
construct.library(text)
pil_projection_idx <- complete_prompt_score_idx("purposeinlife_text", "Ryff_purpose_in_life")
bts_projection_idx <- complete_prompt_score_idx("purpose_goal", "EPS_beyond_self")
proj_pil <- textProjection(
words = clean$purposeinlife_text[pil_projection_idx],
word_embeddings = text_embed$purposeinlife_text[pil_projection_idx, , drop = FALSE],
word_types_embeddings = word_type_embed$purposeinlife_text,
x = scored$Ryff_purpose_in_life[pil_projection_idx],
y = NULL,
pca = NULL,
aggregation = "mean",
split = "quartile",
word_weight_power = 1,
min_freq_words_test = 2
)
proj_bts <- textProjection(
words = clean$purpose_goal[bts_projection_idx],
word_embeddings = text_embed$purpose_goal[bts_projection_idx, , drop = FALSE],
word_types_embeddings = word_type_embed$purpose_goal,
x = scored$EPS_beyond_self[bts_projection_idx],
y = NULL,
pca = NULL,
aggregation = "mean",
split = "quartile",
word_weight_power = 1,
min_freq_words_test = 2
)
saveRDS(proj_pil, here::here("Data", "results", "projection_pil.rds"))
saveRDS(proj_bts, here::here("Data", "results", "projection_bts.rds"))
library(text)
semp_specs <- bind_rows(
h1_specs,
tribble(
~emb_var, ~outcome,
"purpose_goal", "EPS_beyond_self"
)
)
semp_train_one <- function(emb_var, outcome) {
idx <- which(
prompt_available[[emb_var]] &
!is.na(scored[[outcome]]) &
complete.cases(text_embed[[emb_var]])
)
if (length(idx) < 10) stop("Too few complete rows for ", emb_var, " -> ", outcome)
fit <- textTrainRegression(
x = text_embed[[emb_var]][idx, , drop = FALSE],
y = scored[[outcome]][idx],
penalty = penalty_grid,
multi_cores = TRUE,
seed = 2020
)
list(
emb_var = emb_var,
outcome = outcome,
model_idx = idx,
fit = fit,
lambda_chosen = fit$final_model$fit$fit$spec$args$penalty,
text_pkg_version = packageVersion("text"),
built_on = Sys.time()
)
}
semp_models <- pmap(semp_specs, ~ semp_train_one(..1, ..2))
semp_dir <- here::here("Data", "models", "semp")
dir.create(semp_dir, showWarnings = FALSE, recursive = TRUE)
walk(semp_models, ~ saveRDS(
.x,
file.path(semp_dir, paste0(.x$emb_var, "__", .x$outcome, ".rds"))
))
if (!exists("model_id_vars")) {
model_id_vars <- function(final_model) {
blueprint <- final_model$pre$mold$blueprint
id_ptype <- tryCatch(
blueprint$extra_role_ptypes[["id variable"]],
error = function(e) NULL
)
id_vars <- if (is.null(id_ptype)) character() else names(id_ptype)
recipe_cols <- tryCatch(names(blueprint$recipe$ptype), error = function(e) character())
predictor_cols <- tryCatch(names(blueprint$ptypes$predictors), error = function(e) character())
id_vars <- union(id_vars, intersect(setdiff(recipe_cols, c(predictor_cols, "y")), "id_nr"))
id_vars
}
}
if (!exists("embedding_new_data")) {
embedding_new_data <- function(final_model, emb_var, idx) {
predictor_cols <- tryCatch(
names(final_model$pre$mold$blueprint$ptypes$predictors),
error = function(e) names(text_embed[[emb_var]])
)
new_x <- as_tibble(text_embed[[emb_var]][idx, predictor_cols, drop = FALSE])
for (id_var in model_id_vars(final_model)) {
if (!id_var %in% names(new_x)) new_x[[id_var]] <- seq_along(idx)
}
new_x
}
}
predict_sem_model <- function(model_path, emb_var) {
obj <- readRDS(model_path)
out <- rep(NA_real_, nrow(scored))
idx <- which(prompt_available[[emb_var]] & complete.cases(text_embed[[emb_var]]))
out[idx] <- predict(
obj$fit$final_model,
new_data = embedding_new_data(obj$fit$final_model, emb_var, idx)
)$.pred
out
}
pil_path <- here::here("Data", "models", "semp", "purposeinlife_text__Ryff_purpose_in_life.rds")
bts_path <- here::here("Data", "models", "semp", "purpose_goal__EPS_beyond_self.rds")
pred_pil_on_pil_text <- predict_sem_model(pil_path, "purposeinlife_text")
pred_bts_on_pil_text <- predict_sem_model(bts_path, "purposeinlife_text")
pred_pil_on_bts_text <- predict_sem_model(pil_path, "purpose_goal")
pred_bts_on_bts_text <- predict_sem_model(bts_path, "purpose_goal")
cross_pred <- tibble(
ResponseId = scored$ResponseId,
obs_pil = scored$Ryff_purpose_in_life,
obs_bts = scored$EPS_beyond_self,
pred_pil_on_pil_text = pred_pil_on_pil_text,
pred_bts_on_pil_text = pred_bts_on_pil_text,
pred_pil_on_bts_text = pred_pil_on_bts_text,
pred_bts_on_bts_text = pred_bts_on_bts_text
)
cor_matrix <- cross_pred |>
select(-ResponseId) |>
cor(use = "pairwise.complete.obs")
cor_matrix |> as_tibble(rownames = "var") |> gt() |>
tab_header(title = "Cross-prediction matrix (PIL vs. EPS-BTS)")
readr::write_csv(cross_pred, here::here("Data", "results", "purpose_cross_pred.csv"))
other_outcomes <- c("Ryff_self_acceptance", "Ryff_autonomy", "Ryff_env_mastery",
"Ryff_pos_relations", "Ryff_personal_growth",
"SWLS_swls", "HILS_hils",
"PANAS_positive_affect", "PANAS_negative_affect",
"PHQ9_phq9", "GAD7_gad7")
diff_eff <- map_dfr(other_outcomes, function(y) {
d <- scored |>
transmute(y = .data[[y]], pil = cross_pred$pred_pil_on_pil_text,
bts = cross_pred$pred_bts_on_bts_text) |>
drop_na()
if (nrow(d) < 10) return(NULL)
m <- lm(y ~ pil + bts, data = d)
broom::tidy(m, conf.int = TRUE) |>
filter(term %in% c("pil", "bts")) |>
mutate(outcome = y, .before = 1)
})
diff_eff |> gt() |>
tab_header(title = "Differential effects of predicted PIL vs. predicted EPS-BTS",
subtitle = "Standardized regression of each outcome on both predicted scores") |>
fmt_number(columns = c(estimate, std.error, statistic, p.value, conf.low, conf.high), decimals = 3)
diff_eff |>
ggplot(aes(estimate, fct_rev(outcome), color = term)) +
geom_point(position = position_dodge(width = 0.5), size = 2.5) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
position = position_dodge(width = 0.5), height = 0) +
geom_vline(xintercept = 0, linetype = 2, color = "grey60") +
scale_color_manual(values = c(pil = brand_blue, bts = brand_red),
labels = c(pil = "Predicted PIL (sense of purpose)",
bts = "Predicted EPS-BTS (beyond-the-self)")) +
labs(title = "Differential effects on downstream wellbeing outcomes",
x = "Coefficient (95% CI)", y = NULL, color = NULL) +
theme(legend.position = "bottom")
library(text)
if (!exists("penalty_grid")) penalty_grid <- 10^seq(-6, 6)
if (!exists("brand_blue")) brand_blue <- "#013d5aff"
if (!exists("brand_red")) brand_red <- "#cd5d5bff"
if (!exists("embedding_matrix")) {
embedding_matrix <- function(emb) {
emb$texts$texts |>
select(-matches("^id(_|$)"), -matches("^id_"))
}
}
purpose_embedding_vars <- c("purposeinlife_text", "purpose_goal")
if (!exists("text_embed")) text_embed <- list()
if (!all(purpose_embedding_vars %in% names(text_embed))) {
embeddings_dir <- here::here("Data", "embeddings")
missing_embedding_vars <- setdiff(purpose_embedding_vars, names(text_embed))
text_embed[missing_embedding_vars] <- map(
missing_embedding_vars,
~ readRDS(file.path(embeddings_dir, paste0(.x, ".rds"))) |> embedding_matrix()
)
}
if (!exists("prompt_available")) {
prompt_available <- clean |>
transmute(across(any_of(purpose_embedding_vars),
~ !is.na(.x) & nzchar(trimws(.x))))
}
if (!exists("pwb_dims")) {
pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
"pos_relations", "purpose_in_life", "personal_growth")
}
purpose_embedding_sources <- tribble(
~source, ~emb_var, ~label,
"sense_of_purpose", "purposeinlife_text", "Sense of purpose text",
"beyond_self", "purpose_goal", "Beyond-the-self goal text"
)
wellbeing_outcome_candidates <- c(
paste0("Ryff_", pwb_dims),
"SWLS_swls", "HILS_hils",
"PANAS_positive_affect", "PANAS_negative_affect",
"PHQ9_phq9", "GAD7_gad7",
"EPS_meaningfulness", "EPS_commitment", "EPS_beyond_self",
"Agency_agency", "Responsibility_responsibility"
)
wellbeing_outcome_candidates <- wellbeing_outcome_candidates[wellbeing_outcome_candidates %in% names(scored)]
purpose_embedding_outcome_availability <- tibble(outcome = wellbeing_outcome_candidates) |>
mutate(
n_complete = map_int(outcome, ~ sum(!is.na(scored[[.x]]))),
included = n_complete >= 10
)
readr::write_csv(
purpose_embedding_outcome_availability,
here::here("Data", "results", "purpose_embedding_wellbeing_outcome_availability.csv")
)
wellbeing_outcomes <- purpose_embedding_outcome_availability |>
filter(included) |>
pull(outcome)
purpose_embedding_outcome_availability |>
gt() |>
tab_header(
title = "Purpose Embedding Outcome Availability",
subtitle = "Outcomes with fewer than 10 complete cases are listed but not modeled."
)
purpose_embedding_model_dir <- here::here("Data", "models", "purpose_embedding_wellbeing")
dir.create(purpose_embedding_model_dir, showWarnings = FALSE, recursive = TRUE)
extract_text_lambda <- function(fit) {
lambda <- tryCatch(fit$final_model$fit$fit$spec$args$penalty, error = function(e) NA_real_)
if (is.numeric(lambda) && length(lambda) == 1) return(lambda)
lambda_line <- tryCatch(
str_subset(fit$model_description, "^penalty in final model ="),
error = function(e) character()
)
if (length(lambda_line) == 1) return(readr::parse_number(lambda_line))
NA_real_
}
get_cv_predictions <- function(fit) {
pred_tbl <- fit$predictions
if (!is.data.frame(pred_tbl) || !all(c("predictions", "y") %in% names(pred_tbl))) {
stop("Could not find out-of-fold predictions at fit$predictions.")
}
pred_tbl |>
transmute(
id_nr = if ("id_nr" %in% names(pred_tbl)) as.integer(id_nr) else row_number(),
predictions = as.numeric(.data$predictions),
y = as.numeric(.data$y)
)
}
fit_embedding_outcome <- function(source, emb_var, outcome) {
idx <- which(
prompt_available[[emb_var]] &
!is.na(scored[[outcome]]) &
complete.cases(text_embed[[emb_var]])
)
if (length(idx) < 10) return(NULL)
model_path <- file.path(
purpose_embedding_model_dir,
paste0(source, "__", outcome, ".rds")
)
if (file.exists(model_path)) {
obj <- readRDS(model_path)
} else {
fit <- textTrainRegression(
x = text_embed[[emb_var]][idx, , drop = FALSE],
y = scored[[outcome]][idx],
penalty = penalty_grid,
multi_cores = TRUE,
seed = 2020
)
obj <- list(source = source, emb_var = emb_var, outcome = outcome,
model_idx = idx, fit = fit,
text_pkg_version = packageVersion("text"),
built_on = Sys.time())
saveRDS(obj, model_path)
}
pred_tbl <- get_cv_predictions(obj$fit)
pred_tbl <- pred_tbl |>
mutate(ResponseId = scored$ResponseId[obj$model_idx[id_nr]])
ok <- complete.cases(pred_tbl$predictions, pred_tbl$y)
ct <- cor.test(pred_tbl$predictions[ok], pred_tbl$y[ok], alternative = "two.sided")
list(
result = tibble(
source = source,
emb_var = emb_var,
outcome = outcome,
n = sum(ok),
cv_r = unname(ct$estimate),
lcl = ct$conf.int[1],
ucl = ct$conf.int[2],
p_raw = ct$p.value,
lambda = extract_text_lambda(obj$fit)
),
predictions = pred_tbl |>
transmute(source, emb_var, outcome, ResponseId,
pred = predictions, truth = y)
)
}
purpose_embedding_fits <- pmap(
crossing(purpose_embedding_sources, outcome = wellbeing_outcomes),
\(source, emb_var, label, outcome) fit_embedding_outcome(source, emb_var, outcome)
) |>
compact()
purpose_embedding_cv <- map_dfr(purpose_embedding_fits, "result") |>
left_join(purpose_embedding_sources |> select(source, label), by = "source") |>
group_by(outcome) |>
mutate(
p_fdr_within_outcome = p.adjust(p_raw, method = "BH"),
winner_cv_r = source[which.max(cv_r)]
) |>
ungroup()
purpose_embedding_predictions <- map_dfr(purpose_embedding_fits, "predictions")
paired_error_test <- function(err_pil, err_bts) {
out <- tryCatch(
t.test(err_pil, err_bts, paired = TRUE),
error = function(e) NULL
)
if (is.null(out)) {
tibble(t_stat = NA_real_, p_raw = NA_real_)
} else {
tibble(t_stat = unname(out$statistic), p_raw = out$p.value)
}
}
purpose_embedding_cv_wide <- purpose_embedding_cv |>
select(outcome, source, cv_r) |>
pivot_wider(names_from = source, values_from = cv_r) |>
mutate(delta_bts_minus_pil = beyond_self - sense_of_purpose)
purpose_embedding_error_compare <- purpose_embedding_predictions |>
select(source, outcome, ResponseId, pred, truth) |>
group_by(outcome) |>
mutate(
truth_z = as.numeric(scale(truth)),
pred_z = as.numeric(scale(pred))
) |>
ungroup() |>
mutate(sq_err = (pred_z - truth_z)^2) |>
select(source, outcome, ResponseId, sq_err) |>
pivot_wider(names_from = source, values_from = sq_err, names_prefix = "err_") |>
drop_na(err_sense_of_purpose, err_beyond_self) |>
group_by(outcome) |>
summarise(
n = n(),
mean_err_pil = mean(err_sense_of_purpose),
mean_err_bts = mean(err_beyond_self),
diff_err_pil_minus_bts = mean(err_sense_of_purpose - err_beyond_self),
paired_error_test(err_sense_of_purpose, err_beyond_self),
.groups = "drop"
) |>
mutate(
p_fdr = p.adjust(p_raw, method = "BH"),
lower_error_winner = if_else(diff_err_pil_minus_bts > 0,
"beyond_self", "sense_of_purpose")
) |>
left_join(purpose_embedding_cv_wide, by = "outcome")
scale_to_wellbeing <- map_dfr(wellbeing_outcomes, function(outcome) {
map_dfr(c(obs_pil = "Ryff_purpose_in_life", obs_bts = "EPS_beyond_self"), function(predictor) {
d <- scored |> select(x = all_of(predictor), y = all_of(outcome)) |> drop_na()
if (nrow(d) < 10) return(NULL)
ct <- cor.test(d$x, d$y)
tibble(outcome = outcome, scale_predictor = predictor, n = nrow(d),
r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2],
p_raw = ct$p.value)
}, .id = "scale_source")
})
readr::write_csv(purpose_embedding_cv,
here::here("Data", "results", "purpose_embedding_wellbeing_cv.csv"))
readr::write_csv(purpose_embedding_predictions,
here::here("Data", "results", "purpose_embedding_wellbeing_predictions.csv"))
readr::write_csv(purpose_embedding_error_compare,
here::here("Data", "results", "purpose_embedding_wellbeing_error_compare.csv"))
readr::write_csv(scale_to_wellbeing,
here::here("Data", "results", "purpose_scale_to_wellbeing_correlations.csv"))
purpose_embedding_cv |>
select(label, outcome, n, cv_r, lcl, ucl, p_raw, p_fdr_within_outcome, lambda, winner_cv_r) |>
arrange(outcome, desc(cv_r)) |>
gt(groupname_col = "outcome") |>
tab_header(
title = "Purpose Embeddings Predicting Wellbeing Outcomes",
subtitle = "Same outcome predicted separately from sense-of-purpose text and beyond-the-self goal text."
) |>
fmt_number(columns = c(cv_r, lcl, ucl, p_raw, p_fdr_within_outcome, lambda), decimals = 3)
purpose_embedding_error_compare |>
arrange(desc(diff_err_pil_minus_bts)) |>
gt() |>
tab_header(
title = "Paired Error Comparison: BTS Embeddings vs. PIL Embeddings",
subtitle = "Positive error difference means BTS embeddings had lower standardized squared error."
) |>
fmt_number(columns = c(mean_err_pil, mean_err_bts, diff_err_pil_minus_bts,
t_stat, p_raw, p_fdr, sense_of_purpose, beyond_self,
delta_bts_minus_pil),
decimals = 3)
purpose_embedding_cv |>
mutate(outcome = fct_reorder(outcome, cv_r, .fun = max, .desc = FALSE)) |>
ggplot(aes(cv_r, outcome, color = label)) +
geom_vline(xintercept = 0, linetype = 2, color = "grey65") +
geom_point(position = position_dodge(width = 0.55), size = 2.6) +
geom_errorbarh(aes(xmin = lcl, xmax = ucl),
position = position_dodge(width = 0.55), height = 0) +
scale_color_manual(values = c("Sense of purpose text" = brand_blue,
"Beyond-the-self goal text" = brand_red)) +
labs(title = "Out-of-fold prediction of wellbeing from purpose-language embeddings",
subtitle = "Higher CV correlation means the embedding source carries more outcome-relevant signal.",
x = "CV correlation between predicted and observed outcome",
y = NULL, color = NULL) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom")
purpose_embedding_error_compare |>
mutate(outcome = fct_reorder(outcome, delta_bts_minus_pil)) |>
ggplot(aes(delta_bts_minus_pil, outcome)) +
geom_vline(xintercept = 0, linetype = 2, color = "grey65") +
geom_col(aes(fill = delta_bts_minus_pil > 0), width = 0.7) +
scale_fill_manual(values = c(`TRUE` = brand_red, `FALSE` = brand_blue),
labels = c(`TRUE` = "BTS higher CV r", `FALSE` = "PIL higher CV r")) +
labs(title = "Which purpose embedding source better predicts each wellbeing outcome?",
subtitle = "Positive values favor beyond-the-self goal text; negative values favor sense-of-purpose text.",
x = "CV r difference: BTS embeddings minus PIL embeddings",
y = NULL, fill = NULL) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom")
Train a ridge model with the TRWOC-mirrored prompts
(college_goals_text, transformative_text) as
input and EPS-BTS as outcome. The result is the L-BAM that Chapter 3
applies to actual TRWOC interview transcripts.
The linguistic-register-matched prompt is the bridge: participants answer the same questions TRWOC interviewers asked, with their EPS-BTS scores attached. The model learns what beyond-the-self purpose “sounds like” in that register.
library(text)
lbam_text <- paste(
replace_na(clean$college_goals_text, ""),
replace_na(clean$transformative_text, ""),
sep = " || "
)
lbam_text_missing <- !nzchar(str_remove_all(lbam_text, "[\\s|]"))
lbam_text_for_embedding <- lbam_text
lbam_text_for_embedding[lbam_text_missing] <- "missing response"
lbam_emb <- textEmbed(
texts = tibble(text = lbam_text_for_embedding),
model = "mixedbread-ai/mxbai-embed-large-v1",
layers = -2,
aggregation_from_layers_to_tokens = "concatenate",
aggregation_from_tokens_to_texts = "mean",
keep_token_embeddings = FALSE,
batch_size = 16
)
lbam_x <- lbam_emb$texts[[1]] |>
select(-starts_with("id_"))
ok <- !is.na(scored$EPS_beyond_self) &
!lbam_text_missing &
complete.cases(lbam_x)
lbam_fit <- textTrainRegression(
x = lbam_x[ok, , drop = FALSE],
y = scored$EPS_beyond_self[ok],
penalty = penalty_grid,
multi_cores = TRUE,
seed = 2020
)
lbam_artifact <- list(
name = "EPS_BTS_TRWOC_register_v1",
outcome = "Engaged Purpose Scale — Beyond-the-Self subscale (3 items, mean of 1–5 Likert)",
prompts = c("Howard's college goals", "Howard's transformative experience"),
text_combiner = "Pasted with ' || ' separator",
embedder = "mixedbread-ai/mxbai-embed-large-v1, layer -2, mean-pooled",
text_pkg_version = packageVersion("text"),
training_rows = which(ok),
fit = lbam_fit,
built_on = Sys.time(),
n_train = sum(ok),
cv_r = lbam_fit$results$estimate
)
lbam_dir <- here::here("Data", "lbam")
dir.create(lbam_dir, showWarnings = FALSE, recursive = TRUE)
saveRDS(lbam_artifact, file.path(lbam_dir, "eps_bts_lbam.rds"))
lbam_artifact[c("name","outcome","prompts","embedder","n_train","cv_r")]
LDA on the goal/why/doing prompts to surface interpretable topics.
The topics package wraps the heavy lifting.
library(topics)
target_outcomes <- c("EPS_beyond_self", "Ryff_purpose_in_life")
purpose_corpus_df <- clean |>
transmute(
ResponseId,
text = paste(replace_na(purpose_goal, ""),
replace_na(purpose_goal_reason, ""),
replace_na(purpose_goal_action, ""), sep = " ")
) |>
filter(nchar(trimws(text)) > 0) |>
left_join(
scored |> select(ResponseId, all_of(target_outcomes)),
by = "ResponseId"
)
purpose_corpus <- purpose_corpus_df$text
cat("Corpus size:", length(purpose_corpus), "documents\n")
cat("Mean tokens / doc:", round(mean(lengths(strsplit(purpose_corpus, "\\s+"))), 1), "\n")
if (length(purpose_corpus) < 100) {
dtm <- topicsDtm(
data = tibble(text = purpose_corpus),
ngram_window = c(1, 1),
shuffle = FALSE,
seed = 2020
)
} else {
dtm <- topicsDtm(
data = tibble(text = purpose_corpus),
ngram_window = c(1, 1),
removal_mode = "frequency",
removal_rate_least = 2,
shuffle = FALSE,
seed = 2020
)
}
cat("\nDTM object structure:\n")
print(str(dtm, max.level = 1, give.attr = FALSE))
cat("Slot names:", paste(names(dtm), collapse = ", "), "\n\n")
extract_dtm_matrix <- function(dtm) {
candidates <- c("train_dtm", "dtm", "train_data", "matrix", "dtm_train",
"trainset", "DTM", "x")
for (nm in candidates) {
if (!is.null(dtm[[nm]])) {
m <- dtm[[nm]]
if (inherits(m, c("dgCMatrix","dgTMatrix","Matrix","matrix",
"DocumentTermMatrix","simple_triplet_matrix",
"data.frame","tbl_df"))) {
return(list(slot = nm, matrix = m))
}
}
}
for (nm in names(dtm)) {
m <- dtm[[nm]]
if (inherits(m, c("dgCMatrix","dgTMatrix","Matrix",
"DocumentTermMatrix","simple_triplet_matrix")) ||
(is.matrix(m) && is.numeric(m)) ||
(is.data.frame(m) && ncol(m) > 1)) {
return(list(slot = nm, matrix = m))
}
}
list(slot = NA, matrix = NULL)
}
mat_info <- extract_dtm_matrix(dtm)
if (is.null(mat_info$matrix)) {
stop("Could not locate the DTM matrix in the topicsDtm() return. ",
"Slot names returned: ", paste(names(dtm), collapse = ", "), ". ",
"Run str(dtm, max.level = 2) interactively and update extract_dtm_matrix().")
}
dtm_mat <- mat_info$matrix
cat("DTM matrix found in slot: $", mat_info$slot, "\n", sep = "")
cat("DTM dimensions: ", paste(dim(dtm_mat), collapse = " x "), "\n", sep = "")
cat("Vocab size: ", ncol(dtm_mat), "\n", sep = "")
if (ncol(dtm_mat) < 20) {
warning("DTM vocabulary is very small (<20 terms) — LDA results will be unstable. ",
"Consider lowering removal_rate_least or relaxing stopword removal.")
}
topic_model <- tryCatch(
topicsModel(
dtm = dtm,
num_topics = 8,
num_iterations = 1000,
seed = 2020
),
error = function(e) {
message("topicsModel failed: ", conditionMessage(e))
NULL
}
)
if (is.null(topic_model)) {
stop("topicsModel returned NULL. Common causes: (1) Java/MALLET not initialized — ",
"run library(rJava); .jinit() and check Sys.getenv('JAVA_HOME'); ",
"(2) DTM too sparse — current vocab = ", ncol(dtm_mat), " terms; ",
"(3) num_topics > vocabulary. Re-run topicsDtm with removal_rate_least = 0 ",
"and re-check.")
}
topic_model$topics_pkg_version <- packageVersion("topics")
theta <- topic_model$theta
if (is.null(theta)) stop("topic_model$theta is NULL — cannot compute topic-outcome correlations.")
theta_tbl <- tibble::as_tibble(theta)
candidate_topic_cols <- grep("^t_|^Topic|^topic", names(theta_tbl), value = TRUE)
if (length(candidate_topic_cols) == 0) {
candidate_topic_cols <- names(theta_tbl)[vapply(theta_tbl, is.numeric, logical(1))]
}
stopifnot(length(candidate_topic_cols) > 0)
top_terms_collapsed <- topic_model$top_terms |>
summarise(across(everything(),
~ paste(head(.x[!is.na(.x)], 10), collapse = ", "))) |>
pivot_longer(everything(), names_to = "topic", values_to = "top_terms")
compute_topic_assoc <- function(theta_tbl, topic_cols, outcome_vec) {
map_dfr(topic_cols, function(tc) {
ok <- complete.cases(theta_tbl[[tc]], outcome_vec)
if (sum(ok) < 5) return(NULL)
ct <- cor.test(theta_tbl[[tc]][ok], outcome_vec[ok])
tibble(topic = tc,
n = sum(ok),
estimate = unname(ct$estimate),
lcl = ct$conf.int[1],
ucl = ct$conf.int[2],
p_value = ct$p.value)
}) |>
mutate(p_fdr = p.adjust(p_value, method = "fdr"),
direction = case_when(
p_fdr < 0.05 & estimate > 0 ~ "↑ positive",
p_fdr < 0.05 & estimate < 0 ~ "↓ negative",
TRUE ~ "n.s."
))
}
render_ranked_table <- function(ranked, outcome_label) {
ranked |>
select(topic, direction, estimate, lcl, ucl, p_value, p_fdr, n, top_terms) |>
gt() |>
tab_header(
title = paste0("LDA topics ranked by association with ", outcome_label),
subtitle = "Pearson r between topic prevalence (theta) and outcome; FDR-adjusted p."
) |>
fmt_number(columns = c(estimate, lcl, ucl, p_value, p_fdr), decimals = 3) |>
data_color(
columns = direction,
fn = function(x) {
dplyr::case_when(
x == "↑ positive" ~ "#cfe8d6",
x == "↓ negative" ~ "#f4cdcd",
TRUE ~ "#f0f0f0"
)
}
)
}
topic_assoc_by_outcome <- list()
ranked_by_outcome <- list()
for (oc in target_outcomes) {
outcome_vec <- purpose_corpus_df[[oc]]
stopifnot(length(outcome_vec) == nrow(theta_tbl))
ta <- compute_topic_assoc(theta_tbl, candidate_topic_cols, outcome_vec)
rk <- ta |>
left_join(top_terms_collapsed, by = "topic") |>
arrange(desc(estimate))
topic_assoc_by_outcome[[oc]] <- ta
ranked_by_outcome[[oc]] <- rk
cat("\n\n### Topics × ", oc, "\n\n", sep = "")
print(render_ranked_table(rk, oc))
readr::write_csv(
rk,
here::here("Data", "results",
paste0("topics_purpose_ranked__", oc, ".csv"))
)
}
Per-topic word clouds — try topicsPlot() first; if it
fails (version drift), fall back to ggplot scatter clouds built directly
from topic_model$phi. Each topic’s header annotates its
sign and effect size against the target outcome.
library(topics)
topic_plots <- tryCatch(
topicsPlot(model = topic_model, overview_plot = FALSE),
error = function(e) { message("topicsPlot failed: ", conditionMessage(e)); NULL }
)
manual_word_cloud <- function(model, topic_name, n = 30) {
phi <- model$phi
if (is.null(phi)) return(NULL)
phi_mat <- as.matrix(phi)
if (topic_name %in% rownames(phi_mat)) {
probs <- phi_mat[topic_name, ]
words <- colnames(phi_mat)
} else if (topic_name %in% colnames(phi_mat)) {
probs <- phi_mat[, topic_name]
words <- rownames(phi_mat)
} else {
j <- match(topic_name, paste0("t_", seq_len(min(dim(phi_mat)))))
if (is.na(j)) return(NULL)
if (nrow(phi_mat) >= ncol(phi_mat)) {
probs <- phi_mat[, j]; words <- rownames(phi_mat)
} else {
probs <- phi_mat[j, ]; words <- colnames(phi_mat)
}
}
ord <- order(probs, decreasing = TRUE)[seq_len(min(n, length(probs)))]
d <- tibble(term = words[ord], prob = probs[ord])
set.seed(which(names(topic_model$theta) == topic_name)[1] %||% 1)
d$x <- runif(nrow(d), -1, 1)
d$y <- runif(nrow(d), -1, 1)
ggplot(d, aes(x, y, label = term, size = prob)) +
geom_text(check_overlap = TRUE, color = brand_blue) +
scale_size(range = c(3, 10)) +
theme_void(base_size = 11) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
}
topic_names <- candidate_topic_cols
build_topic_annotation <- function(nm, ranked_by_outcome) {
parts <- character(0)
for (oc in names(ranked_by_outcome)) {
row <- ranked_by_outcome[[oc]] |> filter(topic == nm)
if (nrow(row) == 1) {
parts <- c(parts,
sprintf("%s: %s (r = %.3f, p_fdr = %.3f)",
oc, row$direction, row$estimate, row$p_fdr))
}
}
if (length(parts) == 0) "" else paste0(" · ", paste(parts, collapse = " · "))
}
for (nm in topic_names) {
ann <- build_topic_annotation(nm, ranked_by_outcome)
cat("\n#### ", nm, ann, "\n\n", sep = "")
rendered <- FALSE
if (!is.null(topic_plots) && nm %in% names(topic_plots)) {
obj <- topic_plots[[nm]]
if (inherits(obj, c("ggplot","gg","patchwork"))) {
print(obj); rendered <- TRUE
} else if (is.list(obj)) {
for (el in obj) {
if (inherits(el, c("ggplot","gg","patchwork"))) {
print(el); rendered <- TRUE
}
}
}
}
if (!rendered) {
fallback <- manual_word_cloud(topic_model, nm, n = 30)
if (!is.null(fallback)) {
print(fallback +
labs(title = paste0(nm, ann)))
} else {
cat("\n*Could not render word cloud for ", nm, ".*\n\n", sep = "")
}
}
}
Word-level supervised projection — for the same purpose corpus,
project word-type embeddings onto the difference vector between high and
low quartiles of (a) EPS_beyond_self and (b)
Ryff_purpose_in_life. Words on the right pole of each plot
are most associated with HIGH scale scores; words on the left, with
LOW.
library(text)
emb_purpose_goal <- readRDS(here::here("Data", "embeddings", "purpose_goal.rds"))
emb_purpose_in_life <- readRDS(here::here("Data", "embeddings", "purposeinlife_text.rds"))
embedding_matrix <- function(emb) {
m <- emb$texts$texts |> dplyr::select(-dplyr::any_of("id_texts"))
as.matrix(m)
}
word_type_matrix <- function(emb) {
wt <- emb$word_types
if (is.list(wt) && "texts" %in% names(wt)) wt$texts else wt
}
run_projection <- function(prompt_var, outcome_var, emb_obj) {
txt <- clean[[prompt_var]]
ok <- !is.na(txt) & nzchar(trimws(txt)) &
!is.na(scored[[outcome_var]])
txt_mat <- embedding_matrix(emb_obj)
ok <- ok & complete.cases(txt_mat)
textProjection(
words = txt[ok],
word_embeddings = txt_mat[ok, , drop = FALSE],
word_types_embeddings = word_type_matrix(emb_obj),
x = scored[[outcome_var]][ok],
pca = NULL,
aggregation = "mean",
split = "quartile",
word_weight_power = 1,
min_freq_words_test = 2
)
}
proj_bts_inline <- run_projection("purpose_goal", "EPS_beyond_self", emb_purpose_goal)
proj_pil_inline <- run_projection("purposeinlife_text", "Ryff_purpose_in_life", emb_purpose_in_life)
cat("\n#### Beyond-the-self purpose (EPS-BTS) — high vs low pole\n\n")
print(textProjectionPlot(
word_data = proj_bts_inline,
k_n_words_to_test = FALSE,
min_freq_words_test = 2,
plot_n_word_extreme = 10,
plot_n_word_frequency = 5,
plot_n_words_middle = 5,
plot_n_words_p = 5,
title_top = "EPS-BTS: high pole right (high score), low pole left"
))
cat("\n#### Sense of purpose (Ryff PIL) — high vs low pole\n\n")
print(textProjectionPlot(
word_data = proj_pil_inline,
k_n_words_to_test = FALSE,
min_freq_words_test = 2,
plot_n_word_extreme = 10,
plot_n_word_frequency = 5,
plot_n_words_middle = 5,
plot_n_words_p = 5,
title_top = "Ryff PIL: high pole right (high score), low pole left"
))
dir.create(here::here("Data", "results"), showWarnings = FALSE, recursive = TRUE)
saveRDS(proj_bts_inline,
here::here("Data", "results", "projection_bts_inline.rds"))
saveRDS(proj_pil_inline,
here::here("Data", "results", "projection_pil_inline.rds"))
topics_dir <- here::here("Data", "topics")
dir.create(topics_dir, showWarnings = FALSE, recursive = TRUE)
saveRDS(list(model = topic_model,
theta = theta_tbl,
topic_assoc_by_outcome = topic_assoc_by_outcome,
ranked_by_outcome = ranked_by_outcome,
corpus_df = purpose_corpus_df,
target_outcomes = target_outcomes),
file.path(topics_dir, "purpose_lda_k8.rds"))
The pipeline writes:
Data/wellbeing_clean.csv — analytic sample after
exclusionsData/wellbeing_scored.csv — clean + scale scores
joinedData/embeddings/<prompt>.rds — per-prompt
mixedbread embeddingsData/models/pilot/*.rds — pilot 5-fold CV ridge
modelsData/models/semp/*.rds — SEMP-frozen ridge models (when
N=500)Data/results/projection_*.rds,
purpose_cross_pred.csv — Interest #1 outputsData/lbam/eps_bts_lbam.rds — L-BAM artifact for Chapter
3Data/topics/purpose_lda_k8.rds — LDA modelChapter 3 (R Projects/The Real World of College/) reads
Data/lbam/eps_bts_lbam.rds and applies it to TRWOC
transcripts.
What this Rmd will produce on the current pilot CSV:
eval = FALSE chunks: embeddings → models
→ Interest #1 → Interest #2 → topics): blocked on having mixedbread
embeddings extracted. Flip eval = TRUE and run when ready,
but be realistic about pilot N for the modeling sections.What the Rmd CAN’T exercise on pilot (data-design issues, not pipeline issues):
condition is NaN for all
real respondents).