1 Overview

Master analysis Rmd for the wellbeing paper preregistration AND the dissertation Chapter 2 cuts. It reads the cleaned Qualtrics export (Data/wellbeing.csv), applies exclusions, computes scale scores, runs the prereg’s H1–H5 plus order effects, and then runs the dissertation-specific Interest #1 (PIL vs. EPS-BTS purpose comparison) and Interest #2 (L-BAM for TRWOC transfer).

Pipeline-to-prereg map:

Section	Prereg item	Notes
Setup → Reliability	Indices, transformations	Reverse-keys per Ryff/BFI scoring keys
Embeddings	“Convergent validity” embedding spec	Layer 23, mean-pooled, both text + word_type aggregations
H1 train	H1 confirmatory CV	Penalty grid `10^seq(-6, 6)`, FDR-corrected, lambda boundary check
H1 disattenuated	H1 secondary	r / sqrt(α_observed)
H1 prospective	H1 sequential evaluation	Frozen model on N=100; benchmark r ≥ .45 lower bound
H1 pooled panel	Setup for H2–H5	CV preds (N=400) + prospective preds (N=100)
H2 nomological	H2	Two-tailed Pearson, FDR within predicted dim
H3 external	H3 i–iv (one-tailed, FDR), exploratory (two-tailed)	Religious commitment = Q202 ≥ 3
H4 face validity	H4	textProjection + LDA per construct + rater export packet
H5 discriminant	H5 + Brendan’s mixed-effects layer	6×6 corr matrix (observed + disattenuated); lmer on z-scored sq err with emmeans for 30 contrasts; per-construct paired t-tests
Order effects	“Other planned analyses”	t-tests on scale means by condition
Interest #1	Dissertation Ch2	Purpose-language comparison
Interest #2	Dissertation Ch2	L-BAM build

The launch began 2026-04-27. All earlier responses (Nov 2025 – Apr 2026 dev/pilot) are excluded by a pre-launch date filter in Section 4. On the current launch batch (N=40), the prereg analyses with eval = FALSE are blocked on:

Survey republish. All 40 launch responses are v1_pre_republish: missing condition, ethnicity (Q190), Q199 employment, Q200 religious affiliation, Q201 relationship, Q202 religion importance. Until the survey is republished and new responses come in, H3iv (religious commitment), H3 exploratory (relationship/employment), H2/H3 ethnicity reporting, and order effects all cannot run.
N for SEMP. The H1 train/prospective split assumes a wave column on clean (1 = N=400 train, 2 = N=100 held-out). Add that upstream once you have wave 2 collected.
Embeddings. eval = FALSE on the embedding chunk; flip when ready (10–30+ min per prompt).

2 Setup

library(tidyverse)
library(here)
library(janitor)
library(psych)
library(lavaan)
library(broom)
library(broom.mixed)
library(scales)
library(knitr)
library(gt)
library(lme4)
library(lmerTest)
library(emmeans)

results_dir <- here::here("Data", "results")
dir.create(results_dir, showWarnings = FALSE, recursive = TRUE)

theme_set(theme_minimal(base_size = 12))
brand_blue <- "#013d5aff"
brand_red  <- "#cd5d5bff"
update_geom_defaults("point",  list(colour = brand_blue))
update_geom_defaults("line",   list(colour = brand_blue))
update_geom_defaults("bar",    list(fill   = brand_blue))
update_geom_defaults("col",    list(fill   = brand_blue))
update_geom_defaults("density",list(colour = brand_blue, fill = brand_blue, alpha = 0.3))

3 Data import

Qualtrics CSV exports include two metadata header rows above the actual data. We strip them, retype, and keep the original column names.

strip_qualtrics_header <- function(path) {
  readr::read_csv(path, show_col_types = FALSE, na = c("", "NA")) |>
    slice(-(1:2))
}

dat <- strip_qualtrics_header(here::here("Data", "wellbeing.csv")) |>
  mutate(
    across(c(`Duration (in seconds)`, Progress, Finished), as.numeric),
    duration_min = `Duration (in seconds)` / 60,
    StartDate    = as.POSIXct(StartDate,    tz = "America/Los_Angeles"),
    EndDate      = as.POSIXct(EndDate,      tz = "America/Los_Angeles"),
    RecordedDate = as.POSIXct(RecordedDate, tz = "America/Los_Angeles")
  )

dat_labeled <- strip_qualtrics_header(here::here("Data", "wellbeing_labeled.csv"))

build_value_labels <- function(numeric_df, labeled_df) {
  shared <- intersect(names(numeric_df), names(labeled_df))
  out <- list()
  for (q in shared) {
    pairs <- tibble(num = as.character(numeric_df[[q]]),
                    lab = as.character(labeled_df[[q]])) |>
      filter(!is.na(num), !is.na(lab), num != lab) |>
      distinct()
    if (nrow(pairs) > 0) out[[q]] <- setNames(pairs$lab, pairs$num)
  }
  out
}

value_labels <- build_value_labels(dat, dat_labeled)

cat("Numeric rows:", nrow(dat), "  Labeled rows:", nrow(dat_labeled), "\n")

## Numeric rows: 63   Labeled rows: 63

cat("Date range:", format(min(dat$RecordedDate, na.rm = TRUE)), "to",
    format(max(dat$RecordedDate, na.rm = TRUE)), "\n")

## Date range: 2025-11-17 16:30:46 to 2026-04-27 22:01:32

cat("Questions with value labels resolved:", length(value_labels), "\n")

## Questions with value labels resolved: 102

4 Scale → item map

This is the canonical mapping from Qualtrics question codes to scales/subscales for this study.

Reverse-key status (verified against docs/survey_codebook.md, 2026-04-28):

SWLS (Q61, Q63, Q64) and HILS (Q94, Q95, Q96) all reverse = TRUE. Qualtrics built the scale with “Strongly Agree” as choice 1 and “Strongly Disagree” as choice 7 — opposite of convention. The launch-batch correlation pattern (SWLS/HILS correlate negatively with Ryff PWB and positively with PHQ-9/GAD-7 in raw data) confirms.
Ryff PWB reverse keys filled in from the codebook item wording: Q175 (Env Mastery: “demands of everyday life get me down”), Q172 (Personal Growth: “not interested in trying to grow”), Q168 (Pos Relations: “close relationships have been difficult”). The other 15 items are positively worded — no reversal. Autonomy (Q161, Q178, Q177), Self-Acceptance (Q167, Q166, Q165), and Purpose in Life (Q164, Q163, Q179) all use only positively-worded items in this survey, so all three subscales have all reverse = FALSE.
BFI-10 reverse keys filled in from the codebook item wording per Rammstedt & John (2007): Q127 (reserved → R Extraversion), Q137 (lazy → R Conscientiousness), Q136 (relaxed → R Neuroticism), Q135 (few artistic interests → R Openness), Q133 (tends to find fault → R Agreeableness).
Sanity check at N=40 (with reverse keys applied): all sign expectations recovered (Ryff PIL ↔︎ SWLS = +.53, Ryff PIL ↔︎ PHQ-9 = -.55, BFI Neuroticism ↔︎ GAD-7 = +.65). Pilot alphas reasonable except Ryff Autonomy α = .39 (small-N noise on a historically lower-α subscale; should firm up at N=400).

scale_map <- tribble(
  ~scale,        ~subscale,            ~item,    ~likert_min, ~likert_max, ~reverse,
  "EPS",         "meaningfulness",     "QID181",  1, 5, FALSE,
  "EPS",         "meaningfulness",     "Q189",    1, 5, FALSE,
  "EPS",         "meaningfulness",     "Q188",    1, 5, FALSE,
  "EPS",         "commitment",         "Q187",    1, 5, FALSE,
  "EPS",         "commitment",         "Q186",    1, 5, FALSE,
  "EPS",         "commitment",         "Q185",    1, 5, FALSE,
  "EPS",         "beyond_self",        "Q184",    1, 5, FALSE,
  "EPS",         "beyond_self",        "Q183",    1, 5, FALSE,
  "EPS",         "beyond_self",        "Q182",    1, 5, FALSE,
  "SWLS",        "swls",               "Q61",     1, 7, TRUE,
  "SWLS",        "swls",               "Q63",     1, 7, TRUE,
  "SWLS",        "swls",               "Q64",     1, 7, TRUE,
  "HILS",        "hils",               "Q94",     1, 7, TRUE,
  "HILS",        "hils",               "Q95",     1, 7, TRUE,
  "HILS",        "hils",               "Q96",     1, 7, TRUE,
  "Ryff",        "autonomy",           "Q161",    1, 6, FALSE,
  "Ryff",        "autonomy",           "Q178",    1, 6, FALSE,
  "Ryff",        "autonomy",           "Q177",    1, 6, FALSE,
  "Ryff",        "env_mastery",        "Q176",    1, 6, FALSE,
  "Ryff",        "env_mastery",        "Q175",    1, 6, TRUE,
  "Ryff",        "env_mastery",        "Q174",    1, 6, FALSE,
  "Ryff",        "personal_growth",    "Q173",    1, 6, FALSE,
  "Ryff",        "personal_growth",    "Q172",    1, 6, TRUE,
  "Ryff",        "personal_growth",    "Q171",    1, 6, FALSE,
  "Ryff",        "pos_relations",      "Q170",    1, 6, FALSE,
  "Ryff",        "pos_relations",      "Q169",    1, 6, FALSE,
  "Ryff",        "pos_relations",      "Q168",    1, 6, TRUE,
  "Ryff",        "self_acceptance",    "Q167",    1, 6, FALSE,
  "Ryff",        "self_acceptance",    "Q166",    1, 6, FALSE,
  "Ryff",        "self_acceptance",    "Q165",    1, 6, FALSE,
  "Ryff",        "purpose_in_life",    "Q164",    1, 6, FALSE,
  "Ryff",        "purpose_in_life",    "Q163",    1, 6, FALSE,
  "Ryff",        "purpose_in_life",    "Q179",    1, 6, FALSE,
  "PANAS",       "positive_affect",    "interested", 1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q88",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q87",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q86",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q85",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q84",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q83",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q82",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q81",     1, 5, FALSE,
  "PANAS",       "positive_affect",    "Q80",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q79",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q78",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q77",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q76",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q75",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q74",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q73",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q72",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q71",     1, 5, FALSE,
  "PANAS",       "negative_affect",    "Q70",     1, 5, FALSE,
  "Agency",      "agency",             "Q89",     1, 8, FALSE,
  "Agency",      "agency",             "Q92",     1, 8, FALSE,
  "Agency",      "agency",             "Q91",     1, 8, FALSE,
  "Agency",      "agency",             "Q90",     1, 8, FALSE,
  "PHQ9",        "phq9",               "Q101",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q110",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q114",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q115",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q113",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q112",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q111",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q117",    1, 4, FALSE,
  "PHQ9",        "phq9",               "Q116",    1, 4, FALSE,
  "BFI10",       "bfi_extraversion",   "Q127",    1, 5, TRUE,
  "BFI10",       "bfi_agreeableness",  "Q128",    1, 5, FALSE,
  "BFI10",       "bfi_conscientious",  "Q137",    1, 5, TRUE,
  "BFI10",       "bfi_neuroticism",    "Q136",    1, 5, TRUE,
  "BFI10",       "bfi_openness",       "Q135",    1, 5, TRUE,
  "BFI10",       "bfi_extraversion",   "Q134",    1, 5, FALSE,
  "BFI10",       "bfi_agreeableness",  "Q133",    1, 5, TRUE,
  "BFI10",       "bfi_conscientious",  "Q132",    1, 5, FALSE,
  "BFI10",       "bfi_neuroticism",    "Q131",    1, 5, FALSE,
  "BFI10",       "bfi_openness",       "Q130",    1, 5, FALSE,
  "GAD7",        "gad7",               "Q139",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q145",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q144",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q143",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q142",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q141",    1, 4, FALSE,
  "GAD7",        "gad7",               "Q140",    1, 4, FALSE,
  "Responsibility","responsibility",   "Q192",    1, 4, FALSE,
  "Responsibility","responsibility",   "Q193",    1, 4, FALSE,
  "Responsibility","responsibility",   "Q194",    1, 4, FALSE
)

text_prompts <- tribble(
  ~prompt_var,             ~purpose,
  "purpose_goal",          "Goal aspiration (EPS-BTS prompt)",
  "purpose_goal_reason",   "Why this goal (purpose motivation)",
  "purpose_goal_action",   "What you're doing (commitment)",
  "selfacceptance_text",   "Ryff Self-Acceptance bidirectional",
  "confidence_text",       "Confidence/Autonomy unidirectional",
  "environmental_text",    "Ryff Environmental Mastery bidirectional",
  "posrel_text",           "Ryff Positive Relations bidirectional",
  "autonomy_text",         "Ryff Autonomy bidirectional",
  "purposeinlife_text",    "Ryff Purpose in Life bidirectional",
  "personalgrowth_text",   "Ryff Personal Growth bidirectional",
  "college_goals_text",    "TRWOC-mirrored: college goals",
  "transformative_text",   "TRWOC-mirrored: transformation"
)

scale_map |>
  count(scale, subscale) |>
  gt() |>
  tab_header(title = "Scale items in the survey")

scale	subscale	n
Scale items in the survey
Agency	agency	4
BFI10	bfi_agreeableness	2
BFI10	bfi_conscientious	2
BFI10	bfi_extraversion	2
BFI10	bfi_neuroticism	2
BFI10	bfi_openness	2
EPS	beyond_self	3
EPS	commitment	3
EPS	meaningfulness	3
GAD7	gad7	7
HILS	hils	3
PANAS	negative_affect	10
PANAS	positive_affect	10
PHQ9	phq9	9
Responsibility	responsibility	3
Ryff	autonomy	3
Ryff	env_mastery	3
Ryff	personal_growth	3
Ryff	pos_relations	3
Ryff	purpose_in_life	3
Ryff	self_acceptance	3
SWLS	swls	3

5 Data quality flagging

Build per-respondent flags. We flag rather than auto-drop so the exclusion rules are visible.

attention_check_var <- "Q125"
attention_correct   <- 5
launch_date <- as.Date("2026-04-27")

dat_text_chars <- dat |>
  rowwise() |>
  mutate(
    text_total_chars = sum(nchar(c_across(any_of(text_prompts$prompt_var))), na.rm = TRUE),
    text_n_nonempty  = sum(nchar(c_across(any_of(text_prompts$prompt_var))) > 0, na.rm = TRUE)
  ) |>
  ungroup() |>
  select(ResponseId, text_total_chars, text_n_nonempty)

junk_pattern <- "^(asdf|sdf|sadf|asdfd|qwer|test|none?|na)\\.?$"

dat_flagged <- dat |>
  left_join(dat_text_chars, by = "ResponseId") |>
  mutate(
    survey_version = case_when(
      !is.na(condition) | !is.na(Q202) | !is.na(Q199) | !is.na(Q200) | !is.na(Q201) ~ "v2_post_republish",
      TRUE ~ "v1_pre_republish"
    ),
    flag_pre_launch   = as.Date(RecordedDate) < launch_date,
    flag_preview      = DistributionChannel == "preview",
    flag_unfinished   = is.na(Finished) | Finished != 1 | Progress < 100,
    flag_too_fast     = duration_min < 5,
    flag_too_slow     = duration_min > 60,
    flag_attention    = !is.na(.data[[attention_check_var]]) &
                         as.numeric(.data[[attention_check_var]]) != attention_correct,
    flag_junk_text    = rowSums(across(
      any_of(text_prompts$prompt_var),
      ~ str_detect(tolower(replace_na(.x, "")), junk_pattern)
    )) >= 3,
    flag_short_text   = text_total_chars < 200,
    flag_any_exclude  = flag_pre_launch | flag_preview | flag_unfinished |
                        flag_too_fast | flag_junk_text
  )

dat_flagged |>
  count(survey_version) |>
  gt() |>
  tab_header(title = "Survey version split (detected from field presence)",
             subtitle = "v1 = collected before Qualtrics republish (missing condition + new demos); v2 = post-republish")

survey_version	n
Survey version split (detected from field presence)
v1 = collected before Qualtrics republish (missing condition + new demos); v2 = post-republish
v1_pre_republish	61
v2_post_republish	2

dat_flagged |>
  summarise(
    n_total          = n(),
    n_pre_launch     = sum(flag_pre_launch),
    n_preview        = sum(flag_preview),
    n_unfinished     = sum(flag_unfinished),
    n_too_fast       = sum(flag_too_fast),
    n_too_slow       = sum(flag_too_slow),
    n_attention_fail = sum(flag_attention, na.rm = TRUE),
    n_junk_text      = sum(flag_junk_text),
    n_short_text     = sum(flag_short_text),
    n_excluded       = sum(flag_any_exclude),
    n_kept           = sum(!flag_any_exclude)
  ) |>
  pivot_longer(everything(), names_to = "criterion", values_to = "n") |>
  gt() |>
  tab_header(title = "Quality flag summary")

criterion	n
Quality flag summary
n_total	63
n_pre_launch	23
n_preview	5
n_unfinished	3
n_too_fast	5
n_too_slow	4
n_attention_fail	7
n_junk_text	2
n_short_text	5
n_excluded	23
n_kept	40

completion_time_tbl <- bind_rows(
  dat_flagged |> mutate(sample = "All responses"),
  dat_flagged |> filter(!flag_any_exclude) |> mutate(sample = "Analytic sample"),
  dat_flagged |> filter(flag_any_exclude) |> mutate(sample = "Excluded")
) |>
  group_by(sample) |>
  summarise(
    n = n(),
    median_min = median(duration_min, na.rm = TRUE),
    max_min = max(duration_min, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(sample = factor(sample, levels = c("All responses", "Analytic sample", "Excluded"))) |>
  arrange(sample)

completion_time_tbl |>
  gt() |>
  tab_header(title = "Completion time summary",
             subtitle = "Minutes, from Qualtrics duration") |>
  fmt_number(columns = c(median_min, max_min), decimals = 1)

sample	n	median_min	max_min
Completion time summary
Minutes, from Qualtrics duration
All responses	63	23.6	77.0
Analytic sample	40	22.1	77.0
Excluded	23	25.0	68.5

dat_flagged |>
  filter(flag_any_exclude | flag_attention | flag_short_text | flag_too_slow) |>
  select(ResponseId, DistributionChannel, Progress, duration_min,
         text_total_chars, interested,
         starts_with("flag_")) |>
  gt() |>
  tab_header(title = "Responses tripping any quality flag",
             subtitle = "Inspect before applying exclusions")

ResponseId	DistributionChannel	Progress	duration_min	text_total_chars	interested	flag_pre_launch	flag_preview	flag_unfinished	flag_too_fast	flag_too_slow	flag_attention	flag_junk_text	flag_short_text	flag_any_exclude
Responses tripping any quality flag
Inspect before applying exclusions
R_3Bu1hK1zYqDsM86	preview	100	0.350	0	NA	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	TRUE
R_1CE0QON8Nm8OWUy	anonymous	100	15.200	429	4	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
R_7TKTAsrbYYfbBoL	anonymous	100	7.550	249	3	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
R_8LeN190VWzrOSVH	anonymous	100	24.467	354	3	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
R_214npk8n9eAFLQW	anonymous	100	37.233	822	4	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
R_1YxhmhI9ebtfKen	anonymous	100	25.033	513	2	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE
R_5HxEq14dgBns2FH	preview	100	1.467	16	NA	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	TRUE
R_34d4np5PTGcD2Zj	anonymous	100	13.900	1096	NA	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_9r6Or7W8GYnYreN	anonymous	100	21.200	1936	4	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_2FXVfJYbE5r4mUV	anonymous	100	25.883	2785	2	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_6KuBlG28QSUOAHu	anonymous	100	25.583	2949	3	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_6ISePjAjPLwhylk	anonymous	100	27.633	2152	3	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_5WHLTgLvBtEjv8t	anonymous	100	40.983	2491	3	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_5Cd78srxQDU4SlN	anonymous	100	32.867	4673	4	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_7fxtcq5udPpmGOL	anonymous	100	68.467	2741	2	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE
R_6HiaYG3kQ6n0KzZ	anonymous	100	67.050	3661	4	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE
R_1jNueOqRkJAMTot	anonymous	100	34.267	3408	4	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_57h5OopJnvxY5gd	anonymous	23	8.333	316	NA	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_5sBVbjYpQJoC01z	anonymous	24	34.333	2731	NA	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_12RZjVmrsd0zOOV	anonymous	27	30.883	2096	NA	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE
R_6LY158qtkL39bPx	preview	100	0.383	0	NA	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	TRUE
R_60ZosKHqL4uamZV	preview	100	4.833	45	3	TRUE	TRUE	FALSE	TRUE	FALSE	TRUE	TRUE	TRUE	TRUE
R_6e4WeAXFgzGL1Zl	preview	100	4.200	33	2	TRUE	TRUE	FALSE	TRUE	FALSE	TRUE	TRUE	TRUE	TRUE
R_6IWrMKVp7EH3QEp	anonymous	100	77.000	2397	2	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE
R_5CwiWoKtvBh3iaS	anonymous	100	73.467	4828	5	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE

6 Cleaning

The exclusion rule we apply is conservative: drop preview, unfinished, too-fast, and obvious junk text. Attention-check failures and short-text are kept and flagged for sensitivity analyses.

clean <- dat_flagged |>
  filter(!flag_any_exclude) |>
  mutate(across(all_of(scale_map$item), as.numeric))

cat("Cleaned N:", nrow(clean), "\n")

## Cleaned N: 40

cat("By survey version:\n")

## By survey version:

print(table(clean$survey_version))

## 
## v1_pre_republish 
##               40

readr::write_csv(clean, here::here("Data", "wellbeing_clean.csv"))

7 Sample flow (CONSORT-style)

flow <- tibble(
  step = c("Total responses",
           sprintf("− Pre-launch (before %s)", format(launch_date)),
           "− Preview / admin",
           "− Unfinished or <100% progress",
           "− Duration < 5 minutes",
           "− Junk text (≥3 prompts gibberish)",
           "= Analytic sample"),
  n = c(
    nrow(dat),
    -sum(dat_flagged$flag_pre_launch),
    -sum(dat_flagged$flag_preview & !dat_flagged$flag_pre_launch),
    -sum(dat_flagged$flag_unfinished & !dat_flagged$flag_pre_launch &
         !dat_flagged$flag_preview),
    -sum(dat_flagged$flag_too_fast & !dat_flagged$flag_pre_launch &
         !dat_flagged$flag_preview & !dat_flagged$flag_unfinished),
    -sum(dat_flagged$flag_junk_text & !dat_flagged$flag_pre_launch &
         !dat_flagged$flag_preview & !dat_flagged$flag_unfinished &
         !dat_flagged$flag_too_fast),
    nrow(clean)
  )
)
flow |> gt() |> tab_header(title = "Sample-flow accounting")

step	n
Sample-flow accounting
Total responses	63
− Pre-launch (before 2026-04-27)	-23
− Preview / admin	0
− Unfinished or <100% progress	0
− Duration < 5 minutes	0
− Junk text (≥3 prompts gibberish)	0
= Analytic sample	40

8 Demographics

demog_summary <- function(varname, var, lookup) {
  if (varname == "Q146") {
    x <- as.numeric(var)
    return(tibble(level = "Age (mean ± sd)",
                  n    = sprintf("%.1f ± %.1f", mean(x, na.rm=TRUE), sd(x, na.rm=TRUE))))
  }
  tibble(level = as.character(var)) |>
    count(level) |>
    mutate(level = case_when(
             is.na(level) ~ "Missing",
             !is.null(lookup) & level %in% names(lookup) ~ lookup[level],
             TRUE ~ level
           ),
           n = as.character(n))
}

demog_tbl <- bind_rows(
  demog_summary("Q146", clean$Q146, NULL) |> mutate(var = "Age"),
  demog_summary("Q147", clean$Q147, value_labels$Q147) |> mutate(var = "Gender"),
  demog_summary("Q148", clean$Q148, value_labels$Q148) |> mutate(var = "Financial situation"),
  demog_summary("Q190", clean$Q190, value_labels$Q190) |> mutate(var = "Ethnicity"),
  demog_summary("highest_education", clean$highest_education, value_labels$highest_education) |>
    mutate(var = "Education"),
  demog_summary("in_college", clean$in_college, value_labels$in_college) |>
    mutate(var = "Currently a student")
) |>
  select(var, level, n)

demog_tbl |>
  gt(groupname_col = "var") |>
  tab_header(title = "Demographic composition (analytic sample)",
             subtitle = "Value labels pulled from wellbeing_labeled.csv")

level	n
Demographic composition (analytic sample)
Value labels pulled from wellbeing_labeled.csv
Age
Age (mean ± sd)	31.9 ± 11.0
Gender
Male	17
Female	22
Prefer not to say	1
Financial situation
5 - Living comfortably	7
4 - Doing well financially	5
3 - Doing okay financially	4
4	8
2 - Difficult to get by	12
1 - Finding it very difficult to get by	3
Prefer not to say	1
Ethnicity
Missing	40
Education
High school degree or GED	5
Some college but no degree	10
Associate's degree	5
Certificate or technical degree	1
Bachelor's degree	13
Some graduate or professional school	1
Completed Master's, Doctorate, or Professional degree (post-Bachelor's)	5
Currently a student
No	21
Yes	19

9 Scale construction

score_subscale <- function(data, items_df) {
  m <- score_items(data, items_df)
  out <- rowMeans(m, na.rm = TRUE)
  out[is.nan(out)] <- NA_real_
  out
}

score_items <- function(data, items_df) {
  items <- items_df$item
  rev   <- items_df$reverse
  mins  <- items_df$likert_min
  maxs  <- items_df$likert_max
  m <- as.matrix(data[, items, drop = FALSE])
  for (j in seq_along(items)) {
    if (isTRUE(rev[j])) m[, j] <- (mins[j] + maxs[j]) - m[, j]
  }
  m
}

subscales <- scale_map |>
  group_by(scale, subscale) |>
  group_split() |>
  (\(x) set_names(x, map_chr(x, ~ paste(.x$scale[1], .x$subscale[1], sep = "_"))))()

scores <- map_dfc(subscales, ~ score_subscale(clean, .x)) |>
  mutate(ResponseId = clean$ResponseId, .before = 1)

scored <- clean |>
  select(ResponseId, all_of(text_prompts$prompt_var),
         Q146, Q147, Q148, Q190, highest_education, in_college,
         condition, interested, duration_min, text_total_chars) |>
  left_join(scores, by = "ResponseId")

readr::write_csv(scored, here::here("Data", "wellbeing_scored.csv"))

scored |> select(starts_with(c("EPS_","Ryff_","SWLS_","HILS_","PANAS_","Agency_","PHQ9_","BFI10_","GAD7_"))) |>
  psych::describe() |>
  as_tibble(rownames = "scale") |>
  select(scale, n, mean, sd, min, max, skew, kurtosis) |>
  gt() |>
  tab_header(title = "Scale-score descriptives") |>
  fmt_number(columns = c(mean, sd, min, max, skew, kurtosis), decimals = 2)

scale	n	mean	sd	min	max	skew	kurtosis
Scale-score descriptives
EPS_beyond_self	40	4.12	1.17	1.00	5.00	−1.38	0.60
EPS_commitment	40	4.24	0.71	1.67	5.00	−1.84	4.24
EPS_meaningfulness	40	4.41	0.70	1.67	5.00	−1.85	4.15
Ryff_autonomy	40	4.38	0.75	2.67	6.00	0.29	−0.39
Ryff_env_mastery	40	4.04	0.90	2.67	6.00	0.22	−0.97
Ryff_personal_growth	40	5.11	0.74	3.33	6.00	−0.81	0.20
Ryff_pos_relations	40	4.42	0.86	2.00	6.00	−0.58	0.24
Ryff_purpose_in_life	40	4.53	0.92	2.33	6.00	−0.43	−0.62
Ryff_self_acceptance	40	4.32	1.11	1.33	6.00	−0.76	0.09
SWLS_swls	40	4.47	1.44	1.67	6.67	−0.40	−1.11
HILS_hils	40	4.44	1.43	2.00	7.00	−0.21	−1.21
PANAS_negative_affect	40	2.03	0.89	1.00	4.70	0.95	0.45
PANAS_positive_affect	40	3.20	0.81	1.50	4.80	−0.25	−0.69
Agency_agency	40	5.70	1.16	3.00	7.50	−0.25	−0.63
PHQ9_phq9	40	1.86	0.71	1.00	4.00	1.13	1.01
BFI10_bfi_agreeableness	40	3.46	1.04	1.00	5.00	−0.50	−0.68
BFI10_bfi_conscientious	40	3.80	0.99	2.00	5.00	−0.33	−1.28
BFI10_bfi_extraversion	40	2.71	1.07	1.00	4.50	0.21	−1.26
BFI10_bfi_neuroticism	40	3.23	1.24	1.00	5.00	−0.36	−1.11
BFI10_bfi_openness	40	3.74	0.99	1.00	5.00	−0.58	−0.19
GAD7_gad7	40	2.04	0.84	1.00	4.00	0.57	−0.84

pwb_dims <- c("self_acceptance","autonomy","env_mastery",
              "pos_relations","purpose_in_life","personal_growth")

10 Scale reliability and descriptives

Reliability and distributional summaries per subscale. With pilot N these estimates are noisy — treat as sanity checks, not final values. Omega is McDonald’s omega total from a one-factor solution. Cronbach’s alpha is retained as the primary reliability coefficient for disattenuated correlations to match the preregistered formula; omega total is reported descriptively and can be used as a sensitivity check.

mean_or_na <- function(x) {
  x <- x[is.finite(x)]
  if (length(x) == 0) NA_real_ else mean(x)
}

omega_total <- function(m) {
  if (ncol(m) < 2 || nrow(m) < 5) return(NA_real_)
  res <- tryCatch(
    suppressWarnings(suppressMessages(psych::omega(m, nfactors = 1, plot = FALSE))),
    error = function(e) NULL
  )
  if (is.null(res) || is.null(res$omega.tot)) NA_real_ else unname(res$omega.tot)
}

scale_metrics <- function(data, items_df) {
  m <- score_items(data, items_df)
  score <- rowMeans(m, na.rm = TRUE)
  score[is.nan(score)] <- NA_real_
  ok <- complete.cases(m)
  m_ok <- m[ok, , drop = FALSE]

  alpha_res <- tryCatch(
    suppressWarnings(psych::alpha(m_ok, warnings = FALSE, check.keys = FALSE)),
    error = function(e) NULL
  )

  item_cors <- if (ncol(m_ok) >= 2 && nrow(m_ok) >= 5) {
    r <- suppressWarnings(cor(m_ok, use = "pairwise.complete.obs"))
    r[lower.tri(r)]
  } else {
    NA_real_
  }

  corrected_item_total <- if (ncol(m_ok) >= 2 && nrow(m_ok) >= 5) {
    map_dbl(seq_len(ncol(m_ok)), function(j) {
      item <- m_ok[, j]
      total_without_item <- rowMeans(m_ok[, -j, drop = FALSE])
      if (sd(item) == 0 || sd(total_without_item) == 0) {
        NA_real_
      } else {
        cor(item, total_without_item)
      }
    })
  } else {
    NA_real_
  }

  tibble(
    n_items = nrow(items_df),
    n_complete = sum(ok),
    alpha = if (is.null(alpha_res)) NA_real_ else unname(alpha_res$total$raw_alpha),
    omega_total = omega_total(m_ok),
    mean_inter_item_r = mean_or_na(item_cors),
    mean_corrected_item_total_r = mean_or_na(corrected_item_total),
    mean = mean(score, na.rm = TRUE),
    sd = sd(score, na.rm = TRUE),
    skew = psych::skew(score, na.rm = TRUE),
    kurtosis = psych::kurtosi(score, na.rm = TRUE)
  )
}

scale_descriptives_tbl <- scale_map |>
  group_by(scale, subscale) |>
  group_modify(~ scale_metrics(clean, .x)) |>
  ungroup()

alpha_tbl <- scale_descriptives_tbl

readr::write_csv(scale_descriptives_tbl,
                 here::here("Data", "results", "scale_reliability_descriptives.csv"))

reliability_gt <- scale_descriptives_tbl |>
  gt(groupname_col = "scale") |>
  cols_label(
    subscale                    = "Subscale",
    n_items                     = "k items",
    n_complete                  = "n",
    alpha                       = "α",
    omega_total                 = "ω",
    mean_inter_item_r           = "mean r (inter-item)",
    mean_corrected_item_total_r = "mean r (item-total)",
    mean                        = "M",
    sd                          = "SD",
    skew                        = "Skew",
    kurtosis                    = "Kurt"
  ) |>
  tab_spanner(label = "Reliability",
              columns = c(alpha, omega_total,
                          mean_inter_item_r, mean_corrected_item_total_r)) |>
  tab_spanner(label = "Distribution",
              columns = c(mean, sd, skew, kurtosis)) |>
  tab_header(
    title    = "Scale reliability and descriptives (pilot)",
    subtitle = "α is the primary reliability for disattenuation; ω (McDonald's total) is sensitivity. Inter-item r and corrected item-total r are unweighted means across items in the subscale. M/SD on the subscale-mean score."
  ) |>
  fmt_number(
    columns = c(alpha, omega_total, mean_inter_item_r, mean_corrected_item_total_r,
                mean, sd, skew, kurtosis),
    decimals = 2
  ) |>
  sub_missing(missing_text = "—") |>
  data_color(
    columns = alpha,
    fn = function(x) {
      dplyr::case_when(
        is.na(x)   ~ "#f0f0f0",
        x >= 0.70  ~ "#cfe8d6",
        x >= 0.50  ~ "#fff3cd",
        TRUE       ~ "#f4cdcd"
      )
    }
  )

reliability_gt

Subscale	k items	n	Reliability				Distribution
Scale reliability and descriptives (pilot)
α is the primary reliability for disattenuation; ω (McDonald's total) is sensitivity. Inter-item r and corrected item-total r are unweighted means across items in the subscale. M/SD on the subscale-mean score.
Subscale	k items	n	α	ω	mean r (inter-item)	mean r (item-total)	M	SD	Skew	Kurt
Agency
agency	4	40	0.79	0.80	0.49	0.61	5.70	1.16	−0.25	−0.63
BFI10
bfi_agreeableness	2	40	0.47	0.47	0.31	0.31	3.46	1.04	−0.50	−0.68
bfi_conscientious	2	40	0.44	0.52	0.35	0.35	3.80	0.99	−0.33	−1.28
bfi_extraversion	2	40	0.64	0.65	0.48	0.48	2.71	1.07	0.21	−1.26
bfi_neuroticism	2	40	0.85	0.86	0.75	0.75	3.23	1.24	−0.36	−1.11
bfi_openness	2	39	0.46	0.46	0.30	0.30	3.74	0.99	−0.58	−0.19
EPS
beyond_self	3	40	0.95	0.95	0.85	0.89	4.12	1.17	−1.38	0.60
commitment	3	40	0.66	0.71	0.39	0.49	4.24	0.71	−1.84	4.24
meaningfulness	3	40	0.88	0.88	0.72	0.77	4.41	0.70	−1.85	4.15
GAD7
gad7	7	38	0.92	0.92	0.63	0.76	2.04	0.84	0.57	−0.84
HILS
hils	3	40	0.90	0.91	0.76	0.81	4.44	1.43	−0.21	−1.21
PANAS
negative_affect	10	40	0.95	0.95	0.66	0.79	2.03	0.89	0.95	0.45
positive_affect	10	39	0.91	0.91	0.51	0.68	3.20	0.81	−0.25	−0.69
PHQ9
phq9	9	39	0.91	0.91	0.53	0.69	1.86	0.71	1.13	1.01
Responsibility
responsibility	3	0	—	—	—	—	—	—	—	—
Ryff
autonomy	3	40	0.39	0.41	0.18	0.24	4.38	0.75	0.29	−0.39
env_mastery	3	40	0.63	0.67	0.36	0.45	4.04	0.90	0.22	−0.97
personal_growth	3	39	0.53	0.56	0.29	0.35	5.11	0.74	−0.81	0.20
pos_relations	3	40	0.54	0.66	0.31	0.37	4.42	0.86	−0.58	0.24
purpose_in_life	3	39	0.74	0.78	0.51	0.58	4.53	0.92	−0.43	−0.62
self_acceptance	3	40	0.83	0.86	0.64	0.71	4.32	1.11	−0.76	0.09
SWLS
swls	3	40	0.83	0.84	0.62	0.69	4.47	1.44	−0.40	−1.11

needed_pkgs <- setdiff(c("webshot2", "chromote"), rownames(installed.packages()))
if (length(needed_pkgs)) install.packages(needed_pkgs)

gt::gtsave(reliability_gt,
           here::here("Data", "results", "scale_reliability.html"))

tryCatch(
  gt::gtsave(reliability_gt,
             here::here("Data", "results", "scale_reliability.png"),
             vwidth = 1400, vheight = 1000, expand = 5),
  error = function(e) {
    message("PNG save failed (",
            conditionMessage(e),
            "). HTML version is still available at Data/results/scale_reliability.html — ",
            "open in a browser and File → Print → Save as PDF for a slide-ready view.")
  }
)

11 CFA (Ryff 6-factor + EPS 3-factor)

ryff_model <- "
  SA  =~ Q161 + Q178 + Q177
  AU  =~ Q176 + Q175 + Q174
  EM  =~ Q173 + Q172 + Q171
  PR  =~ Q170 + Q169 + Q168
  PIL =~ Q167 + Q166 + Q165
  PG  =~ Q164 + Q163 + Q179
"
fit_ryff <- cfa(ryff_model, data = clean, missing = "fiml", estimator = "MLR")
summary(fit_ryff, fit.measures = TRUE, standardized = TRUE)

eps_model <- "
  Mean   =~ QID181 + Q189 + Q188
  Commit =~ Q187  + Q186 + Q185
  BTS    =~ Q184  + Q183 + Q182
"
fit_eps <- cfa(eps_model, data = clean, missing = "fiml", estimator = "MLR")
summary(fit_eps, fit.measures = TRUE, standardized = TRUE)

12 Open-ended response descriptives

text_long <- clean |>
  select(ResponseId, all_of(text_prompts$prompt_var)) |>
  pivot_longer(-ResponseId, names_to = "prompt_var", values_to = "text") |>
  left_join(text_prompts, by = "prompt_var") |>
  mutate(
    text  = replace_na(text, ""),
    chars = nchar(text),
    words = str_count(text, "\\S+")
  )

text_long |>
  group_by(prompt_var, purpose) |>
  summarise(
    n_responded = sum(chars > 0),
    median_chars = median(chars[chars > 0]),
    median_words = median(words[chars > 0]),
    p10_words    = quantile(words[chars > 0], 0.10),
    p90_words    = quantile(words[chars > 0], 0.90),
    .groups = "drop"
  ) |>
  arrange(desc(median_chars)) |>
  gt() |>
  tab_header(title = "Open-ended response lengths") |>
  fmt_number(columns = c(median_chars, median_words, p10_words, p90_words), decimals = 0)

prompt_var	purpose	n_responded	median_chars	median_words	p10_words	p90_words
Open-ended response lengths
posrel_text	Ryff Positive Relations bidirectional	39	379	70	33	119
selfacceptance_text	Ryff Self-Acceptance bidirectional	40	312	61	29	97
environmental_text	Ryff Environmental Mastery bidirectional	40	297	59	23	103
confidence_text	Confidence/Autonomy unidirectional	40	292	57	25	94
purposeinlife_text	Ryff Purpose in Life bidirectional	40	273	52	25	104
autonomy_text	Ryff Autonomy bidirectional	40	257	50	23	98
personalgrowth_text	Ryff Personal Growth bidirectional	40	233	45	17	77
transformative_text	TRWOC-mirrored: transformation	21	175	33	11	65
college_goals_text	TRWOC-mirrored: college goals	19	134	28	13	72
purpose_goal_action	What you're doing (commitment)	40	95	18	5	36
purpose_goal_reason	Why this goal (purpose motivation)	40	87	16	7	41
purpose_goal	Goal aspiration (EPS-BTS prompt)	40	69	14	4	36

text_long |>
  filter(chars > 0) |>
  ggplot(aes(words)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ prompt_var, scales = "free_y") +
  labs(title = "Word count distribution per prompt",
       x = "Words", y = "Respondents")

13 Embeddings

Per the prereg: layer 23 of mixedbread-large (24-layer model), mean-pooled across tokens. We retain both text-level embeddings (for ridge prediction) and word-type embeddings (for textProjection in H4 face validity). Heavy compute (10–30+ min per prompt on CPU; longer the first time the model downloads).

library(text)

embed_one <- function(prompt_var) {
  message("Embedding: ", prompt_var)
  if (!prompt_var %in% names(clean)) stop("Missing prompt column: ", prompt_var)
  texts <- as.character(clean[[prompt_var]])
  missing_text <- is.na(texts) | !nzchar(trimws(texts))
  if (any(missing_text)) {
    message("  replacing ", sum(missing_text), " missing/blank rows with sentinel text")
    texts[missing_text] <- "missing response"
  }
  emb <- textEmbed(
    texts = tibble(texts = texts),
    model = "mixedbread-ai/mxbai-embed-large-v1",
    layers = 23,
    aggregation_from_tokens_to_texts       = "mean",
    aggregation_from_tokens_to_word_types  = "mean",
    keep_token_embeddings = FALSE,
    batch_size = 16
  )
  list(prompt_var = prompt_var, embeddings = emb)
}

embeddings_dir <- here::here("Data", "embeddings")
dir.create(embeddings_dir, showWarnings = FALSE, recursive = TRUE)

for (pv in text_prompts$prompt_var) {
  out_path <- file.path(embeddings_dir, paste0(pv, ".rds"))
  if (file.exists(out_path)) next
  res <- embed_one(pv)
  saveRDS(res$embeddings, out_path)
}

list.files(embeddings_dir)

embeddings_dir <- here::here("Data", "embeddings")
embeddings <- map(set_names(text_prompts$prompt_var),
                  ~ readRDS(file.path(embeddings_dir, paste0(.x, ".rds"))))
embedding_matrix <- function(emb) {
  emb$texts$texts |>
    select(-matches("^id(_|$)"), -matches("^id_"))
}
word_type_matrix <- function(emb) {
  wt <- emb$word_types
  if (is.list(wt) && "texts" %in% names(wt)) wt$texts else wt
}
text_embed <- map(embeddings, embedding_matrix)
word_type_embed <- map(embeddings, word_type_matrix)
prompt_available <- clean |>
  transmute(across(all_of(text_prompts$prompt_var), ~ !is.na(.x) & nzchar(trimws(.x))))
complete_prompt_score_idx <- function(prompt_var, outcome) {
  which(
    prompt_available[[prompt_var]] &
      !is.na(scored[[outcome]]) &
      complete.cases(text_embed[[prompt_var]])
  )
}
model_id_vars <- function(final_model) {
  blueprint <- final_model$pre$mold$blueprint
  id_ptype <- tryCatch(
    blueprint$extra_role_ptypes[["id variable"]],
    error = function(e) NULL
  )
  id_vars <- if (is.null(id_ptype)) character() else names(id_ptype)

  recipe_cols <- tryCatch(names(blueprint$recipe$ptype), error = function(e) character())
  predictor_cols <- tryCatch(names(blueprint$ptypes$predictors), error = function(e) character())
  id_vars <- union(id_vars, intersect(setdiff(recipe_cols, c(predictor_cols, "y")), "id_nr"))

  id_vars
}
embedding_new_data <- function(final_model, emb_var, idx) {
  predictor_cols <- tryCatch(
    names(final_model$pre$mold$blueprint$ptypes$predictors),
    error = function(e) names(text_embed[[emb_var]])
  )
  new_x <- as_tibble(text_embed[[emb_var]][idx, predictor_cols, drop = FALSE])
  for (id_var in model_id_vars(final_model)) {
    if (!id_var %in% names(new_x)) new_x[[id_var]] <- seq_along(idx)
  }
  new_x
}
map(text_embed, dim)

14 H1 — Convergent validity (training-sample CV)

Per prereg: ridge regression via textTrainRegression on the N=400 training sample with the package nested-CV defaults (10 outer folds × inner 75/25 validation split, RMSE-minimizing with lowest_penalty tie-breaking, outer + inner stratified by outcome quartiles), with the penalty grid widened from the package default 10^seq(-3, 3) to 10^seq(-6, 6) (13 candidates). The final_model returned by the package is the parsnip workflow we’ll freeze for Prereg 2.

The six PWB constructs are tested with two-tailed Pearson correlations, FDR-corrected (Benjamini-Hochberg) across the six tests. We also report disattenuated correlations using observed-scale reliabilities from Section 9.

library(text)

penalty_grid <- 10^seq(-6, 6)

train_idx <- if ("wave" %in% names(clean)) which(clean$wave == 1) else seq_len(nrow(clean))

h1_specs <- tribble(
  ~emb_var,              ~outcome,
  "selfacceptance_text", "Ryff_self_acceptance",
  "autonomy_text",       "Ryff_autonomy",
  "environmental_text",  "Ryff_env_mastery",
  "posrel_text",         "Ryff_pos_relations",
  "purposeinlife_text",  "Ryff_purpose_in_life",
  "personalgrowth_text", "Ryff_personal_growth"
)

h1_train_one <- function(emb_var, outcome) {
  idx <- train_idx
  y <- scored[[outcome]][idx]
  ok <- prompt_available[[emb_var]][idx] & !is.na(y) & complete.cases(text_embed[[emb_var]][idx, ])
  idx <- idx[ok]
  x <- text_embed[[emb_var]][idx, , drop = FALSE]
  y <- scored[[outcome]][idx]
  if (length(y) < 10) stop("Too few complete rows for ", emb_var, " -> ", outcome)
  fit <- textTrainRegression(
    x = x,
    y = y,
    penalty = penalty_grid,
    multi_cores = TRUE,
    seed = 2020
  )
  list(emb_var = emb_var, outcome = outcome, model_idx = idx, fit = fit,
       lambda_chosen = fit$final_model$fit$fit$spec$args$penalty)
}

h1_models <- pmap(h1_specs, ~ h1_train_one(..1, ..2))

models_dir <- here::here("Data", "models", "h1_train")
dir.create(models_dir, showWarnings = FALSE, recursive = TRUE)
walk(h1_models, ~ saveRDS(.x, file.path(models_dir,
       paste0(.x$outcome, ".rds"))))

h1_cv_pred_tbl <- function(m) {
  pred_tbl <- m$fit$predictions
  if (!is.data.frame(pred_tbl) || !all(c("predictions", "y") %in% names(pred_tbl))) {
    stop("Could not find numeric CV predictions for ", m$outcome)
  }
  id_nr <- if ("id_nr" %in% names(pred_tbl)) pred_tbl$id_nr else seq_len(nrow(pred_tbl))
  pred_tbl |>
    transmute(
      id_nr = coalesce(id_nr, row_number()),
      predictions = as.numeric(.data$predictions),
      y = as.numeric(.data$y)
    )
}

h1_lambda <- function(m) {
  lambda <- m$lambda_chosen
  if (is.numeric(lambda) && length(lambda) == 1) return(lambda)

  lambda <- m$fit$final_model$fit$fit$spec$args$penalty
  if (is.numeric(lambda) && length(lambda) == 1) return(lambda)

  lambda_line <- str_subset(m$fit$model_description, "^penalty in final model =")
  if (length(lambda_line) == 1) {
    return(readr::parse_number(lambda_line))
  }

  NA_real_
}

h1_cv <- map_dfr(h1_models, function(m) {
  pred_tbl <- h1_cv_pred_tbl(m)
  ok <- complete.cases(pred_tbl$predictions, pred_tbl$y)
  ct <- cor.test(pred_tbl$predictions[ok], pred_tbl$y[ok], alternative = "two.sided")
  tibble(outcome = m$outcome, n = sum(ok),
         cv_r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2],
         p_raw = ct$p.value, lambda = h1_lambda(m))
}) |>
  mutate(p_fdr = p.adjust(p_raw, method = "BH"))

h1_cv |>
  gt() |>
  tab_header(title = "H1 (cross-validated training sample, N=400)",
             subtitle = "Two-tailed Pearson, FDR-corrected across 6 PWB constructs") |>
  fmt_number(columns = c(cv_r, lcl, ucl, p_raw, p_fdr, lambda), decimals = 3)

boundary_check <- h1_cv |>
  mutate(at_boundary = lambda <= min(penalty_grid) | lambda >= max(penalty_grid))
boundary_check |>
  filter(at_boundary) |>
  gt() |>
  tab_header(title = "Lambda boundary check",
             subtitle = "If any rows appear, widen the penalty grid before locking the prereg amendment")

readr::write_csv(h1_cv, here::here("Data", "results", "h1_cv.csv"))

h1_cv_predictions <- map_dfc(h1_models, function(m) {
  preds <- rep(NA_real_, nrow(clean))
  pred_tbl <- h1_cv_pred_tbl(m)
  idx <- if (!is.null(m$model_idx)) m$model_idx else train_idx
  preds[idx[pred_tbl$id_nr]] <- pred_tbl$predictions
  setNames(tibble(preds), paste0("pred_", m$outcome))
}) |>
  mutate(ResponseId = clean$ResponseId, .before = 1)

readr::write_csv(h1_cv_predictions, here::here("Data", "results", "predictions_cv_train.csv"))

15 H1 — Disattenuated correlations

Disattenuation per construct uses observed-scale reliability only, treating the predicted language score as the true-score proxy. The primary preprint value uses Cronbach’s alpha to match the preregistered formula: r_disattenuated_alpha = r_observed / sqrt(alpha_outcome). McDonald’s omega total is reported as a sensitivity check: r_disattenuated_omega = r_observed / sqrt(omega_total_outcome).

alpha_lookup <- alpha_tbl |>
  mutate(scale_subscale = paste(scale, subscale, sep = "_")) |>
  select(scale_subscale, alpha, omega_total)

h1_disattenuated <- h1_cv |>
  left_join(alpha_lookup, by = c("outcome" = "scale_subscale")) |>
  mutate(
    reliability_primary = "Cronbach alpha",
    r_disattenuated_alpha = cv_r / sqrt(pmax(alpha, .Machine$double.eps)),
    r_disattenuated_omega = cv_r / sqrt(pmax(omega_total, .Machine$double.eps))
  )

h1_disattenuated |>
  select(outcome, n, cv_r, alpha, omega_total,
         r_disattenuated_alpha, r_disattenuated_omega, reliability_primary) |>
  gt() |>
  tab_header(title = "H1 — observed and disattenuated correlations",
             subtitle = "Primary disattenuated estimate uses Cronbach's alpha; omega total is sensitivity.") |>
  fmt_number(columns = c(cv_r, alpha, omega_total,
                         r_disattenuated_alpha, r_disattenuated_omega), decimals = 3)

readr::write_csv(h1_disattenuated, here::here("Data", "results", "h1_disattenuated.csv"))

16 H1 — Prospective evaluation (held-out N=100)

Frozen models applied to the held-out sample. Benchmark: r ≥ .45 lower bound of 95% CI.

stopifnot("Need a 'wave' column on `clean` (1 = train, 2 = held-out)" = "wave" %in% names(clean))
test_idx <- which(clean$wave == 2)

h1_prospective <- map_dfr(h1_models, function(m) {
  idx <- test_idx[prompt_available[[m$emb_var]][test_idx]]
  new_x <- embedding_new_data(m$fit$final_model, m$emb_var, idx)
  preds <- predict(m$fit$final_model, new_data = new_x)$.pred
  truth <- scored[[m$outcome]][idx]
  ok <- complete.cases(preds, truth)
  ct <- cor.test(preds[ok], truth[ok], alternative = "two.sided")
  tibble(outcome = m$outcome, n_test = sum(ok),
         prospective_r = unname(ct$estimate),
         lcl = ct$conf.int[1], ucl = ct$conf.int[2],
         p_raw = ct$p.value, lambda = h1_lambda(m))
}) |>
  mutate(p_fdr = p.adjust(p_raw, method = "BH"),
         meets_benchmark = lcl >= 0.45)

h1_prospective |>
  gt() |>
  tab_header(title = "H1 (prospective held-out sample, N=100)",
             subtitle = "Benchmark: r ≥ .45 lower bound of 95% CI") |>
  fmt_number(columns = c(prospective_r, lcl, ucl, p_raw, p_fdr, lambda), decimals = 3)

h1_prospective_predictions <- map_dfc(h1_models, function(m) {
  preds <- rep(NA_real_, nrow(clean))
  idx <- test_idx[prompt_available[[m$emb_var]][test_idx]]
  new_x <- embedding_new_data(m$fit$final_model, m$emb_var, idx)
  preds[idx] <- predict(m$fit$final_model, new_data = new_x)$.pred
  setNames(tibble(preds), paste0("pred_", m$outcome))
}) |>
  mutate(ResponseId = clean$ResponseId, .before = 1)

readr::write_csv(h1_prospective, here::here("Data", "results", "h1_prospective.csv"))
readr::write_csv(h1_prospective_predictions, here::here("Data", "results", "predictions_prospective.csv"))

17 H1 sensitivity — autonomy prompt comparison

Per prereg’s “Other planned analyses”: compare the unidirectional confidence_text prompt (which mirrors Ryff’s autonomy item wording) with the dimensional autonomy_text prompt for predicting Ryff Autonomy. The confirmatory H1 result above uses autonomy_text (the registered predictor); this block reports confidence_text as a parallel sensitivity model with identical settings (penalty grid 10^seq(-6, 6), package nested-CV defaults, layer 23 mixedbread embeddings). Saved separately so it doesn’t pollute the prereg-aligned pred_Ryff_autonomy panel used for H2/H3/H5.

library(text)

idx <- complete_prompt_score_idx("confidence_text", "Ryff_autonomy")

h1_autonomy_alt <- textTrainRegression(
  x = text_embed[["confidence_text"]][idx, ],
  y = scored$Ryff_autonomy[idx],
  penalty = penalty_grid,
  multi_cores = TRUE,
  seed = 2020
)

saveRDS(h1_autonomy_alt,
        here::here("Data", "models", "h1_train", "Ryff_autonomy__confidence_text.rds"))

alt_lambda <- h1_autonomy_alt$final_model$fit$actions$model$spec$args$penalty
alt_truth  <- scored$Ryff_autonomy[idx]
alt_preds  <- h1_autonomy_alt$results_predictions$predictions
alt_ok     <- complete.cases(alt_preds, alt_truth)
alt_ct     <- cor.test(alt_preds[alt_ok], alt_truth[alt_ok], alternative = "two.sided")

primary_row <- h1_cv |> filter(outcome == "Ryff_autonomy")

autonomy_comparison <- tibble(
  predictor       = c("autonomy_text (primary, prereg-aligned)", "confidence_text (sensitivity)"),
  n               = c(primary_row$n, sum(alt_ok)),
  cv_r            = c(primary_row$cv_r, unname(alt_ct$estimate)),
  lcl             = c(primary_row$lcl, alt_ct$conf.int[1]),
  ucl             = c(primary_row$ucl, alt_ct$conf.int[2]),
  p_raw           = c(primary_row$p_raw, alt_ct$p.value),
  lambda          = c(primary_row$lambda, alt_lambda),
  at_lower_bound  = c(primary_row$lambda, alt_lambda) <= min(penalty_grid),
  at_upper_bound  = c(primary_row$lambda, alt_lambda) >= max(penalty_grid)
)

autonomy_comparison |>
  gt() |>
  tab_header(title = "Autonomy prompt comparison (sensitivity per prereg)",
             subtitle = "Both predict Ryff Autonomy; primary is locked, sensitivity is exploratory") |>
  fmt_number(columns = c(cv_r, lcl, ucl, p_raw, lambda), decimals = 3)

readr::write_csv(autonomy_comparison,
                 here::here("Data", "results", "h1_autonomy_prompt_comparison.csv"))

18 Pooled out-of-sample predictions for H2–H5 (N=500)

Per prereg: combine CV training predictions (N=400) with held-out prospective predictions (N=100) into a single panel of out-of-sample predicted scores per participant per construct. This panel is the predictor input for H2–H5.

h1_pooled <- bind_rows(
  h1_cv_predictions |> filter(!is.na(pred_Ryff_purpose_in_life)),
  h1_prospective_predictions |> filter(!is.na(pred_Ryff_purpose_in_life))
) |>
  distinct(ResponseId, .keep_all = TRUE)

readr::write_csv(h1_pooled, here::here("Data", "results", "predictions_pooled.csv"))
cat("Pooled prediction panel rows:", nrow(h1_pooled), "\n")

19 H2 — Nomological validity (N=500, two-tailed)

Pearson correlations between each language-based PWB score and: (a) the five non-matched PWB observed subscales, (b) SWLS, HILS, PANAS PA (positive expected), (c) PANAS NA, PHQ-9, GAD-7 (negative expected). Two-tailed per prereg, FDR-corrected within each predicted construct.

h2_panel <- scored |>
  select(ResponseId, starts_with("Ryff_"), SWLS_swls, HILS_hils,
         PANAS_positive_affect, PANAS_negative_affect,
         PHQ9_phq9, GAD7_gad7, starts_with("BFI10_"), Agency_agency) |>
  inner_join(h1_pooled, by = "ResponseId")

h2_other_observed <- c("SWLS_swls","HILS_hils",
                       "PANAS_positive_affect","PANAS_negative_affect",
                       "PHQ9_phq9","GAD7_gad7")

h2_pairs <- crossing(pred_dim = pwb_dims, observed = c(paste0("Ryff_", pwb_dims), h2_other_observed)) |>
  filter(observed != paste0("Ryff_", pred_dim))

h2_results <- h2_pairs |>
  pmap_dfr(function(pred_dim, observed) {
    pred_col <- paste0("pred_Ryff_", pred_dim)
    df <- h2_panel |> select(p = all_of(pred_col), o = all_of(observed)) |> drop_na()
    if (nrow(df) < 10) return(NULL)
    ct <- cor.test(df$p, df$o, alternative = "two.sided")
    tibble(pred_dim = pred_dim, observed = observed,
           n = nrow(df), r = unname(ct$estimate),
           lcl = ct$conf.int[1], ucl = ct$conf.int[2], p_raw = ct$p.value)
  }) |>
  group_by(pred_dim) |>
  mutate(p_fdr = p.adjust(p_raw, method = "BH")) |>
  ungroup()

h2_results |>
  gt(groupname_col = "pred_dim") |>
  tab_header(title = "H2 — nomological validity (N=500, two-tailed, FDR within predicted dim)") |>
  fmt_number(columns = c(r, lcl, ucl, p_raw, p_fdr), decimals = 3)

readr::write_csv(h2_results, here::here("Data", "results", "h2_results.csv"))

h2_results |>
  ggplot(aes(r, fct_rev(pred_dim), color = sign(r) > 0)) +
  geom_point(size = 2) +
  geom_errorbarh(aes(xmin = lcl, xmax = ucl), height = 0) +
  geom_vline(xintercept = 0, linetype = 2, color = "grey60") +
  facet_wrap(~ observed, scales = "free_x") +
  scale_color_manual(values = c(`TRUE` = brand_blue, `FALSE` = brand_red), guide = "none") +
  labs(title = "H2 — predicted PWB ↔ other constructs",
       x = "Pearson r (95% CI)", y = NULL)

20 H3 — External criteria validity (one-tailed for confirmatory i–iv)

Confirmatory: H3i age × {EM, AU}, H3ii gender × pos relations, H3iii education × {PG, PIL}, H3iv religious commitment × PIL. Religious commitment = Q202 ≥ 3 (“Somewhat” or higher). Education tested ordinal + sensitivity (continuous and binary at bachelor’s). All confirmatory tests one-tailed, FDR-corrected within H3 confirmatory family. Exploratory: SES (Q148), relationship status, employment.

h3_panel <- h2_panel |>
  left_join(clean |> select(ResponseId, Q146, Q147, Q148, highest_education,
                            Q199, Q201, Q202),
            by = "ResponseId") |>
  mutate(
    age_num         = as.numeric(Q146),
    gender_binary   = case_when(Q147 == "1" ~ "Male", Q147 == "2" ~ "Female", TRUE ~ NA_character_),
    ses_continuous  = as.numeric(Q148),
    edu_ordinal     = as.numeric(highest_education),
    edu_continuous  = edu_ordinal,
    edu_binary_BAplus = if_else(edu_ordinal >= 6, "BA+", "below_BA"),
    relig_commit_binary = case_when(
      as.numeric(Q202) >= 3 ~ "Religious",
      as.numeric(Q202) %in% 1:2 ~ "Not religious",
      TRUE ~ NA_character_
    ),
    relationship_partnered = case_when(
      as.numeric(Q201) %in% 3:6 ~ "Partnered",
      as.numeric(Q201) %in% c(1, 2, 7, 8, 9) ~ "Not partnered",
      TRUE ~ NA_character_
    ),
    employed_binary = case_when(
      as.numeric(Q199) %in% 1:2 ~ "Employed",
      as.numeric(Q199) %in% 3:4 ~ "Not employed",
      TRUE ~ NA_character_
    )
  )

confirmatory_one_tailed <- function(predicted_col, x, alternative, kind = c("cor","ttest")) {
  d <- tibble(p = h3_panel[[predicted_col]], x = x) |> drop_na()
  if (nrow(d) < 10) return(NULL)
  if (kind == "cor") {
    ct <- cor.test(d$p, d$x, alternative = alternative)
    tibble(stat = "Pearson r", estimate = unname(ct$estimate),
           lcl = ct$conf.int[1], ucl = ct$conf.int[2], p_raw = ct$p.value, n = nrow(d))
  } else {
    d$x <- factor(d$x)
    tt <- t.test(p ~ x, data = d, alternative = alternative, var.equal = FALSE)
    means <- d |> group_by(x) |> summarise(m = mean(p), .groups = "drop")
    pooled_sd <- sqrt(((sum(d$x == levels(d$x)[1])-1)*var(d$p[d$x == levels(d$x)[1]]) +
                       (sum(d$x == levels(d$x)[2])-1)*var(d$p[d$x == levels(d$x)[2]])) /
                      (nrow(d) - 2))
    cohens_d <- (means$m[1] - means$m[2]) / pooled_sd
    tibble(stat = "Welch t", estimate = unname(tt$statistic),
           lcl = NA_real_, ucl = NA_real_, p_raw = tt$p.value, n = nrow(d),
           cohens_d = cohens_d, mean_diff = means$m[1] - means$m[2])
  }
}

h3_confirmatory <- bind_rows(
  confirmatory_one_tailed("pred_Ryff_env_mastery",    h3_panel$age_num, "greater", "cor")  |> mutate(test = "H3i: age × EM"),
  confirmatory_one_tailed("pred_Ryff_autonomy",       h3_panel$age_num, "greater", "cor")  |> mutate(test = "H3i: age × AU"),
  confirmatory_one_tailed("pred_Ryff_pos_relations",  h3_panel$gender_binary, "greater", "ttest") |> mutate(test = "H3ii: gender × PR (women > men)"),
  confirmatory_one_tailed("pred_Ryff_personal_growth", h3_panel$edu_ordinal, "greater", "cor") |> mutate(test = "H3iii: edu × PG (ordinal)"),
  confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$edu_ordinal, "greater", "cor") |> mutate(test = "H3iii: edu × PIL (ordinal)"),
  confirmatory_one_tailed("pred_Ryff_purpose_in_life", h3_panel$relig_commit_binary, "greater", "ttest") |> mutate(test = "H3iv: religious commitment × PIL")
) |>
  mutate(p_fdr = p.adjust(p_raw, method = "BH")) |>
  select(test, n, stat, estimate, cohens_d, mean_diff, lcl, ucl, p_raw, p_fdr)

h3_confirmatory |>
  gt() |>
  tab_header(title = "H3 confirmatory (one-tailed, FDR-corrected within H3 family)") |>
  fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw, p_fdr), decimals = 3)

h3_edu_sensitivity <- bind_rows(
  confirmatory_one_tailed("pred_Ryff_personal_growth",  h3_panel$edu_continuous, "greater", "cor") |> mutate(test = "edu × PG (continuous)"),
  confirmatory_one_tailed("pred_Ryff_purpose_in_life",  h3_panel$edu_continuous, "greater", "cor") |> mutate(test = "edu × PIL (continuous)"),
  confirmatory_one_tailed("pred_Ryff_personal_growth",  h3_panel$edu_binary_BAplus, "greater", "ttest") |> mutate(test = "edu × PG (binary BA+)"),
  confirmatory_one_tailed("pred_Ryff_purpose_in_life",  h3_panel$edu_binary_BAplus, "greater", "ttest") |> mutate(test = "edu × PIL (binary BA+)")
)

h3_edu_sensitivity |>
  gt() |>
  tab_header(title = "H3 education sensitivity analyses") |>
  fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw), decimals = 3)

h3_exploratory_specs <- expand_grid(
  predicted_col = paste0("pred_Ryff_", pwb_dims),
  pair = list(
    list(label = "SES (Q148)",        x = "ses_continuous",         kind = "cor",   alt = "two.sided"),
    list(label = "Partnered",         x = "relationship_partnered", kind = "ttest", alt = "two.sided"),
    list(label = "Employed",          x = "employed_binary",        kind = "ttest", alt = "two.sided")
  )
) |> unnest_wider(pair)

h3_exploratory <- pmap_dfr(h3_exploratory_specs, function(predicted_col, label, x, kind, alt) {
  res <- confirmatory_one_tailed(predicted_col, h3_panel[[x]], alt, kind)
  if (is.null(res)) return(NULL)
  res |> mutate(predicted = predicted_col, covariate = label)
})

h3_exploratory |>
  gt(groupname_col = "predicted") |>
  tab_header(title = "H3 exploratory (two-tailed, no FDR)") |>
  fmt_number(columns = c(estimate, cohens_d, mean_diff, lcl, ucl, p_raw), decimals = 3)

readr::write_csv(h3_confirmatory, here::here("Data", "results", "h3_confirmatory.csv"))
readr::write_csv(h3_edu_sensitivity, here::here("Data", "results", "h3_edu_sensitivity.csv"))
readr::write_csv(h3_exploratory, here::here("Data", "results", "h3_exploratory.csv"))

21 H4 — Face validity (textProjection + LDA per construct)

For each PWB dimension: (a) supervised projection of word-type embeddings onto the direction vector that separates low vs. high observed scale scores, (b) LDA topics from the dimension’s open-ended responses. Outputs are exported as tables + plots for two independent raters (blind to hypotheses, per appendix) to qualitatively compare against Ryff’s theoretical definitions.

library(text)
library(topics)

h4_specs <- h1_specs

h4_projection <- map(seq_len(nrow(h4_specs)), function(i) {
  pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
  idx <- complete_prompt_score_idx(pv, oc)
  proj <- textProjection(
    words           = clean[[pv]][idx],
    word_embeddings = text_embed[[pv]][idx, , drop = FALSE],
    word_types_embeddings = word_type_embed[[pv]],
    x = scored[[oc]][idx],
    pca = NULL,
    aggregation = "mean",
    split = "quartile",
    word_weight_power = 1,
    min_freq_words_test = 2
  )
  list(prompt = pv, outcome = oc, projection = proj)
})

proj_dir <- here::here("Data", "results", "h4_projections")
dir.create(proj_dir, showWarnings = FALSE, recursive = TRUE)
walk(h4_projection, ~ saveRDS(.x, file.path(proj_dir, paste0(.x$outcome, ".rds"))))

h4_top_words <- map_dfr(h4_projection, function(p) {
  df <- p$projection$word_data
  bind_rows(
    df |> arrange(desc(dot.x)) |> head(15) |> mutate(pole = "high"),
    df |> arrange(dot.x)        |> head(15) |> mutate(pole = "low")
  ) |>
    select(words, dot.x, p_values_dot.x, cohens_d.x, n, n_g1.x, n_g2.x, pole) |>
    mutate(outcome = p$outcome, .before = 1)
})

h4_top_words |>
  gt(groupname_col = c("outcome","pole")) |>
  tab_header(title = "H4 — top projected words/phrases per dimension (high vs. low observed score)") |>
  fmt_number(columns = c(dot.x, p_values_dot.x, cohens_d.x), decimals = 3)

set.seed(2020)
h4_rater_packet <- map_dfr(seq_len(nrow(h4_specs)), function(i) {
  oc <- h4_specs$outcome[i]; pv <- h4_specs$emb_var[i]
  scored_long <- tibble(text = clean[[pv]], score = scored[[oc]]) |>
    drop_na() |> mutate(quartile = ntile(score, 4))

  q1 <- scored_long |> filter(quartile == 1)
  q4 <- scored_long |> filter(quartile == 4)

  bind_rows(
    q1 |> slice_sample(n = min(5, nrow(q1))) |> mutate(pole = "low quartile"),
    q4 |> slice_sample(n = min(5, nrow(q4))) |> mutate(pole = "high quartile")
  ) |> mutate(outcome = oc, .before = 1)
})

readr::write_csv(h4_rater_packet, here::here("Data", "results", "h4_rater_packet.csv"))
readr::write_csv(h4_top_words, here::here("Data", "results", "h4_top_words.csv"))

Topic-count selection — fit LDA at several k per construct, pick by mean per-topic coherence. At pilot N=40 these estimates are unstable; treat as an ordering, not a precise pick. Re-run at full N to lock k per construct.

library(topics)

candidate_k <- c(4, 6, 8, 10, 12, 15)

build_dtm_for_construct <- function(prompt_var) {
  texts <- clean[[prompt_var]][prompt_available[[prompt_var]]]
  topicsDtm(
    data = tibble(text = texts),
    ngram_window = c(1, 1),
    removal_mode = "frequency",
    removal_rate_least = 2,
    shuffle = FALSE,
    seed = 2020
  )
}

h4_dtm <- map(set_names(h4_specs$emb_var), build_dtm_for_construct)

h4_k_diagnostic <- map_dfr(seq_len(nrow(h4_specs)), function(i) {
  pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
  dtm <- h4_dtm[[pv]]
  map_dfr(candidate_k, function(k) {
    mod <- topicsModel(dtm = dtm, num_topics = k, num_iterations = 1500, seed = 2020)
    tibble(
      prompt           = pv,
      outcome          = oc,
      k                = k,
      mean_coherence   = mean(mod$coherence,   na.rm = TRUE),
      median_coherence = median(mod$coherence, na.rm = TRUE),
      min_coherence    = min(mod$coherence,    na.rm = TRUE),
      prevalence_sd    = sd(mod$prevalence,    na.rm = TRUE)
    )
  })
})

h4_k_diagnostic |>
  ggplot(aes(k, mean_coherence)) +
  geom_line(color = brand_blue) +
  geom_point(color = brand_blue, size = 2) +
  facet_wrap(~ outcome, scales = "free_y") +
  labs(title = "H4 — LDA topic-count selection",
       subtitle = "Mean per-topic coherence across candidate k. Pick the elbow / peak per construct.",
       x = "Number of topics (k)", y = "Mean coherence") +
  theme_minimal(base_size = 11)

h4_k_chosen <- h4_k_diagnostic |>
  group_by(outcome) |>
  slice_max(mean_coherence, n = 1, with_ties = FALSE) |>
  ungroup() |>
  select(outcome, prompt, k_chosen = k, mean_coherence)

h4_k_chosen |>
  gt() |>
  tab_header(title = "Chosen k per construct (peak coherence)",
             subtitle = "Override manually below if visual inspection prefers a different k.") |>
  fmt_number(columns = mean_coherence, decimals = 3)

readr::write_csv(h4_k_diagnostic, here::here("Data", "results", "h4_topic_k_diagnostic.csv"))
readr::write_csv(h4_k_chosen,     here::here("Data", "results", "h4_topic_k_chosen.csv"))

Final LDA fit at the chosen k per construct, plus topicsTest() correlating each topic’s prevalence with the matched observed scale score. The test result powers the inline plot below — top topics most strongly associated with high vs. low scale scores.

library(topics)

h4_k_lookup <- setNames(h4_k_chosen$k_chosen, h4_k_chosen$outcome)

h4_lda <- map(seq_len(nrow(h4_specs)), function(i) {
  pv <- h4_specs$emb_var[i]; oc <- h4_specs$outcome[i]
  ok <- prompt_available[[pv]]
  k  <- h4_k_lookup[[oc]]
  texts <- clean[[pv]][ok]

  dtm <- h4_dtm[[pv]]
  mod <- topicsModel(dtm = dtm, num_topics = k, num_iterations = 1500, seed = 2020)

  data_for_test <- tibble(text = texts)
  data_for_test[[oc]] <- scored[[oc]][ok]

  test <- tryCatch(
    topicsTest(
      data            = data_for_test,
      model           = mod,
      x_variable      = oc,
      test_method     = "default",
      p_adjust_method = "fdr",
      seed            = 2020
    ),
    error = function(e) {
      message("topicsTest failed for ", oc, ": ", conditionMessage(e))
      NULL
    }
  )

  list(prompt = pv, outcome = oc, k = k, dtm = dtm, model = mod, test = test,
       topics_pkg_version = packageVersion("topics"))
})

lda_dir <- here::here("Data", "topics", "h4")
dir.create(lda_dir, showWarnings = FALSE, recursive = TRUE)
walk(h4_lda, ~ saveRDS(.x, file.path(lda_dir,
                                     paste0(.x$outcome, "_k", .x$k, ".rds"))))

Inline visuals — for each PWB construct, render the supervised word-projection plot (low ↔︎ high observed score) and the topic-prevalence × scale association plot. Read across them to label what each topic / each projection pole is “about”, then record those labels for the rater packet.

library(text)
library(topics)

print_any_plot <- function(obj) {
  if (inherits(obj, c("ggplot", "gg", "patchwork"))) {
    print(obj)
  } else if (is.list(obj)) {
    if ("final_plot" %in% names(obj)) {
      print_any_plot(obj$final_plot)
    } else {
      for (el in obj) print_any_plot(el)
    }
  }
  invisible(NULL)
}

extract_test_table <- function(test_obj) {
  if (is.null(test_obj)) return(NULL)
  if (is.data.frame(test_obj)) return(tibble::as_tibble(test_obj))
  if (is.list(test_obj)) {
    for (nm in c("test_results", "results", "test", "topic_results")) {
      if (!is.null(test_obj[[nm]]) && is.data.frame(test_obj[[nm]])) {
        return(tibble::as_tibble(test_obj[[nm]]))
      }
    }
    df_el <- purrr::detect(test_obj, is.data.frame)
    if (!is.null(df_el)) return(tibble::as_tibble(df_el))
  }
  NULL
}

for (i in seq_len(nrow(h4_specs))) {
  oc <- h4_specs$outcome[i]
  pv <- h4_specs$emb_var[i]
  k_used <- h4_lda[[i]]$k

  cat("\n\n### ", oc, "  ·  prompt: `", pv, "`  ·  k = ", k_used, "\n\n", sep = "")

  cat("\n**Word-level projection — low ↔ high observed `", oc, "`**\n\n", sep = "")
  proj <- h4_projection[[i]]$projection
  pp <- tryCatch(
    textProjectionPlot(
      word_data             = proj,
      k_n_words_to_test     = FALSE,
      min_freq_words_test   = 2,
      plot_n_word_extreme   = 10,
      plot_n_word_frequency = 5,
      plot_n_words_middle   = 5,
      plot_n_words_p        = 5,
      title_top             = paste0("Projection: ", oc, " (high pole right, low pole left)")
    ),
    error = function(e) {
      cat("\n*textProjectionPlot failed:* ", conditionMessage(e), "\n\n", sep = "")
      NULL
    }
  )
  print_any_plot(pp)

  cat("\n\n**Top 10 topics most associated with `", oc, "` (linear regression on topic prevalence, FDR-corrected)**\n\n", sep = "")
  td <- extract_test_table(h4_lda[[i]]$test)
  if (is.null(td)) {
    cat("\n*Could not extract topicsTest result table for ", oc, ".*\n\n", sep = "")
  } else {
    estimate_col <- intersect(c("estimate","beta","coef","Estimate","b","z"),     names(td))[1]
    p_col        <- intersect(c("p.adjusted","p.adj","p_value_adj","p","p_value"),names(td))[1]

    if (is.na(estimate_col)) {
      cat("\n*Effect-size column not found in topicsTest output. Columns: ",
          paste(names(td), collapse = ", "), "*\n\n", sep = "")
    } else {
      td_top <- td |>
        mutate(.estimate = .data[[estimate_col]]) |>
        arrange(desc(abs(.estimate))) |>
        head(10) |>
        select(-.estimate)
      print(
        gt(td_top) |>
          tab_header(title = paste0("Top topics — ", oc),
                     subtitle = paste0("Ranked by |", estimate_col, "|; ",
                                       if (!is.na(p_col)) paste0("p column: ", p_col) else "p column not detected")) |>
          fmt_number(columns = where(is.numeric), decimals = 3)
      )
    }
  }

  cat("\n\n**Topic plot — prevalence × scale association**\n\n")
  tp <- NULL
  for (test_arg in list(h4_lda[[i]]$test, td, NULL)) {
    tp <- tryCatch(
      topicsPlot(
        model           = h4_lda[[i]]$model,
        test            = test_arg,
        p_alpha         = 0.05,
        p_adjust_method = "fdr",
        ngrams_max      = 20,
        plot_n_most_prevalent_topics = 10,
        overview_plot   = FALSE
      ),
      error = function(e) NULL
    )
    if (!is.null(tp)) break
  }
  if (is.null(tp)) {
    cat("\n*topicsPlot failed for all test-arg variants; see the top-10 table above.*\n\n")
  }
  print_any_plot(tp)

  cat("\n\n---\n\n")
}

After viewing the projections and topics, fill in human-readable labels for each pole and each topic so the rater packet has interpretable anchors. The blank scaffold below writes to disk; expand columns if you keep more than 6 topics.

h4_labels <- tribble(
  ~outcome,               ~high_pole_label, ~low_pole_label,
  ~topic_1, ~topic_2, ~topic_3, ~topic_4, ~topic_5, ~topic_6,
  "Ryff_self_acceptance",  "", "", "", "", "", "", "", "",
  "Ryff_autonomy",         "", "", "", "", "", "", "", "",
  "Ryff_env_mastery",      "", "", "", "", "", "", "", "",
  "Ryff_pos_relations",    "", "", "", "", "", "", "", "",
  "Ryff_purpose_in_life",  "", "", "", "", "", "", "", "",
  "Ryff_personal_growth",  "", "", "", "", "", "", "", ""
)
readr::write_csv(h4_labels, here::here("Data", "results", "h4_labels.csv"))

22 H5 — Discriminant validity (correlation matrix + mixed-effects errors)

Per prereg: 6×6 predicted-by-observed correlation matrix with both observed and disattenuated correlations, descriptive pairwise comparisons of matched vs. each non-matched subscale.

Layered on top: a participant-level mixed-effects model on z-scored squared prediction errors. Random participant intercept; fixed effect of is_target (1 if matched). emmeans extracts the 30 individual target-vs-off-target pairwise contrasts.

if (!exists("pwb_dims")) {
  pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
                "pos_relations", "purpose_in_life", "personal_growth")
}

h5_obs <- paste0("Ryff_", pwb_dims)
h5_pred <- paste0("pred_Ryff_", pwb_dims)

h5_panel <- h2_panel |> select(ResponseId, all_of(h5_obs), all_of(h5_pred))

h5_corr <- expand_grid(predicted = h5_pred, observed = h5_obs) |>
  pmap_dfr(function(predicted, observed) {
    d <- h5_panel |> select(p = all_of(predicted), o = all_of(observed)) |> drop_na()
    if (nrow(d) < 10) return(NULL)
    ct <- cor.test(d$p, d$o)
    tibble(predicted = predicted, observed = observed, n = nrow(d),
           r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2])
  }) |>
  mutate(
    pred_dim = str_remove(predicted, "pred_Ryff_"),
    obs_dim  = str_remove(observed, "Ryff_"),
    is_target = pred_dim == obs_dim
  )

h5_corr_matrix <- h5_corr |>
  select(pred_dim, obs_dim, r) |>
  pivot_wider(names_from = obs_dim, values_from = r)

h5_corr_matrix |>
  gt() |>
  tab_header(title = "H5 — predicted × observed PWB correlation matrix",
             subtitle = "Diagonal = matched (target); off-diagonal = non-matched")

h5_corr_disattenuated <- h5_corr |>
  left_join(alpha_lookup |> rename(observed = scale_subscale,
                                   alpha_o = alpha,
                                   omega_total_o = omega_total),
            by = "observed") |>
  mutate(
    reliability_primary = "Cronbach alpha",
    r_disattenuated_alpha = r / sqrt(pmax(alpha_o, .Machine$double.eps)),
    r_disattenuated_omega = r / sqrt(pmax(omega_total_o, .Machine$double.eps))
  )

h5_corr_disattenuated |>
  select(pred_dim, obs_dim, n, r, alpha_o, omega_total_o,
         r_disattenuated_alpha, r_disattenuated_omega,
         reliability_primary, is_target) |>
  gt(groupname_col = "pred_dim") |>
  tab_header(title = "H5 — observed and disattenuated correlations",
             subtitle = "Primary disattenuated estimate uses Cronbach's alpha; omega total is sensitivity.") |>
  fmt_number(columns = c(r, alpha_o, omega_total_o,
                         r_disattenuated_alpha, r_disattenuated_omega), decimals = 3)

readr::write_csv(h5_corr_disattenuated, here::here("Data", "results", "h5_corr_disattenuated.csv"))

if (!exists("pwb_dims")) {
  pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
                "pos_relations", "purpose_in_life", "personal_growth")
}

h5_obs <- paste0("Ryff_", pwb_dims)
h5_pred <- paste0("pred_Ryff_", pwb_dims)

if (!exists("h1_pooled")) {
  pooled_path <- here::here("Data", "results", "predictions_pooled.csv")
  cv_path <- here::here("Data", "results", "predictions_cv_train.csv")
  prospective_path <- here::here("Data", "results", "predictions_prospective.csv")

  if (file.exists(pooled_path)) {
    h1_pooled <- readr::read_csv(pooled_path, show_col_types = FALSE)
  } else {
    prediction_pieces <- list()

    if (exists("h1_cv_predictions")) {
      prediction_pieces <- append(prediction_pieces, list(h1_cv_predictions))
    } else if (file.exists(cv_path)) {
      prediction_pieces <- append(
        prediction_pieces,
        list(readr::read_csv(cv_path, show_col_types = FALSE))
      )
    }

    if (exists("h1_prospective_predictions")) {
      prediction_pieces <- append(prediction_pieces, list(h1_prospective_predictions))
    } else if (file.exists(prospective_path)) {
      prediction_pieces <- append(
        prediction_pieces,
        list(readr::read_csv(prospective_path, show_col_types = FALSE))
      )
    }

    if (length(prediction_pieces) == 0) {
      stop("No H1 prediction panel found. Run H1 prediction chunks or create Data/results/predictions_cv_train.csv.")
    }

    h1_pooled <- bind_rows(prediction_pieces) |>
      distinct(ResponseId, .keep_all = TRUE)
  }

  if ("pred_Ryff_purpose_in_life" %in% names(h1_pooled)) {
    h1_pooled <- h1_pooled |> filter(!is.na(pred_Ryff_purpose_in_life))
  }
}

if (!exists("h2_panel")) {
  h2_panel <- scored |>
    select(ResponseId, starts_with("Ryff_"), SWLS_swls, HILS_hils,
           PANAS_positive_affect, PANAS_negative_affect,
           PHQ9_phq9, GAD7_gad7, starts_with("BFI10_"), Agency_agency) |>
    inner_join(h1_pooled, by = "ResponseId")
}

if (!exists("h5_panel")) {
  h5_panel <- h2_panel |> select(ResponseId, all_of(h5_obs), all_of(h5_pred))
}

h5_long <- h5_panel |>
  pivot_longer(all_of(h5_pred), names_to = "predicted", values_to = "pred_score") |>
  mutate(pred_dim = str_remove(predicted, "pred_Ryff_")) |>
  select(-predicted) |>
  pivot_longer(all_of(h5_obs), names_to = "observed", values_to = "obs_score") |>
  mutate(obs_dim = str_remove(observed, "Ryff_")) |>
  select(-observed) |>
  group_by(pred_dim) |>
  mutate(z_pred = as.numeric(scale(pred_score))) |>
  group_by(obs_dim) |>
  mutate(z_obs = as.numeric(scale(obs_score))) |>
  ungroup() |>
  mutate(
    sq_err = (z_pred - z_obs)^2,
    is_target = factor(if_else(pred_dim == obs_dim, "target", "off_target"),
                       levels = c("off_target", "target")),
    pred_dim = factor(pred_dim, levels = pwb_dims),
    obs_dim  = factor(obs_dim,  levels = pwb_dims)
  ) |>
  drop_na(sq_err)

h5_lmer_overall <- lmer(
  sq_err ~ is_target + (1 | ResponseId) + (1 | pred_dim),
  data = h5_long
)
summary(h5_lmer_overall)

h5_lmer_pairwise <- lmer(
  sq_err ~ pred_dim * obs_dim + (1 | ResponseId),
  data = h5_long
)

h5_emm_pairwise <- emmeans(h5_lmer_pairwise, ~ obs_dim | pred_dim) |>
  contrast(method = "trt.vs.ctrl",
           ref = match(pwb_dims, levels(h5_long$obs_dim))) |>
  as_tibble()

h5_emm_pairwise |>
  filter(str_detect(contrast, "ref")) |>
  gt() |>
  tab_header(title = "H5 — 30 pairwise contrasts (target as reference; predicted dim grouped)") |>
  fmt_number(columns = any_of(c("estimate", "SE", "df", "t.ratio", "z.ratio", "p.value")),
             decimals = 3)

readr::write_csv(h5_emm_pairwise, here::here("Data", "results", "h5_pairwise_contrasts.csv"))

h5_per_construct_paired <- h5_long |>
  group_by(ResponseId, pred_dim, is_target) |>
  summarise(mean_sq_err = mean(sq_err), .groups = "drop") |>
  pivot_wider(names_from = is_target, values_from = mean_sq_err) |>
  drop_na(target, off_target)

h5_per_construct_t <- h5_per_construct_paired |>
  group_by(pred_dim) |>
  summarise(
    n          = n(),
    mean_target = mean(target),
    mean_off    = mean(off_target),
    diff        = mean(off_target - target),
    t_stat      = t.test(off_target, target, paired = TRUE, alternative = "greater")$statistic,
    p_one_sided = t.test(off_target, target, paired = TRUE, alternative = "greater")$p.value
  ) |>
  mutate(p_fdr = p.adjust(p_one_sided, method = "BH"))

h5_per_construct_t |>
  gt() |>
  tab_header(title = "H5 — paired t-tests per construct (target vs. mean of 5 off-target errors)") |>
  fmt_number(columns = c(mean_target, mean_off, diff, t_stat, p_one_sided, p_fdr), decimals = 3)

readr::write_csv(h5_per_construct_t, here::here("Data", "results", "h5_per_construct_paired.csv"))

23 Order effects (counterbalancing condition)

Per prereg “Other planned analyses.” Compares scale means between text_first vs. scales_first conditions via t-tests.

condition_panel <- scored |>
  left_join(clean |> select(ResponseId, condition), by = "ResponseId") |>
  mutate(
    text_first = case_when(
      condition == "scales_last" ~ "text_first",
      condition == "text_last"   ~ "scales_first",
      TRUE ~ NA_character_
    )
  ) |>
  filter(!is.na(text_first))

scale_cols <- c(paste0("Ryff_", pwb_dims), "SWLS_swls", "HILS_hils",
                "PANAS_positive_affect", "PANAS_negative_affect",
                "PHQ9_phq9", "GAD7_gad7", paste0("EPS_", c("meaningfulness","commitment","beyond_self")))

order_effects <- map_dfr(scale_cols, function(s) {
  d <- condition_panel |> select(y = all_of(s), text_first) |> drop_na()
  if (length(unique(d$text_first)) < 2 || nrow(d) < 10) return(NULL)
  tt <- t.test(y ~ text_first, data = d, var.equal = FALSE)
  means <- d |> group_by(text_first) |> summarise(m = mean(y), s = sd(y), n = n(), .groups = "drop")
  tibble(scale = s, n_text_first = means$n[means$text_first == "text_first"],
         n_scales_first = means$n[means$text_first == "scales_first"],
         mean_text_first = means$m[means$text_first == "text_first"],
         mean_scales_first = means$m[means$text_first == "scales_first"],
         diff = unname(diff(means$m)),
         t_stat = unname(tt$statistic), p_raw = tt$p.value)
}) |>
  mutate(p_fdr = p.adjust(p_raw, method = "BH"))

order_effects |>
  gt() |>
  tab_header(title = "Order effects: scale means by counterbalancing condition",
             subtitle = "Welch's t-tests, FDR-corrected across scales") |>
  fmt_number(columns = c(mean_text_first, mean_scales_first, diff, t_stat, p_raw, p_fdr), decimals = 3)

readr::write_csv(order_effects, here::here("Data", "results", "order_effects.csv"))

24 Interest #1 — Purpose-language comparison (PIL vs. EPS-BTS)

The dissertation’s central question for this chapter: do the language profiles of sense of purpose (Ryff PIL) and beyond-the-self purpose (Engaged Purpose Scale BTS subscale) differ, and do those differences predict different downstream wellbeing outcomes?

Four sub-analyses:

Top-feature comparison via textProjection() — which words/phrases load on each construct.
Cross-prediction — does the PIL model predict EPS-BTS scores (and vice versa)? If they’re capturing the same construct, cross-r ≈ within-r. If different, cross-r is meaningfully smaller.
Differential effects regression — predicted PIL and predicted EPS-BTS scores as predictors of the OTHER PWB outcomes (SA, AU, EM, PR, PG). Do they predict different downstream outcomes?
Direct embedding prediction of wellbeing — train the same wellbeing outcome from PIL-text embeddings vs. BTS-goal embeddings and compare out-of-fold prediction accuracy.

library(text)

pil_projection_idx <- complete_prompt_score_idx("purposeinlife_text", "Ryff_purpose_in_life")
bts_projection_idx <- complete_prompt_score_idx("purpose_goal", "EPS_beyond_self")

proj_pil <- textProjection(
  words      = clean$purposeinlife_text[pil_projection_idx],
  word_embeddings   = text_embed$purposeinlife_text[pil_projection_idx, , drop = FALSE],
  word_types_embeddings = word_type_embed$purposeinlife_text,
  x = scored$Ryff_purpose_in_life[pil_projection_idx],
  y = NULL,
  pca = NULL,
  aggregation = "mean",
  split = "quartile",
  word_weight_power = 1,
  min_freq_words_test = 2
)

proj_bts <- textProjection(
  words      = clean$purpose_goal[bts_projection_idx],
  word_embeddings   = text_embed$purpose_goal[bts_projection_idx, , drop = FALSE],
  word_types_embeddings = word_type_embed$purpose_goal,
  x = scored$EPS_beyond_self[bts_projection_idx],
  y = NULL,
  pca = NULL,
  aggregation = "mean",
  split = "quartile",
  word_weight_power = 1,
  min_freq_words_test = 2
)

saveRDS(proj_pil, here::here("Data", "results", "projection_pil.rds"))
saveRDS(proj_bts, here::here("Data", "results", "projection_bts.rds"))

library(text)

semp_specs <- bind_rows(
  h1_specs,
  tribble(
    ~emb_var,        ~outcome,
    "purpose_goal",  "EPS_beyond_self"
  )
)

semp_train_one <- function(emb_var, outcome) {
  idx <- which(
    prompt_available[[emb_var]] &
      !is.na(scored[[outcome]]) &
      complete.cases(text_embed[[emb_var]])
  )
  if (length(idx) < 10) stop("Too few complete rows for ", emb_var, " -> ", outcome)

  fit <- textTrainRegression(
    x = text_embed[[emb_var]][idx, , drop = FALSE],
    y = scored[[outcome]][idx],
    penalty = penalty_grid,
    multi_cores = TRUE,
    seed = 2020
  )

  list(
    emb_var = emb_var,
    outcome = outcome,
    model_idx = idx,
    fit = fit,
    lambda_chosen = fit$final_model$fit$fit$spec$args$penalty,
    text_pkg_version = packageVersion("text"),
    built_on = Sys.time()
  )
}

semp_models <- pmap(semp_specs, ~ semp_train_one(..1, ..2))

semp_dir <- here::here("Data", "models", "semp")
dir.create(semp_dir, showWarnings = FALSE, recursive = TRUE)
walk(semp_models, ~ saveRDS(
  .x,
  file.path(semp_dir, paste0(.x$emb_var, "__", .x$outcome, ".rds"))
))

if (!exists("model_id_vars")) {
  model_id_vars <- function(final_model) {
    blueprint <- final_model$pre$mold$blueprint
    id_ptype <- tryCatch(
      blueprint$extra_role_ptypes[["id variable"]],
      error = function(e) NULL
    )
    id_vars <- if (is.null(id_ptype)) character() else names(id_ptype)

    recipe_cols <- tryCatch(names(blueprint$recipe$ptype), error = function(e) character())
    predictor_cols <- tryCatch(names(blueprint$ptypes$predictors), error = function(e) character())
    id_vars <- union(id_vars, intersect(setdiff(recipe_cols, c(predictor_cols, "y")), "id_nr"))

    id_vars
  }
}

if (!exists("embedding_new_data")) {
  embedding_new_data <- function(final_model, emb_var, idx) {
    predictor_cols <- tryCatch(
      names(final_model$pre$mold$blueprint$ptypes$predictors),
      error = function(e) names(text_embed[[emb_var]])
    )
    new_x <- as_tibble(text_embed[[emb_var]][idx, predictor_cols, drop = FALSE])
    for (id_var in model_id_vars(final_model)) {
      if (!id_var %in% names(new_x)) new_x[[id_var]] <- seq_along(idx)
    }
    new_x
  }
}

predict_sem_model <- function(model_path, emb_var) {
  obj <- readRDS(model_path)
  out <- rep(NA_real_, nrow(scored))
  idx <- which(prompt_available[[emb_var]] & complete.cases(text_embed[[emb_var]]))
  out[idx] <- predict(
    obj$fit$final_model,
    new_data = embedding_new_data(obj$fit$final_model, emb_var, idx)
  )$.pred
  out
}

pil_path <- here::here("Data", "models", "semp", "purposeinlife_text__Ryff_purpose_in_life.rds")
bts_path <- here::here("Data", "models", "semp", "purpose_goal__EPS_beyond_self.rds")

pred_pil_on_pil_text <- predict_sem_model(pil_path, "purposeinlife_text")
pred_bts_on_pil_text <- predict_sem_model(bts_path, "purposeinlife_text")
pred_pil_on_bts_text <- predict_sem_model(pil_path, "purpose_goal")
pred_bts_on_bts_text <- predict_sem_model(bts_path, "purpose_goal")

cross_pred <- tibble(
  ResponseId  = scored$ResponseId,
  obs_pil     = scored$Ryff_purpose_in_life,
  obs_bts     = scored$EPS_beyond_self,
  pred_pil_on_pil_text = pred_pil_on_pil_text,
  pred_bts_on_pil_text = pred_bts_on_pil_text,
  pred_pil_on_bts_text = pred_pil_on_bts_text,
  pred_bts_on_bts_text = pred_bts_on_bts_text
)

cor_matrix <- cross_pred |>
  select(-ResponseId) |>
  cor(use = "pairwise.complete.obs")

cor_matrix |> as_tibble(rownames = "var") |> gt() |>
  tab_header(title = "Cross-prediction matrix (PIL vs. EPS-BTS)")

readr::write_csv(cross_pred, here::here("Data", "results", "purpose_cross_pred.csv"))

other_outcomes <- c("Ryff_self_acceptance", "Ryff_autonomy", "Ryff_env_mastery",
                    "Ryff_pos_relations", "Ryff_personal_growth",
                    "SWLS_swls", "HILS_hils",
                    "PANAS_positive_affect", "PANAS_negative_affect",
                    "PHQ9_phq9", "GAD7_gad7")

diff_eff <- map_dfr(other_outcomes, function(y) {
  d <- scored |>
    transmute(y = .data[[y]], pil = cross_pred$pred_pil_on_pil_text,
              bts = cross_pred$pred_bts_on_bts_text) |>
    drop_na()
  if (nrow(d) < 10) return(NULL)
  m <- lm(y ~ pil + bts, data = d)
  broom::tidy(m, conf.int = TRUE) |>
    filter(term %in% c("pil", "bts")) |>
    mutate(outcome = y, .before = 1)
})

diff_eff |> gt() |>
  tab_header(title = "Differential effects of predicted PIL vs. predicted EPS-BTS",
             subtitle = "Standardized regression of each outcome on both predicted scores") |>
  fmt_number(columns = c(estimate, std.error, statistic, p.value, conf.low, conf.high), decimals = 3)

diff_eff |>
  ggplot(aes(estimate, fct_rev(outcome), color = term)) +
  geom_point(position = position_dodge(width = 0.5), size = 2.5) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 position = position_dodge(width = 0.5), height = 0) +
  geom_vline(xintercept = 0, linetype = 2, color = "grey60") +
  scale_color_manual(values = c(pil = brand_blue, bts = brand_red),
                     labels = c(pil = "Predicted PIL (sense of purpose)",
                                bts = "Predicted EPS-BTS (beyond-the-self)")) +
  labs(title = "Differential effects on downstream wellbeing outcomes",
       x = "Coefficient (95% CI)", y = NULL, color = NULL) +
  theme(legend.position = "bottom")

library(text)

if (!exists("penalty_grid")) penalty_grid <- 10^seq(-6, 6)
if (!exists("brand_blue")) brand_blue <- "#013d5aff"
if (!exists("brand_red")) brand_red <- "#cd5d5bff"

if (!exists("embedding_matrix")) {
  embedding_matrix <- function(emb) {
    emb$texts$texts |>
      select(-matches("^id(_|$)"), -matches("^id_"))
  }
}

purpose_embedding_vars <- c("purposeinlife_text", "purpose_goal")
if (!exists("text_embed")) text_embed <- list()
if (!all(purpose_embedding_vars %in% names(text_embed))) {
  embeddings_dir <- here::here("Data", "embeddings")
  missing_embedding_vars <- setdiff(purpose_embedding_vars, names(text_embed))
  text_embed[missing_embedding_vars] <- map(
    missing_embedding_vars,
    ~ readRDS(file.path(embeddings_dir, paste0(.x, ".rds"))) |> embedding_matrix()
  )
}

if (!exists("prompt_available")) {
  prompt_available <- clean |>
    transmute(across(any_of(purpose_embedding_vars),
                     ~ !is.na(.x) & nzchar(trimws(.x))))
}

if (!exists("pwb_dims")) {
  pwb_dims <- c("self_acceptance", "autonomy", "env_mastery",
                "pos_relations", "purpose_in_life", "personal_growth")
}

purpose_embedding_sources <- tribble(
  ~source,             ~emb_var,              ~label,
  "sense_of_purpose",  "purposeinlife_text",  "Sense of purpose text",
  "beyond_self",       "purpose_goal",        "Beyond-the-self goal text"
)

wellbeing_outcome_candidates <- c(
  paste0("Ryff_", pwb_dims),
  "SWLS_swls", "HILS_hils",
  "PANAS_positive_affect", "PANAS_negative_affect",
  "PHQ9_phq9", "GAD7_gad7",
  "EPS_meaningfulness", "EPS_commitment", "EPS_beyond_self",
  "Agency_agency", "Responsibility_responsibility"
)
wellbeing_outcome_candidates <- wellbeing_outcome_candidates[wellbeing_outcome_candidates %in% names(scored)]

purpose_embedding_outcome_availability <- tibble(outcome = wellbeing_outcome_candidates) |>
  mutate(
    n_complete = map_int(outcome, ~ sum(!is.na(scored[[.x]]))),
    included = n_complete >= 10
  )

readr::write_csv(
  purpose_embedding_outcome_availability,
  here::here("Data", "results", "purpose_embedding_wellbeing_outcome_availability.csv")
)

wellbeing_outcomes <- purpose_embedding_outcome_availability |>
  filter(included) |>
  pull(outcome)

purpose_embedding_outcome_availability |>
  gt() |>
  tab_header(
    title = "Purpose Embedding Outcome Availability",
    subtitle = "Outcomes with fewer than 10 complete cases are listed but not modeled."
  )

purpose_embedding_model_dir <- here::here("Data", "models", "purpose_embedding_wellbeing")
dir.create(purpose_embedding_model_dir, showWarnings = FALSE, recursive = TRUE)

extract_text_lambda <- function(fit) {
  lambda <- tryCatch(fit$final_model$fit$fit$spec$args$penalty, error = function(e) NA_real_)
  if (is.numeric(lambda) && length(lambda) == 1) return(lambda)

  lambda_line <- tryCatch(
    str_subset(fit$model_description, "^penalty in final model ="),
    error = function(e) character()
  )
  if (length(lambda_line) == 1) return(readr::parse_number(lambda_line))

  NA_real_
}

get_cv_predictions <- function(fit) {
  pred_tbl <- fit$predictions
  if (!is.data.frame(pred_tbl) || !all(c("predictions", "y") %in% names(pred_tbl))) {
    stop("Could not find out-of-fold predictions at fit$predictions.")
  }
  pred_tbl |>
    transmute(
      id_nr = if ("id_nr" %in% names(pred_tbl)) as.integer(id_nr) else row_number(),
      predictions = as.numeric(.data$predictions),
      y = as.numeric(.data$y)
    )
}

fit_embedding_outcome <- function(source, emb_var, outcome) {
  idx <- which(
    prompt_available[[emb_var]] &
      !is.na(scored[[outcome]]) &
      complete.cases(text_embed[[emb_var]])
  )
  if (length(idx) < 10) return(NULL)

  model_path <- file.path(
    purpose_embedding_model_dir,
    paste0(source, "__", outcome, ".rds")
  )

  if (file.exists(model_path)) {
    obj <- readRDS(model_path)
  } else {
    fit <- textTrainRegression(
      x = text_embed[[emb_var]][idx, , drop = FALSE],
      y = scored[[outcome]][idx],
      penalty = penalty_grid,
      multi_cores = TRUE,
      seed = 2020
    )
    obj <- list(source = source, emb_var = emb_var, outcome = outcome,
                model_idx = idx, fit = fit,
                text_pkg_version = packageVersion("text"),
                built_on = Sys.time())
    saveRDS(obj, model_path)
  }

  pred_tbl <- get_cv_predictions(obj$fit)
  pred_tbl <- pred_tbl |>
    mutate(ResponseId = scored$ResponseId[obj$model_idx[id_nr]])

  ok <- complete.cases(pred_tbl$predictions, pred_tbl$y)
  ct <- cor.test(pred_tbl$predictions[ok], pred_tbl$y[ok], alternative = "two.sided")

  list(
    result = tibble(
      source = source,
      emb_var = emb_var,
      outcome = outcome,
      n = sum(ok),
      cv_r = unname(ct$estimate),
      lcl = ct$conf.int[1],
      ucl = ct$conf.int[2],
      p_raw = ct$p.value,
      lambda = extract_text_lambda(obj$fit)
    ),
    predictions = pred_tbl |>
      transmute(source, emb_var, outcome, ResponseId,
                pred = predictions, truth = y)
  )
}

purpose_embedding_fits <- pmap(
  crossing(purpose_embedding_sources, outcome = wellbeing_outcomes),
  \(source, emb_var, label, outcome) fit_embedding_outcome(source, emb_var, outcome)
) |>
  compact()

purpose_embedding_cv <- map_dfr(purpose_embedding_fits, "result") |>
  left_join(purpose_embedding_sources |> select(source, label), by = "source") |>
  group_by(outcome) |>
  mutate(
    p_fdr_within_outcome = p.adjust(p_raw, method = "BH"),
    winner_cv_r = source[which.max(cv_r)]
  ) |>
  ungroup()

purpose_embedding_predictions <- map_dfr(purpose_embedding_fits, "predictions")

paired_error_test <- function(err_pil, err_bts) {
  out <- tryCatch(
    t.test(err_pil, err_bts, paired = TRUE),
    error = function(e) NULL
  )
  if (is.null(out)) {
    tibble(t_stat = NA_real_, p_raw = NA_real_)
  } else {
    tibble(t_stat = unname(out$statistic), p_raw = out$p.value)
  }
}

purpose_embedding_cv_wide <- purpose_embedding_cv |>
  select(outcome, source, cv_r) |>
  pivot_wider(names_from = source, values_from = cv_r) |>
  mutate(delta_bts_minus_pil = beyond_self - sense_of_purpose)

purpose_embedding_error_compare <- purpose_embedding_predictions |>
  select(source, outcome, ResponseId, pred, truth) |>
  group_by(outcome) |>
  mutate(
    truth_z = as.numeric(scale(truth)),
    pred_z = as.numeric(scale(pred))
  ) |>
  ungroup() |>
  mutate(sq_err = (pred_z - truth_z)^2) |>
  select(source, outcome, ResponseId, sq_err) |>
  pivot_wider(names_from = source, values_from = sq_err, names_prefix = "err_") |>
  drop_na(err_sense_of_purpose, err_beyond_self) |>
  group_by(outcome) |>
  summarise(
    n = n(),
    mean_err_pil = mean(err_sense_of_purpose),
    mean_err_bts = mean(err_beyond_self),
    diff_err_pil_minus_bts = mean(err_sense_of_purpose - err_beyond_self),
    paired_error_test(err_sense_of_purpose, err_beyond_self),
    .groups = "drop"
  ) |>
  mutate(
    p_fdr = p.adjust(p_raw, method = "BH"),
    lower_error_winner = if_else(diff_err_pil_minus_bts > 0,
                                 "beyond_self", "sense_of_purpose")
  ) |>
  left_join(purpose_embedding_cv_wide, by = "outcome")

scale_to_wellbeing <- map_dfr(wellbeing_outcomes, function(outcome) {
  map_dfr(c(obs_pil = "Ryff_purpose_in_life", obs_bts = "EPS_beyond_self"), function(predictor) {
    d <- scored |> select(x = all_of(predictor), y = all_of(outcome)) |> drop_na()
    if (nrow(d) < 10) return(NULL)
    ct <- cor.test(d$x, d$y)
    tibble(outcome = outcome, scale_predictor = predictor, n = nrow(d),
           r = unname(ct$estimate), lcl = ct$conf.int[1], ucl = ct$conf.int[2],
           p_raw = ct$p.value)
  }, .id = "scale_source")
})

readr::write_csv(purpose_embedding_cv,
                 here::here("Data", "results", "purpose_embedding_wellbeing_cv.csv"))
readr::write_csv(purpose_embedding_predictions,
                 here::here("Data", "results", "purpose_embedding_wellbeing_predictions.csv"))
readr::write_csv(purpose_embedding_error_compare,
                 here::here("Data", "results", "purpose_embedding_wellbeing_error_compare.csv"))
readr::write_csv(scale_to_wellbeing,
                 here::here("Data", "results", "purpose_scale_to_wellbeing_correlations.csv"))

purpose_embedding_cv |>
  select(label, outcome, n, cv_r, lcl, ucl, p_raw, p_fdr_within_outcome, lambda, winner_cv_r) |>
  arrange(outcome, desc(cv_r)) |>
  gt(groupname_col = "outcome") |>
  tab_header(
    title = "Purpose Embeddings Predicting Wellbeing Outcomes",
    subtitle = "Same outcome predicted separately from sense-of-purpose text and beyond-the-self goal text."
  ) |>
  fmt_number(columns = c(cv_r, lcl, ucl, p_raw, p_fdr_within_outcome, lambda), decimals = 3)

purpose_embedding_error_compare |>
  arrange(desc(diff_err_pil_minus_bts)) |>
  gt() |>
  tab_header(
    title = "Paired Error Comparison: BTS Embeddings vs. PIL Embeddings",
    subtitle = "Positive error difference means BTS embeddings had lower standardized squared error."
  ) |>
  fmt_number(columns = c(mean_err_pil, mean_err_bts, diff_err_pil_minus_bts,
                         t_stat, p_raw, p_fdr, sense_of_purpose, beyond_self,
                         delta_bts_minus_pil),
             decimals = 3)

purpose_embedding_cv |>
  mutate(outcome = fct_reorder(outcome, cv_r, .fun = max, .desc = FALSE)) |>
  ggplot(aes(cv_r, outcome, color = label)) +
  geom_vline(xintercept = 0, linetype = 2, color = "grey65") +
  geom_point(position = position_dodge(width = 0.55), size = 2.6) +
  geom_errorbarh(aes(xmin = lcl, xmax = ucl),
                 position = position_dodge(width = 0.55), height = 0) +
  scale_color_manual(values = c("Sense of purpose text" = brand_blue,
                                "Beyond-the-self goal text" = brand_red)) +
  labs(title = "Out-of-fold prediction of wellbeing from purpose-language embeddings",
       subtitle = "Higher CV correlation means the embedding source carries more outcome-relevant signal.",
       x = "CV correlation between predicted and observed outcome",
       y = NULL, color = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

purpose_embedding_error_compare |>
  mutate(outcome = fct_reorder(outcome, delta_bts_minus_pil)) |>
  ggplot(aes(delta_bts_minus_pil, outcome)) +
  geom_vline(xintercept = 0, linetype = 2, color = "grey65") +
  geom_col(aes(fill = delta_bts_minus_pil > 0), width = 0.7) +
  scale_fill_manual(values = c(`TRUE` = brand_red, `FALSE` = brand_blue),
                    labels = c(`TRUE` = "BTS higher CV r", `FALSE` = "PIL higher CV r")) +
  labs(title = "Which purpose embedding source better predicts each wellbeing outcome?",
       subtitle = "Positive values favor beyond-the-self goal text; negative values favor sense-of-purpose text.",
       x = "CV r difference: BTS embeddings minus PIL embeddings",
       y = NULL, fill = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

25 Interest #2 — L-BAM build for TRWOC transfer

Train a ridge model with the TRWOC-mirrored prompts (college_goals_text, transformative_text) as input and EPS-BTS as outcome. The result is the L-BAM that Chapter 3 applies to actual TRWOC interview transcripts.

The linguistic-register-matched prompt is the bridge: participants answer the same questions TRWOC interviewers asked, with their EPS-BTS scores attached. The model learns what beyond-the-self purpose “sounds like” in that register.

library(text)

lbam_text <- paste(
  replace_na(clean$college_goals_text, ""),
  replace_na(clean$transformative_text, ""),
  sep = " || "
)
lbam_text_missing <- !nzchar(str_remove_all(lbam_text, "[\\s|]"))
lbam_text_for_embedding <- lbam_text
lbam_text_for_embedding[lbam_text_missing] <- "missing response"

lbam_emb <- textEmbed(
  texts = tibble(text = lbam_text_for_embedding),
  model = "mixedbread-ai/mxbai-embed-large-v1",
  layers = -2,
  aggregation_from_layers_to_tokens = "concatenate",
  aggregation_from_tokens_to_texts  = "mean",
  keep_token_embeddings = FALSE,
  batch_size = 16
)

lbam_x <- lbam_emb$texts[[1]] |>
  select(-starts_with("id_"))

ok <- !is.na(scored$EPS_beyond_self) &
  !lbam_text_missing &
  complete.cases(lbam_x)

lbam_fit <- textTrainRegression(
  x = lbam_x[ok, , drop = FALSE],
  y = scored$EPS_beyond_self[ok],
  penalty = penalty_grid,
  multi_cores = TRUE,
  seed = 2020
)

lbam_artifact <- list(
  name        = "EPS_BTS_TRWOC_register_v1",
  outcome     = "Engaged Purpose Scale — Beyond-the-Self subscale (3 items, mean of 1–5 Likert)",
  prompts     = c("Howard's college goals", "Howard's transformative experience"),
  text_combiner = "Pasted with ' || ' separator",
  embedder    = "mixedbread-ai/mxbai-embed-large-v1, layer -2, mean-pooled",
  text_pkg_version = packageVersion("text"),
  training_rows = which(ok),
  fit         = lbam_fit,
  built_on    = Sys.time(),
  n_train     = sum(ok),
  cv_r        = lbam_fit$results$estimate
)

lbam_dir <- here::here("Data", "lbam")
dir.create(lbam_dir, showWarnings = FALSE, recursive = TRUE)
saveRDS(lbam_artifact, file.path(lbam_dir, "eps_bts_lbam.rds"))

lbam_artifact[c("name","outcome","prompts","embedder","n_train","cv_r")]

26 Topic modeling

LDA on the goal/why/doing prompts to surface interpretable topics. The topics package wraps the heavy lifting.

library(topics)

target_outcomes <- c("EPS_beyond_self", "Ryff_purpose_in_life")

purpose_corpus_df <- clean |>
  transmute(
    ResponseId,
    text = paste(replace_na(purpose_goal, ""),
                 replace_na(purpose_goal_reason, ""),
                 replace_na(purpose_goal_action, ""), sep = " ")
  ) |>
  filter(nchar(trimws(text)) > 0) |>
  left_join(
    scored |> select(ResponseId, all_of(target_outcomes)),
    by = "ResponseId"
  )
purpose_corpus <- purpose_corpus_df$text

cat("Corpus size:", length(purpose_corpus), "documents\n")
cat("Mean tokens / doc:", round(mean(lengths(strsplit(purpose_corpus, "\\s+"))), 1), "\n")

if (length(purpose_corpus) < 100) {
  dtm <- topicsDtm(
    data         = tibble(text = purpose_corpus),
    ngram_window = c(1, 1),
    shuffle      = FALSE,
    seed         = 2020
  )
} else {
  dtm <- topicsDtm(
    data               = tibble(text = purpose_corpus),
    ngram_window       = c(1, 1),
    removal_mode       = "frequency",
    removal_rate_least = 2,
    shuffle            = FALSE,
    seed               = 2020
  )
}

cat("\nDTM object structure:\n")
print(str(dtm, max.level = 1, give.attr = FALSE))
cat("Slot names:", paste(names(dtm), collapse = ", "), "\n\n")

extract_dtm_matrix <- function(dtm) {
  candidates <- c("train_dtm", "dtm", "train_data", "matrix", "dtm_train",
                  "trainset", "DTM", "x")
  for (nm in candidates) {
    if (!is.null(dtm[[nm]])) {
      m <- dtm[[nm]]
      if (inherits(m, c("dgCMatrix","dgTMatrix","Matrix","matrix",
                        "DocumentTermMatrix","simple_triplet_matrix",
                        "data.frame","tbl_df"))) {
        return(list(slot = nm, matrix = m))
      }
    }
  }
  for (nm in names(dtm)) {
    m <- dtm[[nm]]
    if (inherits(m, c("dgCMatrix","dgTMatrix","Matrix",
                      "DocumentTermMatrix","simple_triplet_matrix")) ||
        (is.matrix(m) && is.numeric(m)) ||
        (is.data.frame(m) && ncol(m) > 1)) {
      return(list(slot = nm, matrix = m))
    }
  }
  list(slot = NA, matrix = NULL)
}

mat_info <- extract_dtm_matrix(dtm)
if (is.null(mat_info$matrix)) {
  stop("Could not locate the DTM matrix in the topicsDtm() return. ",
       "Slot names returned: ", paste(names(dtm), collapse = ", "), ". ",
       "Run str(dtm, max.level = 2) interactively and update extract_dtm_matrix().")
}
dtm_mat <- mat_info$matrix
cat("DTM matrix found in slot: $", mat_info$slot, "\n", sep = "")
cat("DTM dimensions: ", paste(dim(dtm_mat), collapse = " x "), "\n", sep = "")
cat("Vocab size: ", ncol(dtm_mat), "\n", sep = "")

if (ncol(dtm_mat) < 20) {
  warning("DTM vocabulary is very small (<20 terms) — LDA results will be unstable. ",
          "Consider lowering removal_rate_least or relaxing stopword removal.")
}

topic_model <- tryCatch(
  topicsModel(
    dtm            = dtm,
    num_topics     = 8,
    num_iterations = 1000,
    seed           = 2020
  ),
  error = function(e) {
    message("topicsModel failed: ", conditionMessage(e))
    NULL
  }
)

if (is.null(topic_model)) {
  stop("topicsModel returned NULL. Common causes: (1) Java/MALLET not initialized — ",
       "run library(rJava); .jinit() and check Sys.getenv('JAVA_HOME'); ",
       "(2) DTM too sparse — current vocab = ", ncol(dtm_mat), " terms; ",
       "(3) num_topics > vocabulary. Re-run topicsDtm with removal_rate_least = 0 ",
       "and re-check.")
}

topic_model$topics_pkg_version <- packageVersion("topics")

theta <- topic_model$theta
if (is.null(theta)) stop("topic_model$theta is NULL — cannot compute topic-outcome correlations.")

theta_tbl <- tibble::as_tibble(theta)
candidate_topic_cols <- grep("^t_|^Topic|^topic", names(theta_tbl), value = TRUE)
if (length(candidate_topic_cols) == 0) {
  candidate_topic_cols <- names(theta_tbl)[vapply(theta_tbl, is.numeric, logical(1))]
}
stopifnot(length(candidate_topic_cols) > 0)

top_terms_collapsed <- topic_model$top_terms |>
  summarise(across(everything(),
                   ~ paste(head(.x[!is.na(.x)], 10), collapse = ", "))) |>
  pivot_longer(everything(), names_to = "topic", values_to = "top_terms")

compute_topic_assoc <- function(theta_tbl, topic_cols, outcome_vec) {
  map_dfr(topic_cols, function(tc) {
    ok <- complete.cases(theta_tbl[[tc]], outcome_vec)
    if (sum(ok) < 5) return(NULL)
    ct <- cor.test(theta_tbl[[tc]][ok], outcome_vec[ok])
    tibble(topic    = tc,
           n        = sum(ok),
           estimate = unname(ct$estimate),
           lcl      = ct$conf.int[1],
           ucl      = ct$conf.int[2],
           p_value  = ct$p.value)
  }) |>
    mutate(p_fdr     = p.adjust(p_value, method = "fdr"),
           direction = case_when(
             p_fdr < 0.05 & estimate > 0 ~ "↑ positive",
             p_fdr < 0.05 & estimate < 0 ~ "↓ negative",
             TRUE                        ~ "n.s."
           ))
}

render_ranked_table <- function(ranked, outcome_label) {
  ranked |>
    select(topic, direction, estimate, lcl, ucl, p_value, p_fdr, n, top_terms) |>
    gt() |>
    tab_header(
      title    = paste0("LDA topics ranked by association with ", outcome_label),
      subtitle = "Pearson r between topic prevalence (theta) and outcome; FDR-adjusted p."
    ) |>
    fmt_number(columns = c(estimate, lcl, ucl, p_value, p_fdr), decimals = 3) |>
    data_color(
      columns = direction,
      fn = function(x) {
        dplyr::case_when(
          x == "↑ positive" ~ "#cfe8d6",
          x == "↓ negative" ~ "#f4cdcd",
          TRUE              ~ "#f0f0f0"
        )
      }
    )
}

topic_assoc_by_outcome <- list()
ranked_by_outcome      <- list()

for (oc in target_outcomes) {
  outcome_vec <- purpose_corpus_df[[oc]]
  stopifnot(length(outcome_vec) == nrow(theta_tbl))

  ta <- compute_topic_assoc(theta_tbl, candidate_topic_cols, outcome_vec)
  rk <- ta |>
    left_join(top_terms_collapsed, by = "topic") |>
    arrange(desc(estimate))

  topic_assoc_by_outcome[[oc]] <- ta
  ranked_by_outcome[[oc]]      <- rk

  cat("\n\n### Topics × ", oc, "\n\n", sep = "")
  print(render_ranked_table(rk, oc))

  readr::write_csv(
    rk,
    here::here("Data", "results",
               paste0("topics_purpose_ranked__", oc, ".csv"))
  )
}

Per-topic word clouds — try topicsPlot() first; if it fails (version drift), fall back to ggplot scatter clouds built directly from topic_model$phi. Each topic’s header annotates its sign and effect size against the target outcome.

library(topics)

topic_plots <- tryCatch(
  topicsPlot(model = topic_model, overview_plot = FALSE),
  error = function(e) { message("topicsPlot failed: ", conditionMessage(e)); NULL }
)

manual_word_cloud <- function(model, topic_name, n = 30) {
  phi <- model$phi
  if (is.null(phi)) return(NULL)
  phi_mat <- as.matrix(phi)
  if (topic_name %in% rownames(phi_mat)) {
    probs <- phi_mat[topic_name, ]
    words <- colnames(phi_mat)
  } else if (topic_name %in% colnames(phi_mat)) {
    probs <- phi_mat[, topic_name]
    words <- rownames(phi_mat)
  } else {
    j <- match(topic_name, paste0("t_", seq_len(min(dim(phi_mat)))))
    if (is.na(j)) return(NULL)
    if (nrow(phi_mat) >= ncol(phi_mat)) {
      probs <- phi_mat[, j]; words <- rownames(phi_mat)
    } else {
      probs <- phi_mat[j, ]; words <- colnames(phi_mat)
    }
  }
  ord <- order(probs, decreasing = TRUE)[seq_len(min(n, length(probs)))]
  d <- tibble(term = words[ord], prob = probs[ord])
  set.seed(which(names(topic_model$theta) == topic_name)[1] %||% 1)
  d$x <- runif(nrow(d), -1, 1)
  d$y <- runif(nrow(d), -1, 1)
  ggplot(d, aes(x, y, label = term, size = prob)) +
    geom_text(check_overlap = TRUE, color = brand_blue) +
    scale_size(range = c(3, 10)) +
    theme_void(base_size = 11) +
    theme(legend.position = "none",
          plot.title      = element_text(hjust = 0.5))
}

topic_names <- candidate_topic_cols
build_topic_annotation <- function(nm, ranked_by_outcome) {
  parts <- character(0)
  for (oc in names(ranked_by_outcome)) {
    row <- ranked_by_outcome[[oc]] |> filter(topic == nm)
    if (nrow(row) == 1) {
      parts <- c(parts,
                 sprintf("%s: %s (r = %.3f, p_fdr = %.3f)",
                         oc, row$direction, row$estimate, row$p_fdr))
    }
  }
  if (length(parts) == 0) "" else paste0("  ·  ", paste(parts, collapse = "  ·  "))
}

for (nm in topic_names) {
  ann <- build_topic_annotation(nm, ranked_by_outcome)
  cat("\n#### ", nm, ann, "\n\n", sep = "")

  rendered <- FALSE
  if (!is.null(topic_plots) && nm %in% names(topic_plots)) {
    obj <- topic_plots[[nm]]
    if (inherits(obj, c("ggplot","gg","patchwork"))) {
      print(obj); rendered <- TRUE
    } else if (is.list(obj)) {
      for (el in obj) {
        if (inherits(el, c("ggplot","gg","patchwork"))) {
          print(el); rendered <- TRUE
        }
      }
    }
  }
  if (!rendered) {
    fallback <- manual_word_cloud(topic_model, nm, n = 30)
    if (!is.null(fallback)) {
      print(fallback +
              labs(title = paste0(nm, ann)))
    } else {
      cat("\n*Could not render word cloud for ", nm, ".*\n\n", sep = "")
    }
  }
}

Word-level supervised projection — for the same purpose corpus, project word-type embeddings onto the difference vector between high and low quartiles of (a) EPS_beyond_self and (b) Ryff_purpose_in_life. Words on the right pole of each plot are most associated with HIGH scale scores; words on the left, with LOW.

library(text)

emb_purpose_goal      <- readRDS(here::here("Data", "embeddings", "purpose_goal.rds"))
emb_purpose_in_life   <- readRDS(here::here("Data", "embeddings", "purposeinlife_text.rds"))

embedding_matrix <- function(emb) {
  m <- emb$texts$texts |> dplyr::select(-dplyr::any_of("id_texts"))
  as.matrix(m)
}
word_type_matrix <- function(emb) {
  wt <- emb$word_types
  if (is.list(wt) && "texts" %in% names(wt)) wt$texts else wt
}

run_projection <- function(prompt_var, outcome_var, emb_obj) {
  txt <- clean[[prompt_var]]
  ok  <- !is.na(txt) & nzchar(trimws(txt)) &
         !is.na(scored[[outcome_var]])
  txt_mat <- embedding_matrix(emb_obj)
  ok      <- ok & complete.cases(txt_mat)
  textProjection(
    words                 = txt[ok],
    word_embeddings       = txt_mat[ok, , drop = FALSE],
    word_types_embeddings = word_type_matrix(emb_obj),
    x = scored[[outcome_var]][ok],
    pca = NULL,
    aggregation = "mean",
    split = "quartile",
    word_weight_power = 1,
    min_freq_words_test = 2
  )
}

proj_bts_inline <- run_projection("purpose_goal",        "EPS_beyond_self",      emb_purpose_goal)
proj_pil_inline <- run_projection("purposeinlife_text",  "Ryff_purpose_in_life", emb_purpose_in_life)

cat("\n#### Beyond-the-self purpose (EPS-BTS) — high vs low pole\n\n")
print(textProjectionPlot(
  word_data             = proj_bts_inline,
  k_n_words_to_test     = FALSE,
  min_freq_words_test   = 2,
  plot_n_word_extreme   = 10,
  plot_n_word_frequency = 5,
  plot_n_words_middle   = 5,
  plot_n_words_p        = 5,
  title_top             = "EPS-BTS: high pole right (high score), low pole left"
))

cat("\n#### Sense of purpose (Ryff PIL) — high vs low pole\n\n")
print(textProjectionPlot(
  word_data             = proj_pil_inline,
  k_n_words_to_test     = FALSE,
  min_freq_words_test   = 2,
  plot_n_word_extreme   = 10,
  plot_n_word_frequency = 5,
  plot_n_words_middle   = 5,
  plot_n_words_p        = 5,
  title_top             = "Ryff PIL: high pole right (high score), low pole left"
))

dir.create(here::here("Data", "results"), showWarnings = FALSE, recursive = TRUE)
saveRDS(proj_bts_inline,
        here::here("Data", "results", "projection_bts_inline.rds"))
saveRDS(proj_pil_inline,
        here::here("Data", "results", "projection_pil_inline.rds"))

topics_dir <- here::here("Data", "topics")
dir.create(topics_dir, showWarnings = FALSE, recursive = TRUE)
saveRDS(list(model                  = topic_model,
             theta                  = theta_tbl,
             topic_assoc_by_outcome = topic_assoc_by_outcome,
             ranked_by_outcome      = ranked_by_outcome,
             corpus_df              = purpose_corpus_df,
             target_outcomes        = target_outcomes),
        file.path(topics_dir, "purpose_lda_k8.rds"))

27 Outputs and handoff

The pipeline writes:

Data/wellbeing_clean.csv — analytic sample after exclusions
Data/wellbeing_scored.csv — clean + scale scores joined
Data/embeddings/<prompt>.rds — per-prompt mixedbread embeddings
Data/models/pilot/*.rds — pilot 5-fold CV ridge models
Data/models/semp/*.rds — SEMP-frozen ridge models (when N=500)
Data/results/projection_*.rds, purpose_cross_pred.csv — Interest #1 outputs
Data/lbam/eps_bts_lbam.rds — L-BAM artifact for Chapter 3
Data/topics/purpose_lda_k8.rds — LDA model

Chapter 3 (R Projects/The Real World of College/) reads Data/lbam/eps_bts_lbam.rds and applies it to TRWOC transcripts.

28 Pilot status

What this Rmd will produce on the current pilot CSV:

Sections 1–5 (setup through scale construction): runs.
Section 6 (CFA): runs but estimation is unreliable at N≈55.
Section 7 (text descriptives): runs.
Sections 8–14 (eval = FALSE chunks: embeddings → models → Interest #1 → Interest #2 → topics): blocked on having mixedbread embeddings extracted. Flip eval = TRUE and run when ready, but be realistic about pilot N for the modeling sections.

What the Rmd CAN’T exercise on pilot (data-design issues, not pipeline issues):

The H1 SEMP train/test split (need wave 1/wave 2 column populated).
The Hx counterbalancing test (condition is NaN for all real respondents).
H3v/H3vi/H3vii (employment, religiosity importance, religious affiliation, relationship status all NaN for real respondents).
Responsibility scale analyses (Q192–194 NaN for real respondents).

Chapter 2 — PWB/SWB Language-Based Assessment

Brendan Fereday

2026-05-07