Text Processing

Text data processing is necessary to transform raw responses into Document Text Matrices (DTM) to be later use for the sentiment analysis. This processing was the same for all the dictionaries since all the analysis were done over the section final output.

Loading and Cleaning the Data

data <- read_csv("db_online.csv")
data <- clean_names(data)

Create a corpus from the uploaded database

The database contains each participant response. The responses are in raw state without any pre-processing. All the text data appears in the column text

corpus_data <- quanteda::corpus(data, text_field = "text")

Apply custom stopwords

A pre-elaborated list of stopwords was used for customization of the analysis. This list of stopwords is meant to control the text filtering and avoid losing relevant information due to conventional filtering process.

custom_stopwords <- tolower(trimws(readLines("custom_stopwords.txt")))

Tokenization and other text processing

The text data was tokenized, lemmatized and filtered to remove punctuation, symbols, numbers and the customized stopwords. This processing was done using the quanteda package

tokens_data <- quanteda::tokens(corpus_data,
                                remove_punct = TRUE,
                                remove_symbols = TRUE,
                                remove_numbers = TRUE) %>%
  quanteda::tokens_tolower() %>%
  quanteda::tokens_select(pattern = custom_stopwords, selection = "remove")

tokens_lem <- tokens_data %>%
  as.list() %>%
  lapply(textstem::lemmatize_words, language = "en") %>%
  quanteda::tokens()

Text output

A final output document (tokens.csv) is generated with the information of each participant and their processed text responses.

dfm_clean <- quanteda::dfm(tokens_lem)
dfm_df <- quanteda::convert(dfm_clean, to = "data.frame")
write_csv(dfm_df, "tokens.csv")

Sentiment Analysis

For the dictionaries GI, HE and QDAP, we used the SentimentAnalysis package. A final output with the sentiment score per dictionary was generated at the end

library(SentimentAnalysis)
texts <- data$text
sentiment_results <- SentimentAnalysis::analyzeSentiment(texts)
results <- cbind(data, sentiment_results)
write.csv(results, "quanteda_sentiment.csv", row.names = FALSE)

For the dictionaires TextBlob and VADER the analysis was done in Python and each generated a .csv document, similar to the one generated in the previous analysis.

Pre-processing for Supervised Machine Learning

Unified database

For the SML analysis, we start by creating a single database with the scoring from the five dictionaries and the human scoring. The human scoring data were set to a -2 to 2 scale to match the polarity conventions used by the dictionaries. The database also include a variable indicating the age-groups of the participants.

# SML Analysis (split by age)
sent <- read.csv("quanteda_sentiment.csv") %>% clean_names()
blob <- read.csv("blob.csv") %>% clean_names()
vader <- read.csv("vader.csv") %>% clean_names()
sent$polarity_blob <- blob$blob_polarity
sent$compound_vader <- vader$compound
sent$emotions <- factor(sent$valence - 3, ordered = TRUE)
sent$age <- factor(sent$age, levels = c("younger", "older"))

Normalize sentiment

All the sentiment scores were normalized by word count of each narration to control for verbosity. A total sentiment score (sentiment_total) was also calculated for exploratory reasons.

# Normalize and compute total
sent <- sent %>%
  mutate(
    sentiment_gi    = sentiment_gi / word_count,
    sentiment_he    = sentiment_he / word_count,
    sentiment_qdap  = sentiment_qdap / word_count,
    sentiment_blob  = polarity_blob / word_count,
    sentiment_vader = compound_vader / word_count
  )

Original input

We explored class imbalance in the human scoring since class imbalance (toward positive scoring in this case), could potentially affect the SML analysis due to overestimation.

The following plot confirms that the human scoring was indeed skewed toward positive values.

Metrics

The function eval_metrics() evaluates model predictions by deriving performance metrics from the confusion matrix. First, a confusion matrix is computed using caret::confusionMatrix(preds, actuals), which cross-tabulates predicted emotion classes against the true classes. The contingency table underlying the confusion matrix is extracted and converted to a numeric matrix (tab) to allow manual computation of performance indices.

For each emotion class, the confusion matrix is decomposed into its four fundamental components: true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). True positives are obtained from the diagonal of the confusion matrix (diag(tab)), representing cases in which a class is correctly predicted. False positives are calculated as the number of observations incorrectly assigned to a given class, defined as the column sum minus the true positives (colSums(tab) − diag(tab)). False negatives correspond to observations belonging to a given class that were misclassified as another class and are computed as the row sum minus the true positives (rowSums(tab) − diag(tab)). True negatives are defined as all remaining observations that are neither false positives nor false negatives for a given class; these are computed by summing all cells outside the row and column of the target class (sum(tab[−i, −i])).

Using these quantities, standard performance metrics are computed separately for each class. Precision is defined as the proportion of predicted instances of a class that are correct (TP / (TP + FP)). Recall, or sensitivity, reflects the proportion of true instances of a class that are correctly identified (TP / (TP + FN)). Specificity quantifies the proportion of non-target instances that are correctly rejected (TN / (TN + FP)). The F1 score is computed as the harmonic mean of precision and recall, capturing the balance between these two measures. Finally, false positive rate (FPR) and false negative rate (FNR) are computed to quantify error tendencies, defined respectively as FP / (FP + TN) and FN / (FN + TP).

# Helper: evaluation metrics (must be defined BEFORE the main function)
eval_metrics <- function(preds, actuals, predictor = NA, group = NA) {
  cm  <- caret::confusionMatrix(preds, actuals)
  tab <- as.matrix(cm$table)

  # Per-class components
  TN <- sapply(seq_len(nrow(tab)), function(i) sum(tab[-i, -i]))
  FP <- colSums(tab) - diag(tab)
  FN <- rowSums(tab) - diag(tab)
  TP <- diag(tab)

  # Per-class rates
  precision   <- TP / (TP + FP)
  recall      <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  f1          <- 2 * (precision * recall) / (precision + recall)
  fpr         <- FP / (FP + TN)
  fnr         <- FN / (FN + TP)

  # Macro-averaged metrics (equal weight to each class)
  metrics <- data.frame(
    Accuracy    = sum(TP) / sum(tab),
    Precision   = mean(precision, na.rm = TRUE),
    Recall      = mean(recall, na.rm = TRUE),
    F1          = mean(f1, na.rm = TRUE),
    Specificity = mean(specificity, na.rm = TRUE),
    FPR         = mean(fpr, na.rm = TRUE),
    FNR         = mean(fnr, na.rm = TRUE)
  )

  list(metrics = metrics, confusion = cm$table, Predictor = predictor, Group = group)
}

Data handling and supervised machine learning

The function run_age_group_analysis() evaluates how well each sentiment dictionary predicts human-rated emotional valence separately for younger and older adults. The analysis is conducted independently for each age group by subsetting the dataset to the requested group (df <- df[df$age == group_label, ]) and ensuring the outcome variable (emotions) is treated as an ordered factor (factor(..., ordered = TRUE)).

Weights

To address class imbalance in the human ratings, the function first computes class weights from the original, pre-oversampling distribution of emotion categories. Specifically, it tabulates the number of observations in each emotion class (class_counts <- table(df$emotions)), computes inverse-frequency weights (inv_weights <- 1 / class_counts), and applies a log transformation (log_weights <- log(1 + inv_weights)) to reduce extreme penalties for rare classes. These class weights are then normalised to have a mean of 1 (class_w <- log_weights / mean(log_weights)), which stabilises the numerical scale of weights during model fitting.

Next, the dataset is split into training and testing partitions before any oversampling using stratified sampling (createDataPartition(df$emotions, p = 0.8)). This ensures that evaluation is performed on an untouched test set of genuinely unseen human-rated texts, preventing information leakage that could occur if duplicated observations appeared in both training and test sets.

Oversampling

Oversampling is then applied only to the training data to ensure equal representation of emotion categories during model induction. The training set is resampled with replacement within each emotion class until all classes match the frequency of the most common class in the training data (sample_n(target_freq, replace = TRUE)). After oversampling, the previously computed class weights (derived from the original data) are mapped onto the oversampled training rows (train_weights <- class_w[...]). Weights are rounded and constrained to be at least 1 to ensure all observations retain non-zero influence during training.

Conditional Trees Models

Models are fitted using weighted conditional inference trees (ctree) in a one-predictor-at-a-time framework. For each sentiment dictionary (GI, HE, QDAP, TextBlob, and VADER), a separate model is trained on the oversampled, weighted training set and then used to predict emotion classes in the untouched test set. Performance is evaluated using macro-averaged classification metrics computed by eval_metrics(), including accuracy, precision, recall, F1, specificity, false positive rate, and false negative rate. Macro-averaging provides a single score per metric for each dictionary by averaging class-specific values, ensuring each emotion category contributes equally to performance estimates. Finally, the function returns a combined table of metrics for all dictionaries within the specified age group.

run_age_group_analysis <- function(df, group_label) {
  cat("\nRunning analysis for:", group_label, "\n")
  
  # 1) Subset to the requested age group
  df <- df[df$age == group_label, ]
  df$emotions <- factor(df$emotions, ordered = TRUE)
  
  # 2) Compute class weights from ORIGINAL (pre-oversampling) distribution
  class_counts <- table(df$emotions)
  inv_weights  <- 1 / class_counts
  log_weights  <- log(1 + inv_weights)
  
  # Normalise for numerical stability (mean weight ~ 1)
  class_w <- log_weights / mean(log_weights)
  
  # 3) Train/test split on ORIGINAL data (pre-oversampling)
  set.seed(123)
  train_idx <- caret::createDataPartition(df$emotions, p = 0.8, list = FALSE)
  train_raw <- df[train_idx, ]
  test      <- df[-train_idx, ]
  
  # 4) Oversample TRAINING data only (to avoid leakage)
  target_freq <- max(table(train_raw$emotions))
  train_bal <- train_raw %>%
    dplyr::group_by(emotions) %>%
    dplyr::sample_n(target_freq, replace = TRUE) %>%
    dplyr::ungroup()
  
  # 5) Map ORIGINAL class weights onto the oversampled training set
  train_weights <- class_w[as.character(train_bal$emotions)]
  train_weights <- as.integer(round(pmax(train_weights, 1)))  # enforce minimum weight = 1
  
  # 6) Fit and evaluate one-predictor models
  train_model <- function(pred, name) {
    
    m <- partykit::ctree(
      emotions ~ .,
      data    = train_bal[, c("emotions", pred)],
      weights = train_weights
    )
    
    preds <- predict(m, newdata = test, type = "response")
    
    # Macro-averaged metrics (from your existing eval_metrics function)
    eval <- eval_metrics(preds, test$emotions, predictor = name, group = group_label)
    
    cat(name, "metrics:\n")
    print(eval$metrics)
    
    cbind(Predictor = name, eval$metrics, Group = group_label)
  }
  
  # 7) Return combined results for this age group
  do.call(rbind, list(
    train_model("sentiment_gi",    "GI"),
    train_model("sentiment_he",    "HE"),
    train_model("sentiment_qdap",  "QDAP"),
    train_model("sentiment_blob",  "Blob"),
    train_model("sentiment_vader", "VADER")
  ))
}


# Run analysis for each age group
# ============================================================
metrics_younger <- run_age_group_analysis(sent, group_label = "younger")

## 
## Running analysis for: younger 
## GI metrics:
##    Accuracy Precision   Recall        F1 Specificity       FPR      FNR
## 1 0.1575342  0.134281 0.140625 0.2009951   0.7927995 0.2072005 0.859375
## HE metrics:
##     Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.06849315 0.2470588 0.1214065 0.1564799   0.8069631 0.1930369 0.8785935
## QDAP metrics:
##    Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.1780822 0.2604248 0.1890814 0.1604167   0.8090176 0.1909824 0.8109186
## Blob metrics:
##    Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.2808219 0.2777778 0.1860051 0.2860963   0.7961902 0.2038098 0.8139949
## VADER metrics:
##    Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.2739726 0.3380117 0.2986486 0.2479238   0.8160976 0.1839024 0.7013514

metrics_older   <- run_age_group_analysis(sent, group_label = "older")

## 
## Running analysis for: older 
## GI metrics:
##    Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.1955307 0.1454987 0.1974907 0.2315084   0.8066298 0.1933702 0.8025093
## HE metrics:
##    Accuracy  Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.1731844 0.08698135 0.2210227 0.2588076   0.8074854 0.1925146 0.7789773
## QDAP metrics:
##   Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.122905 0.1439189 0.1337963 0.2173252   0.8059532 0.1940468 0.8662037
## Blob metrics:
##   Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.396648 0.1950895 0.2898482 0.4558824   0.8099197 0.1900803 0.7101518
## VADER metrics:
##    Accuracy Precision    Recall        F1 Specificity       FPR       FNR
## 1 0.2905028 0.2571987 0.1427884 0.2939394   0.8118885 0.1881115 0.8572116

# Compare all metrics (stack results)
# ============================================================
metrics_all <- rbind(metrics_younger, metrics_older)

Tables 1 and 2 show the results for older and younger adults

library(knitr)

# Table: Younger group
kable(
  metrics_younger,
  caption = "Classification performance by dictionary (Younger group)",
  digits = 3
)

Classification performance by dictionary (Younger group)
Predictor	Accuracy	Precision	Recall	F1	Specificity	FPR	FNR	Group
GI	0.158	0.134	0.141	0.201	0.793	0.207	0.859	younger
HE	0.068	0.247	0.121	0.156	0.807	0.193	0.879	younger
QDAP	0.178	0.260	0.189	0.160	0.809	0.191	0.811	younger
Blob	0.281	0.278	0.186	0.286	0.796	0.204	0.814	younger
VADER	0.274	0.338	0.299	0.248	0.816	0.184	0.701	younger

# Table: Older group
kable(
  metrics_older,
  caption = "Classification performance by dictionary (Older group)",
  digits = 3
)

Classification performance by dictionary (Older group)
Predictor	Accuracy	Precision	Recall	F1	Specificity	FPR	FNR	Group
GI	0.196	0.145	0.197	0.232	0.807	0.193	0.803	older
HE	0.173	0.087	0.221	0.259	0.807	0.193	0.779	older
QDAP	0.123	0.144	0.134	0.217	0.806	0.194	0.866	older
Blob	0.397	0.195	0.290	0.456	0.810	0.190	0.710	older
VADER	0.291	0.257	0.143	0.294	0.812	0.188	0.857	older

Composite score

A composite performance score was computed to provide a single, summary index of classification performance across multiple evaluation metrics. For each predictor, false positive rate (FPR) and false negative rate (FNR) were first reverse-coded so that higher values consistently reflected better performance. The composite score was then calculated as mean of the seven metrics: accuracy, precision, recall, F1 score, specificity, reversed FPR, and reversed FNR. Finally, predictors were ranked based on their composite scores, with higher scores indicating superior overall performance and ties assigned the same minimum rank. This procedure was applied separately to the younger and older groups to allow within-group comparisons of predictor performance.

compute_composite_scores_mean <- function(metrics_df) {
  
  df <- metrics_df
  
  # Reverse cost metrics so that higher values indicate better performance
  df$FPR_rev <- max(df$FPR, na.rm = TRUE) - df$FPR
  df$FNR_rev <- max(df$FNR, na.rm = TRUE) - df$FNR
  
  # Compute composite score as a simple average (equal weighting)
  df$CompositeScore <- rowMeans(
    df[, c(
      "Accuracy",
      "Precision",
      "Recall",
      "F1",
      "Specificity",
      "FPR_rev",
      "FNR_rev"
    )],
    na.rm = TRUE
  )
  
  # Rank predictors (higher composite score = better performance)
  df$Rank <- rank(-df$CompositeScore, ties.method = "min")
  
  df
}

metrics_younger_comp <- compute_composite_scores_mean(metrics_younger)
metrics_older_comp   <- compute_composite_scores_mean(metrics_older)

# Younger adults
metrics_younger_comp[
  order(metrics_younger_comp$Rank),
  c("Predictor", "CompositeScore", "Rank")
]

##   Predictor CompositeScore Rank
## 5     VADER      0.3107421    1
## 4      Blob      0.2706972    2
## 3      QDAP      0.2401308    3
## 1        GI      0.2064933    4
## 2        HE      0.2020807    5

# Older adults
metrics_older_comp[
  order(metrics_older_comp$Rank),
  c("Predictor", "CompositeScore", "Rank")
]

##   Predictor CompositeScore Rank
## 4      Blob      0.3296295    1
## 5     VADER      0.2587493    2
## 1        GI      0.2344328    3
## 2        HE      0.2337486    4
## 3      QDAP      0.2034141    5

Because each dictionary yielded a single performance estimate per age group, these aggregate results were used to characterise relative differences and rank dictionaries descriptively rather than to support statistical inference. Accordingly, aggregate comparisons reflect overall model behaviour and practical performance differences across dictionaries, but do not capture variability across individual participants.

Mixed-Effects Models with Participant-Level Performance

While the previous analyses answered the question Which sentiment dictionaries perform better ? , the participant-level analysis will answers Are these performance differences reliable across individuals?

To assess whether observed differences in dictionary performance were reliable across individuals, performance was additionally evaluated at the participant level. Participant-level scores were computed by comparing predicted and observed emotion ratings within each participant’s narratives, yielding one composite performance score per participant and dictionary. This approach preserves between-participant variability and enables the use of inferential statistical models, such as mixed-effects analyses, with participant treated as a random factor. Participant-level analyses therefore allow performance differences between dictionaries to be tested for consistency across individuals and provide an appropriate basis for statistical inference beyond descriptive model-level comparisons.
In both models, the random intercept for participant had an estimated variance of zero, resulting in a singular fit. This indicates that, after accounting for dictionary and age-group effects, there was no remaining systematic between-participant variability in composite agreement scores. Fixed-effect estimates are therefore interpreted as population-level effects of dictionary reliability.

library(lme4)
library(lmerTest)
library(emmeans)

# 0) Detect participant ID column in `sent`
candidate_ids <- c("participant", "participant_id", "participantid", "participantID")
id_col <- candidate_ids[candidate_ids %in% names(sent)][1]

if (is.na(id_col)) {
  stop(
    "I couldn't find a participant ID column in `sent`.\n",
    "Rename your ID column to one of: ",
    paste(candidate_ids, collapse = ", "),
    "\nOr set it manually: id_col <- 'your_column_name'"
  )
}

# Ensure consistent types
sent[[id_col]] <- factor(sent[[id_col]])
sent$age <- factor(sent$age, levels = c("younger", "older"))
sent$emotions <- factor(sent$emotions, ordered = TRUE)

# 1) Per-participant metric helper (same macro-averaging logic)
eval_metrics_one_participant <- function(preds, actuals, all_levels) {
  preds   <- factor(preds,   levels = all_levels, ordered = TRUE)
  actuals <- factor(actuals, levels = all_levels, ordered = TRUE)
  
  cm  <- caret::confusionMatrix(preds, actuals)
  tab <- as.matrix(cm$table)
  
  TN <- sapply(seq_len(nrow(tab)), function(i) sum(tab[-i, -i]))
  FP <- colSums(tab) - diag(tab)
  FN <- rowSums(tab) - diag(tab)
  TP <- diag(tab)
  
  precision   <- TP / (TP + FP)
  recall      <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  f1          <- 2 * (precision * recall) / (precision + recall)
  fpr         <- FP / (FP + TN)
  fnr         <- FN / (FN + TP)
  
  data.frame(
    Accuracy    = sum(TP) / sum(tab),
    Precision   = mean(precision, na.rm = TRUE),
    Recall      = mean(recall, na.rm = TRUE),
    F1          = mean(f1, na.rm = TRUE),
    Specificity = mean(specificity, na.rm = TRUE),
    FPR         = mean(fpr, na.rm = TRUE),
    FNR         = mean(fnr, na.rm = TRUE)
  )
}

eval_metrics_by_participant <- function(df_test, pred_col, actual_col, id_col) {
  all_levels <- levels(df_test[[actual_col]])
  
  df_test %>%
    group_by(.data[[id_col]]) %>%
    group_modify(~{
      eval_metrics_one_participant(
        preds = .x[[pred_col]],
        actuals = .x[[actual_col]],
        all_levels = all_levels
      )
    }) %>%
    ungroup() %>%
    rename(participant = 1)
}

# 2) Run the same training pipeline, but return test predictions
run_age_group_preds <- function(df, group_label, pred, name) {
  
  df <- df[df$age == group_label, ]
  df$emotions <- factor(df$emotions, ordered = TRUE)
  
  # ---- grouped split by participant (NOT rows) ----
  set.seed(123)
  pids <- unique(df[[id_col]])
  train_pids <- sample(pids, size = floor(0.8 * length(pids)), replace = FALSE)
  
  train_raw <- df[df[[id_col]] %in% train_pids, ]
  test      <- df[!(df[[id_col]] %in% train_pids), ]
  
  # weights computed from ORIGINAL distribution (in this age group)
  class_counts <- table(train_raw$emotions)   # use train distribution
  inv_weights  <- 1 / class_counts
  log_weights  <- log(1 + inv_weights)
  class_w <- log_weights / mean(log_weights)
  
  # oversample TRAINING only
  target_freq <- max(table(train_raw$emotions))
  train_bal <- train_raw %>%
    dplyr::group_by(emotions) %>%
    dplyr::sample_n(target_freq, replace = TRUE) %>%
    dplyr::ungroup()
  
  train_weights <- class_w[as.character(train_bal$emotions)]
  train_weights <- as.numeric(pmax(train_weights, 1))  # keep numeric (no rounding)
  
  m <- partykit::ctree(
    emotions ~ .,
    data    = train_bal[, c("emotions", pred), drop = FALSE],
    weights = train_weights
  )
  
  test$pred <- predict(m, newdata = test, type = "response")
  
  test %>%
    transmute(
      participant = .data[[id_col]],
      Group = group_label,
      Predictor = name,
      actual = emotions,
      pred = pred
    )
}


# 3) Collect test predictions for each age group and dictionary
preds_younger <- bind_rows(
  run_age_group_preds(sent, "younger", "sentiment_gi",    "GI"),
  run_age_group_preds(sent, "younger", "sentiment_he",    "HE"),
  run_age_group_preds(sent, "younger", "sentiment_qdap",  "QDAP"),
  run_age_group_preds(sent, "younger", "sentiment_blob",  "Blob"),
  run_age_group_preds(sent, "younger", "sentiment_vader", "VADER")
)

preds_older <- bind_rows(
  run_age_group_preds(sent, "older", "sentiment_gi",    "GI"),
  run_age_group_preds(sent, "older", "sentiment_he",    "HE"),
  run_age_group_preds(sent, "older", "sentiment_qdap",  "QDAP"),
  run_age_group_preds(sent, "older", "sentiment_blob",  "Blob"),
  run_age_group_preds(sent, "older", "sentiment_vader", "VADER")
)

preds_all <- bind_rows(preds_younger, preds_older) %>%
  mutate(
    participant = factor(participant),
    Group = factor(Group, levels = c("younger", "older")),
    Predictor = factor(Predictor, levels = c("GI","HE","QDAP","Blob","VADER")),
    actual = factor(actual, levels = levels(sent$emotions), ordered = TRUE),
    pred   = factor(pred,   levels = levels(sent$emotions), ordered = TRUE)
  )


# 4) Participant-level metrics per dictionary x group
metrics_by_participant <- preds_all %>%
  group_by(Group, Predictor) %>%
  group_modify(~{
    eval_metrics_by_participant(
      df_test = .x,
      pred_col = "pred",
      actual_col = "actual",
      id_col = "participant"
    )
  }) %>%
  ungroup() %>%
  mutate(
    participant = factor(participant),
    Group = factor(Group, levels = c("younger", "older")),
    Predictor = factor(Predictor, levels = c("GI","HE","QDAP","Blob","VADER"))
  )

# 5) Composite score per participant (rowMeans)
# Use 1 - FPR and 1 - FNR so higher means better
metric_cols <- c("Accuracy", "Precision", "Recall", "F1", "Specificity", "FPR_rev", "FNR_rev")

metrics_by_participant <- metrics_by_participant %>%
  mutate(
    FPR_rev = 1 - FPR,
    FNR_rev = 1 - FNR
  ) %>%
  mutate(
    CompositeScore = rowMeans(
      as.matrix(dplyr::select(., dplyr::all_of(metric_cols))),
      na.rm = TRUE
    )
  )

#mean composite per participant x dictionary (one row per cell)
participant_means <- metrics_by_participant %>%
  group_by(participant, Group, Predictor) %>%
  summarise(CompositeScore_mean = mean(CompositeScore, na.rm = TRUE), .groups = "drop")

Analysis with All the Dictionaries

When considering all sentiment dictionaries together, there is no overall age-related difference in agreement with human emotional ratings. Instead, the reliability of the dictionaries varies substantially across tools and depends on age group. Several dictionaries show higher agreement with human scores in younger adults (notably HE, QDAP, and Blob), whereas others (GI and VADER) show little to no age-related difference in agreement. The significant Group × Dictionary interaction indicates that dictionary-based sentiment measures do not uniformly capture human-rated emotional content across age groups. Consequently, conclusions about age differences in sentiment reliability are contingent on the specific dictionary used, underscoring the importance of evaluating dictionary performance separately across populations rather than assuming age-invariant validity.

lme_comp <- lmer(
  CompositeScore ~ Group * Predictor + (1 | participant),
  data = metrics_by_participant,
  REML = TRUE
)

summary(lme_comp)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CompositeScore ~ Group * Predictor + (1 | participant)
##    Data: metrics_by_participant
## 
## REML criterion at convergence: -250
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1461 -0.5796 -0.0384  0.5243  3.2382 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  participant (Intercept) 0.00000  0.00000 
##  Residual                0.00919  0.09586 
## Number of obs: 160, groups:  participant, 32
## 
## Fixed effects:
##                             Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)                 0.389332   0.023966 150.000000  16.245  < 2e-16 ***
## Groupolder                  0.012590   0.033893 150.000000   0.371 0.710814    
## PredictorHE                -0.092946   0.033893 150.000000  -2.742 0.006843 ** 
## PredictorQDAP               0.017394   0.033893 150.000000   0.513 0.608575    
## PredictorBlob               0.008341   0.033893 150.000000   0.246 0.805938    
## PredictorVADER              0.126453   0.033893 150.000000   3.731 0.000270 ***
## Groupolder:PredictorHE      0.084744   0.047932 150.000000   1.768 0.079095 .  
## Groupolder:PredictorQDAP   -0.165636   0.047932 150.000000  -3.456 0.000714 ***
## Groupolder:PredictorBlob    0.062141   0.047932 150.000000   1.296 0.196817    
## Groupolder:PredictorVADER  -0.070210   0.047932 150.000000  -1.465 0.145073    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Grpldr PrdcHE PrQDAP PrdctB PVADER Gr:PHE G:PQDA Grp:PB
## Groupolder  -0.707                                                        
## PredictorHE -0.707  0.500                                                 
## PredctrQDAP -0.707  0.500  0.500                                          
## PredictrBlb -0.707  0.500  0.500  0.500                                   
## PrdctrVADER -0.707  0.500  0.500  0.500  0.500                            
## Grpldr:PrHE  0.500 -0.707 -0.707 -0.354 -0.354 -0.354                     
## Grpld:PQDAP  0.500 -0.707 -0.354 -0.707 -0.354 -0.354  0.500              
## Grpldr:PrdB  0.500 -0.707 -0.354 -0.354 -0.707 -0.354  0.500  0.500       
## Grpl:PVADER  0.500 -0.707 -0.354 -0.354 -0.354 -0.707  0.500  0.500  0.500
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

anova(lme_comp)

## Type III Analysis of Variance Table with Satterthwaite's method
##                  Sum Sq  Mean Sq NumDF DenDF F value    Pr(>F)    
## Group           0.00108 0.001082     1   150  0.1178    0.7319    
## Predictor       0.53415 0.133539     4   150 14.5309 4.722e-10 ***
## Group:Predictor 0.33460 0.083650     4   150  9.1023 1.307e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emm_age_by_dic <- emmeans(lme_comp, ~ Group | Predictor)
pairs(emm_age_by_dic, adjust = "holm")

## Predictor = GI:
##  contrast        estimate     SE  df t.ratio p.value
##  younger - older  -0.0126 0.0339 150  -0.371  0.7108
## 
## Predictor = HE:
##  contrast        estimate     SE  df t.ratio p.value
##  younger - older  -0.0973 0.0339 150  -2.872  0.0047
## 
## Predictor = QDAP:
##  contrast        estimate     SE  df t.ratio p.value
##  younger - older   0.1530 0.0339 150   4.516  <.0001
## 
## Predictor = Blob:
##  contrast        estimate     SE  df t.ratio p.value
##  younger - older  -0.0747 0.0339 150  -2.205  0.0290
## 
## Predictor = VADER:
##  contrast        estimate     SE  df t.ratio p.value
##  younger - older   0.0576 0.0339 150   1.700  0.0912
## 
## Degrees-of-freedom method: kenward-roger

Analysis with VADER and Blob only

When restricting the analysis to Blob and VADER, there is no overall age difference in agreement with human emotional ratings. Instead, the reliability of the sentiment dictionaries depends on both age group and dictionary type. Blob shows higher agreement with human scores in older adults, whereas VADER shows higher agreement with human scores in younger adults. This crossover interaction indicates that the extent to which a dictionary captures human-rated emotional content is age-dependent, highlighting that dictionary reliability is not uniform across populations and that conclusions about sentiment accuracy can shift depending on both the tool used and the characteristics of the sample.

# Subset to Blob and VADER only
metrics_BV <- metrics_by_participant %>%
  dplyr::filter(Predictor %in% c("Blob", "VADER")) %>%
  dplyr::mutate(
    Predictor = droplevels(Predictor),
    Group = droplevels(Group)
  )

participant_means_BV <- participant_means %>%
  dplyr::filter(Predictor %in% c("Blob", "VADER")) %>%
  dplyr::mutate(
    Predictor = droplevels(Predictor),
    Group = droplevels(Group)
  )

lme_comp_BV <- lmer(
  CompositeScore ~ Group * Predictor + (1 | participant),
  data = metrics_BV,
  REML = TRUE
)

summary(lme_comp_BV)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CompositeScore ~ Group * Predictor + (1 | participant)
##    Data: metrics_BV
## 
## REML criterion at convergence: -93.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0370 -0.6123  0.1496  0.5133  3.0736 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  participant (Intercept) 0.0000   0.000   
##  Residual                0.0102   0.101   
## Number of obs: 64, groups:  participant, 32
## 
## Fixed effects:
##                           Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)                0.39767    0.02525 60.00000  15.750  < 2e-16 ***
## Groupolder                 0.07473    0.03571 60.00000   2.093  0.04060 *  
## PredictorVADER             0.11811    0.03571 60.00000   3.308  0.00159 ** 
## Groupolder:PredictorVADER -0.13235    0.05050 60.00000  -2.621  0.01110 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Grpldr PVADER
## Groupolder  -0.707              
## PrdctrVADER -0.707  0.500       
## Grpl:PVADER  0.500 -0.707 -0.707
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

anova(lme_comp_BV)

## Type III Analysis of Variance Table with Satterthwaite's method
##                   Sum Sq  Mean Sq NumDF DenDF F value  Pr(>F)  
## Group           0.001171 0.001171     1    60  0.1148 0.73592  
## Predictor       0.043159 0.043159     1    60  4.2309 0.04405 *
## Group:Predictor 0.070067 0.070067     1    60  6.8688 0.01110 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

emm_age_by_dic <- emmeans(lme_comp_BV, ~ Group | Predictor)
pairs(emm_age_by_dic, adjust = "holm")

## Predictor = Blob:
##  contrast        estimate     SE df t.ratio p.value
##  younger - older  -0.0747 0.0357 60  -2.093  0.0406
## 
## Predictor = VADER:
##  contrast        estimate     SE df t.ratio p.value
##  younger - older   0.0576 0.0357 60   1.614  0.1119
## 
## Degrees-of-freedom method: kenward-roger

emm_dic_by_age <- emmeans(lme_comp_BV, ~ Predictor | Group)
pairs(emm_dic_by_age, adjust = "holm")

## Group = younger:
##  contrast     estimate     SE df t.ratio p.value
##  Blob - VADER  -0.1181 0.0357 30  -3.308  0.0024
## 
## Group = older:
##  contrast     estimate     SE df t.ratio p.value
##  Blob - VADER   0.0142 0.0357 30   0.399  0.6929
## 
## Degrees-of-freedom method: kenward-roger

Evaluating the performance of computational tools in assessing sentiment within narrative memories of younger and older adults

Morales Valiente, C.

Husein, K.

Fernandes, M.

2026-01-31