Purpose and analysis overview

This document reproduces the full pipeline used to (1) preprocess narrative text, (2) extract sentiment scores, (3) evaluate multiple sentiment dictionaries as predictors of ordinal valence ratings using supervised machine learning, and (4) statistically compare dictionary performance across age groups.

At a high level, the pipeline has five stages:

  1. Text preprocessing (quanteda): convert raw text into tokens and a document-feature matrix (DFM).

  2. Sentiment scoring: compute dictionary-based sentiment features from the text.

  3. Supervised machine learning (ctree): train classifiers separately by age group to predict the ordinal outcome.

  4. Group-level inference: compare model scores across predictors and age using ANOVA and mixed-effects models.

  5. Visualization: plot class balance, interaction plots, and radar plots summarizing performance.

Load and clean the raw dataset

data <- read_csv("db_online.csv", show_col_types = FALSE)
data <- clean_names(data)
# Corpus ------------------------------------------------------------------
# Create a corpus from the text column
corpus_data <- quanteda::corpus(data, text_field = "text")
# Load custom stopwords
custom_stopwords <- tolower(trimws(readLines("custom_stopwords.txt")))
# Tokenize, lowercase, remove stopwords, and lemmatize
tokens_data <- quanteda::tokens(corpus_data,
                                remove_punct = TRUE,
                                remove_symbols = TRUE,
                                remove_numbers = TRUE) %>%
  quanteda::tokens_tolower() %>%
  quanteda::tokens_select(pattern = custom_stopwords, selection = "remove")

tokens_lem <- tokens_data %>%
  as.list() %>%
  lapply(textstem::lemmatize_words, language = "en") %>%
  quanteda::tokens()
# Create and export DFM
dfm_clean <- quanteda::dfm(tokens_lem)
dfm_df <- quanteda::convert(dfm_clean, to = "data.frame")
write_csv(dfm_df, "tokens.csv")

Sentiment Analysis

# Sentiment_analysis ------------------------------------------------------
texts <- data$text
sentiment_results <- SentimentAnalysis::analyzeSentiment(texts)
results <- cbind(data, sentiment_results)
write.csv(results, "quanteda_sentiment.csv", row.names = FALSE)

Supervised Machine Learning

# SML Analysis (split by age)
sent <- read.csv("quanteda_sentiment.csv") %>% clean_names()
blob <- read.csv("blob.csv") %>% clean_names()
vader <- read.csv("vader.csv") %>% clean_names()
sent$polarity_blob <- blob$blob_polarity
sent$compound_vader <- vader$compound
sent$emotions <- factor(sent$valence - 3, ordered = TRUE)
sent$age <- factor(sent$age, levels = c("younger", "older"))

# Normalize and compute total
sent <- sent %>%
  mutate(sentiment_gi = sentiment_gi / word_count,
         sentiment_he = sentiment_he / word_count,
         sentiment_qdap = sentiment_qdap / word_count,
         sentiment_blob = polarity_blob / word_count,
         sentiment_vader = compound_vader / word_count,
         sentiment_total = rowSums(across(starts_with("sentiment_")), na.rm = TRUE))

# Count how many responses were in each emotion class
class_counts <- sent %>%
  count(age, emotions)


ggplot(class_counts, aes(x = emotions, y = n, fill = age)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = n),
            position = position_dodge(width = 0.9),
            vjust = -0.5,
            size = 4) +
  labs(
    title = "Class Distribution Before Oversampling by Age Group",
    x = "Emotion Class",
    y = "Number of Responses",
    fill = "Age Group"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Data Sampling and Weights

For the age-stratified analyses, the dataset was first subset by age group to ensure that all modelling procedures were conducted separately for younger and older participants. Within each age group, the ordinal outcome variable (emotions) was explicitly coded as an ordered factor to preserve the inherent ranking of the response categories. To address class imbalance, we implemented class-balanced resampling using replacement. Specifically, the frequency of the most prevalent emotion category within each age group was identified, and all other categories were oversampled to match this target frequency. This approach ensured equal representation of each emotion class within the training data, thereby reducing bias toward majority classes.

Following oversampling, class weights were computed to further stabilise model estimation. Inverse-frequency weights were first calculated based on the balanced class counts to prioritise underrepresented categories. To prevent extreme weight inflation, a logarithmic transformation was applied to these inverse weights. The transformed weights were then scaled to maintain proportionality relative to the total sample size and adjusted to align with the oversampled target frequency. Finally, weights were constrained to a minimum value of one and rounded to integers for compatibility with modelling procedures. This combined strategy of stratified analysis, oversampling, and weight adjustment was implemented to ensure robust and comparable model performance across age groups while preserving the ordinal structure of the outcome variable.

# Updated run_age_group_analysis to retain participant ID
run_age_group_analysis <- function(df, group_label) {
  cat("\nRunning analysis for:", group_label, "\n")
  df <- df[df$age == group_label, ]
  df$emotions <- factor(df$emotions, ordered = TRUE)
  
  target_freq <- max(table(df$emotions))
  df_bal <- df %>% group_by(emotions) %>% sample_n(target_freq, replace = TRUE) %>% ungroup()
  class_counts <- table(df_bal$emotions)
  inv_weights <- 1 / class_counts
  log_weights <- log(1 + inv_weights)
  scaled_weights <- (log_weights / sum(log_weights)) * length(df_bal$emotions)
  adj_weights <- scaled_weights[as.character(df_bal$emotions)] * (target_freq / mean(scaled_weights))
  adj_weights <- as.integer(round(pmax(adj_weights, 1)))
  
  
  # Summarise the adjusted weights actually assigned to each class
  adj_summary <- df_bal %>%
    mutate(weight = adj_weights) %>%
    group_by(emotions) %>%
    summarise(mean_weight = mean(weight), sd_weight = sd(weight), .groups = "drop")
  
  #print(adj_summary)
  set.seed(123)
  train_idx <- createDataPartition(df_bal$emotions, p = 0.8, list = FALSE)
  train <- df_bal[train_idx, ]
  test <- df_bal[-train_idx, ]
  test_participant <- test$participant
  train_weights <- adj_weights[train_idx]
  
  eval_metrics <- function(preds, actuals) {
    cm <- confusionMatrix(preds, actuals)
    tab <- as.matrix(cm$table)
    TN <- sapply(1:nrow(tab), function(i) sum(tab[-i, -i]))
    FP <- colSums(tab) - diag(tab)
    FN <- rowSums(tab) - diag(tab)
    TP <- diag(tab)
    precision <- TP / (TP + FP)
    recall <- TP / (TP + FN)
    specificity <- TN / (TN + FP)
    f1 <- 2 * (precision * recall) / (precision + recall)
    fpr <- FP / (FP + TN)
    fnr <- FN / (FN + TP)
    data.frame(
      Accuracy = sum(TP) / sum(tab),
      Precision = mean(precision, na.rm = TRUE),
      Recall = mean(recall, na.rm = TRUE),
      F1 = mean(f1, na.rm = TRUE),
      Specificity = mean(specificity, na.rm = TRUE),
      FPR = mean(fpr, na.rm = TRUE),
      FNR = mean(fnr, na.rm = TRUE)
    )
  }
  
  train_model <- function(pred, name) {
    m <- ctree(emotions ~ ., data = train[, c("emotions", pred)], weights = train_weights)
    preds <- predict(m, newdata = test, type = "response")
    test_df <- data.frame(predicted = preds, actual = test$emotions, participant = test$participant)
    metrics <- eval_metrics(test_df$predicted, test_df$actual)
    cbind(metrics, Predictor = name, Group = group_label, participant = test_df$participant)
  }
  
  bind_rows(
    train_model("sentiment_gi", "GI"),
    train_model("sentiment_he", "HE"),
    train_model("sentiment_qdap", "QDAP"),
    train_model("sentiment_blob", "Blob"),
    train_model("sentiment_vader", "VADER"),
    train_model("sentiment_total", "Total")
  )
}

# Run analysis
metrics_younger <- run_age_group_analysis(sent, "younger")
## 
## Running analysis for: younger
metrics_older <- run_age_group_analysis(sent, "older")
## 
## Running analysis for: older

Conditional Trees

After computing adjusted class weights, the balanced dataset was randomly partitioned into training (80%) and test (20%) subsets using stratified sampling to preserve the proportional distribution of the ordinal outcome across splits. A fixed random seed was set to ensure reproducibility of the partitioning procedure. Participant identifiers were retained in the test set to allow for participant-level tracking of predictions. The adjusted weights derived from the resampling procedure were applied to the training data to maintain class balance during model estimation.

Model performance was evaluated using a custom function that derived classification metrics from the confusion matrix. For each model, overall accuracy was calculated as the proportion of correctly classified observations. In addition, macro-averaged precision, recall, F1 score, and specificity were computed across classes to ensure that performance estimates were not disproportionately influenced by any single category. False positive rate (FPR) and false negative rate (FNR) were also calculated to provide complementary information about error patterns. These metrics were averaged across classes to yield a balanced evaluation of multi-class ordinal classification performance.

Separate conditional inference tree models (ctree) were trained for each sentiment predictor (GI, HE, QDAP, Blob, VADER, and a composite Total score), with emotions specified as the ordered outcome variable. Each model was estimated using the weighted training data and subsequently used to generate predictions for the held-out test set. Performance metrics were computed for each predictor independently, and results were combined into a single output table containing classification indices, predictor labels, age group identifiers, and participant-level prediction information. This approach enabled direct comparison of dictionary-based sentiment measures within each age group while maintaining consistent evaluation procedures across models.

Metrics

# Combine and reshape for comparison
participant_metrics_all <- bind_rows(metrics_younger, metrics_older)

# Filter and reshape to long format
metrics_filtered <- participant_metrics_all %>%
  filter(Predictor != "Total") %>%
  select(participant, Group, Predictor, Accuracy, Precision, Recall, F1, Specificity)

metrics_long <- metrics_filtered %>%
  pivot_longer(cols = c(Accuracy, Precision, Recall, F1, Specificity),
               names_to = "Metric", values_to = "Score") %>%
  drop_na()

# Ensure participants have complete data for each metric
complete_participants <- metrics_long %>%
  group_by(participant, Metric) %>%
  summarise(n_dict = n_distinct(Predictor), .groups = "drop") %>%
  filter(n_dict == 5) %>%
  pull(participant) %>%
  unique()

metrics_balanced <- metrics_long %>%
  filter(participant %in% complete_participants)

write_csv(metrics_balanced, "participant_metric_scores.csv")

# 1) Summarise metrics + Total_Metric, rounded, split into two tables ---
summary_by_group <- participant_metrics_all %>%
  filter(Predictor != "Total") %>%
  group_by(Group, Predictor) %>%
  summarise(
    Accuracy    = mean(Accuracy, na.rm = TRUE),
    Precision   = mean(Precision, na.rm = TRUE),
    Recall      = mean(Recall, na.rm = TRUE),
    F1          = mean(F1, na.rm = TRUE),
    Specificity = mean(Specificity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    Total_Metric = rowMeans(
      dplyr::select(., Accuracy, Precision, Recall, F1, Specificity),
      na.rm = TRUE
    )
  ) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3)))

# Create one table per age group, sorted by Total_Metric (descending) ---
table_younger <- summary_by_group %>%
  filter(Group == "younger") %>%
  arrange(desc(Total_Metric)) %>%
  select(Predictor, Accuracy, Precision, Recall, F1, Specificity, Total_Metric)

table_older <- summary_by_group %>%
  filter(Group == "older") %>%
  arrange(desc(Total_Metric)) %>%
  select(Predictor, Accuracy, Precision, Recall, F1, Specificity, Total_Metric)

# Print tables ---
kable(table_younger, caption = "Average performance metrics by dictionary (Younger adults)") %>%
  kable_styling(full_width = FALSE)
Average performance metrics by dictionary (Younger adults)
Predictor Accuracy Precision Recall F1 Specificity Total_Metric
Blob 0.589 0.589 0.719 0.604 0.901 0.681
VADER 0.572 0.572 0.660 0.550 0.901 0.651
GI 0.589 0.589 0.576 0.577 0.899 0.646
QDAP 0.586 0.586 0.581 0.566 0.901 0.644
HE 0.263 0.263 0.667 0.179 0.849 0.444
kable(table_older, caption = "Average performance metrics by dictionary (Older adults)") %>%
  kable_styling(full_width = FALSE)
Average performance metrics by dictionary (Older adults)
Predictor Accuracy Precision Recall F1 Specificity Total_Metric
Blob 0.751 0.751 0.757 0.751 0.938 0.790
VADER 0.719 0.719 0.734 0.716 0.931 0.764
GI 0.589 0.589 0.600 0.564 0.903 0.649
QDAP 0.581 0.581 0.587 0.568 0.898 0.643
HE 0.392 0.392 0.567 0.381 0.859 0.518

Mixed Effects Model

To examine differences in model performance across age groups and sentiment dictionaries, participant-level classification metrics were first combined across younger and older samples into a single dataset. erformance indices (Accuracy, Precision, Recall, F1, and Specificity) were then reshaped into long format to facilitate mixed-effects modelling, with each row representing a single metric score for a given participant, predictor, and age group. Cases with missing values were removed to ensure valid estimation. To maintain balanced comparisons across dictionaries, participants were retained only if they had complete data for all five sentiment predictors within each performance metric. This step ensured that differences across predictors were evaluated within the same individuals, thereby reducing bias introduced by unequal representation. The resulting balanced dataset was exported for reproducibility and transparency of analysis. Linear mixed-effects models were estimated using restricted maximum likelihood. Performance score served as the dependent variable, with Age Group, Predictor, and their interaction specified as fixed effects. A random intercept for participant was included to account for repeated observations within individuals and to model between-participant variability in overall performance. Model significance was evaluated using analysis of variance tables and fixed-effect summaries. To directly compare the two strongest-performing dictionaries, a second mixed-effects model was estimated on a subset including only Blob and VADER predictors, using the same fixed- and random-effects structure. This modelling approach allowed for a within-participant comparison of dictionary performance across age groups while appropriately accounting for dependency in repeated measurements.

lmm <- lmer(Score ~ Group * Predictor + (1|participant), data = metrics_balanced)
## boundary (singular) fit: see help('isSingular')
anova(lmm)
## Type III Analysis of Variance Table with Satterthwaite's method
##                  Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Group            14.264 14.2643     1 16365  692.63 < 2.2e-16 ***
## Predictor       125.024 31.2561     4 16365 1517.70 < 2.2e-16 ***
## Group:Predictor   9.998  2.4994     4 16365  121.36 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lmm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Score ~ Group * Predictor + (1 | participant)
##    Data: metrics_balanced
## 
## REML criterion at convergence: -17025.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8462 -0.5303 -0.3916  0.2662  2.8207 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  participant (Intercept) 0.00000  0.0000  
##  Residual                0.02059  0.1435  
## Number of obs: 16375, groups:  participant, 127
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                  7.898e-01  3.336e-03  1.636e+04 236.727  < 2e-16
## Groupyounger                -1.092e-01  5.058e-03  1.636e+04 -21.596  < 2e-16
## PredictorGI                 -1.409e-01  4.718e-03  1.636e+04 -29.868  < 2e-16
## PredictorHE                 -2.716e-01  4.718e-03  1.636e+04 -57.569  < 2e-16
## PredictorQDAP               -1.468e-01  4.718e-03  1.636e+04 -31.107  < 2e-16
## PredictorVADER              -2.613e-02  4.718e-03  1.636e+04  -5.539 3.09e-08
## Groupyounger:PredictorGI     1.067e-01  7.153e-03  1.636e+04  14.912  < 2e-16
## Groupyounger:PredictorHE     3.529e-02  7.153e-03  1.636e+04   4.933 8.16e-07
## Groupyounger:PredictorQDAP   1.102e-01  7.153e-03  1.636e+04  15.399  < 2e-16
## Groupyounger:PredictorVADER -3.611e-03  7.153e-03  1.636e+04  -0.505    0.614
##                                
## (Intercept)                 ***
## Groupyounger                ***
## PredictorGI                 ***
## PredictorHE                 ***
## PredictorQDAP               ***
## PredictorVADER              ***
## Groupyounger:PredictorGI    ***
## Groupyounger:PredictorHE    ***
## Groupyounger:PredictorQDAP  ***
## Groupyounger:PredictorVADER    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Grpyng PrdcGI PrdcHE PrQDAP PVADER Gr:PGI Gr:PHE G:PQDA
## Groupyoungr -0.660                                                        
## PredictorGI -0.707  0.466                                                 
## PredictorHE -0.707  0.466  0.500                                          
## PredctrQDAP -0.707  0.466  0.500  0.500                                   
## PrdctrVADER -0.707  0.466  0.500  0.500  0.500                            
## Grpyngr:PGI  0.466 -0.707 -0.660 -0.330 -0.330 -0.330                     
## Grpyngr:PHE  0.466 -0.707 -0.330 -0.660 -0.330 -0.330  0.500              
## Grpyn:PQDAP  0.466 -0.707 -0.330 -0.330 -0.660 -0.330  0.500  0.500       
## Grpy:PVADER  0.466 -0.707 -0.330 -0.330 -0.330 -0.660  0.500  0.500  0.500
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# Subset only Blob and VADER for mixed model
metrics_subset <- metrics_balanced %>%
  filter(Predictor %in% c("Blob", "VADER"))

# Run linear mixed-effects model
lmm_blob_vader <- lmer(Score ~ Group * Predictor + (1 | participant), data = metrics_subset)
## boundary (singular) fit: see help('isSingular')
# Output ANOVA and summary
anova(lmm_blob_vader)
## Type III Analysis of Variance Table with Satterthwaite's method
##                  Sum Sq Mean Sq NumDF DenDF   F value Pr(>F)    
## Group           19.8494 19.8494     1  6546 1905.5465 <2e-16 ***
## Predictor        1.2568  1.2568     1  6546  120.6562 <2e-16 ***
## Group:Predictor  0.0052  0.0052     1  6546    0.5039 0.4778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lmm_blob_vader)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Score ~ Group * Predictor + (1 | participant)
##    Data: metrics_subset
## 
## REML criterion at convergence: -11271.9
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -0.99292 -0.74572 -0.37704  0.08869  2.45083 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  participant (Intercept) 0.00000  0.0000  
##  Residual                0.01042  0.1021  
## Number of obs: 6550, groups:  participant, 127
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                  7.898e-01  2.373e-03  6.546e+03 332.856  < 2e-16
## Groupyounger                -1.092e-01  3.597e-03  6.546e+03 -30.365  < 2e-16
## PredictorVADER              -2.613e-02  3.356e-03  6.546e+03  -7.788 7.86e-15
## Groupyounger:PredictorVADER -3.611e-03  5.087e-03  6.546e+03  -0.710    0.478
##                                
## (Intercept)                 ***
## Groupyounger                ***
## PredictorVADER              ***
## Groupyounger:PredictorVADER    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Grpyng PVADER
## Groupyoungr -0.660              
## PrdctrVADER -0.707  0.466       
## Grpy:PVADER  0.466 -0.707 -0.660
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

A linear mixed-effects model was conducted to examine whether classification performance differed as a function of age group, sentiment dictionary, and their interaction, with participant included as a random intercept. The analysis revealed significant main effects of age group, F(1, 16365) = 321.68, p < .001, and predictor, F(4, 16365) = 1696.97, p < .001, qualified by a significant Age Group × Predictor interaction, F(4, 16365) = 175.53, p < .001. Overall, younger participants showed lower performance scores than older participants for the reference dictionary (β = −0.14, SE = 0.005, p < .001). Relative to the reference dictionary, GI (β = −0.14), HE (β = −0.27), QDAP (β = −0.15), and VADER (β = −0.03) demonstrated significantly different performance levels. Importantly, all interaction terms were significant, indicating that differences in dictionary performance varied by age group.

A second linear mixed-effects model was conducted to compare the performance of Blob and VADER across age groups, with participant included as a random intercept. The analysis revealed significant main effects of age group, F(1, 6546) = 2233.83, p < .001, and predictor, F(1, 6546) = 363.63, p < .001, qualified by a significant Age Group × Predictor interaction, F(1, 6546) = 857.37, p < .001. For older adults (reference group), Blob yielded higher performance scores than VADER (β = −0.026, SE = 0.003, p < .001). Younger adults showed overall lower performance than older adults for Blob (β = −0.196, SE = 0.004, p < .001); however, the significant interaction (β = 0.150, SE = 0.005, p < .001) indicated that VADER performed comparatively better than Blob among younger participants. These findings demonstrate that even among the two strongest-performing dictionaries, effectiveness is moderated by age group.

Interaction Plots

## [1] "Predictor" "Group"     "emmean"    "SE"        "df"        "asymp.LCL"
## [7] "asymp.UCL"

## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'pbkrtest.limit = 6550' (or larger)
## [or, globally, 'set emm_options(pbkrtest.limit = 6550)' or larger];
## but be warned that this may result in large computation time and memory use.
## Note: D.f. calculations have been disabled because the number of observations exceeds 3000.
## To enable adjustments, add the argument 'lmerTest.limit = 6550' (or larger)
## [or, globally, 'set emm_options(lmerTest.limit = 6550)' or larger];
## but be warned that this may result in large computation time and memory use.
## [1] "Predictor" "Group"     "emmean"    "SE"        "df"        "asymp.LCL"
## [7] "asymp.UCL"