Text data processing is necessary to transform raw responses into Document Text Matrices (DTM) to be later use for the sentiment analysis. This processing was the same for all the dictionaries since all the analysis were done over the section final output.
data <- read_csv("db_online.csv")
data <- clean_names(data)
The database contains each participant response. The responses are in raw state without any pre-processing. All the text data appears in the column text
corpus_data <- quanteda::corpus(data, text_field = "text")
A pre-elaborated list of stopwords was used for customization of the analysis. This list of stopwords is meant to control the text filtering and avoid losing relevant information due to conventional filtering process.
custom_stopwords <- tolower(trimws(readLines("custom_stopwords.txt")))
The text data was tokenized, lemmatized and filtered to remove
punctuation, symbols, numbers and the customized stopwords. This
processing was done using the quanteda package
tokens_data <- quanteda::tokens(corpus_data,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) %>%
quanteda::tokens_tolower() %>%
quanteda::tokens_select(pattern = custom_stopwords, selection = "remove")
tokens_lem <- tokens_data %>%
as.list() %>%
lapply(textstem::lemmatize_words, language = "en") %>%
quanteda::tokens()
A final output document (tokens.csv) is generated with
the information of each participant and their processed text
responses.
dfm_clean <- quanteda::dfm(tokens_lem)
dfm_df <- quanteda::convert(dfm_clean, to = "data.frame")
write_csv(dfm_df, "tokens.csv")
For the dictionaries GI, HE and QDAP, we used the
SentimentAnalysis package. A final output with the
sentiment score per dictionary was generated at the end
library(SentimentAnalysis)
texts <- data$text
sentiment_results <- SentimentAnalysis::analyzeSentiment(texts)
results <- cbind(data, sentiment_results)
write.csv(results, "quanteda_sentiment.csv", row.names = FALSE)
For the dictionaires TextBlob and VADER the analysis was done in Python and each generated a .csv document, similar to the one generated in the previous analysis.
For the SML analysis, we start by creating a single database with the scoring from the five dictionaries and the human scoring. The human scoring data were set to a -2 to 2 scale to match the polarity conventions used by the dictionaries. The database also include a variable indicating the age-groups of the participants.
# SML Analysis (split by age)
sent <- read.csv("quanteda_sentiment.csv") %>% clean_names()
blob <- read.csv("blob.csv") %>% clean_names()
vader <- read.csv("vader.csv") %>% clean_names()
sent$polarity_blob <- blob$blob_polarity
sent$compound_vader <- vader$compound
sent$emotions <- factor(sent$valence - 3, ordered = TRUE)
sent$age <- factor(sent$age, levels = c("younger", "older"))
All the sentiment scores were normalized by word count of each
narration to control for verbosity. A total sentiment score
(sentiment_total) was also calculated for exploratory
reasons.
# Normalize and compute total
sent <- sent %>%
mutate(
sentiment_gi = sentiment_gi / word_count,
sentiment_he = sentiment_he / word_count,
sentiment_qdap = sentiment_qdap / word_count,
sentiment_blob = polarity_blob / word_count,
sentiment_vader = compound_vader / word_count
)
We explored class imbalance in the human scoring since class imbalance (toward positive scoring in this case), could potentially affect the SML analysis due to overestimation.
The following plot confirms that the human scoring was indeed skewed toward positive values.
The function eval_metrics() evaluates model predictions
by deriving performance metrics from the confusion matrix. First, a
confusion matrix is computed using
caret::confusionMatrix(preds, actuals), which
cross-tabulates predicted emotion classes against the true classes. The
contingency table underlying the confusion matrix is extracted and
converted to a numeric matrix (tab) to allow manual
computation of performance indices.
For each emotion class, the confusion matrix is decomposed into its
four fundamental components: true positives (TP), false
positives (FP), false negatives (FN), and true
negatives (TN). True positives are obtained from the diagonal of
the confusion matrix (diag(tab)), representing cases in
which a class is correctly predicted. False positives are calculated as
the number of observations incorrectly assigned to a given class,
defined as the column sum minus the true positives
(colSums(tab) − diag(tab)). False negatives correspond to
observations belonging to a given class that were misclassified as
another class and are computed as the row sum minus the true positives
(rowSums(tab) − diag(tab)). True negatives are defined as
all remaining observations that are neither false positives nor false
negatives for a given class; these are computed by summing all cells
outside the row and column of the target class
(sum(tab[−i, −i])).
Using these quantities, standard performance metrics are computed separately for each class. Precision is defined as the proportion of predicted instances of a class that are correct (TP / (TP + FP)). Recall, or sensitivity, reflects the proportion of true instances of a class that are correctly identified (TP / (TP + FN)). Specificity quantifies the proportion of non-target instances that are correctly rejected (TN / (TN + FP)). The F1 score is computed as the harmonic mean of precision and recall, capturing the balance between these two measures. Finally, false positive rate (FPR) and false negative rate (FNR) are computed to quantify error tendencies, defined respectively as FP / (FP + TN) and FN / (FN + TP).
# Helper: evaluation metrics (must be defined BEFORE the main function)
eval_metrics <- function(preds, actuals, predictor = NA, group = NA) {
cm <- caret::confusionMatrix(preds, actuals)
tab <- as.matrix(cm$table)
# Per-class components
TN <- sapply(seq_len(nrow(tab)), function(i) sum(tab[-i, -i]))
FP <- colSums(tab) - diag(tab)
FN <- rowSums(tab) - diag(tab)
TP <- diag(tab)
# Per-class rates
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
specificity <- TN / (TN + FP)
f1 <- 2 * (precision * recall) / (precision + recall)
fpr <- FP / (FP + TN)
fnr <- FN / (FN + TP)
# Macro-averaged metrics (equal weight to each class)
metrics <- data.frame(
Accuracy = sum(TP) / sum(tab),
Precision = mean(precision, na.rm = TRUE),
Recall = mean(recall, na.rm = TRUE),
F1 = mean(f1, na.rm = TRUE),
Specificity = mean(specificity, na.rm = TRUE),
FPR = mean(fpr, na.rm = TRUE),
FNR = mean(fnr, na.rm = TRUE)
)
list(metrics = metrics, confusion = cm$table, Predictor = predictor, Group = group)
}
The function run_age_group_analysis() evaluates how well
each sentiment dictionary predicts human-rated emotional valence
separately for younger and older adults. The analysis is conducted
independently for each age group by subsetting the dataset to the
requested group (df <- df[df$age == group_label, ]) and
ensuring the outcome variable (emotions) is treated as an
ordered factor (factor(..., ordered = TRUE)).
To address class imbalance in the human ratings, the function first
computes class weights from the original,
pre-oversampling distribution of emotion categories.
Specifically, it tabulates the number of observations in each emotion
class (class_counts <- table(df$emotions)), computes
inverse-frequency weights
(inv_weights <- 1 / class_counts), and applies a log
transformation (log_weights <- log(1 + inv_weights)) to
reduce extreme penalties for rare classes. These class weights are then
normalised to have a mean of 1
(class_w <- log_weights / mean(log_weights)), which
stabilises the numerical scale of weights during model fitting.
Next, the dataset is split into training and testing partitions
before any oversampling using stratified sampling
(createDataPartition(df$emotions, p = 0.8)). This ensures
that evaluation is performed on an untouched test set of genuinely
unseen human-rated texts, preventing information leakage that could
occur if duplicated observations appeared in both training and test
sets.
Oversampling is then applied only to the training
data to ensure equal representation of emotion categories
during model induction. The training set is resampled with replacement
within each emotion class until all classes match the frequency of the
most common class in the training data
(sample_n(target_freq, replace = TRUE)). After
oversampling, the previously computed class weights (derived from the
original data) are mapped onto the oversampled training rows
(train_weights <- class_w[...]). Weights are rounded and
constrained to be at least 1 to ensure all observations retain non-zero
influence during training.
Models are fitted using weighted conditional inference trees
(ctree) in a one-predictor-at-a-time framework. For each
sentiment dictionary (GI, HE, QDAP, TextBlob, and VADER), a separate
model is trained on the oversampled, weighted training set and then used
to predict emotion classes in the untouched test set. Performance is
evaluated using macro-averaged classification metrics computed by
eval_metrics(), including accuracy, precision, recall, F1,
specificity, false positive rate, and false negative rate.
Macro-averaging provides a single score per metric for each dictionary
by averaging class-specific values, ensuring each emotion category
contributes equally to performance estimates. Finally, the function
returns a combined table of metrics for all dictionaries within the
specified age group.
run_age_group_analysis <- function(df, group_label) {
cat("\nRunning analysis for:", group_label, "\n")
# 1) Subset to the requested age group
df <- df[df$age == group_label, ]
df$emotions <- factor(df$emotions, ordered = TRUE)
# 2) Compute class weights from ORIGINAL (pre-oversampling) distribution
class_counts <- table(df$emotions)
inv_weights <- 1 / class_counts
log_weights <- log(1 + inv_weights)
# Normalise for numerical stability (mean weight ~ 1)
class_w <- log_weights / mean(log_weights)
# 3) Train/test split on ORIGINAL data (pre-oversampling)
set.seed(123)
train_idx <- caret::createDataPartition(df$emotions, p = 0.8, list = FALSE)
train_raw <- df[train_idx, ]
test <- df[-train_idx, ]
# 4) Oversample TRAINING data only (to avoid leakage)
target_freq <- max(table(train_raw$emotions))
train_bal <- train_raw %>%
dplyr::group_by(emotions) %>%
dplyr::sample_n(target_freq, replace = TRUE) %>%
dplyr::ungroup()
# 5) Map ORIGINAL class weights onto the oversampled training set
train_weights <- class_w[as.character(train_bal$emotions)]
train_weights <- as.integer(round(pmax(train_weights, 1))) # enforce minimum weight = 1
# 6) Fit and evaluate one-predictor models
train_model <- function(pred, name) {
m <- partykit::ctree(
emotions ~ .,
data = train_bal[, c("emotions", pred)],
weights = train_weights
)
preds <- predict(m, newdata = test, type = "response")
# Macro-averaged metrics (from your existing eval_metrics function)
eval <- eval_metrics(preds, test$emotions, predictor = name, group = group_label)
cat(name, "metrics:\n")
print(eval$metrics)
cbind(Predictor = name, eval$metrics, Group = group_label)
}
# 7) Return combined results for this age group
do.call(rbind, list(
train_model("sentiment_gi", "GI"),
train_model("sentiment_he", "HE"),
train_model("sentiment_qdap", "QDAP"),
train_model("sentiment_blob", "Blob"),
train_model("sentiment_vader", "VADER")
))
}
# Run analysis for each age group
# ============================================================
metrics_younger <- run_age_group_analysis(sent, group_label = "younger")
##
## Running analysis for: younger
## GI metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.1575342 0.134281 0.140625 0.2009951 0.7927995 0.2072005 0.859375
## HE metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.06849315 0.2470588 0.1214065 0.1564799 0.8069631 0.1930369 0.8785935
## QDAP metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.1780822 0.2604248 0.1890814 0.1604167 0.8090176 0.1909824 0.8109186
## Blob metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.2808219 0.2777778 0.1860051 0.2860963 0.7961902 0.2038098 0.8139949
## VADER metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.2739726 0.3380117 0.2986486 0.2479238 0.8160976 0.1839024 0.7013514
metrics_older <- run_age_group_analysis(sent, group_label = "older")
##
## Running analysis for: older
## GI metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.1955307 0.1454987 0.1974907 0.2315084 0.8066298 0.1933702 0.8025093
## HE metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.1731844 0.08698135 0.2210227 0.2588076 0.8074854 0.1925146 0.7789773
## QDAP metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.122905 0.1439189 0.1337963 0.2173252 0.8059532 0.1940468 0.8662037
## Blob metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.396648 0.1950895 0.2898482 0.4558824 0.8099197 0.1900803 0.7101518
## VADER metrics:
## Accuracy Precision Recall F1 Specificity FPR FNR
## 1 0.2905028 0.2571987 0.1427884 0.2939394 0.8118885 0.1881115 0.8572116
# Compare all metrics (stack results)
# ============================================================
metrics_all <- rbind(metrics_younger, metrics_older)
Tables 1 and 2 show the results for older and younger adults
library(knitr)
# Table: Younger group
kable(
metrics_younger,
caption = "Classification performance by dictionary (Younger group)",
digits = 3
)
| Predictor | Accuracy | Precision | Recall | F1 | Specificity | FPR | FNR | Group |
|---|---|---|---|---|---|---|---|---|
| GI | 0.158 | 0.134 | 0.141 | 0.201 | 0.793 | 0.207 | 0.859 | younger |
| HE | 0.068 | 0.247 | 0.121 | 0.156 | 0.807 | 0.193 | 0.879 | younger |
| QDAP | 0.178 | 0.260 | 0.189 | 0.160 | 0.809 | 0.191 | 0.811 | younger |
| Blob | 0.281 | 0.278 | 0.186 | 0.286 | 0.796 | 0.204 | 0.814 | younger |
| VADER | 0.274 | 0.338 | 0.299 | 0.248 | 0.816 | 0.184 | 0.701 | younger |
# Table: Older group
kable(
metrics_older,
caption = "Classification performance by dictionary (Older group)",
digits = 3
)
| Predictor | Accuracy | Precision | Recall | F1 | Specificity | FPR | FNR | Group |
|---|---|---|---|---|---|---|---|---|
| GI | 0.196 | 0.145 | 0.197 | 0.232 | 0.807 | 0.193 | 0.803 | older |
| HE | 0.173 | 0.087 | 0.221 | 0.259 | 0.807 | 0.193 | 0.779 | older |
| QDAP | 0.123 | 0.144 | 0.134 | 0.217 | 0.806 | 0.194 | 0.866 | older |
| Blob | 0.397 | 0.195 | 0.290 | 0.456 | 0.810 | 0.190 | 0.710 | older |
| VADER | 0.291 | 0.257 | 0.143 | 0.294 | 0.812 | 0.188 | 0.857 | older |
A composite performance score was computed to provide a single, summary index of classification performance across multiple evaluation metrics. For each predictor, false positive rate (FPR) and false negative rate (FNR) were first reverse-coded so that higher values consistently reflected better performance. The composite score was then calculated as mean of the seven metrics: accuracy, precision, recall, F1 score, specificity, reversed FPR, and reversed FNR. Finally, predictors were ranked based on their composite scores, with higher scores indicating superior overall performance and ties assigned the same minimum rank. This procedure was applied separately to the younger and older groups to allow within-group comparisons of predictor performance.
compute_composite_scores_mean <- function(metrics_df) {
df <- metrics_df
# Reverse cost metrics so that higher values indicate better performance
df$FPR_rev <- max(df$FPR, na.rm = TRUE) - df$FPR
df$FNR_rev <- max(df$FNR, na.rm = TRUE) - df$FNR
# Compute composite score as a simple average (equal weighting)
df$CompositeScore <- rowMeans(
df[, c(
"Accuracy",
"Precision",
"Recall",
"F1",
"Specificity",
"FPR_rev",
"FNR_rev"
)],
na.rm = TRUE
)
# Rank predictors (higher composite score = better performance)
df$Rank <- rank(-df$CompositeScore, ties.method = "min")
df
}
metrics_younger_comp <- compute_composite_scores_mean(metrics_younger)
metrics_older_comp <- compute_composite_scores_mean(metrics_older)
# Younger adults
metrics_younger_comp[
order(metrics_younger_comp$Rank),
c("Predictor", "CompositeScore", "Rank")
]
## Predictor CompositeScore Rank
## 5 VADER 0.3107421 1
## 4 Blob 0.2706972 2
## 3 QDAP 0.2401308 3
## 1 GI 0.2064933 4
## 2 HE 0.2020807 5
# Older adults
metrics_older_comp[
order(metrics_older_comp$Rank),
c("Predictor", "CompositeScore", "Rank")
]
## Predictor CompositeScore Rank
## 4 Blob 0.3296295 1
## 5 VADER 0.2587493 2
## 1 GI 0.2344328 3
## 2 HE 0.2337486 4
## 3 QDAP 0.2034141 5
Because each dictionary yielded a single performance estimate per age group, these aggregate results were used to characterise relative differences and rank dictionaries descriptively rather than to support statistical inference. Accordingly, aggregate comparisons reflect overall model behaviour and practical performance differences across dictionaries, but do not capture variability across individual participants.
While the previous analyses answered the question Which sentiment dictionaries perform better ? , the participant-level analysis will answers Are these performance differences reliable across individuals?
To assess whether observed differences in dictionary performance were
reliable across individuals, performance was additionally evaluated at
the participant level. Participant-level scores were computed by
comparing predicted and observed emotion ratings within each
participant’s narratives, yielding one composite performance score per
participant and dictionary. This approach preserves between-participant
variability and enables the use of inferential statistical models, such
as mixed-effects analyses, with participant treated as a random factor.
Participant-level analyses therefore allow performance differences
between dictionaries to be tested for consistency across individuals and
provide an appropriate basis for statistical inference beyond
descriptive model-level comparisons.
In both models, the random intercept for participant had an estimated
variance of zero, resulting in a singular fit. This indicates that,
after accounting for dictionary and age-group effects, there was no
remaining systematic between-participant variability in composite
agreement scores. Fixed-effect estimates are therefore interpreted as
population-level effects of dictionary reliability.
library(lme4)
library(lmerTest)
library(emmeans)
# 0) Detect participant ID column in `sent`
candidate_ids <- c("participant", "participant_id", "participantid", "participantID")
id_col <- candidate_ids[candidate_ids %in% names(sent)][1]
if (is.na(id_col)) {
stop(
"I couldn't find a participant ID column in `sent`.\n",
"Rename your ID column to one of: ",
paste(candidate_ids, collapse = ", "),
"\nOr set it manually: id_col <- 'your_column_name'"
)
}
# Ensure consistent types
sent[[id_col]] <- factor(sent[[id_col]])
sent$age <- factor(sent$age, levels = c("younger", "older"))
sent$emotions <- factor(sent$emotions, ordered = TRUE)
# 1) Per-participant metric helper (same macro-averaging logic)
eval_metrics_one_participant <- function(preds, actuals, all_levels) {
preds <- factor(preds, levels = all_levels, ordered = TRUE)
actuals <- factor(actuals, levels = all_levels, ordered = TRUE)
cm <- caret::confusionMatrix(preds, actuals)
tab <- as.matrix(cm$table)
TN <- sapply(seq_len(nrow(tab)), function(i) sum(tab[-i, -i]))
FP <- colSums(tab) - diag(tab)
FN <- rowSums(tab) - diag(tab)
TP <- diag(tab)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
specificity <- TN / (TN + FP)
f1 <- 2 * (precision * recall) / (precision + recall)
fpr <- FP / (FP + TN)
fnr <- FN / (FN + TP)
data.frame(
Accuracy = sum(TP) / sum(tab),
Precision = mean(precision, na.rm = TRUE),
Recall = mean(recall, na.rm = TRUE),
F1 = mean(f1, na.rm = TRUE),
Specificity = mean(specificity, na.rm = TRUE),
FPR = mean(fpr, na.rm = TRUE),
FNR = mean(fnr, na.rm = TRUE)
)
}
eval_metrics_by_participant <- function(df_test, pred_col, actual_col, id_col) {
all_levels <- levels(df_test[[actual_col]])
df_test %>%
group_by(.data[[id_col]]) %>%
group_modify(~{
eval_metrics_one_participant(
preds = .x[[pred_col]],
actuals = .x[[actual_col]],
all_levels = all_levels
)
}) %>%
ungroup() %>%
rename(participant = 1)
}
# 2) Run the same training pipeline, but return test predictions
run_age_group_preds <- function(df, group_label, pred, name) {
df <- df[df$age == group_label, ]
df$emotions <- factor(df$emotions, ordered = TRUE)
# ---- grouped split by participant (NOT rows) ----
set.seed(123)
pids <- unique(df[[id_col]])
train_pids <- sample(pids, size = floor(0.8 * length(pids)), replace = FALSE)
train_raw <- df[df[[id_col]] %in% train_pids, ]
test <- df[!(df[[id_col]] %in% train_pids), ]
# weights computed from ORIGINAL distribution (in this age group)
class_counts <- table(train_raw$emotions) # use train distribution
inv_weights <- 1 / class_counts
log_weights <- log(1 + inv_weights)
class_w <- log_weights / mean(log_weights)
# oversample TRAINING only
target_freq <- max(table(train_raw$emotions))
train_bal <- train_raw %>%
dplyr::group_by(emotions) %>%
dplyr::sample_n(target_freq, replace = TRUE) %>%
dplyr::ungroup()
train_weights <- class_w[as.character(train_bal$emotions)]
train_weights <- as.numeric(pmax(train_weights, 1)) # keep numeric (no rounding)
m <- partykit::ctree(
emotions ~ .,
data = train_bal[, c("emotions", pred), drop = FALSE],
weights = train_weights
)
test$pred <- predict(m, newdata = test, type = "response")
test %>%
transmute(
participant = .data[[id_col]],
Group = group_label,
Predictor = name,
actual = emotions,
pred = pred
)
}
# 3) Collect test predictions for each age group and dictionary
preds_younger <- bind_rows(
run_age_group_preds(sent, "younger", "sentiment_gi", "GI"),
run_age_group_preds(sent, "younger", "sentiment_he", "HE"),
run_age_group_preds(sent, "younger", "sentiment_qdap", "QDAP"),
run_age_group_preds(sent, "younger", "sentiment_blob", "Blob"),
run_age_group_preds(sent, "younger", "sentiment_vader", "VADER")
)
preds_older <- bind_rows(
run_age_group_preds(sent, "older", "sentiment_gi", "GI"),
run_age_group_preds(sent, "older", "sentiment_he", "HE"),
run_age_group_preds(sent, "older", "sentiment_qdap", "QDAP"),
run_age_group_preds(sent, "older", "sentiment_blob", "Blob"),
run_age_group_preds(sent, "older", "sentiment_vader", "VADER")
)
preds_all <- bind_rows(preds_younger, preds_older) %>%
mutate(
participant = factor(participant),
Group = factor(Group, levels = c("younger", "older")),
Predictor = factor(Predictor, levels = c("GI","HE","QDAP","Blob","VADER")),
actual = factor(actual, levels = levels(sent$emotions), ordered = TRUE),
pred = factor(pred, levels = levels(sent$emotions), ordered = TRUE)
)
# 4) Participant-level metrics per dictionary x group
metrics_by_participant <- preds_all %>%
group_by(Group, Predictor) %>%
group_modify(~{
eval_metrics_by_participant(
df_test = .x,
pred_col = "pred",
actual_col = "actual",
id_col = "participant"
)
}) %>%
ungroup() %>%
mutate(
participant = factor(participant),
Group = factor(Group, levels = c("younger", "older")),
Predictor = factor(Predictor, levels = c("GI","HE","QDAP","Blob","VADER"))
)
# 5) Composite score per participant (rowMeans)
# Use 1 - FPR and 1 - FNR so higher means better
metric_cols <- c("Accuracy", "Precision", "Recall", "F1", "Specificity", "FPR_rev", "FNR_rev")
metrics_by_participant <- metrics_by_participant %>%
mutate(
FPR_rev = 1 - FPR,
FNR_rev = 1 - FNR
) %>%
mutate(
CompositeScore = rowMeans(
as.matrix(dplyr::select(., dplyr::all_of(metric_cols))),
na.rm = TRUE
)
)
#mean composite per participant x dictionary (one row per cell)
participant_means <- metrics_by_participant %>%
group_by(participant, Group, Predictor) %>%
summarise(CompositeScore_mean = mean(CompositeScore, na.rm = TRUE), .groups = "drop")
When considering all sentiment dictionaries together, there is no overall age-related difference in agreement with human emotional ratings. Instead, the reliability of the dictionaries varies substantially across tools and depends on age group. Several dictionaries show higher agreement with human scores in younger adults (notably HE, QDAP, and Blob), whereas others (GI and VADER) show little to no age-related difference in agreement. The significant Group × Dictionary interaction indicates that dictionary-based sentiment measures do not uniformly capture human-rated emotional content across age groups. Consequently, conclusions about age differences in sentiment reliability are contingent on the specific dictionary used, underscoring the importance of evaluating dictionary performance separately across populations rather than assuming age-invariant validity.
lme_comp <- lmer(
CompositeScore ~ Group * Predictor + (1 | participant),
data = metrics_by_participant,
REML = TRUE
)
summary(lme_comp)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CompositeScore ~ Group * Predictor + (1 | participant)
## Data: metrics_by_participant
##
## REML criterion at convergence: -250
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1461 -0.5796 -0.0384 0.5243 3.2382
##
## Random effects:
## Groups Name Variance Std.Dev.
## participant (Intercept) 0.00000 0.00000
## Residual 0.00919 0.09586
## Number of obs: 160, groups: participant, 32
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.389332 0.023966 150.000000 16.245 < 2e-16 ***
## Groupolder 0.012590 0.033893 150.000000 0.371 0.710814
## PredictorHE -0.092946 0.033893 150.000000 -2.742 0.006843 **
## PredictorQDAP 0.017394 0.033893 150.000000 0.513 0.608575
## PredictorBlob 0.008341 0.033893 150.000000 0.246 0.805938
## PredictorVADER 0.126453 0.033893 150.000000 3.731 0.000270 ***
## Groupolder:PredictorHE 0.084744 0.047932 150.000000 1.768 0.079095 .
## Groupolder:PredictorQDAP -0.165636 0.047932 150.000000 -3.456 0.000714 ***
## Groupolder:PredictorBlob 0.062141 0.047932 150.000000 1.296 0.196817
## Groupolder:PredictorVADER -0.070210 0.047932 150.000000 -1.465 0.145073
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Grpldr PrdcHE PrQDAP PrdctB PVADER Gr:PHE G:PQDA Grp:PB
## Groupolder -0.707
## PredictorHE -0.707 0.500
## PredctrQDAP -0.707 0.500 0.500
## PredictrBlb -0.707 0.500 0.500 0.500
## PrdctrVADER -0.707 0.500 0.500 0.500 0.500
## Grpldr:PrHE 0.500 -0.707 -0.707 -0.354 -0.354 -0.354
## Grpld:PQDAP 0.500 -0.707 -0.354 -0.707 -0.354 -0.354 0.500
## Grpldr:PrdB 0.500 -0.707 -0.354 -0.354 -0.707 -0.354 0.500 0.500
## Grpl:PVADER 0.500 -0.707 -0.354 -0.354 -0.354 -0.707 0.500 0.500 0.500
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
anova(lme_comp)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Group 0.00108 0.001082 1 150 0.1178 0.7319
## Predictor 0.53415 0.133539 4 150 14.5309 4.722e-10 ***
## Group:Predictor 0.33460 0.083650 4 150 9.1023 1.307e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emm_age_by_dic <- emmeans(lme_comp, ~ Group | Predictor)
pairs(emm_age_by_dic, adjust = "holm")
## Predictor = GI:
## contrast estimate SE df t.ratio p.value
## younger - older -0.0126 0.0339 150 -0.371 0.7108
##
## Predictor = HE:
## contrast estimate SE df t.ratio p.value
## younger - older -0.0973 0.0339 150 -2.872 0.0047
##
## Predictor = QDAP:
## contrast estimate SE df t.ratio p.value
## younger - older 0.1530 0.0339 150 4.516 <.0001
##
## Predictor = Blob:
## contrast estimate SE df t.ratio p.value
## younger - older -0.0747 0.0339 150 -2.205 0.0290
##
## Predictor = VADER:
## contrast estimate SE df t.ratio p.value
## younger - older 0.0576 0.0339 150 1.700 0.0912
##
## Degrees-of-freedom method: kenward-roger
When restricting the analysis to Blob and VADER, there is no overall age difference in agreement with human emotional ratings. Instead, the reliability of the sentiment dictionaries depends on both age group and dictionary type. Blob shows higher agreement with human scores in older adults, whereas VADER shows higher agreement with human scores in younger adults. This crossover interaction indicates that the extent to which a dictionary captures human-rated emotional content is age-dependent, highlighting that dictionary reliability is not uniform across populations and that conclusions about sentiment accuracy can shift depending on both the tool used and the characteristics of the sample.
# Subset to Blob and VADER only
metrics_BV <- metrics_by_participant %>%
dplyr::filter(Predictor %in% c("Blob", "VADER")) %>%
dplyr::mutate(
Predictor = droplevels(Predictor),
Group = droplevels(Group)
)
participant_means_BV <- participant_means %>%
dplyr::filter(Predictor %in% c("Blob", "VADER")) %>%
dplyr::mutate(
Predictor = droplevels(Predictor),
Group = droplevels(Group)
)
lme_comp_BV <- lmer(
CompositeScore ~ Group * Predictor + (1 | participant),
data = metrics_BV,
REML = TRUE
)
summary(lme_comp_BV)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CompositeScore ~ Group * Predictor + (1 | participant)
## Data: metrics_BV
##
## REML criterion at convergence: -93.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0370 -0.6123 0.1496 0.5133 3.0736
##
## Random effects:
## Groups Name Variance Std.Dev.
## participant (Intercept) 0.0000 0.000
## Residual 0.0102 0.101
## Number of obs: 64, groups: participant, 32
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.39767 0.02525 60.00000 15.750 < 2e-16 ***
## Groupolder 0.07473 0.03571 60.00000 2.093 0.04060 *
## PredictorVADER 0.11811 0.03571 60.00000 3.308 0.00159 **
## Groupolder:PredictorVADER -0.13235 0.05050 60.00000 -2.621 0.01110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Grpldr PVADER
## Groupolder -0.707
## PrdctrVADER -0.707 0.500
## Grpl:PVADER 0.500 -0.707 -0.707
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
anova(lme_comp_BV)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Group 0.001171 0.001171 1 60 0.1148 0.73592
## Predictor 0.043159 0.043159 1 60 4.2309 0.04405 *
## Group:Predictor 0.070067 0.070067 1 60 6.8688 0.01110 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emm_age_by_dic <- emmeans(lme_comp_BV, ~ Group | Predictor)
pairs(emm_age_by_dic, adjust = "holm")
## Predictor = Blob:
## contrast estimate SE df t.ratio p.value
## younger - older -0.0747 0.0357 60 -2.093 0.0406
##
## Predictor = VADER:
## contrast estimate SE df t.ratio p.value
## younger - older 0.0576 0.0357 60 1.614 0.1119
##
## Degrees-of-freedom method: kenward-roger
emm_dic_by_age <- emmeans(lme_comp_BV, ~ Predictor | Group)
pairs(emm_dic_by_age, adjust = "holm")
## Group = younger:
## contrast estimate SE df t.ratio p.value
## Blob - VADER -0.1181 0.0357 30 -3.308 0.0024
##
## Group = older:
## contrast estimate SE df t.ratio p.value
## Blob - VADER 0.0142 0.0357 30 0.399 0.6929
##
## Degrees-of-freedom method: kenward-roger