Objective and Dataset Design
Data preparation
To prepare therapist and client utterances for large language model
(LLM) analysis, we applied a standardized text-cleaning pipeline to
remove extraneous content and normalize input across sessions. Each
utterance was stripped of timestamp markers (e.g., [00:15], [1:15:30])
and non-verbal annotations such as [laughs] or (sigh). Contractions were
expanded (e.g., “won’t” → “will not”) to increase clarity and
compatibility with downstream models. Repeated punctuation (e.g., …, !!,
??) was reduced to single marks, and all irregular spacing was
normalized. The resulting cleaned utterances (ClientClean,
TherapistClean) preserved the semantic content of the
original dialogue while minimizing noise, enabling more reliable
summarization and classification.
To identify key psychological shifts and communicative patterns in
therapy sessions, we used a large language model
(GPT-3.5-turbo) to generate rolling summaries based on
sequential conversational windows. We piloted 1-, 2-, 3-, and 4-turn
windows and found that increasing the number of turns improved summary
reliability. A 3-turn window offered the best balance between
interpretive richness and moment-to-moment precision, and was used for
all downstream analyses. The dataset was sorted chronologically by
participant, and a moving window captured up to three consecutive
therapist–client exchanges at each point. The model received a prompt
instructing it to “Summarize the key psychological themes or shifts
in this interaction. Avoid quoting. Use 1–2 sentences.” For each
eligible turn, we saved the raw text window (RawText_3Turn)
and the corresponding summary (Summary_3Turn). This
structured, mid-level summarization provided a scalable and
interpretable representation of therapeutic process across sessions.
scoring transcripts
We used a three-stage Build–Filter–Train approach to classify therapy transcript segments for two distinct coding schemes. Claudia classified seven psychological processes (Motivation, Self, Cognition, Affect, Attention, Overt Behavior, Context) using three methods: instruction-based LLM classification (GPT-3.5 Turbo) with both full and short category definitions, embedding-based similarity (text-embedding-ada-002) to compare segments to category definitions in semantic space, and rule-based keyword matching. Natasa classified five social-interpersonal categories (Relational Content Focus, Understanding, Interpersonal Effectiveness, Collaboration, Ineffective) using the same LLM and keyword methods but not embeddings, as her focus was on social and contextual dimensions rather than content. In the Filter stage, outputs from all methods were combined and run through the Boruta feature selection algorithm separately for each task to retain only the most reliable predictors. In the Train stage, these confirmed predictors were used to train XGBoost models, producing robust, accurate predictions. This framework combined the interpretive flexibility of LLMs, the semantic precision of embeddings (Claudia only), and the transparency of rule-based methods into a unified, data-driven classification pipeline.
For each target category, we used the Boruta feature selection algorithm to identify the most important predictors from the full set of model-generated features. The procedure was run separately for every dependent variable (e.g., Motivation, Cognition, Self, Affect, Attention, Overt Behavior, Context), using a dynamically generated formula that included all candidate predictors. Boruta iteratively compared each predictor’s importance against “shadow” variables created from randomised data, retaining only those that consistently outperformed the shadows. For each category, we saved both the complete Boruta model object and a cleaned importance table containing the mean importance score and selection decision (confirmed, tentative, or rejected) for every predictor. The resulting collection of Boruta outputs was stored in a single list object and saved as an RDS file for later use in model building and inspection.
A shadow feature in Boruta is a copy of one of your real predictors, but with its values randomly shuffled so that any relationship to the outcome variable is destroyed. These shadow features act as a baseline for importance—because they are just noise, they represent the level of importance you might expect by chance alone.
Boruta compares the importance score of each real predictor to the best-performing shadow feature:
This comparison helps ensure that only predictors with importance well above random noise are selected, making the feature selection process more reliable.
See: Ciarrochi, J., Sahrda, B., Hofmann, S., & Hayes, S.C. (2022). Developing an item pool to assess processes of change in psychological interventions. The process-based assessment tool (PBAT). Journal of Contextual Behavioral Science, 23, 200–213.
Across text types, RawText produced more confirmed predictors (n = 115) than Summary (n = 85), while Summary had a higher number of rejected features (n = 69 vs. n = 45 for RawText). At the model level, the embedding-based method contributed the most confirmed predictors (n = 75), followed by the short-prompt LLM (n = 59), full-prompt LLM (n = 49), and rule-based method (n = 17), which also had the largest number of rejected features (n = 58). By target category, Affect yielded the most confirmed predictors (n = 32), closely followed by Attention (n = 30), Cognition (n = 31), and Overt Behavior (n = 31), with Motivation producing the fewest (n = 23). These patterns suggest that predictor confirmation rates varied systematically by both data source and modelling approach, with embedding-based features and raw text generally contributing the strongest signals across multiple categories.
The table below shows the sortable results.
This table presents all predictors retained from the Boruta feature selection output, with each row corresponding to a single predictor and its attributes. The columns identify the text type the feature came from (e.g., raw or summarised transcript), the turn in the dialogue, the feature dimension it belongs to, and the model or method that produced it. For each target category, the table displays the predictor’s Boruta importance score, colour-coded to reflect its selection decision (confirmed, tentative, or rejected). The table is interactive and sortable, allowing the user to filter or arrange predictors based on their source, position in the transcript, feature grouping, method of generation, or category-specific importance.
XGBoost is a machine learning method that builds a series of decision trees, where each new tree tries to fix the mistakes made by the previous ones (Friedman, 2001; Chen & Guestrin, 2016). Because it adds trees one at a time and learns from errors as it goes, it can model very complex patterns, including situations where predictors interact in non-obvious ways. XGBoost is designed to be fast, work well with large numbers of predictors, and avoid “overfitting” (when a model learns the training data too perfectly but performs poorly on new data). It does this by limiting tree depth, using random subsets of the data and predictors for each tree, and slowing down learning so that each step makes only a small adjustment.
Procedure. For each outcome we wanted to predict (e.g., Motivation), we started with the features Boruta had marked as “confirmed” for that outcome and built a dataset using only those predictors. We trained an XGBoost model with settings designed to balance complexity and generalization: tree depth limited to 3 (max_depth = 3), a modest learning rate of 0.05 (eta = 0.05), sampling 80% of the data for each tree (subsample = 0.8), and sampling 80% of the predictors for each tree (colsample_bytree = 0.8). The model’s objective was binary logistic classification (objective = “binary:logistic”) and performance was tracked using the area under the ROC curve (eval_metric = “auc”).
To make sure the model wasn’t just memorizing the training data, we used 5-fold cross-validation. This means we split the data into five equal parts, trained the model on four parts, and tested it on the one part left out. We repeated this process five times so that every part of the data was used once for testing. The results from the five test runs were averaged to give a more reliable measure of how the model would perform on new data. This process greatly reduces the risk of overfitting because the model is tested on data it has never seen during training in each round. We also used early stopping—ending training if performance didn’t improve for 10 rounds—to prevent the model from getting too complex.
Model performance was evaluated using accuracy, a confusion matrix (showing correct and incorrect predictions), the AUC, and a ranked list of the most important predictors based on the model’s internal feature importance scores.
References Chen, T., & Guestrin, C. (2016, August 13). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA. https://doi.org/10.1145/2939672.2939785
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
The XGBoost model achieved an overall accuracy of 92.22% (95% CI: 89.99%–94.09%) in predicting whether therapy sessions contained context-related discussion. This performance was significantly better than the no-information rate of 86.09% (p < .001). The model’s kappa statistic was 0.561, indicating moderate agreement beyond chance.
Sensitivity was very high at 99.35%, indicating that the model correctly identified almost all sessions without context when they were absent. Specificity, however, was lower at 44.57%, meaning the model was less effective at correctly identifying sessions with context present.
The positive predictive value (92.30%) indicates that when the model predicted a session to be in the “context absent” class (the positive class in this coding), it was correct about 92% of the time. The negative predictive value (91.11%) means that when the model predicted a session to be in the “context present” class, it was correct about 91% of the time. Balanced accuracy, which averages sensitivity and specificity to account for class imbalance, was 71.96%, indicating that while the model was excellent at identifying the majority class, performance was more modest when considering both classes equally.
McNemar’s test was significant (p < .001), suggesting systematic differences in the types of classification errors, with the model more likely to misclassify context-present sessions as context-absent than vice versa.
## Predictor Importance
## <char> <num>
## 1: Summary_3Turn_Context_short 0.526
## 2: RawText_3Turn_Context_embed 0.219
## 3: RawText_3Turn_Context_short 0.192
## 4: Summary_3Turn_Context_rule 0.032
## 5: RawText_3Turn_Context_full 0.019
## 6: Summary_3Turn_Context_embed 0.009
## 7: Summary_3Turn_Context_full 0.001
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 608 51
## 1 7 41
##
## Accuracy : 0.918
## 95% CI : (0.8952, 0.9371)
## No Information Rate : 0.8699
## P-Value [Acc > NIR] : 3.714e-05
##
## Kappa : 0.5451
##
## Mcnemar's Test P-Value : 1.641e-08
##
## Sensitivity : 0.9886
## Specificity : 0.4457
## Pos Pred Value : 0.9226
## Neg Pred Value : 0.8542
## Prevalence : 0.8699
## Detection Rate : 0.8600
## Detection Prevalence : 0.9321
## Balanced Accuracy : 0.7171
##
## 'Positive' Class : 0
##
We used SHAP values (SHapley Additive exPlanations) to interpret the trained XGBoost models and quantify the contribution of each predictor to the classification of the target category. SHAP values were computed using XGBoost’s built-in TreeSHAP algorithm, which calculates, for every case and every feature, how much that feature shifts the prediction relative to the model’s average prediction (the “baseline”). We summarised these local contributions into a global SHAP table with three metrics for each predictor:
MeanAbsSHAP – the average absolute magnitude of a feature’s SHAP values across all cases, reflecting its overall influence on the model’s predictions regardless of direction.
MeanSHAP – the average signed SHAP value, indicating whether a feature tends to increase (positive) or decrease (negative) the predicted probability of the target category.
PctPositiveSHAP – the proportion of cases in which a feature’s SHAP value was positive, showing how often it pushes the prediction upward.
All values were rounded to four decimal places for clarity. Features with higher MeanAbsSHAP values have a larger overall impact on the model’s decisions, while MeanSHAP and PctPositiveSHAP help explain the typical direction and consistency of that effect.
The first table summarises the global SHAP values for predictors of the “Context” category. The largest overall contributions to the model’s predictions (MeanAbsSHAP) came from Summary_3Turn_Context_short (0.2871) and RawText_3Turn_Context_short (0.2010), indicating that these features had the strongest influence on classification decisions, regardless of whether that influence was positive or negative. Other predictors, such as RawText_3Turn_Context_embed (0.1119) and Summary_3Turn_Context_embed (0.0269), had more modest overall contributions, while features like RawText_3Turn_Context_rule (0.0012) contributed very little.
In contrast, the average signed SHAP values (MeanSHAP) for all predictors were close to zero, despite some features having high overall importance. This occurs because the same feature can push predictions upward in some cases and downward in others, leading these effects to cancel out when averaged. For example, Summary_3Turn_Context_short had a substantial overall impact but a MeanSHAP of only –0.016, reflecting a near balance between positive and negative contributions across cases.
The PctPositiveSHAP values provide additional insight into this variability. They indicate the proportion of cases in which a feature pushed the prediction upward. For instance, Summary_3Turn_Context_full had a relatively low overall importance (0.0021) but pushed predictions upward in 66.5% of cases, whereas Summary_3Turn_Context_rule had similar low importance (0.0266) but did so in only 2.1% of cases. This reveals that some features consistently act in one direction, while others have mixed directional effects across individuals.
Together, these results illustrate our ergodicity argument: averages alone can be misleading when effects vary substantially across cases. A near-zero mean does not mean a feature is unimportant—it can still have large and meaningful effects in individual instances, but in different directions. Considering both the magnitude (MeanAbsSHAP) and the directional consistency (PctPositiveSHAP) provides a fuller picture of a predictor’s role than mean values alone.
When we examine simple Pearson correlations between each predictor and the binary outcome label_context, all correlations are positive, ranging from 0.191 (RawText_3Turn_Context_rule) to 0.407 (Summary_3Turn_Context_short). This shows that, on average, higher values of each predictor are associated with a greater likelihood of “Context” being present. However, the SHAP results reveal that the model does not use these predictors in a uniformly positive way across all cases. A feature can have a positive global correlation with the outcome while still contributing negatively to the prediction for many individuals, due to interactions with other features in the model. This divergence underscores our ergodicity argument: population-level summaries such as Pearson correlations can mask substantial variability in how predictors operate at the individual level. SHAP-based metrics, by decomposing predictions case-by-case, reveal the heterogeneous and sometimes opposing contributions that are hidden in the average.
One global model is trained on all the data. That model has learned patterns — “if feature X is high and feature Y is low, the probability of Context increases,” etc.
But those learned patterns can interact in complicated ways.
For one case, a particular feature might be in a range that increases the predicted probability (positive SHAP value).
For another case, that same feature’s value, combined with the other features for that person, might push the probability down (negative SHAP value).
Example Say the model uses Summary_3Turn_Context_short as one of its predictors.
Case A: This feature’s value is high, and in combination with other features, the model has learned that this pattern often signals “Context present.” SHAP shows a positive contribution — it pushes the prediction up.
Case B: This feature’s value is also high, but the other features for this case are in a pattern that the model has learned usually means “Context not present.” In that context, the same feature value interacts differently with the others, and SHAP shows a negative contribution — it pushes the prediction down.
So, each individual is scored separately by the same model, and SHAP explains for that individual whether each feature was pushing the probability higher or lower given all the other feature values for that person.
This is exactly why your MeanSHAP values can hover near zero — the same feature can push in opposite directions for different cases, so the average cancels out, even though the feature is influential in almost every case. This is also where your ergodicity point fits in: the population-average effect doesn’t tell you what’s happening for individuals.
## Predictor PearsonCorrelation
## Summary_3Turn_Context_short Summary_3Turn_Context_short 0.407
## RawText_3Turn_Context_short RawText_3Turn_Context_short 0.370
## RawText_3Turn_Context_embed RawText_3Turn_Context_embed 0.274
## Summary_3Turn_Context_embed Summary_3Turn_Context_embed 0.211
## Summary_3Turn_Context_full Summary_3Turn_Context_full 0.253
## RawText_3Turn_Context_full RawText_3Turn_Context_full 0.325
## Summary_3Turn_Context_rule Summary_3Turn_Context_rule 0.206
## RawText_3Turn_Context_rule RawText_3Turn_Context_rule 0.191
Most models classify something as “present” if the probability is greater than 0.50 (or 50%). But this default threshold isn’t always the best choice. In our case, we used a method called Youden’s Index to find the optimal cutoff — the point that gives us the best balance between correctly identifying when “Context” is present (sensitivity) and correctly identifying when it’s not (specificity).
The model found that the best cutoff wasn’t 50%, but 21.4%. That means:
If the model predicts there’s more than a 21.4% chance that “Context” is present in a given therapy turn, we classify it as present. Otherwise, we classify it as absent.
This lower threshold can help catch more true cases of “Context” being present, especially if those cases are harder to detect and tend to have lower predicted probabilities.
## Optimal threshold used to classify 'Context_Present': 0.417
We evaluate the performance of our classifier using the Receiver Operating Characteristic (ROC) curve. The ROC curve plots sensitivity (true positive rate) against 1 – specificity (false positive rate) across all possible thresholds. This allows us to visualize how well the model distinguishes between the two outcome classes.
The Area Under the Curve (AUC) summarizes the overall ability of the model to discriminate:
AUC = 0.5 implies no discrimination (random chance)
AUC = 1 implies perfect discrimination
AUC ≥ 0.8 is typically considered good
To choose an optimal classification cutoff, we apply Youden’s index to find the threshold that maximizes sensitivity + specificity. We display this threshold on the plot using red dashed lines. The crossing point of these lines helps identify the cutoff where we best separate predicted “present” from “absent”.
in present figure
AUC = 0.915 → Excellent model discrimination.
Threshold = 0.157 → This is the probability cutoff that best balances sensitivity and specificity (using Youden’s index).
Sensitivity = 0.815 → Of all actual “Context Present” lines, 81.5% are correctly detected.
Specificity = 0.857 → Of all “Context Not Present” lines, 85.7% are correctly identified as such.
The red dashed lines show this optimal threshold on the ROC space.
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_Affect_embed 0.545
## 2: Summary_3Turn_Affect_embed 0.183
## 3: RawText_3Turn_Affect_rule 0.115
## 4: RawText_3Turn_Affect_full 0.073
## 5: Summary_3Turn_Affect_rule 0.057
## 6: RawText_3Turn_Affect_short 0.023
## 7: Summary_3Turn_Affect_short 0.003
## 8: Summary_3Turn_Affect_full 0.002
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 591 55
## 1 13 48
##
## Accuracy : 0.9038
## 95% CI : (0.8797, 0.9245)
## No Information Rate : 0.8543
## P-Value [Acc > NIR] : 5.533e-05
##
## Kappa : 0.535
##
## Mcnemar's Test P-Value : 6.627e-07
##
## Sensitivity : 0.9785
## Specificity : 0.4660
## Pos Pred Value : 0.9149
## Neg Pred Value : 0.7869
## Prevalence : 0.8543
## Detection Rate : 0.8359
## Detection Prevalence : 0.9137
## Balanced Accuracy : 0.7222
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_OvertBehavior_short 0.498
## 2: RawText_3Turn_OvertBehavior_embed 0.223
## 3: RawText_3Turn_OvertBehavior_full 0.153
## 4: Summary_3Turn_OvertBehavior_embed 0.100
## 5: Summary_3Turn_OvertBehavior_short 0.014
## 6: Summary_3Turn_OvertBehavior_full 0.013
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 605 60
## 1 11 31
##
## Accuracy : 0.8996
## 95% CI : (0.875, 0.9207)
## No Information Rate : 0.8713
## P-Value [Acc > NIR] : 0.01225
##
## Kappa : 0.4189
##
## Mcnemar's Test P-Value : 1.223e-08
##
## Sensitivity : 0.9821
## Specificity : 0.3407
## Pos Pred Value : 0.9098
## Neg Pred Value : 0.7381
## Prevalence : 0.8713
## Detection Rate : 0.8557
## Detection Prevalence : 0.9406
## Balanced Accuracy : 0.6614
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: Summary_3Turn_Motivation_embed 0.302
## 2: RawText_3Turn_Motivation_short 0.267
## 3: RawText_3Turn_Motivation_embed 0.192
## 4: RawText_3Turn_Motivation_rule 0.078
## 5: Summary_3Turn_Motivation_full 0.075
## 6: Summary_3Turn_Motivation_short 0.035
## 7: Summary_3Turn_Motivation_rule 0.028
## 8: RawText_3Turn_Motivation_full 0.023
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 615 51
## 1 7 34
##
## Accuracy : 0.918
## 95% CI : (0.8952, 0.9371)
## No Information Rate : 0.8798
## P-Value [Acc > NIR] : 0.0006661
##
## Kappa : 0.5006
##
## Mcnemar's Test P-Value : 1.641e-08
##
## Sensitivity : 0.9887
## Specificity : 0.4000
## Pos Pred Value : 0.9234
## Neg Pred Value : 0.8293
## Prevalence : 0.8798
## Detection Rate : 0.8699
## Detection Prevalence : 0.9420
## Balanced Accuracy : 0.6944
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: Summary_3Turn_Self_short 0.443
## 2: RawText_3Turn_Self_short 0.294
## 3: Summary_3Turn_Self_rule 0.073
## 4: RawText_3Turn_Self_embed 0.064
## 5: Summary_3Turn_Self_embed 0.051
## 6: RawText_3Turn_Self_full 0.049
## 7: Summary_3Turn_Self_full 0.025
## 8: RawText_3Turn_Self_rule 0.001
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 567 48
## 1 18 74
##
## Accuracy : 0.9066
## 95% CI : (0.8828, 0.9271)
## No Information Rate : 0.8274
## P-Value [Acc > NIR] : 1.449e-09
##
## Kappa : 0.6379
##
## Mcnemar's Test P-Value : 0.0003575
##
## Sensitivity : 0.9692
## Specificity : 0.6066
## Pos Pred Value : 0.9220
## Neg Pred Value : 0.8043
## Prevalence : 0.8274
## Detection Rate : 0.8020
## Detection Prevalence : 0.8699
## Balanced Accuracy : 0.7879
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: Summary_3Turn_Attention_embed 0.383
## 2: Summary_3Turn_Attention_rule 0.192
## 3: RawText_3Turn_Attention_embed 0.138
## 4: RawText_3Turn_Attention_short 0.126
## 5: Summary_3Turn_Attention_short 0.072
## 6: RawText_3Turn_Attention_full 0.040
## 7: RawText_3Turn_Attention_rule 0.030
## 8: Summary_3Turn_Attention_full 0.019
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 602 26
## 1 3 76
##
## Accuracy : 0.959
## 95% CI : (0.9416, 0.9724)
## No Information Rate : 0.8557
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8167
##
## Mcnemar's Test P-Value : 4.402e-05
##
## Sensitivity : 0.9950
## Specificity : 0.7451
## Pos Pred Value : 0.9586
## Neg Pred Value : 0.9620
## Prevalence : 0.8557
## Detection Rate : 0.8515
## Detection Prevalence : 0.8883
## Balanced Accuracy : 0.8701
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_Cognition_full 0.363
## 2: Summary_3Turn_Cognition_short 0.242
## 3: RawText_3Turn_Cognition_embed 0.216
## 4: Summary_3Turn_Cognition_embed 0.112
## 5: RawText_3Turn_Cognition_short 0.040
## 6: Summary_3Turn_Cognition_full 0.027
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 589 54
## 1 6 58
##
## Accuracy : 0.9151
## 95% CI : (0.8921, 0.9346)
## No Information Rate : 0.8416
## P-Value [Acc > NIR] : 5.484e-09
##
## Kappa : 0.6147
##
## Mcnemar's Test P-Value : 1.298e-09
##
## Sensitivity : 0.9899
## Specificity : 0.5179
## Pos Pred Value : 0.9160
## Neg Pred Value : 0.9063
## Prevalence : 0.8416
## Detection Rate : 0.8331
## Detection Prevalence : 0.9095
## Balanced Accuracy : 0.7539
##
## 'Positive' Class : 0
##