Objective and Dataset Design
Data preparation
To prepare therapist and client utterances for large language model
(LLM) analysis, we applied a standardized text-cleaning pipeline to
remove extraneous content and normalize input across sessions. Each
utterance was stripped of timestamp markers (e.g., [00:15], [1:15:30])
and non-verbal annotations such as [laughs] or (sigh). Contractions were
expanded (e.g., “won’t” → “will not”) to increase clarity and
compatibility with downstream models. Repeated punctuation (e.g., …, !!,
??) was reduced to single marks, and all irregular spacing was
normalized. The resulting cleaned utterances (ClientClean,
TherapistClean) preserved the semantic content of the
original dialogue while minimizing noise, enabling more reliable
summarization and classification.
To identify key psychological shifts and communicative patterns in
therapy sessions, we used a large language model
(GPT-3.5-turbo) to generate rolling summaries based on
sequential conversational windows. We piloted 1-, 2-, 3-, and 4-turn
windows and found that increasing the number of turns improved summary
reliability. A 3-turn window offered the best balance between
interpretive richness and moment-to-moment precision, and was used for
all downstream analyses. The dataset was sorted chronologically by
participant, and a moving window captured up to three consecutive
therapist–client exchanges at each point. The model received a prompt
instructing it to “Summarize the key psychological themes or shifts
in this interaction. Avoid quoting. Use 1–2 sentences.” For each
eligible turn, we saved the raw text window (RawText_3Turn)
and the corresponding summary (Summary_3Turn). This
structured, mid-level summarization provided a scalable and
interpretable representation of therapeutic process across sessions.
scoring transcripts
We used a three-stage Build–Filter–Train approach to classify therapy transcript segments for two distinct coding schemes. Claudia classified seven psychological processes (Motivation, Self, Cognition, Affect, Attention, Overt Behavior, Context) using three methods: instruction-based LLM classification (GPT-3.5 Turbo) with both full and short category definitions, embedding-based similarity (text-embedding-ada-002) to compare segments to category definitions in semantic space, and rule-based keyword matching. Natasa classified five social-interpersonal categories (Relational Content Focus, Understanding, Interpersonal Effectiveness, Collaboration, Ineffective) using the same LLM and keyword methods but not embeddings, as her focus was on social and contextual dimensions rather than content. In the Filter stage, outputs from all methods were combined and run through the Boruta feature selection algorithm separately for each task to retain only the most reliable predictors. In the Train stage, these confirmed predictors were used to train XGBoost models, producing robust, accurate predictions. This framework combined the interpretive flexibility of LLMs, the semantic precision of embeddings (Claudia only), and the transparency of rule-based methods into a unified, data-driven classification pipeline.
Rules did not work for summaries (all 0s), except for relational content. did work for raw text. though. summary based rule models exluded from further analysis
summary(QdataSocial)
## ID Turn Therapy friendly Dimension1
## Min. : 1.0 Min. : 1.000 Length:707 Length:707
## 1st Qu.:13.0 1st Qu.: 4.000 Class :character Class :character
## Median :26.0 Median : 7.000 Mode :character Mode :character
## Mean :26.6 Mean : 7.306
## 3rd Qu.:40.0 3rd Qu.:11.000
## Max. :53.0 Max. :20.000
##
## Dimension2 Dimension 3 Social dimension
## Length:707 Length:707 Collaboration : 93
## Class :character Class :character Ineffective :110
## Mode :character Mode :character Interpersonal effectiveness:124
## None explictely simulated :256
## Understanding :124
##
##
## Presenting problem Intervention Strategy Therapist approach
## Length:707 Length:707 Length:707
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Therapist example language Intended effects Client
## Length:707 Length:707 Length:707
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Therapist ClientClean TherapistClean Summary_3Turn
## Length:707 Length:707 Length:707 Length:707
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## RawText_3Turn Summary_3Turn_RelationalContentFocus_rule
## Length:707 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.3508
## 3rd Qu.:0.0000
## Max. :4.0000
##
## Summary_3Turn_Understanding_rule Summary_3Turn_InterpersonalEffectiveness_rule
## Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0
## Median :0 Median :0
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
##
## Summary_3Turn_Collaboration_rule Summary_3Turn_Ineffective_rule
## Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0
## Median :0 Median :0
## Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0
##
## Summary_3Turn_status_rule RawText_3Turn_RelationalContentFocus_rule
## Length:707 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.4328
## 3rd Qu.:0.0000
## Max. :4.0000
##
## RawText_3Turn_Understanding_rule RawText_3Turn_InterpersonalEffectiveness_rule
## Min. : 0.00 Min. :0.0000
## 1st Qu.: 0.00 1st Qu.:0.0000
## Median : 0.00 Median :0.0000
## Mean : 1.38 Mean :0.1075
## 3rd Qu.: 2.00 3rd Qu.:0.0000
## Max. :10.00 Max. :2.0000
##
## RawText_3Turn_Collaboration_rule RawText_3Turn_Ineffective_rule
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.09052 Mean :0.1188
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.00000 Max. :4.0000
##
## RawText_3Turn_status_rule Client_RelationalContentFocus_rule
## Length:707 Min. : 0.0000
## Class :character 1st Qu.: 0.0000
## Mode :character Median : 0.0000
## Mean : 0.5021
## 3rd Qu.: 0.0000
## Max. :10.0000
## NA's :2
## Client_Understanding_rule Client_InterpersonalEffectiveness_rule
## Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000
## Mean : 0.5702 Mean : 0.4482
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :10.0000 Max. :10.0000
## NA's :2 NA's :2
## Client_Collaboration_rule Client_Ineffective_rule Client_status_rule
## Min. : 0.0000 Min. : 0.0000 Length:707
## 1st Qu.: 0.0000 1st Qu.: 0.0000 Class :character
## Median : 0.0000 Median : 0.0000 Mode :character
## Mean : 0.2014 Mean : 0.7773
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :10.0000 Max. :10.0000
## NA's :2 NA's :2
## Therapist_RelationalContentFocus_rule Therapist_Understanding_rule
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 0.000
## Mean :0.3914 Mean : 1.099
## 3rd Qu.:0.0000 3rd Qu.: 2.000
## Max. :6.0000 Max. :10.000
## NA's :12 NA's :12
## Therapist_InterpersonalEffectiveness_rule Therapist_Collaboration_rule
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.4777 Mean :0.1065
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :6.0000 Max. :6.0000
## NA's :12 NA's :12
## Therapist_Ineffective_rule Therapist_status_rule
## Min. :0.0000 Length:707
## 1st Qu.:0.0000 Class :character
## Median :0.0000 Mode :character
## Mean :0.3712
## 3rd Qu.:0.0000
## Max. :6.0000
## NA's :12
## Summary_3Turn_RelationalContentFocus_full Summary_3Turn_Understanding_full
## Min. : 3.000 Min. : 6.500
## 1st Qu.: 8.500 1st Qu.: 9.000
## Median : 8.500 Median : 9.000
## Mean : 8.544 Mean : 8.868
## 3rd Qu.: 8.500 3rd Qu.: 9.000
## Max. :10.000 Max. :10.000
##
## Summary_3Turn_Interpersonaleffectiveness_full Summary_3Turn_Collaboration_full
## Min. :6.000 Min. :3.000
## 1st Qu.:7.500 1st Qu.:8.000
## Median :9.000 Median :8.000
## Mean :8.571 Mean :8.041
## 3rd Qu.:9.500 3rd Qu.:9.000
## Max. :9.500 Max. :9.500
##
## Summary_3Turn_Ineffective_full Summary_3Turn_RelationalContentFocus_short
## Min. :0.000 Min. : 8.000
## 1st Qu.:2.000 1st Qu.: 8.000
## Median :2.000 Median : 8.500
## Mean :1.917 Mean : 8.498
## 3rd Qu.:2.000 3rd Qu.: 9.000
## Max. :9.000 Max. :10.000
##
## Summary_3Turn_Understanding_short
## Min. : 6.500
## 1st Qu.: 8.000
## Median : 9.000
## Mean : 8.579
## 3rd Qu.: 9.000
## Max. :10.000
##
## Summary_3Turn_Interpersonaleffectiveness_short
## Min. :6.000
## 1st Qu.:7.000
## Median :7.000
## Mean :7.422
## 3rd Qu.:7.500
## Max. :9.500
##
## Summary_3Turn_Collaboration_short Summary_3Turn_Ineffective_short
## Min. :4.000 Min. :1.000
## 1st Qu.:8.000 1st Qu.:2.000
## Median :8.000 Median :2.000
## Mean :8.056 Mean :2.295
## 3rd Qu.:8.500 3rd Qu.:3.000
## Max. :9.500 Max. :5.000
##
## RawText_3Turn_RelationalContentFocus_full RawText_3Turn_Understanding_full
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 8.000 1st Qu.: 8.500
## Median : 8.000 Median : 9.000
## Mean : 8.088 Mean : 8.544
## 3rd Qu.: 9.000 3rd Qu.: 9.500
## Max. :10.000 Max. :10.000
##
## RawText_3Turn_Interpersonaleffectiveness_full RawText_3Turn_Collaboration_full
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.500 1st Qu.: 7.500
## Median : 9.000 Median : 8.000
## Mean : 8.176 Mean : 7.574
## 3rd Qu.: 9.500 3rd Qu.: 9.000
## Max. :10.000 Max. :10.000
##
## RawText_3Turn_Ineffective_full RawText_3Turn_RelationalContentFocus_short
## Min. : 0.000 Min. : 2.000
## 1st Qu.: 0.000 1st Qu.: 8.000
## Median : 0.000 Median : 8.000
## Mean : 1.439 Mean : 8.073
## 3rd Qu.: 2.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000
##
## RawText_3Turn_Understanding_short
## Min. : 3.000
## 1st Qu.: 8.500
## Median : 9.000
## Mean : 8.588
## 3rd Qu.: 9.000
## Max. :10.000
##
## RawText_3Turn_Interpersonaleffectiveness_short
## Min. : 3.000
## 1st Qu.: 7.000
## Median : 8.000
## Mean : 7.857
## 3rd Qu.: 9.000
## Max. :10.000
##
## RawText_3Turn_Collaboration_short RawText_3Turn_Ineffective_short
## Min. : 1.000 Min. :0.000
## 1st Qu.: 7.000 1st Qu.:2.000
## Median : 8.000 Median :2.000
## Mean : 7.718 Mean :2.187
## 3rd Qu.: 9.000 3rd Qu.:2.000
## Max. :10.000 Max. :9.000
##
## Client_RelationalContentFocus_full Client_Understanding_full
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 8.000 1st Qu.: 8.000
## Median : 8.500 Median : 9.000
## Mean : 7.961 Mean : 8.349
## 3rd Qu.: 8.500 3rd Qu.: 9.000
## Max. :10.000 Max. :10.000
##
## Client_Interpersonaleffectiveness_full Client_Collaboration_full
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.000 1st Qu.: 6.000
## Median : 7.500 Median : 6.000
## Mean : 7.278 Mean : 6.098
## 3rd Qu.: 7.500 3rd Qu.: 7.500
## Max. :10.000 Max. :10.000
##
## Client_Ineffective_full Client_RelationalContentFocus_short
## Min. :0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.: 8.000
## Median :2.000 Median : 8.000
## Mean :3.076 Mean : 8.061
## 3rd Qu.:3.000 3rd Qu.: 8.000
## Max. :9.500 Max. :10.000
##
## Client_Understanding_short Client_Interpersonaleffectiveness_short
## Min. : 0.00 Min. :0.000
## 1st Qu.: 7.00 1st Qu.:6.000
## Median : 7.50 Median :7.000
## Mean : 7.67 Mean :6.401
## 3rd Qu.: 9.00 3rd Qu.:7.000
## Max. :10.00 Max. :9.000
##
## Client_Collaboration_short Client_Ineffective_short
## Min. :0.000 Min. :0.000
## 1st Qu.:5.000 1st Qu.:3.000
## Median :6.000 Median :3.000
## Mean :5.583 Mean :3.475
## 3rd Qu.:6.000 3rd Qu.:3.000
## Max. :9.000 Max. :9.000
##
## Therapist_RelationalContentFocus_full Therapist_Understanding_full
## Min. :0.000 Min. : 0.00
## 1st Qu.:8.500 1st Qu.: 9.00
## Median :8.500 Median : 9.00
## Mean :8.149 Mean : 8.61
## 3rd Qu.:8.500 3rd Qu.: 9.00
## Max. :9.000 Max. :10.00
##
## Therapist_Interpersonaleffectiveness_full Therapist_Collaboration_full
## Min. :0.000 Min. :0.000
## 1st Qu.:7.500 1st Qu.:6.500
## Median :7.500 Median :8.000
## Mean :8.081 Mean :7.474
## 3rd Qu.:9.500 3rd Qu.:9.000
## Max. :9.500 Max. :9.500
##
## Therapist_Ineffective_full Therapist_RelationalContentFocus_short
## Min. :0.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 8.000
## Median :2.000 Median : 8.000
## Mean :1.731 Mean : 7.946
## 3rd Qu.:2.000 3rd Qu.: 8.000
## Max. :9.000 Max. :10.000
##
## Therapist_Understanding_short Therapist_Interpersonaleffectiveness_short
## Min. : 0.000 Min. :0.000
## 1st Qu.: 7.500 1st Qu.:7.000
## Median : 9.000 Median :7.000
## Mean : 8.219 Mean :7.178
## 3rd Qu.: 9.000 3rd Qu.:7.500
## Max. :10.000 Max. :9.500
##
## Therapist_Collaboration_short Therapist_Ineffective_short
## Min. :0.000 Min. :0.000
## 1st Qu.:6.000 1st Qu.:2.000
## Median :7.000 Median :2.000
## Mean :7.195 Mean :2.214
## 3rd Qu.:8.500 3rd Qu.:2.500
## Max. :9.500 Max. :9.000
##
## Summary_3Turn_status_full Summary_3Turn_status_short RawText_3Turn_status_full
## Length:707 Length:707 Length:707
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## RawText_3Turn_status_short Client_status_full Client_status_short
## Length:707 Length:707 Length:707
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Therapist_status_full Therapist_status_short MasterID
## Length:707 Length:707 Min. : 1.0
## Class :character Class :character 1st Qu.:177.5
## Mode :character Mode :character Median :354.0
## Mean :354.0
## 3rd Qu.:530.5
## Max. :707.0
##
## Dimension1_clean label_motivation label_cognition label_self
## Length:707 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1202 Mean :0.1584 Mean :0.1726
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## label_affect label_attention label_overt_behavior label_context
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1457 Mean :0.1443 Mean :0.1287 Mean :0.1301
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Social_binary SocialDim_Collaboration SocialDim_Ineffective
## Other : 0 Min. :0.0000 Min. :0.0000
## Ineffective:110 1st Qu.:0.0000 1st Qu.:0.0000
## NA's :597 Median :0.0000 Median :0.0000
## Mean :0.1315 Mean :0.1556
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
##
## SocialDim_Interpersonaleffectiveness SocialDim_Noneexplictelysimulated
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1754 Mean :0.3621
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
## SocialDim_Understanding
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1754
## 3rd Qu.:0.0000
## Max. :1.0000
##
To identify the most reliable predictors for each social-interpersonal coding category, we applied the Boruta feature selection algorithm to five target outcomes:
SocialDim_CollaborationSocialDim_IneffectiveSocialDim_InterpersonaleffectivenessSocialDim_NoneexplictelysimulatedSocialDim_UnderstandingFor each of these variables, Boruta was run separately using all available model-generated features (LLM-based scores, rule-based matching, etc.). The algorithm works by comparing the importance of real predictors to “shadow” variables — random noise versions of each feature. Predictors are only retained if they consistently outperform this noise baseline.
Each feature is classified as:
This process ensures that only robust predictors enter the next modeling stage.
A shadow feature is a shuffled version of a real predictor. It has no actual relationship to the outcome, so it acts as a baseline for what “random importance” looks like. Boruta tests whether each real predictor consistently performs better than the best-performing shadow. This allows it to confidently reject weak or noisy features.
All five target categories produced meaningful results:
SocialDim_Ineffective: 15 confirmed predictorsSocialDim_Noneexplictelysimulated: 14 confirmed this is
mostly effective behavior because instructions and default mode is of
llm generaton is to be interpersonally validating and effectiveSocialDim_Interpersonaleffectiveness: 11
confirmedSocialDim_Understanding: 10 confirmedSocialDim_Collaboration: 6 confirmedThis suggests that ineffective and unstructured behaviors are the easiest to predict, while collaborative behaviors may be harder to isolate.
These patterns highlight the strength of raw textual features and LLM-based classifications (especially full-prompt) in capturing interpersonal and contextual nuances. In contrast, rule-based methods are more likely to introduce noise or ambiguity.
The interactive tables below break down the Boruta results by text source, model type, and coding category.
This table presents all predictors retained from the Boruta feature selection output, with each row corresponding to a single predictor and its attributes. The columns identify the text type the feature came from (e.g., raw or summarised transcript), the turn in the dialogue, the feature dimension it belongs to, and the model or method that produced it. For each target category, the table displays the predictor’s Boruta importance score, colour-coded to reflect its selection decision (confirmed, tentative, or rejected). The table is interactive and sortable, allowing the user to filter or arrange predictors based on their source, position in the transcript, feature grouping, method of generation, or category-specific importance.
XGBoost is a machine learning method that builds a series of decision trees, where each new tree tries to fix the mistakes made by the previous ones (Friedman, 2001; Chen & Guestrin, 2016). Because it adds trees one at a time and learns from errors as it goes, it can model very complex patterns, including situations where predictors interact in non-obvious ways. XGBoost is designed to be fast, work well with large numbers of predictors, and avoid “overfitting” (when a model learns the training data too perfectly but performs poorly on new data). It does this by limiting tree depth, using random subsets of the data and predictors for each tree, and slowing down learning so that each step makes only a small adjustment.
Procedure. For each outcome we wanted to predict (e.g., Motivation), we started with the features Boruta had marked as “confirmed” for that outcome and built a dataset using only those predictors. We trained an XGBoost model with settings designed to balance complexity and generalization: tree depth limited to 3 (max_depth = 3), a modest learning rate of 0.05 (eta = 0.05), sampling 80% of the data for each tree (subsample = 0.8), and sampling 80% of the predictors for each tree (colsample_bytree = 0.8). The model’s objective was binary logistic classification (objective = “binary:logistic”) and performance was tracked using the area under the ROC curve (eval_metric = “auc”).
To make sure the model wasn’t just memorizing the training data, we used 5-fold cross-validation. This means we split the data into five equal parts, trained the model on four parts, and tested it on the one part left out. We repeated this process five times so that every part of the data was used once for testing. The results from the five test runs were averaged to give a more reliable measure of how the model would perform on new data. This process greatly reduces the risk of overfitting because the model is tested on data it has never seen during training in each round. We also used early stopping—ending training if performance didn’t improve for 10 rounds—to prevent the model from getting too complex.
Model performance was evaluated using accuracy, a confusion matrix (showing correct and incorrect predictions), the AUC, and a ranked list of the most important predictors based on the model’s internal feature importance scores.
References Chen, T., & Guestrin, C. (2016, August 13). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA. https://doi.org/10.1145/2939672.2939785
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451
## [1] " Run Boruta-based feature selection and XGBoost model training with evaluation\n#"
Did therapist show behavior that was either collaborative, understanding or interpersonally effective. Maybe hard to distinguish between positive behaviors, but easy to distinguish positive from negative.
(PREDICT STAGE). Everything explained here
<###material for results section
We trained a model to predict when therapist–client interactions would be classified as ineffective, using features that focused specifically on signs of ineffective behavior. The model was tested on 707 examples and achieved a high overall accuracy of 94.5%, meaning it got the correct answer in nearly 95 out of every 100 cases. The confidence interval for this accuracy ranged from 92.5% to 96.1%, which means we can be quite confident that the true accuracy falls within that range.
When the actual label was “effective” (coded as 0), the model correctly predicted this 590 times and incorrectly called it “ineffective” only 7 times. This gives us a sensitivity of 98.8%, showing that the model is very good at spotting effective interactions. However, it was somewhat less accurate when identifying truly ineffective behavior (coded as 1). It correctly labeled 78 of the 110 ineffective cases and missed 32, giving a specificity of 70.9%. In other words, the model tended to over-predict effectiveness, sometimes failing to recognize when behavior was actually ineffective.
Despite this, when the model predicted a case as ineffective, it was right 94.9% of the time (positive predictive value), and when it predicted effective, it was correct 91.8% of the time (negative predictive value). The balanced accuracy, which averages the accuracy for both classes, was 84.9%. This helps account for the fact that most of the data (about 84%) belonged to the “effective” group.
We also ran McNemar’s test, which checks whether the types of errors the model made were evenly distributed. The result (p = 0.0001) showed a small but statistically significant bias — the model was more likely to miss an ineffective case than to wrongly label an effective one. This means that while the model performs very well overall, it tends to err on the side of assuming behavior is effective unless it’s clearly not.
Overall, the model did an excellent job of predicting ineffective behavior using only behavior-focused features, though it was slightly better at ruling out ineffectiveness than confirming it.
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_Ineffective_full 0.711
## 2: RawText_3Turn_Ineffective_short 0.264
## 3: Therapist_Ineffective_full 0.016
## 4: Summary_3Turn_Ineffective_short 0.009
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 590 33
## 1 7 77
##
## Accuracy : 0.9434
## 95% CI : (0.9238, 0.9593)
## No Information Rate : 0.8444
## P-Value [Acc > NIR] : 2.729e-16
##
## Kappa : 0.7617
##
## Mcnemar's Test P-Value : 7.723e-05
##
## Sensitivity : 0.9883
## Specificity : 0.7000
## Pos Pred Value : 0.9470
## Neg Pred Value : 0.9167
## Prevalence : 0.8444
## Detection Rate : 0.8345
## Detection Prevalence : 0.8812
## Balanced Accuracy : 0.8441
##
## 'Positive' Class : 0
##
no effectivness variable survives the filter
We could predict effecitve behavior, as boruta shows many variables are important for this…But their was not a specificity between our measure of effecitvenss and specifically effective behavior (e.g., collaboration, understanding, etc , might have all contributed to effectivewness score)
Probably could call this significant. we probably can predict collaboration
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_Collaboration_rule 0.577
## 2: Therapist_Collaboration_rule 0.261
## 3: RawText_3Turn_Collaboration_full 0.163
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 612 82
## 1 2 11
##
## Accuracy : 0.8812
## 95% CI : (0.855, 0.9041)
## No Information Rate : 0.8685
## P-Value [Acc > NIR] : 0.1724
##
## Kappa : 0.1811
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9967
## Specificity : 0.1183
## Pos Pred Value : 0.8818
## Neg Pred Value : 0.8462
## Prevalence : 0.8685
## Detection Rate : 0.8656
## Detection Prevalence : 0.9816
## Balanced Accuracy : 0.5575
##
## 'Positive' Class : 0
##
## Predictor Importance
## <char> <num>
## 1: RawText_3Turn_Understanding_rule 0.587
## 2: Therapist_Understanding_rule 0.159
## 3: RawText_3Turn_Understanding_full 0.146
## 4: RawText_3Turn_Understanding_short 0.108
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 583 115
## 1 0 9
##
## Accuracy : 0.8373
## 95% CI : (0.808, 0.8638)
## No Information Rate : 0.8246
## P-Value [Acc > NIR] : 0.2012
##
## Kappa : 0.1143
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.00000
## Specificity : 0.07258
## Pos Pred Value : 0.83524
## Neg Pred Value : 1.00000
## Prevalence : 0.82461
## Detection Rate : 0.82461
## Detection Prevalence : 0.98727
## Balanced Accuracy : 0.53629
##
## 'Positive' Class : 0
##