Load needed libraries

library(openxlsx)
library(readr)
library(readxl)
library(dplyr)
library(irr)
library(bookdown)
library(scales)
library(RColorBrewer)
library(ggplot2)
library(caret)
library(stringr)
library(stringi)
library(viridis)
library(tibble)
library(writexl)
library(corrplot)
library(Hmisc)

Pilot analysis

We conducted a pilot study to analyze two critical design choices (beyond selecting the large language model (LLM)) that would be essential for any subsequent analysis:

  1. How to formulate prompts.

  2. How to convert BART’s scoring estimates into binary (or ordinal) classifications suitable for comparison with human coding.

For this pilot, we focused on a small set of eight political issues: “immigration” (l1), “climate change” (l2), “health care” (l3), “elections” (l4), “Trump” (l5), “foreign affairs” (l6), “civil rights (l7),” and “culture” (l8).

We created synthetic posts simulating content from X/Twitter using ChatGPT-4, prompted with iterative instructions to generate posts reflecting diverse perspectives on each issue.

This method yielded 777 synthetic posts, roughly balanced across the eight issues. Each post was labeled with a 1 for at least one relevant issue and typically received a 0 for the remaining seven, except in incidental cases.

Two expert coders have classified each sentence into 0/1.

If one prefers to work interactively with an R script, all the information and processes developed in this file regarding the Pilot analysis can be reproduced in the script Markdown_PilotDataset.R.

Load the dataset

The input file containing the pilot dataset is Pilot Data.xlsx, located in the Data folder.

pilot_data <- read_excel("Data/Pilot Data.xlsx",sheet='Complete Data All Issues')
#head(pilot_data) #Uncomment to view the data header.

Initially, pilot_data has 777 observations and 26 variables: each text is assigned an ID, along with the sentence itself (sequence), and for each of the 8 topics, three classification columns: one ending in ConsensusTruth, which represents the consensus between expert 1 and expert 2, one ending in Code1, and another ending in Code2.

For example, for immigration we have Imm_ConsensusTruth, Imm_Code1 and Imm_Code2.

Classification process

The first step is to obtain the classifications for each text in the pilot dataset, considering four hypotheses:

  • h1: “This text is about []”
  • h2: “This text focuses on []”
  • h3: “The main subject of this text is []”
  • h4: “This text is related to []”, and the eight previously defined issues (l1–l8).

To do so, we apply the textZeroShot function of the text library combined with two loops to consider all the h1-h4 and all the issues, labeled as l1-l8.

Since the pilot dataset was synthetically generated, there is no need to perform any text-cleaning process in this case.

After the classification process, pilot_data will contain 58 variables: the initial 26, plus 32 classifications for each hi_lj combination (referring to the i-th prompt and the j-th topic).

labels= c("Immigration", "Health Care", "Foreign Affairs",'Elections',"Civil Rights","Culture","Trump","Climate Change")
hypothesis=c("This text is about {}.", "This text focuses on {}.", "The main subject of this text is {}", "This text is related to {}")

To run the classification process using BART, apply the ZSC function included in Markdown_PilotDataset.R. This function requires the following input parameters: (bart function - Bayesian Additive Regression Trees - RDocumentation).

  • pilot_data: a dataframe containing the 777 texts to be classified.
  • labels: a vector comprising the 8 issues to be considered.
  • hypothesis: a vector encompassing the 4 working prompts.

The classification results are appended to the pilot_data dataframe, resulting in 32 new columns corresponding to each hi_lj combination, where hi denotes the hypothesis i and lj represents the label j.

This approach facilitates zero-shot classification using the BART model, enabling the evaluation of the relationship between texts and formulated hypotheses without additional training.

This part of the process is provided in scripts separate from the rest of the analysis, as the execution time is high and may slow down or cause memory issues during the compilation of the Markdown document.

Analysis of the metrics

The input for this section is the pilot_data dataframe created in the previous section Pilot analysis/Classification process. Excel file Pilot_classification.xlsx, located in the Data folder can be used instead.

# Path to the Excel file
excel_file <- "Data/Pilot_classification.xlsx"
LLM <- read_excel(excel_file)%>%select(ID,TEXT, h1_l1, h1_l2, h1_l3, h1_l4, h1_l5, h1_l6, h1_l7, h1_l8, h2_l1, h2_l2, h2_l3, h2_l4, h2_l5, h2_l6, h2_l7, h2_l8, h3_l1, h3_l2, h3_l3, h3_l4, h3_l5, h3_l6,  h3_l7, h3_l8, h4_l1, h4_l2, h4_l3, h4_l4, h4_l5, h4_l6, h4_l7, h4_l8)
human <- read_excel(excel_file)%>%
  select(ID,TEXT, Imm_ConsensusTruth, Imm_Code1, Imm_Code2, HealthCare_ConsensusTruth, HealthCare_Code1, HealthCare_Code2, ForAff_ConsensusTruth, ForAff_Code1, ForAff_Code2, Elections_ConsensusTruth, Elections_Code1, Elections_Code2, CivRights_ConsensusTruth, CivRights_Code1, CivRights_Code2, Culture_ConsensusTruth, Culture_Code1, Culture_Code2, Trump_ConsensusTruth, Trump_Code1, Trump_Code2, Env_ConsensusTruth, Env_Code1, Env_Code2)

BART scores have to be binarized in order to compare them agains human classifications. A grid of binarization thresholds between 0.3 and 0.7 was employed to assess the impact of different threshold values on the comparison metrics.

thresholds <- c(0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7)

# Function to apply threshold
apply_threshold <- function(value, threshold) {
  ifelse(value >= threshold, 1, 0)
}
# Columns to apply the threshold. It considers only the relevant columns that include classifications.
classifications <- setdiff(colnames(LLM), c("ID", "TEXT"))

# Initialize a list to store the DataFrames
threshold_dataframes <- list()
for (threshold in thresholds) {
  # Generate DataFrame name dynamically (ensure integer conversion)
  df_name <- sprintf("LLM_%02d", as.integer(threshold * 100))
  
  # Create a copy of the DataFrame and apply the threshold
  LLM_copy <- LLM
  LLM_copy[classifications] <- lapply(LLM_copy[classifications], apply_threshold, threshold)
  LLM_copy$TH <- threshold
  
  # Store in the list
  threshold_dataframes[[df_name]] <- LLM_copy
}
# Uncomment following lines to save different binarizations as Excel
# wb <- createWorkbook()
# for (name in names(threshold_dataframes)) { 
# addWorksheet(wb, name)  
# writeData(wb, name, threshold_dataframes[[name]])}
# saveWorkbook(wb, "Data/binarizations.xlsx", overwrite = TRUE)

Next, we organize the variables in order to proceed with the comparison between the binarized BART classifications and those provided by the human coders.

# Define the column groups for humans
cols_data2 <- c('l1', 'l2', 'l3', 'l4', 'l5', 'l6', 'l7', 'l8')
# Make a copy of 'human' without 'ID' and 'TEXT'
human_copy <- human %>% select('Imm_ConsensusTruth','HealthCare_ConsensusTruth','ForAff_ConsensusTruth','Elections_ConsensusTruth','CivRights_ConsensusTruth','Culture_ConsensusTruth','Trump_ConsensusTruth','Env_ConsensusTruth')
human2 <- subset(human, select=-c(ID,TEXT))
# Rename columns
colnames(human_copy) <- cols_data2

Create an empty results DataFrame where we store, for each threshold and each hypothesis–issue combination, different goodness-of-fit metrics: False Positives (FP), False Negatives (FN), True Positives (TP), True Negatives (TN), threshold value (TH), recall, precision, accuracy, f1, Cohen’s Kappa between human consensus and BART (kappa), Fleiss’s kappa between human Code 1, Code 2 and BART (kappa_fleiss), Cohen’s Kappa between human Code 1 and Code 2 (kappa_C1C2), Cohen’s Kappa between human Code 1 and BART (kappa_C1BART), Cohen’s Kappa between human Code 2 and BART (kappa_C2BART).

For the calculation of recall, precision, accuracy, and F1, the consensus value of the human coders is considered as the ground truth.

results_df <- data.frame(Issue=character(), FP=integer(), FN=integer(), TP=integer(), TN=integer(), TH=numeric(), recall=numeric(), precision=numeric(), accuracy=numeric(), f1 = numeric(),kappa=numeric(), topic=character(),hypothesis=character(), kappa_fleiss=numeric(), kappa_C1C2=numeric(),kappa_C1BART=numeric(), kappa_C2BART=numeric(), stringsAsFactors = FALSE)

Next, through the use of loops, all possible binarizations are iterated over, and the previously defined metrics are computed accordingly.

for(df_name in names(threshold_dataframes)){
  LLM_copy <- threshold_dataframes[[df_name]]
  TH <- unique(LLM_copy$TH)
  for (i in 0:3) {  # 4 hypotheses
    for (k in 1:8) {  # 8 labels/topics per hypothesis
      c <- i * 8 + k  # Adjust index accordingly
      # Define true/predicted labels
      y_true <- human_copy[[cols_data2[k]]]
      y_pred <- LLM_copy[[classifications[c]]]
      selected_columns <- human2[, c(3*k-1, 3*k)]     
      #first column in selected_columns is BART classification
      #second column in selected_columns is Coder1 classification
      #third column in selected_columns is Coder2 classification
      selected_columns <- cbind(y_pred, selected_columns)
      # Compute confusion matrix
      cm <- table(y_true, y_pred)
      FP <- cm[1,2]
      FN <- cm[2,1]
      TP <- cm[2,2]
      TN <- cm[1,1]
      # Compute classification metrics
      precision <- ifelse((TP+FP) != 0, TP / (TP+FP), 0)
      recall <- ifelse((TP+FN) != 0, TP / (TP+FN), 0)
      accuracy <- ifelse((TP+TN+FP+FN) != 0, (TP+TN) / (TP+TN +FP+FN), 0)
      f1_score <- ifelse((precision+recall) != 0, (2*precision*recall)/(precision+recall), 0)
      # Compute Cohen's Kappa
      kappa_stat <- kappa2(data.frame(y_true, y_pred))$value
      kappa_C1C2 <- kappa2(data.frame(selected_columns[,2],selected_columns[,3]))$value
      kappa_C1BART <- kappa2(data.frame(selected_columns[,2], y_pred))$value
      kappa_C2BART <- kappa2(data.frame(selected_columns[,3], y_pred))$value
      kappa_fleiss=kappam.fleiss(as.matrix(selected_columns))$value
      # Extract 'hypothesis' from 'Issue' column (only numbers, removing 'h')
      hypothesis <- as.character(str_extract(classifications[c], "(?<=h)\\d+"))
      # Extract 'topic' from 'Issue' column (only numbers, removing 'l')
      topic <- as.character(str_extract(classifications[c], "(?<=l)\\d+"))
      # Create a new row with the results      
      new_row <- data.frame(Issue = classifications[c], FP = FP, FN = FN, TP = TP, TN = TN, TH = TH, recall = recall, precision = precision, accuracy = accuracy, f1 = f1_score, kappa=kappa_stat,topic=topic, hypothesis=hypothesis, kappa_fleiss=kappa_fleiss,kappa_C1C2=kappa_C1C2, kappa_C1BART=kappa_C1BART,kappa_C2BART=kappa_C2BART)
      # Append new row to results DataFrame
      results_df <- bind_rows(results_df, new_row) 
      # Define class labels
      classes <- c("Class 0", "Class 1")
      # Create heatmap of the confusion matrix
      cm_matrix <- matrix(c(TN, FP, FN, TP), nrow = 2, byrow = TRUE)
      cm_df <- as.data.frame(as.table(cm_matrix))
      cm_plot<-ggplot(cm_df, aes(Var2, Var1, fill = Freq))+geom_tile(color = "white")+scale_fill_gradientn(colors = c("LightYellow", "yellow", "orange", "red"), guide = "none")+geom_text(aes(label = Freq), color = "black", size =  5) + theme_minimal() + labs(title = paste("Evolution of Metrics for Prompt", hypothesis, "and Topic", topic),x = "Predicted Labels", y = "True Labels") +scale_x_discrete(labels = classes) +scale_y_discrete(labels = classes) + theme(axis.text.x = element_text(angle = 45, hjust = 1),plot.background = element_rect(fill = "white", color = NA))  # Ensure white background
      # Uncomment to print the plot
      # print(cm_plot)
      # Uncomment to save the figure
       #filename <- sprintf("Results/Pilot/Confusion Matrices/TH=%s_Confusion_Matrix_%s.png", TH, gsub("[ \n.]", "_",classifications[c]))
      #ggsave(filename=filename,plot=cm_plot,dpi=300,width=5,height=5,bg="white")
    }}}
# Uncomment to save ´results_df´as a XLSX file.
# write_xlsx(results_df, "Results/Pilot/results_pilot.xlsx")

Once all the metrics are calculated, they are represented in different contexts to facilitate their analysis and enable a decision to be made regarding the threshold.

The metrics obtained for the different thresholds and issues are located in the file results_pilot.xlsx, within the Results/Pilot folder.

colors <- c("Recall" = "#FFD700", "Precision" = "#66CD00", "Accuracy" = "#FF3030", "F1" = "#00BFFF", "Human Consensus-BART 2 codes kappa"="#D15FEE", "Humans-BART 3 codes kappa"="#FFC0CB", "Code 1-Code 2 kappa"="#C1CDC1", "Code 1-BART kappa"="#8B0000", "Code 2-BART kappa"="#27408B")


# Obtain unique combinations of hypotheses and topics
unique_combinations <- unique(results_df[, c("hypothesis", "topic")])
labels= c("Immigration", "Health Care", "Foreign Affairs",'Elections',"Civil Rights","Culture","Trump","Climate Change")
hypothesis=c("This text is about {}.", "This text focuses on {}.", "The main subject of this text is {}", "This text is related to {}")

Plot recall, accuracy, precision and f1 across the threshold

# Iterate over each unique combination of hypotheses and topics
for (i in 1:nrow(unique_combinations)) {
  hyp <- unique_combinations$hypothesis[i]
  top <- unique_combinations$topic[i]
  
  # Filter the data for the current combination
  subset <- results_df[results_df$hypothesis == hyp & results_df$topic == top, ]
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    scale_color_manual(values = colors) +  
    scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Prompt:", hypothesis[as.numeric(unique_combinations$hyp[i])], " Topic:", labels[as.numeric(unique_combinations$top[i])]), x = "Threshold", y = "Value", color = "Metric") + theme_minimal() + theme(panel.background = element_rect(fill = "white", color = NA), legend.position = "bottom", text = element_text(size =  12))
 # filename <- sprintf("Results/Pilot/Metrics Evolution2/%s_%s_Evolution_of_Metrics.png", hyp, top)
#  ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  # Uncomment to plot
  print(plot)
}

Plot recall, accuracy, precision, f1 and human consensus-BART kappa across the threshold

for (i in 1:nrow(unique_combinations)) {
  hyp <- unique_combinations$hypothesis[i]
  top <- unique_combinations$topic[i]
  # Filter the data for the current combination
  subset <- results_df[results_df$hypothesis == hyp & results_df$topic == top, ]
  # Define the graphic
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
    scale_color_manual(values = colors) + 
    scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Prompt:", hypothesis[as.numeric(unique_combinations$hyp[i])], " Topic:", labels[as.numeric(unique_combinations$top[i])]),
         x = "Threshold", y = "Value", color = "Metric") +
    theme_minimal() +
    theme(panel.background = element_rect(fill = "white", color = NA),
          legend.position = "bottom",
          text = element_text(size =  12))
  # Uncomment to save the figure
 # filename <- sprintf("Results/Pilot/Metrics Evolution with Kappa/%s_%s_Evolution_of_Metrics_kappa.png", hyp, top)
 # ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}

Plot recall, accuracy, precision, f1 and Fleiss kappa across the threshold

for (i in 1:nrow(unique_combinations)) {
  hyp <- unique_combinations$hypothesis[i]
  top <- unique_combinations$topic[i]
  
  # Filter the data for the current combination
  subset <- results_df[results_df$hypothesis == hyp & results_df$topic == top, ]
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    geom_line(aes(y = kappa, color = "Humans-BART 3 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa, color = "Humans-BART 3 codes kappa"), size =  5) +
    scale_color_manual(values = colors) + 
    scale_y_continuous(limits=c(-0.07,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Prompt:", hypothesis[as.numeric(unique_combinations$hyp[i])], " Topic:", labels[as.numeric(unique_combinations$top[i])]),
         x = "Threshold", y = "Value", color = "Metric") +
    theme_minimal() +
    theme(panel.background = element_rect(fill = "white", color = NA),
          legend.position = "bottom",
          text = element_text(size =  12))
  # Uncomment to save the plot
#  filename <- sprintf("Results/Pilot/With Kappa Fleiss/%s_%s_Evolution_of_Metrics_kappa_fleiss.png", hyp, top)
#  ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}

Plot different combinations of Cohen’s kappas and Fleiss’ Kappa

for (i in 1:nrow(unique_combinations)) {
    hyp <- unique_combinations$hypothesis[i]
    top <- unique_combinations$topic[i]
    
    # Filter the data for the current combination
    subset <- results_df[results_df$hypothesis == hyp & results_df$topic == top, ]
    
    # Define the graphic
    plot <- ggplot(subset, aes(x = TH)) +
      geom_line(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), linewidth =  1) +
      geom_point(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), size =  5) +
      geom_line(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), linewidth =  1) +
      geom_point(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), size =  5) +
      geom_line(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), linewidth =  1) +
      geom_point(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), size =  5) +
      geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), linewidth =  1) +
      geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size =  5) +
      geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth =  1) +
      geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
      scale_color_manual(values = colors) + 
      scale_y_continuous(limits=c(-0.07,1),breaks=seq(0,1,by = 0.1))+
      labs(title = paste("Prompt:", hypothesis[as.numeric(unique_combinations$hyp[i])], " Topic:", labels[as.numeric(unique_combinations$top[i])]),
           x = "Threshold", y = "Value", color = "Metric") +
      theme_minimal() +
      theme(panel.background = element_rect(fill = "white", color = NA),
            legend.position = "bottom",
            text = element_text(size =  7))
   # Uncomment to save the plot 
   # filename <- sprintf("Results/Pilot/kappas evolution/%s_%s_Evolution_of_kappas.png", hyp, top)
   # ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
    print(plot)
  }  

Analysis of F1

Next, a detailed analysis of the F1 score is conducted

# Step 1: Create a list to store the tables for each hypothesis
hypothesis_tables <- list()

# Step 2: Loop through each hypothesis to create the corresponding table
for (hyp in unique(results_df$hypothesis)) {
  # Filter the data for the current hypothesis
  hypothesis_data <- results_df %>%filter(hypothesis == hyp) 

# Step 3: For each topic, find the maximum F1 score and the corresponding thresholds and accuracy
  
   max_f1_table <- hypothesis_data %>%
    group_by(topic) %>%
    # Find the maximum F1 score for each topic
    mutate(max_f1 = max(f1)) %>%
    # Filter rows where F1 is equal to the maximum F1 score
    filter(f1 == max_f1) %>%
    # Arrange by TH (threshold) to keep all thresholds where the max F1 is achieved
    arrange(topic, TH) %>%
    # Summarize the results for each topic, keeping the max F1, thresholds, and accuracy
    summarise(
      max_f1 = first(max_f1),  # Maximum F1 score (all rows will have the same value)
      thresholds = paste(TH, collapse = ", "),  # Thresholds where max F1 is achieved
      accuracy = first(accuracy)  # Accuracy corresponding to max F1 score
    ) %>%
    ungroup()
  
  # Store the table for the current hypothesis in the list
  hypothesis_tables[[hyp]] <- max_f1_table
}

Now hypothesis_tables contains a table for each hypothesis. Access the table for a specific hypothesis like this: hypothesis_tables[["Hypothesis1"]].

print(hypothesis_tables[[1]]) #This text is about
## # A tibble: 8 × 4
##   topic max_f1 thresholds accuracy
##   <chr>  <dbl> <chr>         <dbl>
## 1 1      0.936 0.3           0.991
## 2 2      0.887 0.45          0.979
## 3 3      0.688 0.3           0.923
## 4 4      0.716 0.3, 0.35     0.915
## 5 5      0.596 0.45          0.887
## 6 6      0.477 0.35          0.842
## 7 7      0.985 0.5           0.995
## 8 8      0.802 0.3           0.945
print(hypothesis_tables[[2]]) #This text focuses on
## # A tibble: 8 × 4
##   topic max_f1 thresholds      accuracy
##   <chr>  <dbl> <chr>              <dbl>
## 1 1      0.895 0.3                0.986
## 2 2      0.853 0.45, 0.5, 0.55    0.974
## 3 3      0.651 0.3                0.921
## 4 4      0.633 0.3                0.897
## 5 5      0.510 0.35               0.871
## 6 6      0.317 0.3                0.834
## 7 7      0.982 0.4, 0.45, 0.5     0.994
## 8 8      0.747 0.35, 0.4          0.936
print(hypothesis_tables[[3]]) #The main subject of this text is
## # A tibble: 8 × 4
##   topic max_f1 thresholds accuracy
##   <chr>  <dbl> <chr>         <dbl>
## 1 1      0.84  0.35          0.979
## 2 2      0.827 0.35          0.970
## 3 3      0.575 0.3           0.912
## 4 4      0.598 0.3           0.889
## 5 5      0.506 0.35          0.892
## 6 6      0.395 0.3           0.803
## 7 7      0.951 0.3           0.983
## 8 8      0.687 0.3           0.921
print(hypothesis_tables[[4]]) #This text is related to
## # A tibble: 8 × 4
##   topic max_f1 thresholds accuracy
##   <chr>  <dbl> <chr>         <dbl>
## 1 1      0.755 0.7           0.955
## 2 2      0.714 0.7           0.928
## 3 3      0.464 0.7           0.705
## 4 4      0.600 0.7           0.780
## 5 5      0.501 0.7           0.775
## 6 6      0.347 0.7           0.326
## 7 7      0.932 0.7           0.974
## 8 8      0.743 0.7           0.907

Analysis of 0.4 threshold

Preliminary analyses of the metrics allow focusing the process on the 0.4 threshold. Below is the graph presented in Figure 3 of the manuscript.

results_04 <- results_df[results_df$TH == "0.4",]
results_04H1 <- results_04[results_04$hypothesis == "1",]
results_04H1$topic_names<-c('Immigration','Health Care','Foreign Affairs', 'Elecctions', 'Civil Rights','Culture','Trump','Climate Change')

plot <- ggplot(results_04H1, aes(x = topic_names)) +
  geom_line(aes(y = recall, color = "Recall", group=1), linewidth =1) +
  geom_point(aes(y = recall, color = "Recall"), size =  5) +
  geom_line(aes(y = precision, color = "Precision", group=1), linewidth =  1) +
  geom_point(aes(y = precision, color = "Precision"), size =  5) +
  geom_line(aes(y = accuracy, color = "Accuracy", group=1), linewidth =  1) +
  geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
  geom_line(aes(y = f1, color = "F1", group=1), linewidth =  1) +
  geom_point(aes(y = f1, color = "F1"), size =  5) +
  geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa", group=1), linewidth =  1) +
  geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size =  5) +
  geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa", group=1), linewidth =  1) +
  geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
  scale_color_manual(values = colors) + 
  scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.2))+
  labs(x = "", y = "Value", color = "Metric") +
  theme_minimal() +
  theme(panel.background = element_rect(fill = "white", color = NA),
        legend.position = "bottom",
        text = element_text(size =  12))
# Uncomment to save the plot
# filename <- sprintf("Results/Pilot/Evolution_of_topics4_th04.png")
# ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)

Correlations analysis

This section presents a correlation analysis for each hypothesis–issue combination.

# Generate names with structure hypothesis/topics
column_names <- as.vector(outer(labels, hypothesis, paste, sep = "_"))
# Rename variables to automatically calculate correlation matrices
colnames(LLM)[3:34] <- column_names

# List to store the correlation matrices
correlation_matrices <- list()

# Iterate over the topics
for (topic in labels) {
  # Select the columns for the topic with the different hypotheses
  columns_to_select <- paste(topic, hypothesis, sep = "_")
  selected_columns <- LLM[, columns_to_select]
  
  # Calculate the correlation matrix
  cor_matrix <- cor(selected_columns, use = "complete.obs")  
  # Use "complete.obs" to omit NA values if needed
  
  # Store the correlation matrix in the list
  correlation_matrices[[topic]] <- cor_matrix
}

# View the correlation matrix for a specific topic, for example, "Immigration"
print(correlation_matrices[["Immigration"]])
##                                                 Immigration_This text is about {}.
## Immigration_This text is about {}.                                       1.0000000
## Immigration_This text focuses on {}.                                     0.9928256
## Immigration_The main subject of this text is {}                          0.9400273
## Immigration_This text is related to {}                                   0.6758948
##                                                 Immigration_This text focuses on {}.
## Immigration_This text is about {}.                                         0.9928256
## Immigration_This text focuses on {}.                                       1.0000000
## Immigration_The main subject of this text is {}                            0.9416629
## Immigration_This text is related to {}                                     0.6564242
##                                                 Immigration_The main subject of this text is {}
## Immigration_This text is about {}.                                                    0.9400273
## Immigration_This text focuses on {}.                                                  0.9416629
## Immigration_The main subject of this text is {}                                       1.0000000
## Immigration_This text is related to {}                                                0.6728810
##                                                 Immigration_This text is related to {}
## Immigration_This text is about {}.                                           0.6758948
## Immigration_This text focuses on {}.                                         0.6564242
## Immigration_The main subject of this text is {}                              0.6728810
## Immigration_This text is related to {}                                       1.0000000
corr_foraff = correlation_matrices[["ForAff"]]
corr_culture = correlation_matrices[["Culture"]]
corr_immigration = correlation_matrices[["Immigration"]]
corr_trump = correlation_matrices[["Trump"]]
corr_health = correlation_matrices[["HealthCare"]]
corr_elections = correlation_matrices[["Elections"]]
corr_civ_rights= correlation_matrices[["CivRights"]]
corr_EnvEnCli = correlation_matrices[["EnvEnCli"]]

Use the following line to save each correlation matrix as an XLSX file. ´write_xlsx(corr_foraff, “Results/Pilot/Correlations/corr_foraff”)´.

Main project

Data for the main study are posts from the official X (Twitter) accounts of US Members of Congress and from a set of political influencers in the US. A list of all Member accounts is generated and added to this the accounts of fifty top political influencers identified by the advertising firm Amra & Elma. An initial random sample of 70,546 posts was obtained for the combined list of accounts covering the period from January 1 to June 30, 2024, using the Brandwatch archive (see Appendix for more information about the accounts). This was then scaled down to a dataset manageable for human coders by drawing a random subsample of 4,500 posts for the main dataset. Results of the pilot analysis encourage the consideration of the prompt h1: “This text is about []”. An expanded set of eighteen issues, labeled as l1 - l18, is considered: Conspiratorial Logic, The Economy in General, Donald Trump, Joe Biden, Democrats, Republicans, MAGA, Jews and Antisemitism in the US, Healthcare, Reproductive Rights, Homelessness, Immigration, Climate Change, Electric Vehicles, Elections, January 6 Insurrection, Race Relations, Resistance to Social Change or Traditional Values.

This project involved the work of three human coders, who were tasked with classifying the posts as 0/1 for the issues considered. A sampling scheme was implemented to ensure that each post and each issue received two independent human codes, from coders A and B. The PIs of teh project have reviewed and adjudicated the disagreements between coders. As a result, for each issue, there is a consensus truth variable and two independent human codings.

If one prefers to work interactively with an R script, all the information and processes developed in this file regarding the Pilot analysis can be reproduced in the script Markdown_MainProject.R.

Load the dataset

The input file containing the pilot dataset is Main Data.xlsx, located in the Data folder.

main_data <- read_excel("Data/Main Data.xlsx",sheet='Final')
#head(main_data) #Uncomment to view the data header.

Initially, main_data has 4500 observations and 56 variables: each text is assigned an ID, along with the sentence itself, and for each of the 8 topics, three classification columns: one ending in Truth, which represents the consensus between code A and code B, one ending in CodeA, and another ending in CodeB.

For example, for economy we have EconTruth, EconCodeA and EconCodeB.

In this main porject, as these are real texts obtained from a social network, the writing is quite imprecise and irregular. Although BART performs very well with this type of writing, we briefly standardize the texts to be classified by converting everything to lowercase, cleaning up contractions, and removing emojis. We keep retweets, hashtags, and user mentions. If user mentions or hashtags should also be removed, it is enough to uncomment the corresponding line in the clean_text function.

clean_text <- function(text) {
  # Convert to uppercase: consistency, pattern recognition and reduction of complexity
  text <- tolower(text)
  # Remove encoding artifacts (UTF-8 misinterpreted issues)
  text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT", sub = "")
  text <- stri_replace_all_regex(text, "[\\p{So}\\p{Cn}]", "")  # Elimina símbolos y caracteres desconocidos
  text <- str_replace_all(text, "\\b(i'm)\\b", "i am")
  text <- str_replace_all(text, "\\b(you're)\\b", "you are")
  text <- str_replace_all(text, "\\b(he's)\\b", "he is")
  text <- str_replace_all(text, "\\b(she's)\\b", "she is")
  text <- str_replace_all(text, "\\b(it's)\\b", "it is")
  text <- str_replace_all(text, "\\b(we're)\\b", "we are")
  text <- str_replace_all(text, "\\b(they're)\\b", "they are")
  #text <- str_replace_all(text, "[^\x01-\x7F]", "")
  # text <- str_replace_all(text, "[\\p{So}\\p{Cn}]", "")
  # Replace contractions for the verb 'to have'
  text <- str_replace_all(text, "\\b(i've)\\b", "i have")
  text <- str_replace_all(text, "\\b(you've)\\b", "you have")
  text <- str_replace_all(text, "\\b(he's)\\b", "he has")
  text <- str_replace_all(text, "\\b(she's)\\b", "she has")
  text <- str_replace_all(text, "\\b(it's)\\b", "it has")
  text <- str_replace_all(text, "\\b(we've)\\b", "we have")
  text <- str_replace_all(text, "\\b(they've)\\b", "they have")
  # Remove user mentions
  #text <- str_remove(text, "^rt\\s+@\\S+\\s*")  
  #text <- str_remove_all(text, "@\\S+")
  # Remove hashtags
  #text <- str_remove_all(text, "#\\S+")
  # Remove URLs
  text <- str_remove_all(text, "http\\S+|www\\S+")
  # New line to remove everything after 'HTTP' until a space
  text <- str_remove_all(text, "HTTP\\S*") # Removes everything from 'https:' until the first space
  # Remove problematic characters (single quotes, double quotes, apostrophes, etc.)
  text <- str_remove_all(text, "[\"'`´^¨]")
  # Remove non-alphanumeric characters, except spaces
  #    text <- str_remove_all(text, "[^\\w\\s]")
  # Remove non-alphanumeric characters **EXCEPT @ and #**
  text <- str_replace_all(text, "[^\\w\\s@#]", "")
  # Remove extra spaces
  text <- str_squish(text)
  
  return(text)
}

Apply ´clean_text´ function to the ´sequence´ variable in ´main_data´. The newly created variable with the debugged text is named ´clean_text´.

main_data$clean_text <- clean_text(main_data$sequence)
if (anyNA(main_data$sequence)) {
  warning("The 'sequence' column contains missing values (NA). It is recommended to handle these values before proceeding.")
  # Optional: Remove rows with missing values
  main_data <- main_data[!is.na(main_data$sequence), ]
}

main_data$clean_text <- ifelse(main_data$clean_text == "", "empty sentence", main_data$clean_text)
# Uncomment to save as xlsx 
#write_xlsx(main_data, "Data/Main Data.xlsx")

Classification process

The first step is to obtain the classifications for each text in the main project dataset, considering the hypothesis h1: “This text is about []”, and the eighteen previously defined issues (l1–l18).

To do so, we apply the textZeroShot function of the text library combined with a loop to consider h1 and all the issues, labeled as l1-l18, applied over the ´clean_text´ variable obtained after posts debugg (see Section Main project/Load the dataset).

After the classification process, main_data will contain 75 variables: the initial 57, plus 18 classifications for each hi_lj combination (referring to the i-th prompt and the j-th topic)

labels= c("Conspiratorial Logic", "The Economy in General","Donald Trump","Joe Biden","Democrats","Republicans","MAGA","Jews and Antisemitism in the US","Healthcare","Reproductive Rights","Homelessness","Immigration","Climate Change","Electric Vehicles","Elections","January 6 Insurrection","Race Relations","Resistance to Social Change or Traditional Values")
hypothesis=c("This text is about {}.")

To run the classification process using BART, apply the ZSC_main function included in Markdown_MainProject.R. This function requires the following input parameters: (bart function - Bayesian Additive Regression Trees - RDocumentation)

  • main_data: a dataframe containing the 4500 texts to be classified.
  • labels: a vector comprising the 18 issues to be considered.
  • hypothesis: a vector encompassing the working prompt.

The classification results are appended to the main_data dataframe, resulting in 18 new columns corresponding to each hi_lj combination, where hi denotes the hypothesis/prompt i and lj represents the label j.

This approach facilitates zero-shot classification using the BART model, enabling the evaluation of the relationship between texts and formulated hypotheses without additional training.

This part of the process is provided in scripts separate from the rest of the analysis, as the execution time is high and may slow down or cause memory issues during the compilation of the Markdown document.

It is recommended not to apply the classification procedure to all posts and all issues at once, as this may cause performance issues. The preferred approach is to partition the dataset by rows. The script provides an example of this application, using the first 200 posts and the 18 issues: ´main_data_1_200 <- ZSC_main(main_data[1:200, ], labels, hypothesis)´.

Analysis of the metrics

The input for this section is the main_data dataframe created in the previous section Main analysis/Classification process. Excel file Main_classification.xlsx, located in the Data folder can be used instead.

excel_file <- "Data/Main_classification.xlsx"
LLM <- read_excel(excel_file)%>%
  select(ID,sequence, h1_l1, h1_l2, h1_l3, h1_l4, h1_l5, h1_l6, h1_l7, h1_l8, h1_l9, h1_l10, h1_l11, h1_l12, h1_l13, h1_l14, h1_l15, h1_l16, h1_l17, h1_l18)

human <- read_excel(excel_file)%>%
  select(ID,sequence, ConsTruth, ConsCodeA, ConsCodeB, EconTruth, EconCodeA, EconCodeB, TrumpTruth, TrumpCodeA, TrumpCodeB, BidenTruth, BidenCodeA, BidenCodeB, DemsTruth, DemsCodeA, DemcCodeB, GOPTruth, GOPCodeA, GOPCodeB, MAGATruth, MAGACodeA, MAGACodeB, JewsTruth, JewsCodeA, JewsCodeB, HealthTruth, HealthCodeA, HealthCodeB, ReproTruth, ReproCodeA, ReproCodeB, HomelessTruth, HomelessCodeA, HomeLessCodeB, ImmTruth, ImmCodeA, ImmCodeB, ClimateTruth, ClimateCodeA, ClimateCodeB, ElecVTruth, ElecVCodeA, ElecVCodeB, ElectionsTruth, ElectionsCodeA, ElectionsCodeB, Jan6Truth, Jan6CodeA, Jan6CodeB, RaceTruth, RaceCodeA, RaceCodeB, SocChgTruth, SocChgCodeA, SocChgCodeB)

BART scores have to be binarized in order to compare them agains human classifications. A grid of binarization thresholds between 0.3 and 0.7 was employed to assess the impact of different threshold values on the comparison metrics.

thresholds <- c(0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7)

# Function to apply threshold
apply_threshold <- function(value, threshold) {
  ifelse(value >= threshold, 1, 0)
}
# Columns to apply the threshold. It considers only the relevant columns that include classifications.
classifications <- setdiff(colnames(LLM), c("ID", "TEXT"))

# Initialize a list to store the DataFrames
threshold_dataframes <- list()
for (threshold in thresholds) {
  # Generate DataFrame name dynamically (ensure integer conversion)
  df_name <- sprintf("LLM_%02d", as.integer(threshold * 100))
  
  # Create a copy of the DataFrame and apply the threshold
  LLM_copy <- LLM
  LLM_copy[classifications] <- lapply(LLM_copy[classifications], apply_threshold, threshold)
  LLM_copy$TH <- threshold
  
  # Store in the list
  threshold_dataframes[[df_name]] <- LLM_copy
}
# Uncomment following lines to save different binarizations as Excel
# wb <- createWorkbook()
# for (name in names(threshold_dataframes)) { 
# addWorksheet(wb, name)  
# writeData(wb, name, threshold_dataframes[[name]])}
# saveWorkbook(wb, "Data/binarizations.xlsx", overwrite = TRUE)

Next, we organize the variables in order to proceed with the comparison between the binarized BART classifications and those provided by the human coders.

# Make a copy of 'human' without 'ID' and 'TEXT'
human_copy <- human %>% select(matches("Truth$"))
human_all <- human %>% select(-"ID", -"sequence")

cols_data1 <- setdiff(names(LLM_copy), c("ID", "sequence", "clean_text", "TH"))

Create an empty results DataFrame where we store, for each threshold and each hypothesis–issue combination, different goodness-of-fit metrics: False Positives (FP), False Negatives (FN), True Positives (TP), True Negatives (TN), threshold value (TH), recall, precision, accuracy, f1, Cohen’s Kappa between human consensus and BART (kappa), Fleiss’s kappa between human Code 1, Code 2 and BART (kappa_fleiss), Cohen’s Kappa between human Code 1 and Code 2 (kappa_C1C2), Cohen’s Kappa between human Code 1 and BART (kappa_C1BART), Cohen’s Kappa between human Code 2 and BART (kappa_C2BART).

For the calculation of recall, precision, accuracy, and F1, the consensus value of the human coders is considered as the ground truth.

results_df <- data.frame(Issue=character(), FP=integer(), FN=integer(), TP=integer(), TN=integer(), TH=numeric(), recall=numeric(), precision=numeric(), accuracy=numeric(), f1 = numeric(),kappa=numeric(), topic=character(),hypothesis=character(), kappa_fleiss=numeric(), kappa_C1C2=numeric(),kappa_C1BART=numeric(), kappa_C2BART=numeric(), stringsAsFactors = FALSE)

Next, through the use of loops, all possible binarizations are iterated over, and the previously defined metrics are computed accordingly.

for(df_name in names(threshold_dataframes)){
  LLM_copy <- threshold_dataframes[[df_name]]
  TH <- unique(LLM_copy$TH)
  for (k in 1:18) {  # 8 labels/topics per hypothesis
    # Define true/predicted labels
    y_true <- human_copy[k]
    y_pred <- LLM_copy[[cols_data1[k]]]
    selected_columns <- human_all[, c(3*k-1, 3*k)]     
    selected_columns <- cbind(y_pred, selected_columns)
    
    #first column in selected_columns is BART classification
    #second column in selected_columns is Coder1 classification
    #third column in selected_columns is Coder2 classification
    # Compute confusion matrix
    cm <- table(unlist(y_true), y_pred)
    FP <- cm[1,2]
    FN <- cm[2,1]
    TP <- cm[2,2]
    TN <- cm[1,1]
    # Compute classification metrics
    precision <- ifelse((TP + FP) != 0, TP / (TP + FP), 0)
    recall <- ifelse((TP + FN) != 0, TP / (TP + FN), 0)
    accuracy <- ifelse((TP + TN + FP + FN) != 0, (TP + TN) / (TP + TN + FP + FN), 0)
    f1_score <- ifelse((precision + recall) != 0, (2 * precision * recall) / (precision + recall), 0)
    # Compute Cohen's Kappa
    kappa_stat <- kappa2(data.frame(y_true, y_pred))$value
    kappa_C1C2 <- kappa2(data.frame(selected_columns[,2], selected_columns[,3]))$value
    kappa_C1BART <- kappa2(data.frame(selected_columns[,2], y_pred))$value
    kappa_C2BART <- kappa2(data.frame(selected_columns[,3], y_pred))$value
    kappa_fleiss=kappam.fleiss(as.matrix(selected_columns))$value
    # Create a new row with the results      
    new_row <- data.frame(
      Issue = labels[k], FP = FP, FN = FN, TP = TP, TN = TN,TH=TH,
      recall = recall, precision = precision, accuracy = accuracy, 
      f1 = f1_score, kappa=kappa_stat,
      kappa_fleiss=kappa_fleiss,kappa_C1C2=kappa_C1C2,
      kappa_C1BART=kappa_C1BART,kappa_C2BART=kappa_C2BART
    )
    
    # Append new row to results DataFrame
    results_df <- bind_rows(results_df, new_row)
    # Define class labels
    classes <- c("Class 0", "Class 1")
    
  }}

# Uncomment to save ´results_df´as a XLSX file.
# write_xlsx(results_df, "Results/Main project/results_main.xlsx")

Once all the metrics are calculated, they are represented in different contexts to facilitate their analysis and enable a decision to be made regarding the threshold.

The metrics obtained for the different thresholds and issues are located in the file results_main.xlsx, within the Results/Main Project folder.

colors <- c("Recall" = "#FFD700", "Precision" = "#66CD00", "Accuracy" = "#FF3030", "F1" = "#00BFFF", "Human Consensus-BART 2 codes kappa"="#D15FEE", "Humans-BART 3 codes kappa"="#FFC0CB", "Code 1-Code 2 kappa"="#C1CDC1", "Code 1-BART kappa"="#8B0000", "Code 2-BART kappa"="#27408B")

unique_combinations <- unique(results_df[, c("Issue")])

Plot recall, accuracy, precision and f1 across the threshold

for (i in 1:length(unique_combinations)) {
  top <- unique_combinations[i]
  # Filter the data for the current combination
  subset <- results_df[results_df$Issue == top, ]
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    scale_color_manual(values = colors) +  
    scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Topic: ", top), x = "Threshold", y = "Value", color = "Metric") + theme_minimal() + theme(panel.background = element_rect(fill = "white", color = NA), legend.position = "bottom", text = element_text(size =  12))
  filename <- sprintf("Results/Main Project/Metrics Evolution/%s_Evolution_of_Metrics.png",  top)
  ## Uncomment to save the figure
##  ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}

Plot recall, accuracy, precision, f1 and human consensus-BART kappa across the threshold

for (i in 1:length(unique_combinations)) {
  top <- unique_combinations[i]
  # Filter the data for the current combination
  subset <- results_df[results_df$Issue == top, ]
  # Define the graphic
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
    scale_color_manual(values = colors) + 
    scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Topic: ", top), x = "Threshold", y = "Value", color = "Metric") + theme_minimal() + theme(panel.background = element_rect(fill = "white", color = NA), legend.position = "bottom", text = element_text(size =  12))+
    theme_minimal() +
    theme(panel.background = element_rect(fill = "white", color = NA),
          legend.position = "bottom",
          text = element_text(size =  12))
  filename <- sprintf("Results/Main Project/Metrics Evolution with Kappa/%s_Evolution_of_Metrics_kappa.png",  top)
 ## ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}

Plot recall, accuracy, precision, f1 and Fleiss kappa across the threshold

for (i in 1:length(unique_combinations)) {
  topic <- unique_combinations[i]
  
  # Filter the data for the current combination
  subset <- results_df[results_df$Issue == topic, ]
  # Define the graphic
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = recall, color = "Recall"), linewidth =  1) +
    geom_point(aes(y = recall, color = "Recall"), size =  5) +
    geom_line(aes(y = precision, color = "Precision"), linewidth =  1) +
    geom_point(aes(y = precision, color = "Precision"), size =  5) +
    geom_line(aes(y = accuracy, color = "Accuracy"), linewidth =  1) +
    geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
    geom_line(aes(y = f1, color = "F1"), linewidth =  1) +
    geom_point(aes(y = f1, color = "F1"), size =  5) +
    geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size =  5) +
    scale_color_manual(values = colors) + 
    scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Topic: ", topic),
         x = "Threshold", y = "Value", color = "Metric")+
    theme_minimal() +
    theme(panel.background = element_rect(fill = "white", color = NA),
          legend.position = "bottom",
          text = element_text(size =  12))
  filename <- sprintf("Results/Main Project/With Kappa Fleiss/%s_Evolution_of_Metrics_kappa_fleiss.png", topic)
 ## ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}

Plot different combinations of Cohen’s kappas and Fleiss’ Kappa

for (i in 1:length(unique_combinations)) {
  topic <- unique_combinations[i]
  
  # Filter the data for the current combination
  subset <- results_df[results_df$Issue == topic, ]  
  # Define the graphic
  plot <- ggplot(subset, aes(x = TH)) +
    geom_line(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), linewidth =  1) +
    geom_point(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), size =  5) +
    geom_line(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), linewidth =  1) +
    geom_point(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), size =  5) +
    geom_line(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), linewidth =  1) +
    geom_point(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), size =  5) +
    geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size =  5) +
    geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth =  1) +
    geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
    scale_color_manual(values = colors) + 
    scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.1))+
    labs(title = paste("Topic: ", topic),
         x = "Threshold", y = "Value", color = "Metric") +
    scale_fill_manual(
      labels = label_wrap(35))+
    theme_minimal() +
    theme(panel.background = element_rect(fill = "white", color = NA),
          legend.position = "bottom",
          text = element_text(size =  12))
  
  filename <- sprintf("Results/Main Project/kappas evolution/%s_Evolution_of_kappas.png", topic)
  ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
  print(plot)
}  

Analysis of 0.4 threshold

Preliminary analyses of the metrics allow focusing the process on the 0.4 threshold. Below is the graph presented in Figure 4 of the manuscript.

results_04 <- results_df[results_df$TH == "0.4",]
results_04$Issue <- factor(results_04$Issue, levels = unique(results_04$Issue))
plot <- ggplot(results_04, aes(x = Issue)) +
  geom_line(aes(y = recall, color = "Recall", group=1), linewidth =1) +
  geom_point(aes(y = recall, color = "Recall"), size =  5) +
  geom_line(aes(y = precision, color = "Precision", group=1), linewidth =  1) +
  geom_point(aes(y = precision, color = "Precision"), size =  5) +
  geom_line(aes(y = accuracy, color = "Accuracy", group=1), linewidth =  1) +
  geom_point(aes(y = accuracy, color = "Accuracy"), size =  5) +
  geom_line(aes(y = f1, color = "F1", group=1), linewidth =  1) +
  geom_point(aes(y = f1, color = "F1"), size =  5) +
  geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa", group=1), linewidth =  1) +
  geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size =  5) +
  geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa", group=1), linewidth =  1) +
  geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size =  5) +
  scale_color_manual(values = colors) + 
  scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.2))+
  labs(  
    #labs(title = paste("Evolution by Topics for Prompt 1"),
    x = "", y = "Value", color = "Metric") +
  theme_minimal() +
  theme(panel.background = element_rect(fill = "white", color = NA),
        legend.position = "bottom",
        text = element_text(size =  12),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +  # <--- Ajuste clave
  scale_x_discrete(expand = expansion(mult = c(0.05, 0.05))) 

filename <- sprintf("Results/Main Project/Main Study-Evolution_of_topics4_04.png")
##ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)