Main project
Data for the main study are posts from the official X (Twitter) accounts of US Members of Congress and from a set of political influencers in the US. A list of all Member accounts is generated and added to this the accounts of fifty top political influencers identified by the advertising firm Amra & Elma. An initial random sample of 70,546 posts was obtained for the combined list of accounts covering the period from January 1 to June 30, 2024, using the Brandwatch archive (see Appendix for more information about the accounts). This was then scaled down to a dataset manageable for human coders by drawing a random subsample of 4,500 posts for the main dataset. Results of the pilot analysis encourage the consideration of the prompt h1: “This text is about []”. An expanded set of eighteen issues, labeled as l1 - l18, is considered: Conspiratorial Logic, The Economy in General, Donald Trump, Joe Biden, Democrats, Republicans, MAGA, Jews and Antisemitism in the US, Healthcare, Reproductive Rights, Homelessness, Immigration, Climate Change, Electric Vehicles, Elections, January 6 Insurrection, Race Relations, Resistance to Social Change or Traditional Values.
This project involved the work of three human coders, who were tasked with classifying the posts as 0/1 for the issues considered. A sampling scheme was implemented to ensure that each post and each issue received two independent human codes, from coders A and B. The PIs of teh project have reviewed and adjudicated the disagreements between coders. As a result, for each issue, there is a consensus truth variable and two independent human codings.
If one prefers to work interactively with an R script, all the
information and processes developed in this file regarding the Pilot
analysis can be reproduced in the script
Markdown_MainProject.R.
Load the dataset
The input file containing the pilot dataset is Main Data.xlsx, located in the Data folder.
main_data <- read_excel("Data/Main Data.xlsx",sheet='Final')
#head(main_data) #Uncomment to view the data header.
Initially, main_data has 4500 observations and 56 variables: each text is assigned an ID, along with the sentence itself, and for each of the 8 topics, three classification columns: one ending in Truth, which represents the consensus between code A and code B, one ending in CodeA, and another ending in CodeB.
For example, for economy we have EconTruth, EconCodeA and EconCodeB.
In this main porject, as these are real texts obtained from a social network, the writing is quite imprecise and irregular. Although BART performs very well with this type of writing, we briefly standardize the texts to be classified by converting everything to lowercase, cleaning up contractions, and removing emojis. We keep retweets, hashtags, and user mentions. If user mentions or hashtags should also be removed, it is enough to uncomment the corresponding line in the clean_text function.
clean_text <- function(text) {
# Convert to uppercase: consistency, pattern recognition and reduction of complexity
text <- tolower(text)
# Remove encoding artifacts (UTF-8 misinterpreted issues)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT", sub = "")
text <- stri_replace_all_regex(text, "[\\p{So}\\p{Cn}]", "") # Elimina símbolos y caracteres desconocidos
text <- str_replace_all(text, "\\b(i'm)\\b", "i am")
text <- str_replace_all(text, "\\b(you're)\\b", "you are")
text <- str_replace_all(text, "\\b(he's)\\b", "he is")
text <- str_replace_all(text, "\\b(she's)\\b", "she is")
text <- str_replace_all(text, "\\b(it's)\\b", "it is")
text <- str_replace_all(text, "\\b(we're)\\b", "we are")
text <- str_replace_all(text, "\\b(they're)\\b", "they are")
#text <- str_replace_all(text, "[^\x01-\x7F]", "")
# text <- str_replace_all(text, "[\\p{So}\\p{Cn}]", "")
# Replace contractions for the verb 'to have'
text <- str_replace_all(text, "\\b(i've)\\b", "i have")
text <- str_replace_all(text, "\\b(you've)\\b", "you have")
text <- str_replace_all(text, "\\b(he's)\\b", "he has")
text <- str_replace_all(text, "\\b(she's)\\b", "she has")
text <- str_replace_all(text, "\\b(it's)\\b", "it has")
text <- str_replace_all(text, "\\b(we've)\\b", "we have")
text <- str_replace_all(text, "\\b(they've)\\b", "they have")
# Remove user mentions
#text <- str_remove(text, "^rt\\s+@\\S+\\s*")
#text <- str_remove_all(text, "@\\S+")
# Remove hashtags
#text <- str_remove_all(text, "#\\S+")
# Remove URLs
text <- str_remove_all(text, "http\\S+|www\\S+")
# New line to remove everything after 'HTTP' until a space
text <- str_remove_all(text, "HTTP\\S*") # Removes everything from 'https:' until the first space
# Remove problematic characters (single quotes, double quotes, apostrophes, etc.)
text <- str_remove_all(text, "[\"'`´^¨]")
# Remove non-alphanumeric characters, except spaces
# text <- str_remove_all(text, "[^\\w\\s]")
# Remove non-alphanumeric characters **EXCEPT @ and #**
text <- str_replace_all(text, "[^\\w\\s@#]", "")
# Remove extra spaces
text <- str_squish(text)
return(text)
}
Apply ´clean_text´ function to the ´sequence´ variable in ´main_data´. The newly created variable with the debugged text is named ´clean_text´.
main_data$clean_text <- clean_text(main_data$sequence)
if (anyNA(main_data$sequence)) {
warning("The 'sequence' column contains missing values (NA). It is recommended to handle these values before proceeding.")
# Optional: Remove rows with missing values
main_data <- main_data[!is.na(main_data$sequence), ]
}
main_data$clean_text <- ifelse(main_data$clean_text == "", "empty sentence", main_data$clean_text)
# Uncomment to save as xlsx
#write_xlsx(main_data, "Data/Main Data.xlsx")
Classification process
The first step is to obtain the classifications for each text in the main project dataset, considering the hypothesis h1: “This text is about []”, and the eighteen previously defined issues (l1–l18).
To do so, we apply the textZeroShot function of the
text library combined with a loop to consider h1
and all the issues, labeled as l1-l18, applied over the
´clean_text´ variable obtained after posts debugg (see Section Main
project/Load the dataset).
After the classification process, main_data will contain 75 variables: the initial 57, plus 18 classifications for each hi_lj combination (referring to the i-th prompt and the j-th topic)
labels= c("Conspiratorial Logic", "The Economy in General","Donald Trump","Joe Biden","Democrats","Republicans","MAGA","Jews and Antisemitism in the US","Healthcare","Reproductive Rights","Homelessness","Immigration","Climate Change","Electric Vehicles","Elections","January 6 Insurrection","Race Relations","Resistance to Social Change or Traditional Values")
hypothesis=c("This text is about {}.")
To run the classification process using BART, apply the
ZSC_main function included in
Markdown_MainProject.R. This function requires the
following input parameters: (bart
function - Bayesian Additive Regression Trees - RDocumentation)
main_data: a dataframe containing the 4500 texts to be classified.labels: a vector comprising the 18 issues to be considered.hypothesis: a vector encompassing the working prompt.
The classification results are appended to the main_data
dataframe, resulting in 18 new columns corresponding to each
hi_lj combination, where hi denotes the
hypothesis/prompt i and lj represents the label
j.
This approach facilitates zero-shot classification using the BART model, enabling the evaluation of the relationship between texts and formulated hypotheses without additional training.
This part of the process is provided in scripts separate from the rest of the analysis, as the execution time is high and may slow down or cause memory issues during the compilation of the Markdown document.
It is recommended not to apply the classification procedure to all posts and all issues at once, as this may cause performance issues. The preferred approach is to partition the dataset by rows. The script provides an example of this application, using the first 200 posts and the 18 issues: ´main_data_1_200 <- ZSC_main(main_data[1:200, ], labels, hypothesis)´.
Analysis of the metrics
The input for this section is the main_data dataframe
created in the previous section Main analysis/Classification
process. Excel file Main_classification.xlsx, located in
the Data folder can be used instead.
excel_file <- "Data/Main_classification.xlsx"
LLM <- read_excel(excel_file)%>%
select(ID,sequence, h1_l1, h1_l2, h1_l3, h1_l4, h1_l5, h1_l6, h1_l7, h1_l8, h1_l9, h1_l10, h1_l11, h1_l12, h1_l13, h1_l14, h1_l15, h1_l16, h1_l17, h1_l18)
human <- read_excel(excel_file)%>%
select(ID,sequence, ConsTruth, ConsCodeA, ConsCodeB, EconTruth, EconCodeA, EconCodeB, TrumpTruth, TrumpCodeA, TrumpCodeB, BidenTruth, BidenCodeA, BidenCodeB, DemsTruth, DemsCodeA, DemcCodeB, GOPTruth, GOPCodeA, GOPCodeB, MAGATruth, MAGACodeA, MAGACodeB, JewsTruth, JewsCodeA, JewsCodeB, HealthTruth, HealthCodeA, HealthCodeB, ReproTruth, ReproCodeA, ReproCodeB, HomelessTruth, HomelessCodeA, HomeLessCodeB, ImmTruth, ImmCodeA, ImmCodeB, ClimateTruth, ClimateCodeA, ClimateCodeB, ElecVTruth, ElecVCodeA, ElecVCodeB, ElectionsTruth, ElectionsCodeA, ElectionsCodeB, Jan6Truth, Jan6CodeA, Jan6CodeB, RaceTruth, RaceCodeA, RaceCodeB, SocChgTruth, SocChgCodeA, SocChgCodeB)
BART scores have to be binarized in order to compare them agains human classifications. A grid of binarization thresholds between 0.3 and 0.7 was employed to assess the impact of different threshold values on the comparison metrics.
thresholds <- c(0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7)
# Function to apply threshold
apply_threshold <- function(value, threshold) {
ifelse(value >= threshold, 1, 0)
}
# Columns to apply the threshold. It considers only the relevant columns that include classifications.
classifications <- setdiff(colnames(LLM), c("ID", "TEXT"))
# Initialize a list to store the DataFrames
threshold_dataframes <- list()
for (threshold in thresholds) {
# Generate DataFrame name dynamically (ensure integer conversion)
df_name <- sprintf("LLM_%02d", as.integer(threshold * 100))
# Create a copy of the DataFrame and apply the threshold
LLM_copy <- LLM
LLM_copy[classifications] <- lapply(LLM_copy[classifications], apply_threshold, threshold)
LLM_copy$TH <- threshold
# Store in the list
threshold_dataframes[[df_name]] <- LLM_copy
}
# Uncomment following lines to save different binarizations as Excel
# wb <- createWorkbook()
# for (name in names(threshold_dataframes)) {
# addWorksheet(wb, name)
# writeData(wb, name, threshold_dataframes[[name]])}
# saveWorkbook(wb, "Data/binarizations.xlsx", overwrite = TRUE)
Next, we organize the variables in order to proceed with the comparison between the binarized BART classifications and those provided by the human coders.
# Make a copy of 'human' without 'ID' and 'TEXT'
human_copy <- human %>% select(matches("Truth$"))
human_all <- human %>% select(-"ID", -"sequence")
cols_data1 <- setdiff(names(LLM_copy), c("ID", "sequence", "clean_text", "TH"))
Create an empty results DataFrame where we store, for each threshold and each hypothesis–issue combination, different goodness-of-fit metrics: False Positives (FP), False Negatives (FN), True Positives (TP), True Negatives (TN), threshold value (TH), recall, precision, accuracy, f1, Cohen’s Kappa between human consensus and BART (kappa), Fleiss’s kappa between human Code 1, Code 2 and BART (kappa_fleiss), Cohen’s Kappa between human Code 1 and Code 2 (kappa_C1C2), Cohen’s Kappa between human Code 1 and BART (kappa_C1BART), Cohen’s Kappa between human Code 2 and BART (kappa_C2BART).
For the calculation of recall, precision, accuracy, and F1, the consensus value of the human coders is considered as the ground truth.
results_df <- data.frame(Issue=character(), FP=integer(), FN=integer(), TP=integer(), TN=integer(), TH=numeric(), recall=numeric(), precision=numeric(), accuracy=numeric(), f1 = numeric(),kappa=numeric(), topic=character(),hypothesis=character(), kappa_fleiss=numeric(), kappa_C1C2=numeric(),kappa_C1BART=numeric(), kappa_C2BART=numeric(), stringsAsFactors = FALSE)
Next, through the use of loops, all possible binarizations are iterated over, and the previously defined metrics are computed accordingly.
for(df_name in names(threshold_dataframes)){
LLM_copy <- threshold_dataframes[[df_name]]
TH <- unique(LLM_copy$TH)
for (k in 1:18) { # 8 labels/topics per hypothesis
# Define true/predicted labels
y_true <- human_copy[k]
y_pred <- LLM_copy[[cols_data1[k]]]
selected_columns <- human_all[, c(3*k-1, 3*k)]
selected_columns <- cbind(y_pred, selected_columns)
#first column in selected_columns is BART classification
#second column in selected_columns is Coder1 classification
#third column in selected_columns is Coder2 classification
# Compute confusion matrix
cm <- table(unlist(y_true), y_pred)
FP <- cm[1,2]
FN <- cm[2,1]
TP <- cm[2,2]
TN <- cm[1,1]
# Compute classification metrics
precision <- ifelse((TP + FP) != 0, TP / (TP + FP), 0)
recall <- ifelse((TP + FN) != 0, TP / (TP + FN), 0)
accuracy <- ifelse((TP + TN + FP + FN) != 0, (TP + TN) / (TP + TN + FP + FN), 0)
f1_score <- ifelse((precision + recall) != 0, (2 * precision * recall) / (precision + recall), 0)
# Compute Cohen's Kappa
kappa_stat <- kappa2(data.frame(y_true, y_pred))$value
kappa_C1C2 <- kappa2(data.frame(selected_columns[,2], selected_columns[,3]))$value
kappa_C1BART <- kappa2(data.frame(selected_columns[,2], y_pred))$value
kappa_C2BART <- kappa2(data.frame(selected_columns[,3], y_pred))$value
kappa_fleiss=kappam.fleiss(as.matrix(selected_columns))$value
# Create a new row with the results
new_row <- data.frame(
Issue = labels[k], FP = FP, FN = FN, TP = TP, TN = TN,TH=TH,
recall = recall, precision = precision, accuracy = accuracy,
f1 = f1_score, kappa=kappa_stat,
kappa_fleiss=kappa_fleiss,kappa_C1C2=kappa_C1C2,
kappa_C1BART=kappa_C1BART,kappa_C2BART=kappa_C2BART
)
# Append new row to results DataFrame
results_df <- bind_rows(results_df, new_row)
# Define class labels
classes <- c("Class 0", "Class 1")
}}
# Uncomment to save ´results_df´as a XLSX file.
# write_xlsx(results_df, "Results/Main project/results_main.xlsx")
Once all the metrics are calculated, they are represented in different contexts to facilitate their analysis and enable a decision to be made regarding the threshold.
The metrics obtained for the different thresholds and issues are
located in the file results_main.xlsx, within the
Results/Main Project folder.
colors <- c("Recall" = "#FFD700", "Precision" = "#66CD00", "Accuracy" = "#FF3030", "F1" = "#00BFFF", "Human Consensus-BART 2 codes kappa"="#D15FEE", "Humans-BART 3 codes kappa"="#FFC0CB", "Code 1-Code 2 kappa"="#C1CDC1", "Code 1-BART kappa"="#8B0000", "Code 2-BART kappa"="#27408B")
unique_combinations <- unique(results_df[, c("Issue")])
Plot recall, accuracy, precision and f1 across the threshold
for (i in 1:length(unique_combinations)) {
top <- unique_combinations[i]
# Filter the data for the current combination
subset <- results_df[results_df$Issue == top, ]
plot <- ggplot(subset, aes(x = TH)) +
geom_line(aes(y = recall, color = "Recall"), linewidth = 1) +
geom_point(aes(y = recall, color = "Recall"), size = 5) +
geom_line(aes(y = precision, color = "Precision"), linewidth = 1) +
geom_point(aes(y = precision, color = "Precision"), size = 5) +
geom_line(aes(y = accuracy, color = "Accuracy"), linewidth = 1) +
geom_point(aes(y = accuracy, color = "Accuracy"), size = 5) +
geom_line(aes(y = f1, color = "F1"), linewidth = 1) +
geom_point(aes(y = f1, color = "F1"), size = 5) +
scale_color_manual(values = colors) +
scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
labs(title = paste("Topic: ", top), x = "Threshold", y = "Value", color = "Metric") + theme_minimal() + theme(panel.background = element_rect(fill = "white", color = NA), legend.position = "bottom", text = element_text(size = 12))
filename <- sprintf("Results/Main Project/Metrics Evolution/%s_Evolution_of_Metrics.png", top)
## Uncomment to save the figure
## ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)
}
Plot recall, accuracy, precision, f1 and human consensus-BART kappa across the threshold
for (i in 1:length(unique_combinations)) {
top <- unique_combinations[i]
# Filter the data for the current combination
subset <- results_df[results_df$Issue == top, ]
# Define the graphic
plot <- ggplot(subset, aes(x = TH)) +
geom_line(aes(y = recall, color = "Recall"), linewidth = 1) +
geom_point(aes(y = recall, color = "Recall"), size = 5) +
geom_line(aes(y = precision, color = "Precision"), linewidth = 1) +
geom_point(aes(y = precision, color = "Precision"), size = 5) +
geom_line(aes(y = accuracy, color = "Accuracy"), linewidth = 1) +
geom_point(aes(y = accuracy, color = "Accuracy"), size = 5) +
geom_line(aes(y = f1, color = "F1"), linewidth = 1) +
geom_point(aes(y = f1, color = "F1"), size = 5) +
geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth = 1) +
geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size = 5) +
scale_color_manual(values = colors) +
scale_y_continuous(limits=c(0,1),breaks=seq(0,1,by = 0.1))+
labs(title = paste("Topic: ", top), x = "Threshold", y = "Value", color = "Metric") + theme_minimal() + theme(panel.background = element_rect(fill = "white", color = NA), legend.position = "bottom", text = element_text(size = 12))+
theme_minimal() +
theme(panel.background = element_rect(fill = "white", color = NA),
legend.position = "bottom",
text = element_text(size = 12))
filename <- sprintf("Results/Main Project/Metrics Evolution with Kappa/%s_Evolution_of_Metrics_kappa.png", top)
## ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)
}
Plot recall, accuracy, precision, f1 and Fleiss kappa across the threshold
for (i in 1:length(unique_combinations)) {
topic <- unique_combinations[i]
# Filter the data for the current combination
subset <- results_df[results_df$Issue == topic, ]
# Define the graphic
plot <- ggplot(subset, aes(x = TH)) +
geom_line(aes(y = recall, color = "Recall"), linewidth = 1) +
geom_point(aes(y = recall, color = "Recall"), size = 5) +
geom_line(aes(y = precision, color = "Precision"), linewidth = 1) +
geom_point(aes(y = precision, color = "Precision"), size = 5) +
geom_line(aes(y = accuracy, color = "Accuracy"), linewidth = 1) +
geom_point(aes(y = accuracy, color = "Accuracy"), size = 5) +
geom_line(aes(y = f1, color = "F1"), linewidth = 1) +
geom_point(aes(y = f1, color = "F1"), size = 5) +
geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), linewidth = 1) +
geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size = 5) +
scale_color_manual(values = colors) +
scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.1))+
labs(title = paste("Topic: ", topic),
x = "Threshold", y = "Value", color = "Metric")+
theme_minimal() +
theme(panel.background = element_rect(fill = "white", color = NA),
legend.position = "bottom",
text = element_text(size = 12))
filename <- sprintf("Results/Main Project/With Kappa Fleiss/%s_Evolution_of_Metrics_kappa_fleiss.png", topic)
## ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)
}
Plot different combinations of Cohen’s kappas and Fleiss’ Kappa
for (i in 1:length(unique_combinations)) {
topic <- unique_combinations[i]
# Filter the data for the current combination
subset <- results_df[results_df$Issue == topic, ]
# Define the graphic
plot <- ggplot(subset, aes(x = TH)) +
geom_line(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), linewidth = 1) +
geom_point(aes(y = kappa_C1C2, color = "Code 1-Code 2 kappa"), size = 5) +
geom_line(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), linewidth = 1) +
geom_point(aes(y = kappa_C1BART, color = "Code 1-BART kappa"), size = 5) +
geom_line(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), linewidth = 1) +
geom_point(aes(y = kappa_C2BART, color = "Code 2-BART kappa"), size = 5) +
geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), linewidth = 1) +
geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size = 5) +
geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), linewidth = 1) +
geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size = 5) +
scale_color_manual(values = colors) +
scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.1))+
labs(title = paste("Topic: ", topic),
x = "Threshold", y = "Value", color = "Metric") +
scale_fill_manual(
labels = label_wrap(35))+
theme_minimal() +
theme(panel.background = element_rect(fill = "white", color = NA),
legend.position = "bottom",
text = element_text(size = 12))
filename <- sprintf("Results/Main Project/kappas evolution/%s_Evolution_of_kappas.png", topic)
ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)
}
Analysis of 0.4 threshold
Preliminary analyses of the metrics allow focusing the process on the 0.4 threshold. Below is the graph presented in Figure 4 of the manuscript.
results_04 <- results_df[results_df$TH == "0.4",]
results_04$Issue <- factor(results_04$Issue, levels = unique(results_04$Issue))
plot <- ggplot(results_04, aes(x = Issue)) +
geom_line(aes(y = recall, color = "Recall", group=1), linewidth =1) +
geom_point(aes(y = recall, color = "Recall"), size = 5) +
geom_line(aes(y = precision, color = "Precision", group=1), linewidth = 1) +
geom_point(aes(y = precision, color = "Precision"), size = 5) +
geom_line(aes(y = accuracy, color = "Accuracy", group=1), linewidth = 1) +
geom_point(aes(y = accuracy, color = "Accuracy"), size = 5) +
geom_line(aes(y = f1, color = "F1", group=1), linewidth = 1) +
geom_point(aes(y = f1, color = "F1"), size = 5) +
geom_line(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa", group=1), linewidth = 1) +
geom_point(aes(y = kappa_fleiss, color = "Humans-BART 3 codes kappa"), size = 5) +
geom_line(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa", group=1), linewidth = 1) +
geom_point(aes(y = kappa, color = "Human Consensus-BART 2 codes kappa"), size = 5) +
scale_color_manual(values = colors) +
scale_y_continuous(limits=c(-0.09,1),breaks=seq(0,1,by = 0.2))+
labs(
#labs(title = paste("Evolution by Topics for Prompt 1"),
x = "", y = "Value", color = "Metric") +
theme_minimal() +
theme(panel.background = element_rect(fill = "white", color = NA),
legend.position = "bottom",
text = element_text(size = 12),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + # <--- Ajuste clave
scale_x_discrete(expand = expansion(mult = c(0.05, 0.05)))
filename <- sprintf("Results/Main Project/Main Study-Evolution_of_topics4_04.png")
##ggsave(filename = filename, plot = plot, dpi = 300, width = 10, height = 6, bg = "white")
print(plot)