library(tidyverse)
library(tm)
library(fastNaiveBayes)
In this vignette, we’ll explore how to create dictionaries of terms associated with human-coded text using R, based on the methods described in Kush et al. (2023). That paper focused on creating a language-based measure of the strength of transactive memory systems in a group. The code in this vignette could be used in the same way to create language-based measures of other conepts or as part of an investigation into important terms in a set of text.
Before we start, ensure you have a substantial amount of text data available. For reliable dictionaries, the more text, the better. In our research, we analyzed 21,526 instant messages, categorizing each into one of 16 distinct categories.
Here, we’ll work with a simple example. We’ll use a fictitious dataset created for Kush (2023), consisting of a conversation where participants are solving a Clue-type murder mystery. I have coded this conversation into two categories: Question and Statement, to demonstrate how words can be indicative of certain categories of conversation. I will create a very simple data.frame that has 2 columns, our text and our labels. Our goal in this exercise it to identify which words occured more in questions vs. statements by the end of the vignette.
Note: the functions are looking for columns named “Text” and “CodedAs” (case sensitive) so be sure you use those names in any data you feed to these functions.
# Define our sample text and coded categories
Text <- c("I remember them saying that there was blood in the library so I think the crime was done there.","Yeah, I think that's right","But, I'm pretty sure that the rope was used so the blood may not be from the victim","The rope?","I know it wasn't the candlestick so I was thinking knife or pistol if there was blood.","Wait a minute, we need to try and get our thoughts organized.","What do you mean?","I mean we shouldn't be arguing. What is that we know for sure?","I know the rope was used, trust me.","I don't remember anything about a rope. What about the blood?","I know it was the rope guys, that's what my packet thing said. I'm not disputing the blood, but I know the murder weapon was the rope.","Okay, fine. B you said something about a weapon right?","Yeah, its not the candlestick. So, I mean, if we're gonna say that the blood is irrelevant than sure,..","Right. Sure, so we're gonna say it was the rope?","Yep","Sure","I'm ready guys.","So, what do we know for sure?","I wasn't really paying attention, sorry. I'll probably be useless.","That’s okay. We can start with the weapons? I know it wasn't the knife","Why do you say that? What were all the options?","It could have been the pistol, knife, candlestick, or rope.","Um, well, I'm pretty sure it was the rope.","Oh, it was the rope?","Yeah, I remember reading that","I don't remember that in the stuff.","Well I did and you said you weren't paying attention so that's not any help.","Come on guys, no need to be mean.","But I mean, what's the harm in talking about some other options?","I know it was the rope so, we can move on.","Huh, but now I'm a bit confused because they said there was blood in the library but a rope wouldn't cause blood right?","Sure, let' s just move onto the next part.","Oh, I remember they said it wasn't the candlestick.","Let's just keep going. Where it could have been done?","They said our materials were different right? We had different things we were looking at?","Yeah, so maybe it makes sense to just say what we remember?","Sure. I remember reading that there was blood in the library and that the knife was not used.","Hmm, okay. Yeah, my stuff was different. It said there was scuffed carpet in the dining room and the rope was found on the victim.","B, what did your's say?","Something about the candlestick being locked up.","Anything else?","Some of what C said too. So I think my special part was that the candlestick isn't it.","Cool, so it sounds like we can figure out the weapon used. Awesome!","So it sounds like between my stuff and Bs that it wasn't the knife or candlestick and your stuff makes it sound like it was the rope.","Yeah, makes sense to me!","So we go with that for now?","Yep, I'll put down rope here. Sounds good!","So, who do we think did it?","Well, I think we need a few things, like weapon and place.","I think they said that all our materials were like different or something","Really, like how so?","Umm, well, mine said that the knife was locked away.","I don't remember reading about a knife at all! But I trust you. Mine said that the candle stick was locked up.","Okay cool, so mine mentioned that the rope was like, literally on the body and it looked like it was used. Seems like a safe bet.","Oh, well, if not knife or candlestick, then rope it is.","It mentions pistol here too. Anyone read anything about it?","No","Nope","Okay so, maybe that's not important.","Yep","So, we say rope and move on then?","Yeah, I'm satisfied. Works for me!","Sounds good.")
CodedAs <- c("Statement","Statement","Statement","Question","Statement","Statement","Question","Question","Statement","Question","Statement","Question","Statement","Question","Statement","Statement","Statement","Question","Statement","Statement","Question","Statement","Statement","Question","Statement","Statement","Statement","Statement","Question","Statement","Statement","Statement","Statement","Question","Question","Question","Statement","Statement","Question","Statement","Question","Statement","Statement","Statement","Statement","Question","Statement","Question","Statement","Statement","Question","Statement","Statement","Statement","Statement","Question","Statement","Statement","Statement","Statement","Question","Statement","Statement")
Categories <- c("Statement", "Question")
# Create a data frame with the text and their corresponding categories
all_text <- data.frame(cbind(Text,CodedAs))
With our dataset of text in place, I will now define the suite of unique functions developed for Kush et al. (2023). These functions form the core of our text analysis pipeline, preparing data for categorical association.
Function Overview:
removeSpecialChars: A preprocessing function that purges all non-alphabetic characters except for intra-word apostrophes to preserve word contractions. clean_corpus: An extensive cleaner that performs a series of text normalization steps including punctuation removal, special character cleansing, and optional stemming, among others. This repeats some steps as each cleaning function may introduce white space. run_nb: This function deploys a Naive Bayes classifier on the corpus to determine the probability that a given word would occur in a specific category compared to other categories. NgramTokenizer: This function constructs n-grams up to a specified length from a corpus, allowing us to capture multi-word expressions. In this case, we will get a list of all words, 2-word phrases, and 3-word phrases get_WC_ngram: Computes the frequency of single words and n-grams within categories, then outputs this information to CSV files for further analysis.
#This function removes special characters from a given text string.
#It uses the `gsub` function to replace all characters that are not alphanumeric,
#apostrophes, or spaces with an empty string. Apostophes are left to ensure contractions remain in the text.
removeSpecialChars <- function(x) {
gsub("[^a-zA-Z0-9' ]", "", x)
}
#If there are particular words that you want to be removed from the data, you can provide a list here and then set the "excludelist" option to TRUE.
Master_exclude <- c("clue")
#This function performs various cleaning operations on a text corpus, such as removing punctuation,
#special characters, numbers, converting to lowercase, removing stopwords, stemming, and stripping whitespace.
#It uses the `tm` package for text mining and manipulation.
clean_corpus <- function(target_data, excludelist = FALSE, stop = FALSE, removepunc = TRUE, removespecialchar = TRUE, stem = TRUE) {
corpus <- VCorpus(VectorSource(target_data))
corpus <- `if`(removepunc, tm_map(corpus, removePunctuation, preserve_intra_word_contractions = TRUE), corpus)
corpus <- `if`(removespecialchar, tm_map(corpus, removeSpecialChars), corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- `if`(stop, tm_map(corpus, removeWords, tm::stopwords(kind = "en")), corpus)
corpus <- `if`(excludelist, tm_map(corpus, removeWords, Master_exclude), corpus)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- `if`(stem, tm_map(corpus, stemDocument), corpus)
corpus <- `if`(excludelist, tm_map(corpus, removeWords, Master_exclude), corpus)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
return(corpus)
}
#This function runs a Naive Bayes classifier on the data frame containing human-coded text. This is the primary function of this package
#For each category in the `Categories` vector:
run_nb <- function(data, Corpus) {
for (i in 1:length(Categories)) {
d.frame <- data
d.frame$CodedAs <- if_else(d.frame$CodedAs == Categories[i], 1, 0)
classifier <- fnb.gaussian(Corpus, d.frame$CodedAs)
mymat <- matrix(nrow = length(classifier$probability_table[[2]]$means), ncol = 2)
for (k in 1:dim(mymat)[1]) {
mymat[k, 1] = classifier$probability_table[[1]]$means[k]
mymat[k, 2] = classifier$probability_table[[2]]$means[k]
}
mymat <- as.data.frame(mymat)
colnames(mymat) <- c("probno", "probyes")
mymat$Word <- classifier$names
write.csv(mymat, paste0('./', Categories[i], '.csv', sep = ""))
}
}
# This function takes a text input and generates n-grams (sequences of n words) from it.
# It uses the `ngrams` function from the `NLP` package to extract the n-grams
# and then concatenates them into a single string with space as the separator.
NgramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), c(1, 2, 3)), paste, collapse = " "), use.names = FALSE)
}
#This function calculates the word count for the provided human-coded text, categorized by the specified classes.
#For each category in the `Categories` vector:
get_WC_ngram <- function(data, excludelist = FALSE, stop = FALSE, removepunc = TRUE, removespecialchar = TRUE, stem = TRUE) {
for (i in 1:length(Categories)) {
d.frame <- data
d.frame$CodedAs <- if_else(d.frame$CodedAs == Categories[i], 1, 0)
d.frame <- dplyr::filter(d.frame, CodedAs == 1)
temp_corpus <- VCorpus(VectorSource(d.frame$Text))
corpus_clean <- clean_corpus(temp_corpus, excludelist = excludelist, stop = stop, removepunc = removepunc, removespecialchar = removespecialchar, stem = stem)
corpus_clean_tdm <- as.matrix(TermDocumentMatrix(corpus_clean, control = list(tokenize = NgramTokenizer)))
FreqMat <- data.frame(ST = rownames(corpus_clean_tdm),
Freq = rowSums(corpus_clean_tdm))
write.csv(FreqMat, paste0('./', Categories[i], '_WC.csv', sep = ""))
}
}
Now that all the custom functions are defined, I can begin the process of creating a dictionary of terms. This is a multi-step process that will follow this order: Clean data, Determine probability that each ngram occurs in each category, determine the overall occurrence of each ngram, and then to combine those values with some decision rules to create a list of terms for the dictionary.
Clean data: We will clean the conversational data with the clean_corpus() function. The goal here is to remove some words and combine others into their roots. Especially for large data, this is a necessary step to reduce complexity.
In the function, there are some options to tune how we will clean the text which we set within the arguments variable. The first is whether to exclude certain terms. For example, if I wanted to create a general dictionary for questions, I would not care about the context specific words in this text. For example, as this conversation is about a Clue-like task, there are references to “rope” and “knife” that are not likely to be relevant in other group conversations (or may have very different meanings). I could provide a list of the terms that I do not want to be considered (above I provided the term “clue” to the Master_exclude list). I could also chose to exclude so called “stopping words” which are generally considered to have little meaning. The next two arguments should generally be set to true and will remove numbers, symbols, and punctuation from the text. The last argument is the stem terms. Stemming means that the ends of words are removed so that terms can be collapsed into their roots. Combining terms into their roots is helpful as the occurrence of the root word “begin” is likely to be more meaningful than knowing how many times the terms “begins”, “beginning”, etc. occurred. For this example, I will not include an exclude list or remove stop words but will set the other arguments to TRUE.
# Configuration of arguments for the clean_corpus function. Setting these values to the arguments term ensures consisency when we will later call the word count function.
arguments <- c(excludelist = FALSE, stop = FALSE, removespecialchar = TRUE, removepunc = TRUE, stem = TRUE)
#Executes the function using the arguments we set
clean_text <- rlang::exec(clean_corpus, target_data = all_text$Text, !!!arguments)
#Displays a line from the conversation before and after the cleaning function.
as.character(all_text[3,])[1]
#> [1] "But, I'm pretty sure that the rope was used so the blood may not be from the victim"
as.character(clean_text[[3]])
#> [1] "but i'm pretti sure that the rope was use so the blood may not be from the victim"
Notice that the text is now cleaned up with all characters now lower case, commas removed, and words like “used” stemmed.
Our next step is to take these phrases and split them into 1, 2, and 3 word phrases. We can then remove some that have low TF-IDF (Term Frequency-Inverse Document Frequency) scores. This is a statistical measure to gauge word importance. In Kush et al. (2023), we set this value to 10, but in practice, if you have enough data, this value will have little effect on the outcomes but can speed processing when you have a lot of text.
# Generate a Document-Term Matrix (DTM) using the cleaned text with n-grams as terms, weighted by TF-IDF
clean_dtm_ngram <- DocumentTermMatrix(clean_text, control = list(tokenize = NgramTokenizer, weighting = weightTfIdf)) #throws a warning but that is okay
clean_dtm_ngram <- clean_dtm_ngram %>% findFreqTerms(0) %>% #For this vignette, we set this value to 0 to include all terms. Increasing it removes low importance ngrams. In the paper, we set this value to 10.
clean_dtm_ngram[ , .]
# Convert the n-gram DTM to a dataframe for easier manipulation and analysis
clean_corpus_df_ngram <- as.data.frame(as.matrix(clean_dtm_ngram))
# I can output the dimensions of the resulting dataframe to understand the scale of the dataset. Rows are the messages and columns represent the ngrams remaining in the data. Values represent the number of times a given ngram occurred in a message.
cat("The dimensions of the ngram dataframe are: ", dim(clean_corpus_df_ngram), "\n")
#> The dimensions of the ngram dataframe are: 63 1065
Now we have a smaller subset of terms that occur frequently enough but are also unique enough that we would want to consider them (the purpose of the TF-IDF). Now we can start running the Naive-Bayes process on the data to identify if some of the words that remain in the text are more likely to be present in one coding category versus another. A csv file will be created for each category listed in the Categories variable we created earlier. This file lists each term as well as that terms probability to be labelled as the focal coding category. There are 2 values calculated: probyes is the probability that this term is more likely to occur in the focal category than all other categories and probno is the probability it does not. Terms with a higher ‘probyes’ than ‘probno’ are candidates for inclusion in a targeted dictionary, following the approach outlined in Kush et al. (2023).
# Execute the Naive Bayes classification process on the dataset which will evaluate the extent to which each ngram is associated with one of the codes.
run_nb(all_text, clean_corpus_df_ngram)
# After running the Naive Bayes classifier, a CSV file for each category will be generated
# Each CSV file contains terms alongside their calculated probabilities regarding the specified category
# Here we read the results for the category named 'Question' into a dataframe so that we can examine at.
Category1 <- read.csv("Question.csv")
# Display the top terms with their probabilities for the 'Question' category to inspect the initial results
head(Category1)
#> X probno probyes Word
#> 1 1 0.002206453 0 a bit
#> 2 2 0.002206453 0 a bit confus
#> 3 3 0.004633550 0 a few
#> 4 4 0.004633550 0 a few thing
#> 5 5 0.002396664 0 a knife
#> 6 6 0.002396664 0 a knife at
Now that we have calculated the overall probabilities that terms were associated with a coding category, we could decide that we are done. However, if a term only occurs a few times, but always in 1 coding category, it may have a very high “probyes” score. If we want to create a dictionary that will be useful for other situations, however, we may want to only include words that meet some lower threshold of overall occurrence. So, next, we will count how many times each ngram was contained within a coding category. This allows us to set a minimum threshold under which we will include a term in the final dictionary. The logic again being that even if a word is strongly related to a coding category, if it doesn’t occur that often, then we may not want to consider it.
#This command will execute the get_WC_ngram function using the same arguments that we used when cleaning the corpus. This will return a csv with a list of how often each term was included in a coding category.
rlang::exec(get_WC_ngram, data = all_text, !!!arguments)
#Now, I will read in the word count list for the Question category and display the top of this file. Note that this file is smaller as it will not include a row for a term that never occured (Freq = 0)
Category1_WC <- read.csv("Question_WC.csv")
head(Category1_WC)
#> X ST Freq
#> 1 a rope a rope 1
#> 2 a rope what a rope what 1
#> 3 a weapon a weapon 1
#> 4 a weapon right a weapon right 1
#> 5 about about 5
#> 6 about a about a 2
Now, we will merge the results from the naive bayes analyses and the term frequencies we just generated. In the final data frame we will have: the n-gram, the probability it is in the coding category, and the number of times it occured in that coding category. Then we can apply some decision rules to determine which terms are most related to each coding category. I will show you how to take these steps within R but more complex conditional effects as we used in Kush et al. (2023) are easier to apply in Excel.
#I use the "left_join" command from dplyr to combine the two datasets merging on the ngrams.
Category1_combine <- Category1 %>%
left_join(Category1_WC, by = c("Word" = "ST")) %>%
#I will then calculate a binary variable which is 1 when probyes is greater than probno and 0 otherwise. This could be constrained to some other threshold like probyes must be twice as large as probno
mutate(MoreLikely = case_when(probyes > probno ~ 1, TRUE ~ 0),
Freq = replace_na(Freq, 0), #the word count data was empty for any ngrams that did not occur in this coding category we so need to set the value to 0.
Freq_mean = mean(Freq),
Freq_sd = sd(Freq),
IncludeInDictionary = case_when(MoreLikely = 1 & Freq >= 5 & Freq_mean + Freq_sd ~ 1, TRUE ~ 0)) %>% #The decision rule set here is that in order for a term to be included in the dictionary, it must be in the focal category at least 5 times or the mean+SD of the occurrences of all terms in that coding category (whichever is larger)
dplyr::select(Word, MoreLikely, Freq, IncludeInDictionary) %>% #Drops columns we no longer need
filter(IncludeInDictionary == 1) %>%
arrange(desc(Freq))
Category1_combine
#> Word MoreLikely Freq IncludeInDictionary
#> 1 what 1 8 1
#> 2 the 1 6 1
#> 3 about 1 5 1
#> 4 rope 1 5 1
#> 5 say 1 5 1
And now we see that there are some terms that are more likely to have occurred in phrases coded as “Question” than other phrases: what, the, about, rope, say. Counting the occurrence of these words in other similar conversations could allow us to identify other phrases that are questions. If we had made some different choices (such as removing stopping words) we would have a slightly different list. You would just repeat the above steps for each of your coding categories. Creating a loop would be worth it if you have more than a few.
I hope you found this vignette helpful. Please see the Kush et al. (2023) paper in Small Group Research or email me jkush@umassd.edu if you have any questions!
Special thanks to Saheer Shaik for review and suggestions on this vignette.
References:
Kush, J. (2023). A practical guide to performing transcript analysis on group conversations in both LIWC and R. Group Dynamics: Theory, Research, and Practice. https://psycnet.apa.org/doi/10.1037/gdn0000204 Kush, J., Aven, B., & Argote, L. (2023). A text-based measure of transactive memory system strength. Small Group Research. https://doi.org/10.1177/10464964231182130