Map of the loops
1. Loop 1: Iterating over each row to create Document-Term
Matrices (DTMs)
for (i in 1:total_rows) {
row_df <- text[i, ]
# Check if the tokens column is available and tokenize
if (!is.list(row_df$tokens)) {
tokens_data <- tokens(row_df$tokens)
} else {
tokens_data <- row_df$tokens
}
# Create the DTM (document-feature matrix)
dtm <- dfm(tokens_data) # Convert the tokenized text to a DTM
# Create a unique name for each DTM
doc_name <- paste0("P", i, "_dtm")
dtm_list[[doc_name]] <- dtm # Store the DTM for each individual row
# Update the progress bar
setTxtProgressBar(pb, i)
}
Purpose:
- This loop processes each row of the
text data frame,
which represents one document (or participant).
Breakdown:
- row_df <- text[i, ]:
- For each iteration
i, a single row (document) is
selected from the text data frame.
- Tokenization:
- If the
tokens column isn’t already a list (i.e., if
it’s a string), it tokenizes the text using the tokens()
function from the quanteda package.
- Create DTM:
- The
dfm() function is used to create a Document-Term
Matrix (DTM) from the tokenized text. Each DTM contains word counts for
each term in that document.
- Store the DTM:
- A unique name is generated for each DTM (
P1_dtm,
P2_dtm, etc.), and the DTM is stored in the list
dtm_list.
- Progress Bar Update:
- The
setTxtProgressBar() function updates the progress
bar, showing how many rows/documents have been processed.
2. Loop 2: Calculating word frequencies for Lancaster
norms
for (dtm_name in names(dtm_list)) {
# Extract the current DTM (document)
dtm <- dtm_list[[dtm_name]]
# Convert DTM to data frame for easier manipulation
dtm_df <- convert(dtm, to = "data.frame")
# Create an empty data frame to store frequencies and first match for lanc words for the current document
word_frequencies <- data.frame(word = lanc_words)
# Add a column to track if it's the first match
word_frequencies$first_match <- 0
# Loop through each word and calculate frequency and first match
for (k in 1:nrow(word_frequencies)) {
word <- word_frequencies$word[k]
# Check if the word is present in the DTM
if (word %in% colnames(dtm_df)) {
word_freq <- sum(dtm_df[, word], na.rm = TRUE)
# If the word appears and it's the first occurrence, set first_match to 1
if (word_freq > 0) {
word_frequencies$frequency[k] <- word_freq
if (word_frequencies$first_match[k] == 0) {
word_frequencies$first_match[k] <- 1
}
} else {
word_frequencies$frequency[k] <- 0
}
} else {
word_frequencies$frequency[k] <- 0
}
}
# Store the resulting data frame for the current DTM in the list
dtm_frequencies[[dtm_name]] <- word_frequencies
# Update the progress bar
setTxtProgressBar(pb, match(dtm_name, names(dtm_list)))
}
Purpose:
- This loop calculates the frequency of each word from the Lancaster
norms in each document (DTM).
Breakdown:
- Iterating over each DTM:
- The loop iterates over each DTM stored in
dtm_list by
its name.
- For each DTM, the corresponding document is extracted and stored in
the variable
dtm.
- Convert DTM to Data Frame:
- The DTM is converted to a data frame (
dtm_df) for
easier manipulation. The resulting data frame has columns representing
words and rows representing documents.
- Prepare for Frequency Calculation:
- A new data frame
word_frequencies is created to store
the frequencies of Lancaster words and whether they were first matches
in the document.
- Inner Loop:
- for (k in 1:nrow(word_frequencies)): This loop goes
through each word in the
lanc_words list, checking if it
appears in the current document.
- Check if the word is present: If the word exists in
the document, its frequency is calculated from the DTM.
- First Match: If it’s the first occurrence of the
word, the
first_match column is set to 1.
- Store the results:
- The
word_frequencies data frame, which contains the
frequency and first match data for all Lancaster words in the document,
is stored in the list dtm_frequencies.
- Progress Bar Update:
- The progress bar is updated after each document’s word frequencies
are calculated.
3. Loop 3: Calculating Totals for Each
Document
for (dtm_name in names(dtm_frequencies)) {
# Access the individual data frame
df <- dtm_frequencies[[dtm_name]]
# Calculate the column-wise total (sum) for numeric columns
column_totals <- colSums(df[, sapply(df, is.numeric)], na.rm = TRUE)
# Add a new column for the document identifier
column_totals <- c(document = dtm_name, column_totals)
# Store the result in the list with the name of the document (dtm)
column_totals_list[[dtm_name]] <- column_totals
}
Purpose:
- This loop calculates the total sensory scores (e.g., olfactory,
gustatory, visual, etc.) for each document.
Breakdown:
- Iterating over each document’s word frequencies:
- The loop goes through each data frame in
dtm_frequencies. Each data frame corresponds to the word
frequencies for a particular document.
- Calculate Totals:
- colSums() is used to calculate the column-wise
total (i.e., sum) for each sensory variable in the document’s data
frame. This sums the frequencies for the sensory-related columns (e.g.,
olfactory, gustatory).
- Store Results:
- The totals for each document, along with the document’s identifier,
are stored in the list
column_totals_list with the document
name as the key.
4. Final Combination
# Combine all the column totals into a matrix
column_totals_matrix <- do.call(rbind, column_totals_list)
Purpose:
- This combines the results from all documents into a single matrix
using
do.call() with rbind(). Each row of the
matrix corresponds to a document, and the columns represent the totals
for each sensory variable. The matrix is then printed and saved as a CSV
file.