In traditional “Market Basket Analysis,” we ask: If a customer buys milk, will they buy bread? But can this logic be applied to literature?
As a Chinese student living and studying in Poland, I have been deeply fascinated by the richness of Polish literature. Leveraging my knowledge in Unsupervised Learning, I decided to bridge the gap between data science and digital humanities. In this project, I analyze the classic Polish novel “Quo Vadis” by Henryk Sienkiewicz. Instead of shopping baskets, I treat chapters as transactions; instead of products, I treat character names and key concepts as items.
This approach demonstrates how unsupervised learning can extract narrative structures without reading the text line-by-line.
This project aims to demonstrate mastery of Unsupervised Learning tools by: 1. Data Engineering: Transforming unstructured raw text into structured transaction data using Base R. 2. Mining Association Rules: Using the Apriori algorithm to find narrative connections. 3. Benchmarking Algorithms: Comparing the efficiency of Apriori vs. Eclat on sparse text data. 4. Synthesizing Methods: Bridging association rules with Hierarchical Clustering using Jaccard dissimilarity.
Unlike standard CSV datasets, literary text is unstructured. I utilized Base R string manipulation techniques to parse the raw text of Quo Vadis (sourced from Project Gutenberg).
Methodology: I defined a “transaction” (or basket) as a chunk of 50 lines of text. This ensures a robust sample size for statistical analysis.
# --- Step 1: Data Acquisition ---
# Reading the local text file "quovadis.txt"
text_raw <- readLines("quovadis.txt", encoding = "UTF-8", warn = FALSE)
# --- Step 2: Chunking (Creating Transactions) ---
# We treat every 50 lines as one "Transaction" (Basket).
lines_per_chunk <- 50
chunk_ids <- gl(ceiling(length(text_raw)/lines_per_chunk),
lines_per_chunk,
length(text_raw))
chunks <- split(text_raw, chunk_ids)
# --- Step 3: Custom Cleaning & Tokenization ---
# Define standard English stopwords + metadata terms
my_stopwords <- c("the", "and", "of", "to", "in", "a", "that", "was", "he", "it",
"his", "with", "is", "for", "as", "had", "you", "not", "be",
"her", "on", "at", "by", "which", "have", "from", "but", "this",
"him", "she", "they", "all", "were", "my", "there", "would",
"what", "so", "up", "out", "if", "about", "who", "their", "when",
"from", "could", "there", "been", "very", "them", "than", "more",
"chapter", "project", "gutenberg")
trans_list <- list()
counter <- 1
for (i in 1:length(chunks)) {
# Collapse the 50 lines into one string and clean
block <- paste(chunks[[i]], collapse = " ")
block <- tolower(block)
# Extract words with 4+ letters using Regex
words <- unlist(regmatches(block, gregexpr("[a-z]{4,}", block)))
# Remove stopwords and keep unique words per chunk
words <- words[!(words %in% my_stopwords)]
words <- unique(words)
# Only save non-empty transactions
if(length(words) > 0) {
trans_list[[counter]] <- words
counter <- counter + 1
}
}
# Convert to 'transactions' class
trans <- as(trans_list, "transactions")
# Quick Inspection
cat("Total Transactions created:", length(trans), "\n")## Total Transactions created: 456
Before running the algorithms, we inspect the “Item Frequency”. In a supermarket context, the most frequent item might be “Milk”. In Quo Vadis, we expect the main protagonists to appear most frequently.
# Visualizing the Top 20 most frequent terms
itemFrequencyPlot(trans, topN = 20,
col = "skyblue",
border = "white",
main = "Top 20 Frequent Narrative Items in Quo Vadis",
ylab = "Relative Support")Observation: The plot confirms that “vinicius” is the central character, followed closely by “petronius” and “lygia”. This validates that our preprocessing strategy successfully captured the narrative entities.
We apply the Apriori Algorithm. Given the sparsity of text data (thousands of unique words), we set a support threshold of \(0.05\).
# Mining Association Rules
rules <- apriori(trans, parameter = list(supp = 0.05, conf = 0.5, minlen = 2))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9549 item(s), 456 transaction(s)] done [0.01s].
## sorting and recoding items ... [749 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.53s].
## writing ... [2562832 rule(s)] done [0.14s].
## creating S4 object ... done [0.63s].
# Pruning Redundant Rules (Crucial step)
# We remove rules that are supersets of more significant rules with the same confidence.
rules_pruned <- rules[!is.redundant(rules)]
# Sorting by Lift to find the strongest narrative links
inspect(head(sort(rules_pruned, by = "lift"), 10))## lhs rhs support confidence coverage
## [1] {house, tiber, vinicius} => {trans} 0.05043860 0.8846154 0.05701754
## [2] {house, tiber} => {trans} 0.05043860 0.8518519 0.05921053
## [3] {tiber, time, vinicius} => {trans} 0.05043860 0.8518519 0.05921053
## [4] {tiber, vinicius} => {trans} 0.06798246 0.7948718 0.08552632
## [5] {tiber, time} => {trans} 0.05263158 0.7741935 0.06798246
## [6] {lygia, tiber} => {trans} 0.05043860 0.7666667 0.06578947
## [7] {house, trans} => {tiber} 0.05043860 1.0000000 0.05043860
## [8] {said, trans} => {tiber} 0.05263158 1.0000000 0.05263158
## [9] {said, tiber} => {trans} 0.05263158 0.7500000 0.07017544
## [10] {tiber} => {trans} 0.07675439 0.7446809 0.10307018
## lift count
## [1] 11.205128 23
## [2] 10.790123 23
## [3] 10.790123 23
## [4] 10.068376 31
## [5] 9.806452 24
## [6] 9.711111 23
## [7] 9.702128 23
## [8] 9.702128 24
## [9] 9.500000 24
## [10] 9.432624 35
In USL Lecture 11, we discussed the theoretical differences between Apriori (Breadth-First Search) and Eclat (Depth-First Search / Vertical Layout). Here, I empirically compare their execution efficiency on this textual dataset.
Note: For a fair comparison, we target only “frequent itemsets” generation.
# Define parameters (Targeting frequent itemsets only)
params <- list(supp = 0.05, minlen = 2, target = "frequent itemsets")
# 1. Measure Apriori
time_apriori <- system.time({
sets_apriori <- apriori(trans, parameter = params, control = list(verbose = FALSE))
})
# 2. Measure Eclat
time_eclat <- system.time({
sets_eclat <- eclat(trans, parameter = params, control = list(verbose = FALSE))
})
# 3. Visualization
benchmark_res <- data.frame(
Algorithm = c("Apriori", "Eclat"),
Time_Elapsed = c(time_apriori["elapsed"], time_eclat["elapsed"])
)
ggplot(benchmark_res, aes(x=Algorithm, y=Time_Elapsed, fill=Algorithm)) +
geom_bar(stat="identity", width=0.5) +
theme_minimal() +
labs(title="Execution Time Comparison", y="Time (Seconds)")Discussion: While Eclat is often faster on dense datasets, on highly sparse narrative data like this, the performance difference is marginal. This experiment demonstrates the practical trade-offs discussed in class.
USL Lecture 12 states that “Jaccard dissimilarity makes the cluster analysis available.” This allows us to group characters into “factions” based on their co-occurrence patterns.
# 1. Filter: Select only the frequent characters to keep the plot readable
sub_trans <- trans[, itemFrequency(trans) > 0.08]
# 2. Compute Dissimilarity (Jaccard Index)
d_jaccard <- dissimilarity(sub_trans, method = "jaccard", which = "items")
# 3. Hierarchical Clustering (Ward's Method)
hc <- hclust(d_jaccard, method = "ward.D2")
# 4. Dendrogram Visualization
plot(hc, main = "Dendrogram of Character Relationships",
xlab = "Narrative Items", sub = "Method: Jaccard Dissimilarity + Ward's Clustering")
rect.hclust(hc, k = 4, border = "red")Insight: The Dendrogram reveals the social structure of the novel. We can observe distinct clusters representing different storylines (e.g., The Roman Court vs. The Christian Community).
Finally, we analyze Maximal Rules to extract the “backbone” of the story.
# Filter for maximal rules
max_rules <- rules_pruned[is.maximal(rules_pruned)]
# Inspect top rules by Lift
inspect(head(sort(max_rules, by = "lift"), 10))## lhs rhs support confidence
## [1] {house, tiber, vinicius} => {trans} 0.05043860 0.8846154
## [2] {tiber, time, vinicius} => {trans} 0.05043860 0.8518519
## [3] {lygia, tiber} => {trans} 0.05043860 0.7666667
## [4] {said, trans} => {tiber} 0.05263158 1.0000000
## [5] {said, tiber} => {trans} 0.05263158 0.7500000
## [6] {paul} => {tarsus} 0.06140351 0.5600000
## [7] {tarsus} => {paul} 0.06140351 1.0000000
## [8] {arena, people} => {amphitheatre} 0.05043860 0.6571429
## [9] {people, wild} => {beasts} 0.05482456 0.5000000
## [10] {amphitheatre, people} => {arena} 0.05043860 0.5750000
## coverage lift count
## [1] 0.05701754 11.205128 23
## [2] 0.05921053 10.790123 23
## [3] 0.06578947 9.711111 23
## [4] 0.05263158 9.702128 24
## [5] 0.07017544 9.500000 24
## [6] 0.10964912 9.120000 28
## [7] 0.06140351 9.120000 28
## [8] 0.07675439 6.810390 23
## [9] 0.10964912 5.560976 25
## [10] 0.08771930 5.462500 23
The analysis of Maximal Rules yielded striking results that go beyond simple co-occurrence. The algorithm successfully reconstructed the spatial geography and historical entities of the novel:
The “Trans-Tiber” Discovery (Geography): Rules
such as {tiber} => {trans} (Lift ~9.4) and
{house, tiber, vinicius} => {trans} (Lift ~11.2) are
particularly revealing. In the novel, the Christian quarter is located
in “Trans-Tiber” (Trastevere). Our algorithm
successfully reconstructed this geo-location, identifying that when the
protagonist Vinicius visits a specific “house” near the
“Tiber,” he is crossing into the “Trans” district. This captures the
critical plot point where Vinicius enters the poor districts to find
Lygia.
Entity Reconstruction (Identity): The rule
{paul} => {tarsus} serves as a strong validation
of historical accuracy. The algorithm identified the fixed
collocation of “Paul” and “Tarsus,” correctly reconstructing the full
title of the biblical figure, Paul of Tarsus.
This project successfully applied Unsupervised Learning to unstructured literary data. Apriori effectively recovered the main character relationships and geographic settings. Clustering provided a high-level overview of the novel’s social structure. Benchmarking showed the practical nuances of algorithm selection.
This proves that the tools of Market Basket Analysis are powerful instruments for Digital Humanities and text analytics.