Afroman comment analysis: a hemp-toned, music-poster style report that keeps the class workflow intact while adding cleaner networks, a state mention heatmap, and portfolio-ready visual storytelling.
One idea that stood out to me from this week’s text analysis material is that word relationships can reveal meaning that simple word counts miss. A single frequent word can tell us what people mention often, but co-occurrence analysis shows which ideas travel together in the same comments or documents. This matters in business analytics because customers, viewers, and social media users often communicate through repeated phrases, associations, jokes, complaints, and emotional reactions. The tidy text approach is useful because it turns messy language into structured rows that can be counted, filtered, visualized, and interpreted. For this lab, I am using word co-occurrence to move beyond “what words appeared most often” and toward “what ideas appeared together in the Afroman YouTube comment sample.”
For this lab, I reused a cleaned YouTube comment dataset from my previous Afroman comment analysis. The goal is to identify which words tend to appear together in the same comments and then visualize those relationships through several views: a co-occurrence matrix, a network graph, a state mention map, an arc diagram, and a theme flow chart.
This analysis should be interpreted as exploratory. The dataset reflects the comments available through the prior collection workflow, not a complete archive of every comment ever posted on the video.
| Change | Purpose | Risk_Control |
|---|---|---|
| Professor-style custom stopwords | Keeps standard and manual stopwords in one workflow | Manual stopwords are documented and easy to change |
| Top-word co-occurrence matrix | Shows exact pair counts before relying on graph line thickness | Restricted to top words so the table stays readable |
| Degree-sized network nodes | Makes central connector words visually larger | Node size is explained in the subtitle |
| State mention heatmap | Adds a geographic heat layer based on text mentions | Clearly labeled as place mentions, not commenter geography |
| Arc diagram | Shows repeated word-pair links in a cleaner linear layout | Limited to top pairs to avoid edge clutter |
| Alluvial theme flow | Connects themes to keywords and signal strength | Uses a transparent keyword dictionary rather than hidden model labels |
# Put your copied cleaned Afroman CSV in the same folder as this RMD.
# If your file has a different name, update preferred_file below.
preferred_file <- "youtube_comments_clean_u4AiuqQpB1U.csv"
if (file.exists(preferred_file)) {
data_file <- preferred_file
} else {
csv_candidates <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
afroman_candidates <- csv_candidates[
str_detect(str_to_lower(csv_candidates), "afroman|youtube|comment|clean")
]
if (length(afroman_candidates) > 0) {
data_file <- afroman_candidates[1]
} else if (length(csv_candidates) > 0) {
data_file <- csv_candidates[1]
} else {
stop("No CSV file found. Put the cleaned Afroman comment CSV in the same folder as this RMD.")
}
}
comments_raw <- read_csv(data_file, show_col_types = FALSE)
cat("Using data file:", data_file)
## Using data file: youtube_comments_clean_u4AiuqQpB1U.csv
glimpse(comments_raw)
## Rows: 1,051
## Columns: 11
## $ comment_id <chr> "UgzxqV6MPR5RLmUBmal4AaABAg", "UgxUrfoE4p2mooMGNO…
## $ author <chr> "@AP-kg9dz", "@kurtwpg", "@thedont2154", "@hidden…
## $ text <chr> "Songs clowning on corrupt cops is my new favorit…
## $ published_at <dttm> 2026-03-20 15:51:20, 2026-03-26 02:44:02, 2026-0…
## $ like_count <dbl> 34228, 4484, 11333, 233, 9066, 30273, 25758, 500,…
## $ reply_count <dbl> 146, 24, 55, 9, 111, 212, 135, 3, 63, 29, 2, 97, …
## $ video_id <chr> "u4AiuqQpB1U", "u4AiuqQpB1U", "u4AiuqQpB1U", "u4A…
## $ source_method <chr> "YouTube Data API v3 commentThreads.list", "YouTu…
## $ text_original <chr> "Songs clowning on corrupt cops is my new favorit…
## $ text_clean <chr> "songs clowning on corrupt cops is my new favorit…
## $ comment_length_words <dbl> 10, 22, 13, 9, 22, 20, 30, 11, 32, 14, 14, 17, 37…
comments_raw %>%
head(10) %>%
safe_table(caption = "Preview of Imported Afroman Comment Data")
| comment_id | author | text | published_at | like_count | reply_count | video_id | source_method | text_original | text_clean | comment_length_words |
|---|---|---|---|---|---|---|---|---|---|---|
| UgzxqV6MPR5RLmUBmal4AaABAg | @AP-kg9dz | Songs clowning on corrupt cops is my new favorite genre. | 2026-03-20 15:51:20 | 34228 | 146 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | Songs clowning on corrupt cops is my new favorite genre. | songs clowning on corrupt cops is my new favorite genre | 10 |
| UgxUrfoE4p2mooMGNOt4AaABAg | @kurtwpg | “He’s an addict, he got fired. He went to Adams County, somehow he got hired…. | 2026-03-26 02:44:02 | 4484 | 24 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | “He’s an addict, he got fired. He went to Adams County, somehow he got hired…. | he s an addict he got fired he went to adams county somehow he got hired migh… | 22 |
| UgyD4dfBmBtQlm0fIXx4AaABAg | @thedont2154 | They broke the unwritten rule: Never sue someone funnier than you for defama… | 2026-03-25 17:14:32 | 11333 | 55 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | They broke the unwritten rule: Never sue someone funnier than you for defama… | they broke the unwritten rule never sue someone funnier than you for defamation | 13 |
| UgzNaeX9pvu-e3WGKpJ4AaABAg | @hiddenworldforge374 | We need afroman on the bricks and minifigs scandal | 2026-06-02 16:52:21 | 233 | 9 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | We need afroman on the bricks and minifigs scandal | we need afroman on the bricks and minifigs scandal | 9 |
| UgwIVslwTx0H_uMp33Z4AaABAg | @DeanRockne | You know, with the benefit of 3 years of hindsight, I think it might have bee… | 2026-03-20 05:51:23 | 9066 | 111 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | You know, with the benefit of 3 years of hindsight, I think it might have bee… | you know with the benefit of years of hindsight i think it might have been ea… | 22 |
| UgwqYnZoeBT47sgBJUt4AaABAg | @slayanddecay6009 | Remember Randy is ON record that he cannot confirm that Afroman did or did no… | 2026-03-19 18:55:30 | 30273 | 212 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | Remember Randy is ON record that he cannot confirm that Afroman did or did no… | remember randy is on record that he cannot confirm that afroman did or did no… | 20 |
| Ugxb6YvIdGkmADL8CoN4AaABAg | @Whiston555 | Writing a song about fucking his wife and then him having to say for the reco… | 2026-03-19 06:02:40 | 25758 | 135 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | Writing a song about fucking his wife and then him having to say for the reco… | writing a song about fucking his wife and then him having to say for the reco… | 30 |
| UgxxZQ8mAHcdq6gHvKd4AaABAg | @marciaguy10899 | I feel a civic duty to listen to all of these 😂 | 2026-03-31 11:44:59 | 500 | 3 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | I feel a civic duty to listen to all of these 😂 | i feel a civic duty to listen to all of these | 11 |
| Ugz5djYCajTXh4TUSwV4AaABAg | @thisisntmyceiling | When I was a teenager blaring “I got high” I would’ve never thought I’d one d… | 2026-03-23 19:26:48 | 7407 | 63 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | When I was a teenager blaring “I got high” I would’ve never thought I’d one d… | when i was a teenager blaring i got high i would ve never thought i d one day… | 32 |
| UgyXIaehjpy0OQGzJWp4AaABAg | @dusty_cowboy | Contractor here. I’ll fix that door for a slice of lemon pound cake. | 2026-03-29 15:33:35 | 3838 | 29 | u4AiuqQpB1U | YouTube Data API v3 commentThreads.list | Contractor here. I’ll fix that door for a slice of lemon pound cake. | contractor here i ll fix that door for a slice of lemon pound cake | 14 |
possible_text_columns <- c(
"text",
"comment",
"comments",
"comment_text",
"clean_comment",
"clean_text",
"text_original",
"textOriginal",
"body",
"content",
"review",
"message",
"snippet.topLevelComment.snippet.textOriginal"
)
matched_text_columns <- intersect(possible_text_columns, names(comments_raw))
if (length(matched_text_columns) > 0) {
text_column <- matched_text_columns[1]
} else {
character_columns <- names(comments_raw)[sapply(comments_raw, is.character)]
if (length(character_columns) == 0) {
stop("No character/text columns found. Please check the CSV structure.")
}
text_column <- character_columns[
which.max(
sapply(
comments_raw[character_columns],
function(x) sum(!is.na(x) & nchar(as.character(x)) > 0)
)
)
]
}
cat("Detected text column:", text_column)
## Detected text column: text
corpus <- comments_raw %>%
mutate(
doc_id = paste0("comment_", row_number()),
text = as.character(.data[[text_column]])
) %>%
filter(!is.na(text), str_squish(text) != "") %>%
mutate(
text = str_replace_all(text, "http\\S+|www\\.\\S+", " "),
text = str_replace_all(text, "&", " and "),
text = str_replace_all(text, "[\r\n\t]", " "),
text = str_squish(text)
) %>%
select(doc_id, text, everything())
corpus_summary <- tibble(
metric = c(
"CSV file used",
"Detected text column",
"Number of usable comments",
"Average comment length in characters"
),
value = c(
data_file,
text_column,
nrow(corpus),
round(mean(nchar(corpus$text)), 1)
)
)
corpus_summary %>%
safe_table(caption = "Corpus Summary")
| metric | value |
|---|---|
| CSV file used | youtube_comments_clean_u4AiuqQpB1U.csv |
| Detected text column | text |
| Number of usable comments | 1051 |
| Average comment length in characters | 77.3 |
corpus %>%
select(doc_id, text) %>%
head(10) %>%
safe_table(
caption = "First 10 Comments in the Analysis Corpus",
max_chars = 120
)
| doc_id | text |
|---|---|
| comment_1 | Songs clowning on corrupt cops is my new favorite genre. |
| comment_2 | “He’s an addict, he got fired. He went to Adams County, somehow he got hired.” might be his best line yet. |
| comment_3 | They broke the unwritten rule: Never sue someone funnier than you for defamation. |
| comment_4 | We need afroman on the bricks and minifigs scandal |
| comment_5 | You know, with the benefit of 3 years of hindsight, I think it might have been easier to fix this guy’s door. |
| comment_6 | Remember Randy is ON record that he cannot confirm that Afroman did or did not have sex with his wife |
| comment_7 | Writing a song about fucking his wife and then him having to say for the record that he wasn’t sure it didn’t happen … |
| comment_8 | I feel a civic duty to listen to all of these 😂 |
| comment_9 | When I was a teenager blaring “I got high” I would’ve never thought I’d one day look to this man as a champion for fr… |
| comment_10 | Contractor here. I’ll fix that door for a slice of lemon pound cake. |
words <- corpus %>%
mutate(
text = str_replace_all(text, "[^[:alnum:][:space:]$]", " "),
text = str_squish(text)
) %>%
unnest_tokens(word, text, token = "words")
words %>%
select(doc_id, word) %>%
head(30) %>%
safe_table(caption = "First 30 Tokens")
| doc_id | word |
|---|---|
| comment_1 | songs |
| comment_1 | clowning |
| comment_1 | on |
| comment_1 | corrupt |
| comment_1 | cops |
| comment_1 | is |
| comment_1 | my |
| comment_1 | new |
| comment_1 | favorite |
| comment_1 | genre |
| comment_2 | he |
| comment_2 | s |
| comment_2 | an |
| comment_2 | addict |
| comment_2 | he |
| comment_2 | got |
| comment_2 | fired |
| comment_2 | he |
| comment_2 | went |
| comment_2 | to |
| comment_2 | adams |
| comment_2 | county |
| comment_2 | somehow |
| comment_2 | he |
| comment_2 | got |
The manual stopwords added for this version are video
and song. These are broad media-context words for this
YouTube dataset. They are useful for understanding the setting, but they
can become generic hubs in a co-occurrence network rather than revealing
more specific themes.
data("stop_words")
custom_only_stop_words <- tibble(
word = c("video", "song"),
lexicon = "custom"
)
custom_stop_words <- bind_rows(
stop_words,
custom_only_stop_words
)
words_standard_clean <- words %>%
anti_join(stop_words, by = "word") %>%
filter(nchar(word) > 2) %>%
filter(!str_detect(word, "^[0-9]+$"))
words_clean <- words %>%
anti_join(custom_stop_words, by = "word") %>%
filter(nchar(word) > 2) %>%
filter(!str_detect(word, "^[0-9]+$"))
custom_stopword_check <- words_standard_clean %>%
count(word, sort = TRUE) %>%
filter(word %in% custom_only_stop_words$word)
custom_stopword_check %>%
safe_table(caption = "Manual Stopwords Removed from the Afroman Comment Corpus")
| word | n |
|---|---|
| song | 103 |
| video | 30 |
word_freq <- words_clean %>%
count(word, sort = TRUE)
word_freq %>%
slice_head(n = 30) %>%
safe_table(caption = "Top 30 Words After Stopword Removal")
| word | n |
|---|---|
| afroman | 225 |
| love | 87 |
| randy | 83 |
| cops | 68 |
| court | 44 |
| walters | 43 |
| shit | 40 |
| police | 38 |
| time | 38 |
| music | 31 |
| son | 31 |
| county | 27 |
| day | 27 |
| lol | 27 |
| bitch | 26 |
| banger | 24 |
| corrupt | 24 |
| don | 24 |
| people | 24 |
| god | 23 |
| head | 22 |
| judge | 22 |
| legend | 22 |
| wife | 22 |
| adams | 21 |
word_freq_plot <- word_freq %>%
slice_head(n = 20) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = n)) +
geom_col(show.legend = FALSE, color = afroman_palette$earth, linewidth = 0.2) +
coord_flip() +
scale_fill_gradient(low = afroman_palette$hemp_light, high = afroman_palette$deep_green) +
labs(
title = "Top 20 Words in the Afroman YouTube Comment Corpus",
subtitle = "After standard and manual stopword removal",
x = NULL,
y = "Word Count",
caption = "MSBA 580 | Lab 5 | Text Analysis and NLP | Hemp / maryjane visual theme"
) +
theme_afroman(base_size = 13)
print(word_freq_plot)
ggsave(
filename = "visuals/lab5_afroman_top_20_words.png",
plot = word_freq_plot,
width = 10,
height = 7,
dpi = 300
)
A co-occurrence count shows how often two words appear in the same comment. This helps identify words and ideas that tend to travel together in the dataset.
To avoid repeated words inside one comment inflating the results, this version keeps only one instance of each word per comment before computing pairwise counts.
word_cooc <- words_clean %>%
distinct(doc_id, word) %>%
pairwise_count(
item = word,
feature = doc_id,
sort = TRUE,
upper = FALSE
)
word_cooc %>%
head(30) %>%
safe_table(caption = "Top 30 Word Co-Occurrence Pairs")
| item1 | item2 | n |
|---|---|---|
| randy | walters | 41 |
| son | bitch | 25 |
| randy | son | 21 |
| walters | son | 20 |
| pound | cake | 19 |
| afroman | love | 18 |
| randy | bitch | 18 |
| adams | county | 17 |
| walters | bitch | 17 |
| afroman | randy | 15 |
| lemon | pound | 14 |
| lemon | cake | 14 |
| afroman | police | 14 |
| afroman | court | 14 |
| cops | afroman | 13 |
| afroman | legend | 13 |
| afroman | time | 12 |
| county | sheriff | 11 |
| afroman | music | 11 |
| freedom | speech | 10 |
| head | stuck | 10 |
| afroman | wife | 9 |
| afroman | didn | 9 |
| randy | walter | 9 |
| afroman | house | 9 |
This section borrows the professor’s matrix logic, but applies it to the Afroman YouTube comment corpus. The matrix is useful because it shows exact pair counts instead of only showing relative edge thickness in a graph.
top_words <- words_clean %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 15) %>%
pull(word)
afroman_matrix <- word_cooc %>%
filter(item1 %in% top_words, item2 %in% top_words) %>%
bind_rows(
word_cooc %>%
rename(item1 = item2, item2 = item1) %>%
filter(item1 %in% top_words, item2 %in% top_words)
) %>%
cast_sparse(item1, item2, n) %>%
as.matrix()
afroman_matrix
## walters bitch son love randy police court afroman time music shit lol
## randy 41 18 21 2 0 1 7 15 0 0 2 0
## son 20 25 0 1 21 0 2 8 0 0 2 0
## walters 0 17 20 1 41 0 3 7 0 0 0 0
## afroman 7 7 8 18 15 14 14 0 12 11 7 4
## cops 2 0 0 8 3 5 3 13 5 3 7 1
## county 1 1 1 0 2 2 0 8 2 2 1 0
## court 3 0 2 5 7 3 0 14 1 1 2 2
## love 1 0 1 0 2 2 5 18 2 4 3 0
## police 0 0 0 2 1 0 3 14 1 0 3 0
## day 1 1 2 3 1 1 2 3 1 1 2 0
## time 0 0 0 2 0 1 1 12 0 1 1 0
## shit 0 1 2 3 2 3 2 7 1 0 0 0
## music 0 0 0 4 0 0 1 11 1 0 0 1
## bitch 17 0 25 0 18 0 0 7 0 0 1 0
## lol 0 0 0 0 0 0 2 4 0 1 0 0
## day county cops
## randy 1 2 3
## son 2 1 0
## walters 1 1 2
## afroman 3 8 13
## cops 1 2 0
## county 1 0 2
## court 2 0 3
## love 3 0 8
## police 1 2 5
## day 0 1 1
## time 1 2 5
## shit 2 1 7
## music 1 2 3
## bitch 1 1 0
## lol 0 0 1
as.data.frame(afroman_matrix) %>%
rownames_to_column("word") %>%
safe_table(
caption = "Top-Word Co-Occurrence Matrix",
max_rows = 20,
max_chars = 30
)
| word | walters | bitch | son | love | randy | police | court | afroman | time | music | shit | lol | day | county | cops |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| randy | 41 | 18 | 21 | 2 | 0 | 1 | 7 | 15 | 0 | 0 | 2 | 0 | 1 | 2 | 3 |
| son | 20 | 25 | 0 | 1 | 21 | 0 | 2 | 8 | 0 | 0 | 2 | 0 | 2 | 1 | 0 |
| walters | 0 | 17 | 20 | 1 | 41 | 0 | 3 | 7 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| afroman | 7 | 7 | 8 | 18 | 15 | 14 | 14 | 0 | 12 | 11 | 7 | 4 | 3 | 8 | 13 |
| cops | 2 | 0 | 0 | 8 | 3 | 5 | 3 | 13 | 5 | 3 | 7 | 1 | 1 | 2 | 0 |
| county | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 8 | 2 | 2 | 1 | 0 | 1 | 0 | 2 |
| court | 3 | 0 | 2 | 5 | 7 | 3 | 0 | 14 | 1 | 1 | 2 | 2 | 2 | 0 | 3 |
| love | 1 | 0 | 1 | 0 | 2 | 2 | 5 | 18 | 2 | 4 | 3 | 0 | 3 | 0 | 8 |
| police | 0 | 0 | 0 | 2 | 1 | 0 | 3 | 14 | 1 | 0 | 3 | 0 | 1 | 2 | 5 |
| day | 1 | 1 | 2 | 3 | 1 | 1 | 2 | 3 | 1 | 1 | 2 | 0 | 0 | 1 | 1 |
| time | 0 | 0 | 0 | 2 | 0 | 1 | 1 | 12 | 0 | 1 | 1 | 0 | 1 | 2 | 5 |
| shit | 0 | 1 | 2 | 3 | 2 | 3 | 2 | 7 | 1 | 0 | 0 | 0 | 2 | 1 | 7 |
| music | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 11 | 1 | 0 | 0 | 1 | 1 | 2 | 3 |
| bitch | 17 | 0 | 25 | 0 | 18 | 0 | 0 | 7 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
| lol | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
The baseline class syntax uses filter(n > 1) to show
pairs that appear together more than once. I changed the threshold to
filter(n > 2) so a word pair must appear together in at
least three comments before it enters the network.
# Baseline class-style graph filter:
# filter(n > 2) already removes one-off and two-off word pairs.
# To make the graph easier to read, this version then displays only the
# strongest ~50% of those remaining edges.
cooc_threshold <- 2
network_prune_fraction <- 0.50
word_cooc_baseline_filtered <- word_cooc %>%
filter(n > cooc_threshold)
target_edge_count <- max(1, ceiling(nrow(word_cooc_baseline_filtered) * network_prune_fraction))
word_cooc_filtered <- word_cooc_baseline_filtered %>%
arrange(desc(n), item1, item2) %>%
slice_head(n = target_edge_count)
network_summary <- tibble(
metric = c(
"Base filter used",
"Pairs before filtering",
"Pairs after base filter",
"Displayed graph prune target",
"Pairs displayed after pruning",
"Approximate edge reduction from base graph",
"Unique words/nodes displayed"
),
value = c(
paste0("filter(n > ", cooc_threshold, ")"),
nrow(word_cooc),
nrow(word_cooc_baseline_filtered),
paste0(round(network_prune_fraction * 100), "% strongest remaining pairs"),
nrow(word_cooc_filtered),
percent(1 - (nrow(word_cooc_filtered) / max(1, nrow(word_cooc_baseline_filtered))), accuracy = 1),
length(unique(c(word_cooc_filtered$item1, word_cooc_filtered$item2)))
)
)
network_summary %>%
safe_table(caption = "Network Pruning Summary")
| metric | value |
|---|---|
| Base filter used | filter(n > 2) |
| Pairs before filtering | 24569 |
| Pairs after base filter | 334 |
| Displayed graph prune target | 50% strongest remaining pairs |
| Pairs displayed after pruning | 167 |
| Approximate edge reduction from base graph | 50% |
| Unique words/nodes displayed | 123 |
if (nrow(word_cooc_filtered) == 0) {
stop("The filtered network has no edges. Try lowering cooc_threshold from 2 to 1.")
}
cooc_graph <- word_cooc_filtered %>%
graph_from_data_frame(directed = FALSE)
V(cooc_graph)$degree <- degree(cooc_graph)
set.seed(42)
cooc_plot <- ggraph(cooc_graph, layout = "fr") +
geom_edge_link(
aes(edge_alpha = n, edge_width = n),
color = afroman_palette$earth,
show.legend = TRUE
) +
geom_node_point(
aes(size = degree),
color = afroman_palette$gold
) +
geom_node_point(
aes(size = degree),
shape = 21,
fill = afroman_palette$hemp_light,
color = afroman_palette$deep_green,
stroke = 1.1
) +
geom_node_text(
aes(label = name),
repel = TRUE,
size = 4,
color = afroman_palette$ink,
fontface = "bold"
) +
scale_edge_width(range = c(0.4, 3.2)) +
scale_edge_alpha(range = c(0.25, 1)) +
scale_size_continuous(range = c(4, 10)) +
labs(
title = "Pruned Word Co-Occurrence Network: Afroman YouTube Comments",
subtitle = paste0(
"Showing the strongest ~", round(network_prune_fraction * 100),
"% of pairs after filter(n > ", cooc_threshold,
"); node size = number of connections"
),
caption = "MSBA 580 | Lab 5 | Exploratory analysis of API-accessible YouTube comments"
) +
theme_graph(base_family = "sans", background = afroman_palette$smoke) +
theme(
plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
plot.subtitle = element_text(color = afroman_palette$earth),
plot.caption = element_text(color = afroman_palette$muted),
legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
legend.title = element_text(color = afroman_palette$earth),
legend.text = element_text(color = afroman_palette$ink)
)
print(cooc_plot)
ggsave(
filename = "visuals/lab5_afroman_word_cooccurrence_network_pruned.png",
plot = cooc_plot,
width = 12,
height = 8,
dpi = 300
)
This state heatmap visualizes state names that appear in the comment text. It is not a map of verified commenter locations. It is only a text-based location-mention proxy.
state_dictionary <- tibble(
state = state.name,
state_lower = str_to_lower(state.name),
abbreviation = state.abb
) %>%
bind_rows(
tibble(
state = "District of Columbia",
state_lower = "district of columbia",
abbreviation = "DC"
)
)
comment_text_lower <- str_to_lower(corpus$text)
state_counts <- state_dictionary %>%
mutate(
pattern = paste0("\\b", str_replace_all(state_lower, " ", "\\\\s+"), "\\b"),
mentions = map_int(pattern, ~ sum(str_count(comment_text_lower, regex(.x, ignore_case = TRUE))))
) %>%
select(state, abbreviation, mentions) %>%
arrange(desc(mentions))
state_heatmap_summary <- state_counts %>%
summarize(
states_detected = sum(mentions > 0),
total_state_mentions = sum(mentions),
top_state = if_else(total_state_mentions > 0, state[which.max(mentions)], "No state names detected"),
top_state_mentions = max(mentions, na.rm = TRUE)
)
cat(
"State heatmap summary: ",
state_heatmap_summary$states_detected,
" states detected; ",
state_heatmap_summary$total_state_mentions,
" total state-name mentions. Top state: ",
state_heatmap_summary$top_state,
" (", state_heatmap_summary$top_state_mentions, ").",
sep = ""
)
## State heatmap summary: 12 states detected; 30 total state-name mentions. Top state: Ohio (14).
# Robust state tile heatmap. This avoids external map-package failures during knitting
# while still giving a map-like U.S. state view.
state_tile_layout <- tribble(
~abbreviation, ~x, ~y,
"AK", 1, 1, "ME", 12, 1,
"VT", 10, 2, "NH", 11, 2, "WA", 2, 2, "MT", 3, 2, "ND", 4, 2, "MN", 5, 2, "WI", 6, 2, "MI", 7, 2, "NY", 9, 2, "MA", 12, 2,
"OR", 2, 3, "ID", 3, 3, "SD", 4, 3, "IA", 5, 3, "IL", 6, 3, "IN", 7, 3, "OH", 8, 3, "PA", 9, 3, "NJ", 10, 3, "CT", 11, 3, "RI", 12, 3,
"CA", 2, 4, "NV", 3, 4, "WY", 4, 4, "NE", 5, 4, "MO", 6, 4, "KY", 7, 4, "WV", 8, 4, "VA", 9, 4, "MD", 10, 4, "DE", 11, 4,
"AZ", 3, 5, "UT", 4, 5, "CO", 5, 5, "KS", 6, 5, "AR", 7, 5, "TN", 8, 5, "NC", 9, 5, "SC", 10, 5, "DC", 11, 5,
"NM", 4, 6, "OK", 5, 6, "LA", 6, 6, "MS", 7, 6, "AL", 8, 6, "GA", 9, 6,
"HI", 1, 7, "TX", 5, 7, "FL", 10, 7
)
state_tile_data <- state_tile_layout %>%
left_join(state_counts, by = "abbreviation") %>%
mutate(
state = coalesce(state, abbreviation),
mentions = coalesce(mentions, 0L),
label = if_else(mentions > 0, paste0(abbreviation, "\n", mentions), abbreviation)
)
state_heatmap_plot <- ggplot(state_tile_data, aes(x = x, y = -y, fill = mentions)) +
geom_tile(color = afroman_palette$earth, linewidth = 0.75, width = 0.95, height = 0.95) +
geom_text(aes(label = label), color = afroman_palette$ink, fontface = "bold", size = 3.2, lineheight = 0.85) +
scale_fill_gradient(
low = afroman_palette$paper,
high = afroman_palette$deep_green,
name = "Mentions",
label = comma
) +
coord_equal() +
labs(
title = "State Mention Heatmap: Afroman YouTube Comments",
subtitle = "Tile-map heatmap of state names appearing in comment text; not verified commenter geography",
x = NULL,
y = NULL,
caption = "MSBA 580 | Lab 5 | State-name detection with word-boundary matching | Hemp / maryjane visual theme"
) +
theme_afroman(base_size = 13) +
theme(
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.position = "right",
plot.title = element_text(face = "bold", size = 18)
)
print(state_heatmap_plot)
ggsave(
filename = "visuals/lab5_afroman_state_mention_tile_heatmap.png",
plot = state_heatmap_plot,
width = 12,
height = 7,
dpi = 300
)
An arc diagram is a cleaner alternative to a dense force-directed network. It places words along a line and draws arcs between repeated co-occurring pairs.
arc_pair_count <- 35
arc_edges <- word_cooc %>%
slice_max(n, n = arc_pair_count, with_ties = FALSE)
if (nrow(arc_edges) >= 3) {
arc_graph <- arc_edges %>%
graph_from_data_frame(directed = FALSE)
arc_plot <- ggraph(arc_graph, layout = "linear") +
geom_edge_arc(
aes(edge_width = n, edge_alpha = n),
color = afroman_palette$earth,
strength = 0.7,
show.legend = TRUE
) +
geom_node_point(size = 4, color = afroman_palette$gold) +
geom_node_text(
aes(label = name),
angle = 45,
hjust = 0,
size = 3.5,
color = afroman_palette$ink
) +
scale_edge_width(range = c(0.3, 2.8)) +
scale_edge_alpha(range = c(0.2, 1)) +
labs(
title = "Arc Diagram: Top Afroman Comment Word Pairings",
subtitle = paste0("Top ", arc_pair_count, " co-occurring word pairs by count"),
caption = "MSBA 580 | Lab 5 | Arc height and thickness emphasize repeated pairings"
) +
theme_graph(base_family = "sans", background = afroman_palette$smoke) +
theme(
plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
plot.subtitle = element_text(color = afroman_palette$earth),
plot.caption = element_text(color = afroman_palette$muted),
legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
legend.title = element_text(color = afroman_palette$earth),
legend.text = element_text(color = afroman_palette$ink),
plot.margin = margin(10, 30, 40, 10)
)
print(arc_plot)
ggsave(
filename = "visuals/lab5_afroman_arc_diagram.png",
plot = arc_plot,
width = 13,
height = 7,
dpi = 300
)
} else {
cat("Not enough co-occurring word pairs to render the arc diagram.")
}
This lightweight flow chart connects manually defined comment themes to the words that triggered them and then to a simple signal-strength tier. It replaces the heavier alluvial rendering with a faster ggplot2-only version so the report can knit reliably while keeping the same business interpretation.
theme_dictionary <- tribble(
~theme, ~word,
"Police / legal", "police",
"Police / legal", "cops",
"Police / legal", "cop",
"Police / legal", "raid",
"Police / legal", "warrant",
"Police / legal", "rights",
"Police / legal", "law",
"Police / legal", "court",
"Police / legal", "judge",
"Police / legal", "sheriff",
"Humor / reaction", "lol",
"Humor / reaction", "funny",
"Humor / reaction", "hilarious",
"Humor / reaction", "laugh",
"Humor / reaction", "joke",
"Humor / reaction", "classic",
"Music / nostalgia", "music",
"Music / nostalgia", "afroman",
"Music / nostalgia", "rap",
"Music / nostalgia", "lyrics",
"Music / nostalgia", "beat",
"Music / nostalgia", "listen",
"Music / nostalgia", "memories",
"Cannabis / culture", "high",
"Cannabis / culture", "weed",
"Cannabis / culture", "smoke",
"Cannabis / culture", "smoking",
"Cannabis / culture", "marijuana",
"Cannabis / culture", "cannabis",
"Cannabis / culture", "hemp",
"Cannabis / culture", "joint",
"Cannabis / culture", "blunt",
"Cannabis / culture", "pot",
"Politics / public issue", "government",
"Politics / public issue", "political",
"Politics / public issue", "politics",
"Politics / public issue", "freedom",
"Politics / public issue", "tax",
"Politics / public issue", "corruption",
"Support / praise", "love",
"Support / praise", "respect",
"Support / praise", "support",
"Support / praise", "great",
"Support / praise", "legend",
"Support / praise", "amazing"
)
theme_flow_min_keyword_count <- 2
# Prune the keyword column for readability. Three keywords per theme keeps the
# flow chart interpretive instead of turning the middle column into a wall of text.
theme_flow_top_keywords_per_theme <- 3
theme_flow_max_keywords_total <- 18
theme_word_counts_all <- words_clean %>%
inner_join(theme_dictionary, by = "word") %>%
count(theme, word, sort = TRUE)
theme_word_counts_pruned <- theme_word_counts_all %>%
filter(n >= theme_flow_min_keyword_count)
# Fail-soft fallback: if the minimum-count filter is too strict for this sample,
# keep the strongest available dictionary hits rather than producing a blank plot.
if (nrow(theme_word_counts_pruned) == 0 && nrow(theme_word_counts_all) > 0) {
theme_word_counts_pruned <- theme_word_counts_all
}
theme_word_counts <- theme_word_counts_pruned %>%
group_by(theme) %>%
slice_max(n, n = theme_flow_top_keywords_per_theme, with_ties = FALSE) %>%
ungroup() %>%
arrange(desc(n), theme, word) %>%
slice_head(n = theme_flow_max_keywords_total) %>%
mutate(
signal_strength = case_when(
n >= quantile(n, 0.75, na.rm = TRUE) ~ "High repeated signal",
n >= quantile(n, 0.40, na.rm = TRUE) ~ "Medium repeated signal",
TRUE ~ "Lower repeated signal"
),
theme_label = str_wrap(theme, width = 16),
word_label = str_wrap(word, width = 10),
signal_label = str_wrap(signal_strength, width = 16)
)
flow_pruning_summary <- tibble(
setting = c(
"Minimum keyword count kept",
"Maximum keywords per theme",
"Maximum total keyword rows",
"Keyword rows displayed",
"Keyword rows before pruning"
),
value = c(
theme_flow_min_keyword_count,
theme_flow_top_keywords_per_theme,
theme_flow_max_keywords_total,
nrow(theme_word_counts),
nrow(theme_word_counts_all)
)
)
flow_pruning_summary %>%
safe_table(caption = "Theme Flow Pruning Settings")
| setting | value |
|---|---|
| Minimum keyword count kept | 2 |
| Maximum keywords per theme | 3 |
| Maximum total keyword rows | 18 |
| Keyword rows displayed | 18 |
| Keyword rows before pruning | 36 |
theme_word_counts %>%
select(theme, word, n, signal_strength) %>%
safe_table(caption = "Pruned Theme Dictionary Hits Used in the Flow Chart", max_rows = 30)
| theme | word | n | signal_strength |
|---|---|---|---|
| Music / nostalgia | afroman | 225 | High repeated signal |
| Support / praise | love | 87 | High repeated signal |
| Police / legal | cops | 68 | High repeated signal |
| Police / legal | court | 44 | High repeated signal |
| Police / legal | police | 38 | High repeated signal |
| Music / nostalgia | music | 31 | Medium repeated signal |
| Humor / reaction | lol | 27 | Medium repeated signal |
| Support / praise | legend | 22 | Medium repeated signal |
| Politics / public issue | freedom | 17 | Medium repeated signal |
| Humor / reaction | hilarious | 16 | Medium repeated signal |
| Humor / reaction | funny | 12 | Medium repeated signal |
| Support / praise | support | 11 | Lower repeated signal |
| Music / nostalgia | rap | 9 | Lower repeated signal |
| Cannabis / culture | weed | 5 | Lower repeated signal |
| Politics / public issue | government | 5 | Lower repeated signal |
| Politics / public issue | corruption | 4 | Lower repeated signal |
| Cannabis / culture | blunt | 2 | Lower repeated signal |
| Cannabis / culture | smoke | 2 | Lower repeated signal |
# Knit-safe replacement for the theme-flow visual.
# The alluvial and curved-flow versions were visually interesting, but both
# stalled during knitting in this environment. This version uses only a small
# heatmap matrix, which is much faster and more reliable for RPubs.
if (nrow(theme_word_counts) > 0) {
theme_matrix_data <- theme_word_counts %>%
group_by(theme) %>%
slice_max(n, n = 2, with_ties = FALSE) %>%
ungroup() %>%
arrange(theme, desc(n), word) %>%
slice_head(n = 12) %>%
mutate(
theme = str_wrap(theme, width = 18),
word = str_wrap(word, width = 12),
signal_strength = factor(
signal_strength,
levels = c("Lower repeated signal", "Medium repeated signal", "High repeated signal")
),
label = paste0(word, "\n", "n=", n)
)
theme_signal_heatmap <- ggplot(
theme_matrix_data,
aes(x = signal_strength, y = reorder(word, n), fill = n)
) +
geom_tile(color = afroman_palette$paper, linewidth = 1.1) +
geom_text(
aes(label = label),
color = afroman_palette$ink,
fontface = "bold",
size = 3.5,
lineheight = 0.9
) +
facet_wrap(~ theme, scales = "free_y", ncol = 2) +
scale_fill_gradient(
low = afroman_palette$hemp_light,
high = afroman_palette$deep_green
) +
labs(
title = "Theme Signal Heatmap: Afroman YouTube Comments",
subtitle = "Knit-safe replacement for the crowded theme flow: top 2 keywords per theme, capped at 12 rows",
x = "Repeated-signal category",
y = NULL,
fill = "Keyword\nmentions",
caption = "MSBA 580 | Lab 5 | Dictionary-based exploratory theme map | Interprets comment language, not verified audience identity"
) +
theme_afroman(base_size = 13) +
theme(
strip.text = element_text(face = "bold", color = afroman_palette$paper, size = 12),
strip.background = element_rect(fill = afroman_palette$deep_green, color = afroman_palette$earth),
axis.text.x = element_text(angle = 20, hjust = 1, face = "bold", color = afroman_palette$deep_green),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid = element_blank(),
legend.position = "right",
plot.title = element_text(size = 20, face = "bold", color = afroman_palette$deep_green),
plot.subtitle = element_text(size = 12, color = afroman_palette$earth)
)
print(theme_signal_heatmap)
} else {
theme_signal_fallback <- ggplot() +
annotate(
"label",
x = 0,
y = 0,
label = "No theme dictionary hits were detected for the theme signal heatmap.",
fill = afroman_palette$paper,
color = afroman_palette$deep_green,
fontface = "bold",
size = 5,
label.size = 0.8
) +
xlim(-1, 1) +
ylim(-1, 1) +
labs(
title = "Theme Signal Heatmap Fallback",
subtitle = "Core lab outputs above still rendered",
caption = "MSBA 580 | Lab 5 | Fallback visual"
) +
theme_afroman(base_size = 13) +
theme(
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
print(theme_signal_fallback)
}
Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph change after removing these additional stopwords? If so, briefly explain how and why.
I manually added video and song to the
stopword list. These words make sense to remove because the dataset
comes from YouTube comments about a music video, so those words are
likely to appear frequently because of the setting rather than because
they reveal a specific theme. In other words, they are context words,
but they are not the most useful words for interpreting deeper audience
reactions, humor, controversy, nostalgia, or sentiment. The final
network graph changed because removing those terms prevented broad media
words from acting as generic hub nodes. This made the graph cleaner and
helped the remaining word clusters focus more on meaningful
relationships inside the comment language.
Change n in filter(n > 1) to a
different value. What is your final value of n? Will it give you a
better result?
I changed the filter from filter(n > 1) to
filter(n > 2), so my final comparison value is
2. This means a word pair must appear together in at least
three comments before it becomes eligible for the network graph. I then
pruned the displayed network to the strongest approximate 50% of
remaining pairs so the visual is easier to read. I think this gives a
better result because it removes more one-off pairings and reduces graph
clutter while still preserving the strongest recurring relationships. A
much higher threshold might be even cleaner, but it could remove smaller
themes that are still useful in a comment dataset.
On the BlueSky post, there were two co-occurrence network graphs. Which one do you like better? Why?
I prefer the cleaner network graph with fewer nodes and stronger
visible relationships. For business communication, the best graph is not
always the one with the most information; it is usually the one that
helps the audience understand the main pattern quickly. A dense graph
can be useful during exploration, but it can become difficult to
interpret when too many weak connections are shown at the same time. For
a final report or presentation, I would choose the graph with a higher
co-occurrence threshold because it better separates signal from noise.
This is similar to what I did in this lab by raising the threshold from
filter(n > 1) to filter(n > 2).
Which word pairs surprised you? Did buy cluster
more with optimism words like dip or caution words like
wait and valuation?
In my own Afroman comment dataset, the most interesting word pairs
are the ones that reveal repeated audience reactions rather than simple
topic labels. I would look for pairs that connect emotion, humor,
identity, politics, nostalgia, or repeated phrases from the video. For
the stock-related example in the discussion question, I would interpret
buy differently depending on whether it appears closer to
optimism words like dip or caution words like
wait and valuation. If buy
clusters with dip, that suggests a more opportunistic or
optimistic investing frame. If it clusters with wait or
valuation, that suggests a more cautious decision-making
frame.
Should we treat stock tickers, $SPCX and
SpaceX, as the same token instead of separate
ones?
In most business text analysis settings, I would treat
$SPCX and SpaceX as the same token if the goal
is to measure conversation about the same company, brand, or investment
theme. Keeping them separate may split the signal across two labels even
though users may be talking about the same underlying topic. However, I
would document the decision clearly because ticker symbols and company
names can sometimes carry different meanings depending on context. For
example, a ticker may appear more often in investor discussions, while
the company name may appear more often in general news or public
discussion. A good approach would be to create a normalization step
where related tokens are recoded into one standard label.
What is the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?
The main risk is that a small sample can make random or temporary word pairings look more meaningful than they really are. Co-occurrence networks are useful for exploration, but they can be sensitive to sample size, repeated phrases, unusually active commenters, bots, jokes, or event-specific language. A good sample size depends on the platform and business question, but larger samples are generally better because they make repeated patterns more stable. For a business report, I would avoid claiming that the graph represents the entire audience unless the data collection method supports that claim. I would describe the network as an exploratory map of the available comment sample and recommend validating the pattern with a larger sample, multiple time windows, or another data source.
Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?
Yes. A strong approach would be to compare comments before and after a significant event, such as a product launch, controversy, policy announcement, patch update, legal decision, earnings report, or viral media moment. I would identify the event by using a reliable timestamp from a primary source, official announcement, news article, company post, or platform release note. The event is significant if it plausibly changes what people are paying attention to or how they feel about the topic. For example, in a policy shock analytics project, an executive order could be used as the event date, and comments or search interest before and after the order could be compared. The event might influence the language by increasing uncertainty, emotional reactions, political framing, brand mentions, or repeated questions from the public.
The co-occurrence network shows which words tend to appear together in the same Afroman YouTube comments. Thicker and darker edges represent word pairs that co-occur more often. Larger nodes represent words with more connections to other words in the filtered graph.
The matrix, network, arc diagram, map, and alluvial flow each answer a slightly different question. The matrix gives exact word-pair counts. The network shows clusters and central connector words. The state heatmap shows place references in the text. The arc diagram shows repeated word-pair links in a cleaner linear layout. The alluvial flow connects observed words to manually documented themes.
The most important limitation is that these are exploratory patterns from the available comment sample. They should be treated as directional signals rather than definitive conclusions about the full YouTube audience.
set.seed(42) so the graph is
reproducible when knitted again.video and
song.filter(n > 2) to reduce
visual clutter.visuals/ folder.Chiericato, E. (2015). Co Citations and Co Occurrences in SEO:
The Complete Guide. Web Marketing Academy.
https://webmarketing.academy/en/co-citazioni-co-occorrenze-guida-definitiva/
Silge, J., and Robinson, D. (2017). Text Mining with R: A Tidy
Approach. O’Reilly Media.
https://www.tidytextmining.com/
Robinson, D. (2021). widyr: Widen, Process, then Re-Tidy
Data. R package.
https://CRAN.R-project.org/package=widyr
Csardi, G., and Nepusz, T. (2006). The igraph software package
for complex network research. InterJournal, Complex Systems,
1695.
https://igraph.org
Pedersen, T. L. (2024). ggraph: An Implementation of Grammar of
Graphics for Graphs and Networks. R package.
https://CRAN.R-project.org/package=ggraph