Lab 5 Text Analysis Word Co-occurrence

Afroman comment analysis: a hemp-toned, music-poster style report that keeps the class workflow intact while adding cleaner networks, a state mention heatmap, and portfolio-ready visual storytelling.

Part 1: Five-Sentence Reflection

One idea that stood out to me from this week’s text analysis material is that word relationships can reveal meaning that simple word counts miss. A single frequent word can tell us what people mention often, but co-occurrence analysis shows which ideas travel together in the same comments or documents. This matters in business analytics because customers, viewers, and social media users often communicate through repeated phrases, associations, jokes, complaints, and emotional reactions. The tidy text approach is useful because it turns messy language into structured rows that can be counted, filtered, visualized, and interpreted. For this lab, I am using word co-occurrence to move beyond “what words appeared most often” and toward “what ideas appeared together in the Afroman YouTube comment sample.”

Part 2: Word Co-Occurrence Analysis

Background

For this lab, I reused a cleaned YouTube comment dataset from my previous Afroman comment analysis. The goal is to identify which words tend to appear together in the same comments and then visualize those relationships through several views: a co-occurrence matrix, a network graph, a state mention map, an arc diagram, and a theme flow chart.

This analysis should be interpreted as exploratory. The dataset reflects the comments available through the prior collection workflow, not a complete archive of every comment ever posted on the video.

Patch Notes: What This Version Adds

Patch Notes for This Lab 5 Version
Change	Purpose	Risk_Control
Professor-style custom stopwords	Keeps standard and manual stopwords in one workflow	Manual stopwords are documented and easy to change
Top-word co-occurrence matrix	Shows exact pair counts before relying on graph line thickness	Restricted to top words so the table stays readable
Degree-sized network nodes	Makes central connector words visually larger	Node size is explained in the subtitle
State mention heatmap	Adds a geographic heat layer based on text mentions	Clearly labeled as place mentions, not commenter geography
Arc diagram	Shows repeated word-pair links in a cleaner linear layout	Limited to top pairs to avoid edge clutter
Alluvial theme flow	Connects themes to keywords and signal strength	Uses a transparent keyword dictionary rather than hidden model labels

Load the Cleaned Afroman Comment Data

# Put your copied cleaned Afroman CSV in the same folder as this RMD.
# If your file has a different name, update preferred_file below.

preferred_file <- "youtube_comments_clean_u4AiuqQpB1U.csv"

if (file.exists(preferred_file)) {
  data_file <- preferred_file
} else {
  csv_candidates <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
  afroman_candidates <- csv_candidates[
    str_detect(str_to_lower(csv_candidates), "afroman|youtube|comment|clean")
  ]

  if (length(afroman_candidates) > 0) {
    data_file <- afroman_candidates[1]
  } else if (length(csv_candidates) > 0) {
    data_file <- csv_candidates[1]
  } else {
    stop("No CSV file found. Put the cleaned Afroman comment CSV in the same folder as this RMD.")
  }
}

comments_raw <- read_csv(data_file, show_col_types = FALSE)

cat("Using data file:", data_file)

## Using data file: youtube_comments_clean_u4AiuqQpB1U.csv

Inspect the Imported Data

glimpse(comments_raw)

## Rows: 1,051
## Columns: 11
## $ comment_id           <chr> "UgzxqV6MPR5RLmUBmal4AaABAg", "UgxUrfoE4p2mooMGNO…
## $ author               <chr> "@AP-kg9dz", "@kurtwpg", "@thedont2154", "@hidden…
## $ text                 <chr> "Songs clowning on corrupt cops is my new favorit…
## $ published_at         <dttm> 2026-03-20 15:51:20, 2026-03-26 02:44:02, 2026-0…
## $ like_count           <dbl> 34228, 4484, 11333, 233, 9066, 30273, 25758, 500,…
## $ reply_count          <dbl> 146, 24, 55, 9, 111, 212, 135, 3, 63, 29, 2, 97, …
## $ video_id             <chr> "u4AiuqQpB1U", "u4AiuqQpB1U", "u4AiuqQpB1U", "u4A…
## $ source_method        <chr> "YouTube Data API v3 commentThreads.list", "YouTu…
## $ text_original        <chr> "Songs clowning on corrupt cops is my new favorit…
## $ text_clean           <chr> "songs clowning on corrupt cops is my new favorit…
## $ comment_length_words <dbl> 10, 22, 13, 9, 22, 20, 30, 11, 32, 14, 14, 17, 37…

comments_raw %>%
  head(10) %>%
  safe_table(caption = "Preview of Imported Afroman Comment Data")

Preview of Imported Afroman Comment Data
comment_id	author	text	published_at	like_count	reply_count	video_id	source_method	text_original	text_clean	comment_length_words
UgzxqV6MPR5RLmUBmal4AaABAg	@AP-kg9dz	Songs clowning on corrupt cops is my new favorite genre.	2026-03-20 15:51:20	34228	146	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	Songs clowning on corrupt cops is my new favorite genre.	songs clowning on corrupt cops is my new favorite genre	10
UgxUrfoE4p2mooMGNOt4AaABAg	@kurtwpg	“He’s an addict, he got fired. He went to Adams County, somehow he got hired….	2026-03-26 02:44:02	4484	24	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	“He’s an addict, he got fired. He went to Adams County, somehow he got hired….	he s an addict he got fired he went to adams county somehow he got hired migh…	22
UgyD4dfBmBtQlm0fIXx4AaABAg	@thedont2154	They broke the unwritten rule: Never sue someone funnier than you for defama…	2026-03-25 17:14:32	11333	55	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	They broke the unwritten rule: Never sue someone funnier than you for defama…	they broke the unwritten rule never sue someone funnier than you for defamation	13
UgzNaeX9pvu-e3WGKpJ4AaABAg	@hiddenworldforge374	We need afroman on the bricks and minifigs scandal	2026-06-02 16:52:21	233	9	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	We need afroman on the bricks and minifigs scandal	we need afroman on the bricks and minifigs scandal	9
UgwIVslwTx0H_uMp33Z4AaABAg	@DeanRockne	You know, with the benefit of 3 years of hindsight, I think it might have bee…	2026-03-20 05:51:23	9066	111	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	You know, with the benefit of 3 years of hindsight, I think it might have bee…	you know with the benefit of years of hindsight i think it might have been ea…	22
UgwqYnZoeBT47sgBJUt4AaABAg	@slayanddecay6009	Remember Randy is ON record that he cannot confirm that Afroman did or did no…	2026-03-19 18:55:30	30273	212	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	Remember Randy is ON record that he cannot confirm that Afroman did or did no…	remember randy is on record that he cannot confirm that afroman did or did no…	20
Ugxb6YvIdGkmADL8CoN4AaABAg	@Whiston555	Writing a song about fucking his wife and then him having to say for the reco…	2026-03-19 06:02:40	25758	135	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	Writing a song about fucking his wife and then him having to say for the reco…	writing a song about fucking his wife and then him having to say for the reco…	30
UgxxZQ8mAHcdq6gHvKd4AaABAg	@marciaguy10899	I feel a civic duty to listen to all of these 😂	2026-03-31 11:44:59	500	3	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	I feel a civic duty to listen to all of these 😂	i feel a civic duty to listen to all of these	11
Ugz5djYCajTXh4TUSwV4AaABAg	@thisisntmyceiling	When I was a teenager blaring “I got high” I would’ve never thought I’d one d…	2026-03-23 19:26:48	7407	63	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	When I was a teenager blaring “I got high” I would’ve never thought I’d one d…	when i was a teenager blaring i got high i would ve never thought i d one day…	32
UgyXIaehjpy0OQGzJWp4AaABAg	@dusty_cowboy	Contractor here. I’ll fix that door for a slice of lemon pound cake.	2026-03-29 15:33:35	3838	29	u4AiuqQpB1U	YouTube Data API v3 commentThreads.list	Contractor here. I’ll fix that door for a slice of lemon pound cake.	contractor here i ll fix that door for a slice of lemon pound cake	14

Detect the Comment Text Column

possible_text_columns <- c(
  "text",
  "comment",
  "comments",
  "comment_text",
  "clean_comment",
  "clean_text",
  "text_original",
  "textOriginal",
  "body",
  "content",
  "review",
  "message",
  "snippet.topLevelComment.snippet.textOriginal"
)

matched_text_columns <- intersect(possible_text_columns, names(comments_raw))

if (length(matched_text_columns) > 0) {
  text_column <- matched_text_columns[1]
} else {
  character_columns <- names(comments_raw)[sapply(comments_raw, is.character)]

  if (length(character_columns) == 0) {
    stop("No character/text columns found. Please check the CSV structure.")
  }

  text_column <- character_columns[
    which.max(
      sapply(
        comments_raw[character_columns],
        function(x) sum(!is.na(x) & nchar(as.character(x)) > 0)
      )
    )
  ]
}

cat("Detected text column:", text_column)

## Detected text column: text

Build the Corpus

corpus <- comments_raw %>%
  mutate(
    doc_id = paste0("comment_", row_number()),
    text = as.character(.data[[text_column]])
  ) %>%
  filter(!is.na(text), str_squish(text) != "") %>%
  mutate(
    text = str_replace_all(text, "http\\S+|www\\.\\S+", " "),
    text = str_replace_all(text, "&amp;", " and "),
    text = str_replace_all(text, "[\r\n\t]", " "),
    text = str_squish(text)
  ) %>%
  select(doc_id, text, everything())

corpus_summary <- tibble(
  metric = c(
    "CSV file used",
    "Detected text column",
    "Number of usable comments",
    "Average comment length in characters"
  ),
  value = c(
    data_file,
    text_column,
    nrow(corpus),
    round(mean(nchar(corpus$text)), 1)
  )
)

corpus_summary %>%
  safe_table(caption = "Corpus Summary")

Corpus Summary
metric	value
CSV file used	youtube_comments_clean_u4AiuqQpB1U.csv
Detected text column	text
Number of usable comments	1051
Average comment length in characters	77.3

corpus %>%
  select(doc_id, text) %>%
  head(10) %>%
  safe_table(
    caption = "First 10 Comments in the Analysis Corpus",
    max_chars = 120
  )

First 10 Comments in the Analysis Corpus
doc_id	text
comment_1	Songs clowning on corrupt cops is my new favorite genre.
comment_2	“He’s an addict, he got fired. He went to Adams County, somehow he got hired.” might be his best line yet.
comment_3	They broke the unwritten rule: Never sue someone funnier than you for defamation.
comment_4	We need afroman on the bricks and minifigs scandal
comment_5	You know, with the benefit of 3 years of hindsight, I think it might have been easier to fix this guy’s door.
comment_6	Remember Randy is ON record that he cannot confirm that Afroman did or did not have sex with his wife
comment_7	Writing a song about fucking his wife and then him having to say for the record that he wasn’t sure it didn’t happen …
comment_8	I feel a civic duty to listen to all of these 😂
comment_9	When I was a teenager blaring “I got high” I would’ve never thought I’d one day look to this man as a champion for fr…
comment_10	Contractor here. I’ll fix that door for a slice of lemon pound cake.

Tokenization and Cleaning

Step 1: Tokenize to Individual Words

words <- corpus %>%
  mutate(
    text = str_replace_all(text, "[^[:alnum:][:space:]$]", " "),
    text = str_squish(text)
  ) %>%
  unnest_tokens(word, text, token = "words")

words %>%
  select(doc_id, word) %>%
  head(30) %>%
  safe_table(caption = "First 30 Tokens")

First 30 Tokens
doc_id	word
comment_1	songs
comment_1	clowning
comment_1	on
comment_1	corrupt
comment_1	cops
comment_1	is
comment_1	my
comment_1	new
comment_1	favorite
comment_1	genre
comment_2	he
comment_2	s
comment_2	an
comment_2	addict
comment_2	he
comment_2	got
comment_2	fired
comment_2	he
comment_2	went
comment_2	to
comment_2	adams
comment_2	county
comment_2	somehow
comment_2	he
comment_2	got

Step 2: Remove Standard and Custom Stopwords

The manual stopwords added for this version are video and song. These are broad media-context words for this YouTube dataset. They are useful for understanding the setting, but they can become generic hubs in a co-occurrence network rather than revealing more specific themes.

data("stop_words")

custom_only_stop_words <- tibble(
  word = c("video", "song"),
  lexicon = "custom"
)

custom_stop_words <- bind_rows(
  stop_words,
  custom_only_stop_words
)

words_standard_clean <- words %>%
  anti_join(stop_words, by = "word") %>%
  filter(nchar(word) > 2) %>%
  filter(!str_detect(word, "^[0-9]+$"))

words_clean <- words %>%
  anti_join(custom_stop_words, by = "word") %>%
  filter(nchar(word) > 2) %>%
  filter(!str_detect(word, "^[0-9]+$"))

custom_stopword_check <- words_standard_clean %>%
  count(word, sort = TRUE) %>%
  filter(word %in% custom_only_stop_words$word)

custom_stopword_check %>%
  safe_table(caption = "Manual Stopwords Removed from the Afroman Comment Corpus")

Manual Stopwords Removed from the Afroman Comment Corpus
word	n
song	103
video	30

Step 3: Word Frequency Check

word_freq <- words_clean %>%
  count(word, sort = TRUE)

word_freq %>%
  slice_head(n = 30) %>%
  safe_table(caption = "Top 30 Words After Stopword Removal")

Top 30 Words After Stopword Removal
word	n
afroman	225
love	87
randy	83
cops	68
court	44
walters	43
shit	40
police	38
time	38
music	31
son	31
county	27
day	27
lol	27
bitch	26
banger	24
corrupt	24
don	24
people	24
god	23
head	22
judge	22
legend	22
wife	22
adams	21

word_freq_plot <- word_freq %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = reorder(word, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE, color = afroman_palette$earth, linewidth = 0.2) +
  coord_flip() +
  scale_fill_gradient(low = afroman_palette$hemp_light, high = afroman_palette$deep_green) +
  labs(
    title = "Top 20 Words in the Afroman YouTube Comment Corpus",
    subtitle = "After standard and manual stopword removal",
    x = NULL,
    y = "Word Count",
    caption = "MSBA 580 | Lab 5 | Text Analysis and NLP | Hemp / maryjane visual theme"
  ) +
  theme_afroman(base_size = 13)

print(word_freq_plot)

ggsave(
  filename = "visuals/lab5_afroman_top_20_words.png",
  plot = word_freq_plot,
  width = 10,
  height = 7,
  dpi = 300
)

Computing Word Co-Occurrences

A co-occurrence count shows how often two words appear in the same comment. This helps identify words and ideas that tend to travel together in the dataset.

To avoid repeated words inside one comment inflating the results, this version keeps only one instance of each word per comment before computing pairwise counts.

word_cooc <- words_clean %>%
  distinct(doc_id, word) %>%
  pairwise_count(
    item = word,
    feature = doc_id,
    sort = TRUE,
    upper = FALSE
  )

word_cooc %>%
  head(30) %>%
  safe_table(caption = "Top 30 Word Co-Occurrence Pairs")

Top 30 Word Co-Occurrence Pairs
item1	item2	n
randy	walters	41
son	bitch	25
randy	son	21
walters	son	20
pound	cake	19
afroman	love	18
randy	bitch	18
adams	county	17
walters	bitch	17
afroman	randy	15
lemon	pound	14
lemon	cake	14
afroman	police	14
afroman	court	14
cops	afroman	13
afroman	legend	13
afroman	time	12
county	sheriff	11
afroman	music	11
freedom	speech	10
head	stuck	10
afroman	wife	9
afroman	didn	9
randy	walter	9
afroman	house	9

Co-Occurrence Matrix

This section borrows the professor’s matrix logic, but applies it to the Afroman YouTube comment corpus. The matrix is useful because it shows exact pair counts instead of only showing relative edge thickness in a graph.

top_words <- words_clean %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  pull(word)

afroman_matrix <- word_cooc %>%
  filter(item1 %in% top_words, item2 %in% top_words) %>%
  bind_rows(
    word_cooc %>%
      rename(item1 = item2, item2 = item1) %>%
      filter(item1 %in% top_words, item2 %in% top_words)
  ) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

afroman_matrix

##         walters bitch son love randy police court afroman time music shit lol
## randy        41    18  21    2     0      1     7      15    0     0    2   0
## son          20    25   0    1    21      0     2       8    0     0    2   0
## walters       0    17  20    1    41      0     3       7    0     0    0   0
## afroman       7     7   8   18    15     14    14       0   12    11    7   4
## cops          2     0   0    8     3      5     3      13    5     3    7   1
## county        1     1   1    0     2      2     0       8    2     2    1   0
## court         3     0   2    5     7      3     0      14    1     1    2   2
## love          1     0   1    0     2      2     5      18    2     4    3   0
## police        0     0   0    2     1      0     3      14    1     0    3   0
## day           1     1   2    3     1      1     2       3    1     1    2   0
## time          0     0   0    2     0      1     1      12    0     1    1   0
## shit          0     1   2    3     2      3     2       7    1     0    0   0
## music         0     0   0    4     0      0     1      11    1     0    0   1
## bitch        17     0  25    0    18      0     0       7    0     0    1   0
## lol           0     0   0    0     0      0     2       4    0     1    0   0
##         day county cops
## randy     1      2    3
## son       2      1    0
## walters   1      1    2
## afroman   3      8   13
## cops      1      2    0
## county    1      0    2
## court     2      0    3
## love      3      0    8
## police    1      2    5
## day       0      1    1
## time      1      2    5
## shit      2      1    7
## music     1      2    3
## bitch     1      1    0
## lol       0      0    1

as.data.frame(afroman_matrix) %>%
  rownames_to_column("word") %>%
  safe_table(
    caption = "Top-Word Co-Occurrence Matrix",
    max_rows = 20,
    max_chars = 30
  )

Top-Word Co-Occurrence Matrix
word	walters	bitch	son	love	randy	police	court	afroman	time	music	shit	lol	day	county	cops
randy	41	18	21	2	0	1	7	15	0	0	2	0	1	2	3
son	20	25	0	1	21	0	2	8	0	0	2	0	2	1	0
walters	0	17	20	1	41	0	3	7	0	0	0	0	1	1	2
afroman	7	7	8	18	15	14	14	0	12	11	7	4	3	8	13
cops	2	0	0	8	3	5	3	13	5	3	7	1	1	2	0
county	1	1	1	0	2	2	0	8	2	2	1	0	1	0	2
court	3	0	2	5	7	3	0	14	1	1	2	2	2	0	3
love	1	0	1	0	2	2	5	18	2	4	3	0	3	0	8
police	0	0	0	2	1	0	3	14	1	0	3	0	1	2	5
day	1	1	2	3	1	1	2	3	1	1	2	0	0	1	1
time	0	0	0	2	0	1	1	12	0	1	1	0	1	2	5
shit	0	1	2	3	2	3	2	7	1	0	0	0	2	1	7
music	0	0	0	4	0	0	1	11	1	0	0	1	1	2	3
bitch	17	0	25	0	18	0	0	7	0	0	1	0	1	1	0
lol	0	0	0	0	0	0	2	4	0	1	0	0	0	0	1

Visualization 1: Degree-Sized Co-Occurrence Network

Threshold Choice

The baseline class syntax uses filter(n > 1) to show pairs that appear together more than once. I changed the threshold to filter(n > 2) so a word pair must appear together in at least three comments before it enters the network.

# Baseline class-style graph filter:
# filter(n > 2) already removes one-off and two-off word pairs.
# To make the graph easier to read, this version then displays only the
# strongest ~50% of those remaining edges.
cooc_threshold <- 2
network_prune_fraction <- 0.50

word_cooc_baseline_filtered <- word_cooc %>%
  filter(n > cooc_threshold)

target_edge_count <- max(1, ceiling(nrow(word_cooc_baseline_filtered) * network_prune_fraction))

word_cooc_filtered <- word_cooc_baseline_filtered %>%
  arrange(desc(n), item1, item2) %>%
  slice_head(n = target_edge_count)

network_summary <- tibble(
  metric = c(
    "Base filter used",
    "Pairs before filtering",
    "Pairs after base filter",
    "Displayed graph prune target",
    "Pairs displayed after pruning",
    "Approximate edge reduction from base graph",
    "Unique words/nodes displayed"
  ),
  value = c(
    paste0("filter(n > ", cooc_threshold, ")"),
    nrow(word_cooc),
    nrow(word_cooc_baseline_filtered),
    paste0(round(network_prune_fraction * 100), "% strongest remaining pairs"),
    nrow(word_cooc_filtered),
    percent(1 - (nrow(word_cooc_filtered) / max(1, nrow(word_cooc_baseline_filtered))), accuracy = 1),
    length(unique(c(word_cooc_filtered$item1, word_cooc_filtered$item2)))
  )
)

network_summary %>%
  safe_table(caption = "Network Pruning Summary")

Network Pruning Summary
metric	value
Base filter used	filter(n > 2)
Pairs before filtering	24569
Pairs after base filter	334
Displayed graph prune target	50% strongest remaining pairs
Pairs displayed after pruning	167
Approximate edge reduction from base graph	50%
Unique words/nodes displayed	123

if (nrow(word_cooc_filtered) == 0) {
  stop("The filtered network has no edges. Try lowering cooc_threshold from 2 to 1.")
}

cooc_graph <- word_cooc_filtered %>%
  graph_from_data_frame(directed = FALSE)

V(cooc_graph)$degree <- degree(cooc_graph)

set.seed(42)

cooc_plot <- ggraph(cooc_graph, layout = "fr") +
  geom_edge_link(
    aes(edge_alpha = n, edge_width = n),
    color = afroman_palette$earth,
    show.legend = TRUE
  ) +
  geom_node_point(
    aes(size = degree),
    color = afroman_palette$gold
  ) +
  geom_node_point(
    aes(size = degree),
    shape = 21,
    fill = afroman_palette$hemp_light,
    color = afroman_palette$deep_green,
    stroke = 1.1
  ) +
  geom_node_text(
    aes(label = name),
    repel = TRUE,
    size = 4,
    color = afroman_palette$ink,
    fontface = "bold"
  ) +
  scale_edge_width(range = c(0.4, 3.2)) +
  scale_edge_alpha(range = c(0.25, 1)) +
  scale_size_continuous(range = c(4, 10)) +
  labs(
    title = "Pruned Word Co-Occurrence Network: Afroman YouTube Comments",
    subtitle = paste0(
      "Showing the strongest ~", round(network_prune_fraction * 100),
      "% of pairs after filter(n > ", cooc_threshold,
      "); node size = number of connections"
    ),
    caption = "MSBA 580 | Lab 5 | Exploratory analysis of API-accessible YouTube comments"
  ) +
  theme_graph(base_family = "sans", background = afroman_palette$smoke) +
  theme(
    plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
    plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
    plot.subtitle = element_text(color = afroman_palette$earth),
    plot.caption = element_text(color = afroman_palette$muted),
    legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
    legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
    legend.title = element_text(color = afroman_palette$earth),
    legend.text = element_text(color = afroman_palette$ink)
  )

print(cooc_plot)

ggsave(
  filename = "visuals/lab5_afroman_word_cooccurrence_network_pruned.png",
  plot = cooc_plot,
  width = 12,
  height = 8,
  dpi = 300
)

Visualization 2: State Mention Heatmap

This state heatmap visualizes state names that appear in the comment text. It is not a map of verified commenter locations. It is only a text-based location-mention proxy.

state_dictionary <- tibble(
  state = state.name,
  state_lower = str_to_lower(state.name),
  abbreviation = state.abb
) %>%
  bind_rows(
    tibble(
      state = "District of Columbia",
      state_lower = "district of columbia",
      abbreviation = "DC"
    )
  )

comment_text_lower <- str_to_lower(corpus$text)

state_counts <- state_dictionary %>%
  mutate(
    pattern = paste0("\\b", str_replace_all(state_lower, " ", "\\\\s+"), "\\b"),
    mentions = map_int(pattern, ~ sum(str_count(comment_text_lower, regex(.x, ignore_case = TRUE))))
  ) %>%
  select(state, abbreviation, mentions) %>%
  arrange(desc(mentions))

state_heatmap_summary <- state_counts %>%
  summarize(
    states_detected = sum(mentions > 0),
    total_state_mentions = sum(mentions),
    top_state = if_else(total_state_mentions > 0, state[which.max(mentions)], "No state names detected"),
    top_state_mentions = max(mentions, na.rm = TRUE)
  )

cat(
  "State heatmap summary: ",
  state_heatmap_summary$states_detected,
  " states detected; ",
  state_heatmap_summary$total_state_mentions,
  " total state-name mentions. Top state: ",
  state_heatmap_summary$top_state,
  " (", state_heatmap_summary$top_state_mentions, ").",
  sep = ""
)

## State heatmap summary: 12 states detected; 30 total state-name mentions. Top state: Ohio (14).

# Robust state tile heatmap. This avoids external map-package failures during knitting
# while still giving a map-like U.S. state view.
state_tile_layout <- tribble(
  ~abbreviation, ~x, ~y,
  "AK", 1, 1, "ME", 12, 1,
  "VT", 10, 2, "NH", 11, 2, "WA", 2, 2, "MT", 3, 2, "ND", 4, 2, "MN", 5, 2, "WI", 6, 2, "MI", 7, 2, "NY", 9, 2, "MA", 12, 2,
  "OR", 2, 3, "ID", 3, 3, "SD", 4, 3, "IA", 5, 3, "IL", 6, 3, "IN", 7, 3, "OH", 8, 3, "PA", 9, 3, "NJ", 10, 3, "CT", 11, 3, "RI", 12, 3,
  "CA", 2, 4, "NV", 3, 4, "WY", 4, 4, "NE", 5, 4, "MO", 6, 4, "KY", 7, 4, "WV", 8, 4, "VA", 9, 4, "MD", 10, 4, "DE", 11, 4,
  "AZ", 3, 5, "UT", 4, 5, "CO", 5, 5, "KS", 6, 5, "AR", 7, 5, "TN", 8, 5, "NC", 9, 5, "SC", 10, 5, "DC", 11, 5,
  "NM", 4, 6, "OK", 5, 6, "LA", 6, 6, "MS", 7, 6, "AL", 8, 6, "GA", 9, 6,
  "HI", 1, 7, "TX", 5, 7, "FL", 10, 7
)

state_tile_data <- state_tile_layout %>%
  left_join(state_counts, by = "abbreviation") %>%
  mutate(
    state = coalesce(state, abbreviation),
    mentions = coalesce(mentions, 0L),
    label = if_else(mentions > 0, paste0(abbreviation, "\n", mentions), abbreviation)
  )

state_heatmap_plot <- ggplot(state_tile_data, aes(x = x, y = -y, fill = mentions)) +
  geom_tile(color = afroman_palette$earth, linewidth = 0.75, width = 0.95, height = 0.95) +
  geom_text(aes(label = label), color = afroman_palette$ink, fontface = "bold", size = 3.2, lineheight = 0.85) +
  scale_fill_gradient(
    low = afroman_palette$paper,
    high = afroman_palette$deep_green,
    name = "Mentions",
    label = comma
  ) +
  coord_equal() +
  labs(
    title = "State Mention Heatmap: Afroman YouTube Comments",
    subtitle = "Tile-map heatmap of state names appearing in comment text; not verified commenter geography",
    x = NULL,
    y = NULL,
    caption = "MSBA 580 | Lab 5 | State-name detection with word-boundary matching | Hemp / maryjane visual theme"
  ) +
  theme_afroman(base_size = 13) +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 18)
  )

print(state_heatmap_plot)

ggsave(
  filename = "visuals/lab5_afroman_state_mention_tile_heatmap.png",
  plot = state_heatmap_plot,
  width = 12,
  height = 7,
  dpi = 300
)

Visualization 3: Arc Diagram of Top Word Pairs

An arc diagram is a cleaner alternative to a dense force-directed network. It places words along a line and draws arcs between repeated co-occurring pairs.

arc_pair_count <- 35

arc_edges <- word_cooc %>%
  slice_max(n, n = arc_pair_count, with_ties = FALSE)

if (nrow(arc_edges) >= 3) {
  arc_graph <- arc_edges %>%
    graph_from_data_frame(directed = FALSE)

  arc_plot <- ggraph(arc_graph, layout = "linear") +
    geom_edge_arc(
      aes(edge_width = n, edge_alpha = n),
      color = afroman_palette$earth,
      strength = 0.7,
      show.legend = TRUE
    ) +
    geom_node_point(size = 4, color = afroman_palette$gold) +
    geom_node_text(
      aes(label = name),
      angle = 45,
      hjust = 0,
      size = 3.5,
      color = afroman_palette$ink
    ) +
    scale_edge_width(range = c(0.3, 2.8)) +
    scale_edge_alpha(range = c(0.2, 1)) +
    labs(
      title = "Arc Diagram: Top Afroman Comment Word Pairings",
      subtitle = paste0("Top ", arc_pair_count, " co-occurring word pairs by count"),
      caption = "MSBA 580 | Lab 5 | Arc height and thickness emphasize repeated pairings"
    ) +
    theme_graph(base_family = "sans", background = afroman_palette$smoke) +
    theme(
      plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
      plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
      plot.subtitle = element_text(color = afroman_palette$earth),
      plot.caption = element_text(color = afroman_palette$muted),
      legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
      legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
      legend.title = element_text(color = afroman_palette$earth),
      legend.text = element_text(color = afroman_palette$ink),
      plot.margin = margin(10, 30, 40, 10)
    )

  print(arc_plot)

  ggsave(
    filename = "visuals/lab5_afroman_arc_diagram.png",
    plot = arc_plot,
    width = 13,
    height = 7,
    dpi = 300
  )
} else {
  cat("Not enough co-occurring word pairs to render the arc diagram.")
}

Visualization 4: Lightweight Theme Flow

This lightweight flow chart connects manually defined comment themes to the words that triggered them and then to a simple signal-strength tier. It replaces the heavier alluvial rendering with a faster ggplot2-only version so the report can knit reliably while keeping the same business interpretation.

theme_dictionary <- tribble(
  ~theme, ~word,
  "Police / legal", "police",
  "Police / legal", "cops",
  "Police / legal", "cop",
  "Police / legal", "raid",
  "Police / legal", "warrant",
  "Police / legal", "rights",
  "Police / legal", "law",
  "Police / legal", "court",
  "Police / legal", "judge",
  "Police / legal", "sheriff",
  "Humor / reaction", "lol",
  "Humor / reaction", "funny",
  "Humor / reaction", "hilarious",
  "Humor / reaction", "laugh",
  "Humor / reaction", "joke",
  "Humor / reaction", "classic",
  "Music / nostalgia", "music",
  "Music / nostalgia", "afroman",
  "Music / nostalgia", "rap",
  "Music / nostalgia", "lyrics",
  "Music / nostalgia", "beat",
  "Music / nostalgia", "listen",
  "Music / nostalgia", "memories",
  "Cannabis / culture", "high",
  "Cannabis / culture", "weed",
  "Cannabis / culture", "smoke",
  "Cannabis / culture", "smoking",
  "Cannabis / culture", "marijuana",
  "Cannabis / culture", "cannabis",
  "Cannabis / culture", "hemp",
  "Cannabis / culture", "joint",
  "Cannabis / culture", "blunt",
  "Cannabis / culture", "pot",
  "Politics / public issue", "government",
  "Politics / public issue", "political",
  "Politics / public issue", "politics",
  "Politics / public issue", "freedom",
  "Politics / public issue", "tax",
  "Politics / public issue", "corruption",
  "Support / praise", "love",
  "Support / praise", "respect",
  "Support / praise", "support",
  "Support / praise", "great",
  "Support / praise", "legend",
  "Support / praise", "amazing"
)

theme_flow_min_keyword_count <- 2
# Prune the keyword column for readability. Three keywords per theme keeps the
# flow chart interpretive instead of turning the middle column into a wall of text.
theme_flow_top_keywords_per_theme <- 3
theme_flow_max_keywords_total <- 18

theme_word_counts_all <- words_clean %>%
  inner_join(theme_dictionary, by = "word") %>%
  count(theme, word, sort = TRUE)

theme_word_counts_pruned <- theme_word_counts_all %>%
  filter(n >= theme_flow_min_keyword_count)

# Fail-soft fallback: if the minimum-count filter is too strict for this sample,
# keep the strongest available dictionary hits rather than producing a blank plot.
if (nrow(theme_word_counts_pruned) == 0 && nrow(theme_word_counts_all) > 0) {
  theme_word_counts_pruned <- theme_word_counts_all
}

theme_word_counts <- theme_word_counts_pruned %>%
  group_by(theme) %>%
  slice_max(n, n = theme_flow_top_keywords_per_theme, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(desc(n), theme, word) %>%
  slice_head(n = theme_flow_max_keywords_total) %>%
  mutate(
    signal_strength = case_when(
      n >= quantile(n, 0.75, na.rm = TRUE) ~ "High repeated signal",
      n >= quantile(n, 0.40, na.rm = TRUE) ~ "Medium repeated signal",
      TRUE ~ "Lower repeated signal"
    ),
    theme_label = str_wrap(theme, width = 16),
    word_label = str_wrap(word, width = 10),
    signal_label = str_wrap(signal_strength, width = 16)
  )

flow_pruning_summary <- tibble(
  setting = c(
    "Minimum keyword count kept",
    "Maximum keywords per theme",
    "Maximum total keyword rows",
    "Keyword rows displayed",
    "Keyword rows before pruning"
  ),
  value = c(
    theme_flow_min_keyword_count,
    theme_flow_top_keywords_per_theme,
    theme_flow_max_keywords_total,
    nrow(theme_word_counts),
    nrow(theme_word_counts_all)
  )
)

flow_pruning_summary %>%
  safe_table(caption = "Theme Flow Pruning Settings")

Theme Flow Pruning Settings
setting	value
Minimum keyword count kept	2
Maximum keywords per theme	3
Maximum total keyword rows	18
Keyword rows displayed	18
Keyword rows before pruning	36

theme_word_counts %>%
  select(theme, word, n, signal_strength) %>%
  safe_table(caption = "Pruned Theme Dictionary Hits Used in the Flow Chart", max_rows = 30)

Pruned Theme Dictionary Hits Used in the Flow Chart
theme	word	n	signal_strength
Music / nostalgia	afroman	225	High repeated signal
Support / praise	love	87	High repeated signal
Police / legal	cops	68	High repeated signal
Police / legal	court	44	High repeated signal
Police / legal	police	38	High repeated signal
Music / nostalgia	music	31	Medium repeated signal
Humor / reaction	lol	27	Medium repeated signal
Support / praise	legend	22	Medium repeated signal
Politics / public issue	freedom	17	Medium repeated signal
Humor / reaction	hilarious	16	Medium repeated signal
Humor / reaction	funny	12	Medium repeated signal
Support / praise	support	11	Lower repeated signal
Music / nostalgia	rap	9	Lower repeated signal
Cannabis / culture	weed	5	Lower repeated signal
Politics / public issue	government	5	Lower repeated signal
Politics / public issue	corruption	4	Lower repeated signal
Cannabis / culture	blunt	2	Lower repeated signal
Cannabis / culture	smoke	2	Lower repeated signal

# Knit-safe replacement for the theme-flow visual.
# The alluvial and curved-flow versions were visually interesting, but both
# stalled during knitting in this environment. This version uses only a small
# heatmap matrix, which is much faster and more reliable for RPubs.

if (nrow(theme_word_counts) > 0) {
  theme_matrix_data <- theme_word_counts %>%
    group_by(theme) %>%
    slice_max(n, n = 2, with_ties = FALSE) %>%
    ungroup() %>%
    arrange(theme, desc(n), word) %>%
    slice_head(n = 12) %>%
    mutate(
      theme = str_wrap(theme, width = 18),
      word = str_wrap(word, width = 12),
      signal_strength = factor(
        signal_strength,
        levels = c("Lower repeated signal", "Medium repeated signal", "High repeated signal")
      ),
      label = paste0(word, "\n", "n=", n)
    )

  theme_signal_heatmap <- ggplot(
    theme_matrix_data,
    aes(x = signal_strength, y = reorder(word, n), fill = n)
  ) +
    geom_tile(color = afroman_palette$paper, linewidth = 1.1) +
    geom_text(
      aes(label = label),
      color = afroman_palette$ink,
      fontface = "bold",
      size = 3.5,
      lineheight = 0.9
    ) +
    facet_wrap(~ theme, scales = "free_y", ncol = 2) +
    scale_fill_gradient(
      low = afroman_palette$hemp_light,
      high = afroman_palette$deep_green
    ) +
    labs(
      title = "Theme Signal Heatmap: Afroman YouTube Comments",
      subtitle = "Knit-safe replacement for the crowded theme flow: top 2 keywords per theme, capped at 12 rows",
      x = "Repeated-signal category",
      y = NULL,
      fill = "Keyword\nmentions",
      caption = "MSBA 580 | Lab 5 | Dictionary-based exploratory theme map | Interprets comment language, not verified audience identity"
    ) +
    theme_afroman(base_size = 13) +
    theme(
      strip.text = element_text(face = "bold", color = afroman_palette$paper, size = 12),
      strip.background = element_rect(fill = afroman_palette$deep_green, color = afroman_palette$earth),
      axis.text.x = element_text(angle = 20, hjust = 1, face = "bold", color = afroman_palette$deep_green),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      panel.grid = element_blank(),
      legend.position = "right",
      plot.title = element_text(size = 20, face = "bold", color = afroman_palette$deep_green),
      plot.subtitle = element_text(size = 12, color = afroman_palette$earth)
    )

  print(theme_signal_heatmap)
} else {
  theme_signal_fallback <- ggplot() +
    annotate(
      "label",
      x = 0,
      y = 0,
      label = "No theme dictionary hits were detected for the theme signal heatmap.",
      fill = afroman_palette$paper,
      color = afroman_palette$deep_green,
      fontface = "bold",
      size = 5,
      label.size = 0.8
    ) +
    xlim(-1, 1) +
    ylim(-1, 1) +
    labs(
      title = "Theme Signal Heatmap Fallback",
      subtitle = "Core lab outputs above still rendered",
      caption = "MSBA 580 | Lab 5 | Fallback visual"
    ) +
    theme_afroman(base_size = 13) +
    theme(
      axis.text = element_blank(),
      axis.title = element_blank(),
      axis.ticks = element_blank(),
      panel.grid = element_blank()
    )

  print(theme_signal_fallback)
}

Lab Reflection Questions

Question 1

Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph change after removing these additional stopwords? If so, briefly explain how and why.

I manually added video and song to the stopword list. These words make sense to remove because the dataset comes from YouTube comments about a music video, so those words are likely to appear frequently because of the setting rather than because they reveal a specific theme. In other words, they are context words, but they are not the most useful words for interpreting deeper audience reactions, humor, controversy, nostalgia, or sentiment. The final network graph changed because removing those terms prevented broad media words from acting as generic hub nodes. This made the graph cleaner and helped the remaining word clusters focus more on meaningful relationships inside the comment language.

Question 2

Change n in filter(n > 1) to a different value. What is your final value of n? Will it give you a better result?

I changed the filter from filter(n > 1) to filter(n > 2), so my final comparison value is 2. This means a word pair must appear together in at least three comments before it becomes eligible for the network graph. I then pruned the displayed network to the strongest approximate 50% of remaining pairs so the visual is easier to read. I think this gives a better result because it removes more one-off pairings and reduces graph clutter while still preserving the strongest recurring relationships. A much higher threshold might be even cleaner, but it could remove smaller themes that are still useful in a comment dataset.

Question 3

On the BlueSky post, there were two co-occurrence network graphs. Which one do you like better? Why?

I prefer the cleaner network graph with fewer nodes and stronger visible relationships. For business communication, the best graph is not always the one with the most information; it is usually the one that helps the audience understand the main pattern quickly. A dense graph can be useful during exploration, but it can become difficult to interpret when too many weak connections are shown at the same time. For a final report or presentation, I would choose the graph with a higher co-occurrence threshold because it better separates signal from noise. This is similar to what I did in this lab by raising the threshold from filter(n > 1) to filter(n > 2).

Question 4

Which word pairs surprised you? Did buy cluster more with optimism words like dip or caution words like wait and valuation?

In my own Afroman comment dataset, the most interesting word pairs are the ones that reveal repeated audience reactions rather than simple topic labels. I would look for pairs that connect emotion, humor, identity, politics, nostalgia, or repeated phrases from the video. For the stock-related example in the discussion question, I would interpret buy differently depending on whether it appears closer to optimism words like dip or caution words like wait and valuation. If buy clusters with dip, that suggests a more opportunistic or optimistic investing frame. If it clusters with wait or valuation, that suggests a more cautious decision-making frame.

Question 5

Should we treat stock tickers, $SPCX and SpaceX, as the same token instead of separate ones?

In most business text analysis settings, I would treat $SPCX and SpaceX as the same token if the goal is to measure conversation about the same company, brand, or investment theme. Keeping them separate may split the signal across two labels even though users may be talking about the same underlying topic. However, I would document the decision clearly because ticker symbols and company names can sometimes carry different meanings depending on context. For example, a ticker may appear more often in investor discussions, while the company name may appear more often in general news or public discussion. A good approach would be to create a normalization step where related tokens are recoded into one standard label.

Question 6

What is the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?

The main risk is that a small sample can make random or temporary word pairings look more meaningful than they really are. Co-occurrence networks are useful for exploration, but they can be sensitive to sample size, repeated phrases, unusually active commenters, bots, jokes, or event-specific language. A good sample size depends on the platform and business question, but larger samples are generally better because they make repeated patterns more stable. For a business report, I would avoid claiming that the graph represents the entire audience unless the data collection method supports that claim. I would describe the network as an exploratory map of the available comment sample and recommend validating the pattern with a larger sample, multiple time windows, or another data source.

Question 7

Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?

Yes. A strong approach would be to compare comments before and after a significant event, such as a product launch, controversy, policy announcement, patch update, legal decision, earnings report, or viral media moment. I would identify the event by using a reliable timestamp from a primary source, official announcement, news article, company post, or platform release note. The event is significant if it plausibly changes what people are paying attention to or how they feel about the topic. For example, in a policy shock analytics project, an executive order could be used as the event date, and comments or search interest before and after the order could be compared. The event might influence the language by increasing uncertainty, emotional reactions, political framing, brand mentions, or repeated questions from the public.

Interpretation

The co-occurrence network shows which words tend to appear together in the same Afroman YouTube comments. Thicker and darker edges represent word pairs that co-occur more often. Larger nodes represent words with more connections to other words in the filtered graph.

The matrix, network, arc diagram, map, and alluvial flow each answer a slightly different question. The matrix gives exact word-pair counts. The network shows clusters and central connector words. The state heatmap shows place references in the text. The arc diagram shows repeated word-pair links in a cleaner linear layout. The alluvial flow connects observed words to manually documented themes.

The most important limitation is that these are exploratory patterns from the available comment sample. They should be treated as directional signals rather than definitive conclusions about the full YouTube audience.

Reproducibility Notes

The analysis uses a cleaned CSV file rather than live API collection.
No API key is included in this report.
The network layout uses set.seed(42) so the graph is reproducible when knitted again.
The manual stopwords used in this version are video and song.
The final network graph uses filter(n > 2) to reduce visual clutter.
The state heatmap is based on text mentions of state names, not verified commenter locations.
The alluvial flow uses a transparent manual theme dictionary.
Output graphics are saved in the visuals/ folder.

References

Chiericato, E. (2015). Co Citations and Co Occurrences in SEO: The Complete Guide. Web Marketing Academy.
https://webmarketing.academy/en/co-citazioni-co-occorrenze-guida-definitiva/

Silge, J., and Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.
https://www.tidytextmining.com/

Robinson, D. (2021). widyr: Widen, Process, then Re-Tidy Data. R package.
https://CRAN.R-project.org/package=widyr

Csardi, G., and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695.
https://igraph.org

Pedersen, T. L. (2024). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package.
https://CRAN.R-project.org/package=ggraph

Lab 5 - Text Analysis & NLP: Afroman Word Co-Occurrence Analysis

Patched version with matrix, network, state map, arc diagram, and alluvial flow

Inigo Montoya, guided and taught by Dr Jimmy Z the data Jedi of CSUCI

July 01, 2026