Lab 5 Text Analysis Word Co-occurrence

Afroman comment analysis: a hemp-toned, music-poster style report that keeps the class workflow intact while adding cleaner networks, a state mention heatmap, and portfolio-ready visual storytelling.

Part 1: Five-Sentence Reflection

One idea that stood out to me from this week’s text analysis material is that word relationships can reveal meaning that simple word counts miss. A single frequent word can tell us what people mention often, but co-occurrence analysis shows which ideas travel together in the same comments or documents. This matters in business analytics because customers, viewers, and social media users often communicate through repeated phrases, associations, jokes, complaints, and emotional reactions. The tidy text approach is useful because it turns messy language into structured rows that can be counted, filtered, visualized, and interpreted. For this lab, I am using word co-occurrence to move beyond “what words appeared most often” and toward “what ideas appeared together in the Afroman YouTube comment sample.”

Part 2: Word Co-Occurrence Analysis

Background

For this lab, I reused a cleaned YouTube comment dataset from my previous Afroman comment analysis. The goal is to identify which words tend to appear together in the same comments and then visualize those relationships through several views: a co-occurrence matrix, a network graph, a state mention map, an arc diagram, and a theme flow chart.

This analysis should be interpreted as exploratory. The dataset reflects the comments available through the prior collection workflow, not a complete archive of every comment ever posted on the video.

Patch Notes: What This Version Adds

Patch Notes for This Lab 5 Version
Change Purpose Risk_Control
Professor-style custom stopwords Keeps standard and manual stopwords in one workflow Manual stopwords are documented and easy to change
Top-word co-occurrence matrix Shows exact pair counts before relying on graph line thickness Restricted to top words so the table stays readable
Degree-sized network nodes Makes central connector words visually larger Node size is explained in the subtitle
State mention heatmap Adds a geographic heat layer based on text mentions Clearly labeled as place mentions, not commenter geography
Arc diagram Shows repeated word-pair links in a cleaner linear layout Limited to top pairs to avoid edge clutter
Alluvial theme flow Connects themes to keywords and signal strength Uses a transparent keyword dictionary rather than hidden model labels

Load the Cleaned Afroman Comment Data

# Put your copied cleaned Afroman CSV in the same folder as this RMD.
# If your file has a different name, update preferred_file below.

preferred_file <- "youtube_comments_clean_u4AiuqQpB1U.csv"

if (file.exists(preferred_file)) {
  data_file <- preferred_file
} else {
  csv_candidates <- list.files(pattern = "\\.csv$", ignore.case = TRUE)
  afroman_candidates <- csv_candidates[
    str_detect(str_to_lower(csv_candidates), "afroman|youtube|comment|clean")
  ]

  if (length(afroman_candidates) > 0) {
    data_file <- afroman_candidates[1]
  } else if (length(csv_candidates) > 0) {
    data_file <- csv_candidates[1]
  } else {
    stop("No CSV file found. Put the cleaned Afroman comment CSV in the same folder as this RMD.")
  }
}

comments_raw <- read_csv(data_file, show_col_types = FALSE)

cat("Using data file:", data_file)
## Using data file: youtube_comments_clean_u4AiuqQpB1U.csv

Inspect the Imported Data

glimpse(comments_raw)
## Rows: 1,051
## Columns: 11
## $ comment_id           <chr> "UgzxqV6MPR5RLmUBmal4AaABAg", "UgxUrfoE4p2mooMGNO…
## $ author               <chr> "@AP-kg9dz", "@kurtwpg", "@thedont2154", "@hidden…
## $ text                 <chr> "Songs clowning on corrupt cops is my new favorit…
## $ published_at         <dttm> 2026-03-20 15:51:20, 2026-03-26 02:44:02, 2026-0…
## $ like_count           <dbl> 34228, 4484, 11333, 233, 9066, 30273, 25758, 500,…
## $ reply_count          <dbl> 146, 24, 55, 9, 111, 212, 135, 3, 63, 29, 2, 97, …
## $ video_id             <chr> "u4AiuqQpB1U", "u4AiuqQpB1U", "u4AiuqQpB1U", "u4A…
## $ source_method        <chr> "YouTube Data API v3 commentThreads.list", "YouTu…
## $ text_original        <chr> "Songs clowning on corrupt cops is my new favorit…
## $ text_clean           <chr> "songs clowning on corrupt cops is my new favorit…
## $ comment_length_words <dbl> 10, 22, 13, 9, 22, 20, 30, 11, 32, 14, 14, 17, 37…
comments_raw %>%
  head(10) %>%
  safe_table(caption = "Preview of Imported Afroman Comment Data")
Preview of Imported Afroman Comment Data
comment_id author text published_at like_count reply_count video_id source_method text_original text_clean comment_length_words
UgzxqV6MPR5RLmUBmal4AaABAg @AP-kg9dz Songs clowning on corrupt cops is my new favorite genre. 2026-03-20 15:51:20 34228 146 u4AiuqQpB1U YouTube Data API v3 commentThreads.list Songs clowning on corrupt cops is my new favorite genre. songs clowning on corrupt cops is my new favorite genre 10
UgxUrfoE4p2mooMGNOt4AaABAg @kurtwpg “He’s an addict, he got fired. He went to Adams County, somehow he got hired…. 2026-03-26 02:44:02 4484 24 u4AiuqQpB1U YouTube Data API v3 commentThreads.list “He’s an addict, he got fired. He went to Adams County, somehow he got hired…. he s an addict he got fired he went to adams county somehow he got hired migh… 22
UgyD4dfBmBtQlm0fIXx4AaABAg @thedont2154 They broke the unwritten rule: Never sue someone funnier than you for defama… 2026-03-25 17:14:32 11333 55 u4AiuqQpB1U YouTube Data API v3 commentThreads.list They broke the unwritten rule: Never sue someone funnier than you for defama… they broke the unwritten rule never sue someone funnier than you for defamation 13
UgzNaeX9pvu-e3WGKpJ4AaABAg @hiddenworldforge374 We need afroman on the bricks and minifigs scandal 2026-06-02 16:52:21 233 9 u4AiuqQpB1U YouTube Data API v3 commentThreads.list We need afroman on the bricks and minifigs scandal we need afroman on the bricks and minifigs scandal 9
UgwIVslwTx0H_uMp33Z4AaABAg @DeanRockne You know, with the benefit of 3 years of hindsight, I think it might have bee… 2026-03-20 05:51:23 9066 111 u4AiuqQpB1U YouTube Data API v3 commentThreads.list You know, with the benefit of 3 years of hindsight, I think it might have bee… you know with the benefit of years of hindsight i think it might have been ea… 22
UgwqYnZoeBT47sgBJUt4AaABAg @slayanddecay6009 Remember Randy is ON record that he cannot confirm that Afroman did or did no… 2026-03-19 18:55:30 30273 212 u4AiuqQpB1U YouTube Data API v3 commentThreads.list Remember Randy is ON record that he cannot confirm that Afroman did or did no… remember randy is on record that he cannot confirm that afroman did or did no… 20
Ugxb6YvIdGkmADL8CoN4AaABAg @Whiston555 Writing a song about fucking his wife and then him having to say for the reco… 2026-03-19 06:02:40 25758 135 u4AiuqQpB1U YouTube Data API v3 commentThreads.list Writing a song about fucking his wife and then him having to say for the reco… writing a song about fucking his wife and then him having to say for the reco… 30
UgxxZQ8mAHcdq6gHvKd4AaABAg @marciaguy10899 I feel a civic duty to listen to all of these 😂 2026-03-31 11:44:59 500 3 u4AiuqQpB1U YouTube Data API v3 commentThreads.list I feel a civic duty to listen to all of these 😂 i feel a civic duty to listen to all of these 11
Ugz5djYCajTXh4TUSwV4AaABAg @thisisntmyceiling When I was a teenager blaring “I got high” I would’ve never thought I’d one d… 2026-03-23 19:26:48 7407 63 u4AiuqQpB1U YouTube Data API v3 commentThreads.list When I was a teenager blaring “I got high” I would’ve never thought I’d one d… when i was a teenager blaring i got high i would ve never thought i d one day… 32
UgyXIaehjpy0OQGzJWp4AaABAg @dusty_cowboy Contractor here. I’ll fix that door for a slice of lemon pound cake. 2026-03-29 15:33:35 3838 29 u4AiuqQpB1U YouTube Data API v3 commentThreads.list Contractor here. I’ll fix that door for a slice of lemon pound cake. contractor here i ll fix that door for a slice of lemon pound cake 14

Detect the Comment Text Column

possible_text_columns <- c(
  "text",
  "comment",
  "comments",
  "comment_text",
  "clean_comment",
  "clean_text",
  "text_original",
  "textOriginal",
  "body",
  "content",
  "review",
  "message",
  "snippet.topLevelComment.snippet.textOriginal"
)

matched_text_columns <- intersect(possible_text_columns, names(comments_raw))

if (length(matched_text_columns) > 0) {
  text_column <- matched_text_columns[1]
} else {
  character_columns <- names(comments_raw)[sapply(comments_raw, is.character)]

  if (length(character_columns) == 0) {
    stop("No character/text columns found. Please check the CSV structure.")
  }

  text_column <- character_columns[
    which.max(
      sapply(
        comments_raw[character_columns],
        function(x) sum(!is.na(x) & nchar(as.character(x)) > 0)
      )
    )
  ]
}

cat("Detected text column:", text_column)
## Detected text column: text

Build the Corpus

corpus <- comments_raw %>%
  mutate(
    doc_id = paste0("comment_", row_number()),
    text = as.character(.data[[text_column]])
  ) %>%
  filter(!is.na(text), str_squish(text) != "") %>%
  mutate(
    text = str_replace_all(text, "http\\S+|www\\.\\S+", " "),
    text = str_replace_all(text, "&amp;", " and "),
    text = str_replace_all(text, "[\r\n\t]", " "),
    text = str_squish(text)
  ) %>%
  select(doc_id, text, everything())

corpus_summary <- tibble(
  metric = c(
    "CSV file used",
    "Detected text column",
    "Number of usable comments",
    "Average comment length in characters"
  ),
  value = c(
    data_file,
    text_column,
    nrow(corpus),
    round(mean(nchar(corpus$text)), 1)
  )
)

corpus_summary %>%
  safe_table(caption = "Corpus Summary")
Corpus Summary
metric value
CSV file used youtube_comments_clean_u4AiuqQpB1U.csv
Detected text column text
Number of usable comments 1051
Average comment length in characters 77.3
corpus %>%
  select(doc_id, text) %>%
  head(10) %>%
  safe_table(
    caption = "First 10 Comments in the Analysis Corpus",
    max_chars = 120
  )
First 10 Comments in the Analysis Corpus
doc_id text
comment_1 Songs clowning on corrupt cops is my new favorite genre.
comment_2 “He’s an addict, he got fired. He went to Adams County, somehow he got hired.” might be his best line yet.
comment_3 They broke the unwritten rule: Never sue someone funnier than you for defamation.
comment_4 We need afroman on the bricks and minifigs scandal
comment_5 You know, with the benefit of 3 years of hindsight, I think it might have been easier to fix this guy’s door.
comment_6 Remember Randy is ON record that he cannot confirm that Afroman did or did not have sex with his wife
comment_7 Writing a song about fucking his wife and then him having to say for the record that he wasn’t sure it didn’t happen …
comment_8 I feel a civic duty to listen to all of these 😂
comment_9 When I was a teenager blaring “I got high” I would’ve never thought I’d one day look to this man as a champion for fr…
comment_10 Contractor here. I’ll fix that door for a slice of lemon pound cake.

Tokenization and Cleaning

Step 1: Tokenize to Individual Words

words <- corpus %>%
  mutate(
    text = str_replace_all(text, "[^[:alnum:][:space:]$]", " "),
    text = str_squish(text)
  ) %>%
  unnest_tokens(word, text, token = "words")

words %>%
  select(doc_id, word) %>%
  head(30) %>%
  safe_table(caption = "First 30 Tokens")
First 30 Tokens
doc_id word
comment_1 songs
comment_1 clowning
comment_1 on
comment_1 corrupt
comment_1 cops
comment_1 is
comment_1 my
comment_1 new
comment_1 favorite
comment_1 genre
comment_2 he
comment_2 s
comment_2 an
comment_2 addict
comment_2 he
comment_2 got
comment_2 fired
comment_2 he
comment_2 went
comment_2 to
comment_2 adams
comment_2 county
comment_2 somehow
comment_2 he
comment_2 got

Step 2: Remove Standard and Custom Stopwords

The manual stopwords added for this version are video and song. These are broad media-context words for this YouTube dataset. They are useful for understanding the setting, but they can become generic hubs in a co-occurrence network rather than revealing more specific themes.

data("stop_words")

custom_only_stop_words <- tibble(
  word = c("video", "song"),
  lexicon = "custom"
)

custom_stop_words <- bind_rows(
  stop_words,
  custom_only_stop_words
)

words_standard_clean <- words %>%
  anti_join(stop_words, by = "word") %>%
  filter(nchar(word) > 2) %>%
  filter(!str_detect(word, "^[0-9]+$"))

words_clean <- words %>%
  anti_join(custom_stop_words, by = "word") %>%
  filter(nchar(word) > 2) %>%
  filter(!str_detect(word, "^[0-9]+$"))

custom_stopword_check <- words_standard_clean %>%
  count(word, sort = TRUE) %>%
  filter(word %in% custom_only_stop_words$word)

custom_stopword_check %>%
  safe_table(caption = "Manual Stopwords Removed from the Afroman Comment Corpus")
Manual Stopwords Removed from the Afroman Comment Corpus
word n
song 103
video 30

Step 3: Word Frequency Check

word_freq <- words_clean %>%
  count(word, sort = TRUE)

word_freq %>%
  slice_head(n = 30) %>%
  safe_table(caption = "Top 30 Words After Stopword Removal")
Top 30 Words After Stopword Removal
word n
afroman 225
love 87
randy 83
cops 68
court 44
walters 43
shit 40
police 38
time 38
music 31
son 31
county 27
day 27
lol 27
bitch 26
banger 24
corrupt 24
don 24
people 24
god 23
head 22
judge 22
legend 22
wife 22
adams 21
word_freq_plot <- word_freq %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = reorder(word, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE, color = afroman_palette$earth, linewidth = 0.2) +
  coord_flip() +
  scale_fill_gradient(low = afroman_palette$hemp_light, high = afroman_palette$deep_green) +
  labs(
    title = "Top 20 Words in the Afroman YouTube Comment Corpus",
    subtitle = "After standard and manual stopword removal",
    x = NULL,
    y = "Word Count",
    caption = "MSBA 580 | Lab 5 | Text Analysis and NLP | Hemp / maryjane visual theme"
  ) +
  theme_afroman(base_size = 13)

print(word_freq_plot)

ggsave(
  filename = "visuals/lab5_afroman_top_20_words.png",
  plot = word_freq_plot,
  width = 10,
  height = 7,
  dpi = 300
)

Computing Word Co-Occurrences

A co-occurrence count shows how often two words appear in the same comment. This helps identify words and ideas that tend to travel together in the dataset.

To avoid repeated words inside one comment inflating the results, this version keeps only one instance of each word per comment before computing pairwise counts.

word_cooc <- words_clean %>%
  distinct(doc_id, word) %>%
  pairwise_count(
    item = word,
    feature = doc_id,
    sort = TRUE,
    upper = FALSE
  )

word_cooc %>%
  head(30) %>%
  safe_table(caption = "Top 30 Word Co-Occurrence Pairs")
Top 30 Word Co-Occurrence Pairs
item1 item2 n
randy walters 41
son bitch 25
randy son 21
walters son 20
pound cake 19
afroman love 18
randy bitch 18
adams county 17
walters bitch 17
afroman randy 15
lemon pound 14
lemon cake 14
afroman police 14
afroman court 14
cops afroman 13
afroman legend 13
afroman time 12
county sheriff 11
afroman music 11
freedom speech 10
head stuck 10
afroman wife 9
afroman didn 9
randy walter 9
afroman house 9

Co-Occurrence Matrix

This section borrows the professor’s matrix logic, but applies it to the Afroman YouTube comment corpus. The matrix is useful because it shows exact pair counts instead of only showing relative edge thickness in a graph.

top_words <- words_clean %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15) %>%
  pull(word)

afroman_matrix <- word_cooc %>%
  filter(item1 %in% top_words, item2 %in% top_words) %>%
  bind_rows(
    word_cooc %>%
      rename(item1 = item2, item2 = item1) %>%
      filter(item1 %in% top_words, item2 %in% top_words)
  ) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

afroman_matrix
##         walters bitch son love randy police court afroman time music shit lol
## randy        41    18  21    2     0      1     7      15    0     0    2   0
## son          20    25   0    1    21      0     2       8    0     0    2   0
## walters       0    17  20    1    41      0     3       7    0     0    0   0
## afroman       7     7   8   18    15     14    14       0   12    11    7   4
## cops          2     0   0    8     3      5     3      13    5     3    7   1
## county        1     1   1    0     2      2     0       8    2     2    1   0
## court         3     0   2    5     7      3     0      14    1     1    2   2
## love          1     0   1    0     2      2     5      18    2     4    3   0
## police        0     0   0    2     1      0     3      14    1     0    3   0
## day           1     1   2    3     1      1     2       3    1     1    2   0
## time          0     0   0    2     0      1     1      12    0     1    1   0
## shit          0     1   2    3     2      3     2       7    1     0    0   0
## music         0     0   0    4     0      0     1      11    1     0    0   1
## bitch        17     0  25    0    18      0     0       7    0     0    1   0
## lol           0     0   0    0     0      0     2       4    0     1    0   0
##         day county cops
## randy     1      2    3
## son       2      1    0
## walters   1      1    2
## afroman   3      8   13
## cops      1      2    0
## county    1      0    2
## court     2      0    3
## love      3      0    8
## police    1      2    5
## day       0      1    1
## time      1      2    5
## shit      2      1    7
## music     1      2    3
## bitch     1      1    0
## lol       0      0    1
as.data.frame(afroman_matrix) %>%
  rownames_to_column("word") %>%
  safe_table(
    caption = "Top-Word Co-Occurrence Matrix",
    max_rows = 20,
    max_chars = 30
  )
Top-Word Co-Occurrence Matrix
word walters bitch son love randy police court afroman time music shit lol day county cops
randy 41 18 21 2 0 1 7 15 0 0 2 0 1 2 3
son 20 25 0 1 21 0 2 8 0 0 2 0 2 1 0
walters 0 17 20 1 41 0 3 7 0 0 0 0 1 1 2
afroman 7 7 8 18 15 14 14 0 12 11 7 4 3 8 13
cops 2 0 0 8 3 5 3 13 5 3 7 1 1 2 0
county 1 1 1 0 2 2 0 8 2 2 1 0 1 0 2
court 3 0 2 5 7 3 0 14 1 1 2 2 2 0 3
love 1 0 1 0 2 2 5 18 2 4 3 0 3 0 8
police 0 0 0 2 1 0 3 14 1 0 3 0 1 2 5
day 1 1 2 3 1 1 2 3 1 1 2 0 0 1 1
time 0 0 0 2 0 1 1 12 0 1 1 0 1 2 5
shit 0 1 2 3 2 3 2 7 1 0 0 0 2 1 7
music 0 0 0 4 0 0 1 11 1 0 0 1 1 2 3
bitch 17 0 25 0 18 0 0 7 0 0 1 0 1 1 0
lol 0 0 0 0 0 0 2 4 0 1 0 0 0 0 1

Visualization 1: Degree-Sized Co-Occurrence Network

Threshold Choice

The baseline class syntax uses filter(n > 1) to show pairs that appear together more than once. I changed the threshold to filter(n > 2) so a word pair must appear together in at least three comments before it enters the network.

# Baseline class-style graph filter:
# filter(n > 2) already removes one-off and two-off word pairs.
# To make the graph easier to read, this version then displays only the
# strongest ~50% of those remaining edges.
cooc_threshold <- 2
network_prune_fraction <- 0.50

word_cooc_baseline_filtered <- word_cooc %>%
  filter(n > cooc_threshold)

target_edge_count <- max(1, ceiling(nrow(word_cooc_baseline_filtered) * network_prune_fraction))

word_cooc_filtered <- word_cooc_baseline_filtered %>%
  arrange(desc(n), item1, item2) %>%
  slice_head(n = target_edge_count)

network_summary <- tibble(
  metric = c(
    "Base filter used",
    "Pairs before filtering",
    "Pairs after base filter",
    "Displayed graph prune target",
    "Pairs displayed after pruning",
    "Approximate edge reduction from base graph",
    "Unique words/nodes displayed"
  ),
  value = c(
    paste0("filter(n > ", cooc_threshold, ")"),
    nrow(word_cooc),
    nrow(word_cooc_baseline_filtered),
    paste0(round(network_prune_fraction * 100), "% strongest remaining pairs"),
    nrow(word_cooc_filtered),
    percent(1 - (nrow(word_cooc_filtered) / max(1, nrow(word_cooc_baseline_filtered))), accuracy = 1),
    length(unique(c(word_cooc_filtered$item1, word_cooc_filtered$item2)))
  )
)

network_summary %>%
  safe_table(caption = "Network Pruning Summary")
Network Pruning Summary
metric value
Base filter used filter(n > 2)
Pairs before filtering 24569
Pairs after base filter 334
Displayed graph prune target 50% strongest remaining pairs
Pairs displayed after pruning 167
Approximate edge reduction from base graph 50%
Unique words/nodes displayed 123
if (nrow(word_cooc_filtered) == 0) {
  stop("The filtered network has no edges. Try lowering cooc_threshold from 2 to 1.")
}

cooc_graph <- word_cooc_filtered %>%
  graph_from_data_frame(directed = FALSE)

V(cooc_graph)$degree <- degree(cooc_graph)

set.seed(42)

cooc_plot <- ggraph(cooc_graph, layout = "fr") +
  geom_edge_link(
    aes(edge_alpha = n, edge_width = n),
    color = afroman_palette$earth,
    show.legend = TRUE
  ) +
  geom_node_point(
    aes(size = degree),
    color = afroman_palette$gold
  ) +
  geom_node_point(
    aes(size = degree),
    shape = 21,
    fill = afroman_palette$hemp_light,
    color = afroman_palette$deep_green,
    stroke = 1.1
  ) +
  geom_node_text(
    aes(label = name),
    repel = TRUE,
    size = 4,
    color = afroman_palette$ink,
    fontface = "bold"
  ) +
  scale_edge_width(range = c(0.4, 3.2)) +
  scale_edge_alpha(range = c(0.25, 1)) +
  scale_size_continuous(range = c(4, 10)) +
  labs(
    title = "Pruned Word Co-Occurrence Network: Afroman YouTube Comments",
    subtitle = paste0(
      "Showing the strongest ~", round(network_prune_fraction * 100),
      "% of pairs after filter(n > ", cooc_threshold,
      "); node size = number of connections"
    ),
    caption = "MSBA 580 | Lab 5 | Exploratory analysis of API-accessible YouTube comments"
  ) +
  theme_graph(base_family = "sans", background = afroman_palette$smoke) +
  theme(
    plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
    plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
    plot.subtitle = element_text(color = afroman_palette$earth),
    plot.caption = element_text(color = afroman_palette$muted),
    legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
    legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
    legend.title = element_text(color = afroman_palette$earth),
    legend.text = element_text(color = afroman_palette$ink)
  )

print(cooc_plot)

ggsave(
  filename = "visuals/lab5_afroman_word_cooccurrence_network_pruned.png",
  plot = cooc_plot,
  width = 12,
  height = 8,
  dpi = 300
)

Visualization 2: State Mention Heatmap

This state heatmap visualizes state names that appear in the comment text. It is not a map of verified commenter locations. It is only a text-based location-mention proxy.

state_dictionary <- tibble(
  state = state.name,
  state_lower = str_to_lower(state.name),
  abbreviation = state.abb
) %>%
  bind_rows(
    tibble(
      state = "District of Columbia",
      state_lower = "district of columbia",
      abbreviation = "DC"
    )
  )

comment_text_lower <- str_to_lower(corpus$text)

state_counts <- state_dictionary %>%
  mutate(
    pattern = paste0("\\b", str_replace_all(state_lower, " ", "\\\\s+"), "\\b"),
    mentions = map_int(pattern, ~ sum(str_count(comment_text_lower, regex(.x, ignore_case = TRUE))))
  ) %>%
  select(state, abbreviation, mentions) %>%
  arrange(desc(mentions))

state_heatmap_summary <- state_counts %>%
  summarize(
    states_detected = sum(mentions > 0),
    total_state_mentions = sum(mentions),
    top_state = if_else(total_state_mentions > 0, state[which.max(mentions)], "No state names detected"),
    top_state_mentions = max(mentions, na.rm = TRUE)
  )

cat(
  "State heatmap summary: ",
  state_heatmap_summary$states_detected,
  " states detected; ",
  state_heatmap_summary$total_state_mentions,
  " total state-name mentions. Top state: ",
  state_heatmap_summary$top_state,
  " (", state_heatmap_summary$top_state_mentions, ").",
  sep = ""
)
## State heatmap summary: 12 states detected; 30 total state-name mentions. Top state: Ohio (14).
# Robust state tile heatmap. This avoids external map-package failures during knitting
# while still giving a map-like U.S. state view.
state_tile_layout <- tribble(
  ~abbreviation, ~x, ~y,
  "AK", 1, 1, "ME", 12, 1,
  "VT", 10, 2, "NH", 11, 2, "WA", 2, 2, "MT", 3, 2, "ND", 4, 2, "MN", 5, 2, "WI", 6, 2, "MI", 7, 2, "NY", 9, 2, "MA", 12, 2,
  "OR", 2, 3, "ID", 3, 3, "SD", 4, 3, "IA", 5, 3, "IL", 6, 3, "IN", 7, 3, "OH", 8, 3, "PA", 9, 3, "NJ", 10, 3, "CT", 11, 3, "RI", 12, 3,
  "CA", 2, 4, "NV", 3, 4, "WY", 4, 4, "NE", 5, 4, "MO", 6, 4, "KY", 7, 4, "WV", 8, 4, "VA", 9, 4, "MD", 10, 4, "DE", 11, 4,
  "AZ", 3, 5, "UT", 4, 5, "CO", 5, 5, "KS", 6, 5, "AR", 7, 5, "TN", 8, 5, "NC", 9, 5, "SC", 10, 5, "DC", 11, 5,
  "NM", 4, 6, "OK", 5, 6, "LA", 6, 6, "MS", 7, 6, "AL", 8, 6, "GA", 9, 6,
  "HI", 1, 7, "TX", 5, 7, "FL", 10, 7
)

state_tile_data <- state_tile_layout %>%
  left_join(state_counts, by = "abbreviation") %>%
  mutate(
    state = coalesce(state, abbreviation),
    mentions = coalesce(mentions, 0L),
    label = if_else(mentions > 0, paste0(abbreviation, "\n", mentions), abbreviation)
  )

state_heatmap_plot <- ggplot(state_tile_data, aes(x = x, y = -y, fill = mentions)) +
  geom_tile(color = afroman_palette$earth, linewidth = 0.75, width = 0.95, height = 0.95) +
  geom_text(aes(label = label), color = afroman_palette$ink, fontface = "bold", size = 3.2, lineheight = 0.85) +
  scale_fill_gradient(
    low = afroman_palette$paper,
    high = afroman_palette$deep_green,
    name = "Mentions",
    label = comma
  ) +
  coord_equal() +
  labs(
    title = "State Mention Heatmap: Afroman YouTube Comments",
    subtitle = "Tile-map heatmap of state names appearing in comment text; not verified commenter geography",
    x = NULL,
    y = NULL,
    caption = "MSBA 580 | Lab 5 | State-name detection with word-boundary matching | Hemp / maryjane visual theme"
  ) +
  theme_afroman(base_size = 13) +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 18)
  )

print(state_heatmap_plot)

ggsave(
  filename = "visuals/lab5_afroman_state_mention_tile_heatmap.png",
  plot = state_heatmap_plot,
  width = 12,
  height = 7,
  dpi = 300
)

Visualization 3: Arc Diagram of Top Word Pairs

An arc diagram is a cleaner alternative to a dense force-directed network. It places words along a line and draws arcs between repeated co-occurring pairs.

arc_pair_count <- 35

arc_edges <- word_cooc %>%
  slice_max(n, n = arc_pair_count, with_ties = FALSE)

if (nrow(arc_edges) >= 3) {
  arc_graph <- arc_edges %>%
    graph_from_data_frame(directed = FALSE)

  arc_plot <- ggraph(arc_graph, layout = "linear") +
    geom_edge_arc(
      aes(edge_width = n, edge_alpha = n),
      color = afroman_palette$earth,
      strength = 0.7,
      show.legend = TRUE
    ) +
    geom_node_point(size = 4, color = afroman_palette$gold) +
    geom_node_text(
      aes(label = name),
      angle = 45,
      hjust = 0,
      size = 3.5,
      color = afroman_palette$ink
    ) +
    scale_edge_width(range = c(0.3, 2.8)) +
    scale_edge_alpha(range = c(0.2, 1)) +
    labs(
      title = "Arc Diagram: Top Afroman Comment Word Pairings",
      subtitle = paste0("Top ", arc_pair_count, " co-occurring word pairs by count"),
      caption = "MSBA 580 | Lab 5 | Arc height and thickness emphasize repeated pairings"
    ) +
    theme_graph(base_family = "sans", background = afroman_palette$smoke) +
    theme(
      plot.background = element_rect(fill = afroman_palette$smoke, color = NA),
      plot.title = element_text(face = "bold", color = afroman_palette$deep_green),
      plot.subtitle = element_text(color = afroman_palette$earth),
      plot.caption = element_text(color = afroman_palette$muted),
      legend.background = element_rect(fill = afroman_palette$smoke, color = NA),
      legend.key = element_rect(fill = afroman_palette$smoke, color = NA),
      legend.title = element_text(color = afroman_palette$earth),
      legend.text = element_text(color = afroman_palette$ink),
      plot.margin = margin(10, 30, 40, 10)
    )

  print(arc_plot)

  ggsave(
    filename = "visuals/lab5_afroman_arc_diagram.png",
    plot = arc_plot,
    width = 13,
    height = 7,
    dpi = 300
  )
} else {
  cat("Not enough co-occurring word pairs to render the arc diagram.")
}

Visualization 4: Lightweight Theme Flow

This lightweight flow chart connects manually defined comment themes to the words that triggered them and then to a simple signal-strength tier. It replaces the heavier alluvial rendering with a faster ggplot2-only version so the report can knit reliably while keeping the same business interpretation.

theme_dictionary <- tribble(
  ~theme, ~word,
  "Police / legal", "police",
  "Police / legal", "cops",
  "Police / legal", "cop",
  "Police / legal", "raid",
  "Police / legal", "warrant",
  "Police / legal", "rights",
  "Police / legal", "law",
  "Police / legal", "court",
  "Police / legal", "judge",
  "Police / legal", "sheriff",
  "Humor / reaction", "lol",
  "Humor / reaction", "funny",
  "Humor / reaction", "hilarious",
  "Humor / reaction", "laugh",
  "Humor / reaction", "joke",
  "Humor / reaction", "classic",
  "Music / nostalgia", "music",
  "Music / nostalgia", "afroman",
  "Music / nostalgia", "rap",
  "Music / nostalgia", "lyrics",
  "Music / nostalgia", "beat",
  "Music / nostalgia", "listen",
  "Music / nostalgia", "memories",
  "Cannabis / culture", "high",
  "Cannabis / culture", "weed",
  "Cannabis / culture", "smoke",
  "Cannabis / culture", "smoking",
  "Cannabis / culture", "marijuana",
  "Cannabis / culture", "cannabis",
  "Cannabis / culture", "hemp",
  "Cannabis / culture", "joint",
  "Cannabis / culture", "blunt",
  "Cannabis / culture", "pot",
  "Politics / public issue", "government",
  "Politics / public issue", "political",
  "Politics / public issue", "politics",
  "Politics / public issue", "freedom",
  "Politics / public issue", "tax",
  "Politics / public issue", "corruption",
  "Support / praise", "love",
  "Support / praise", "respect",
  "Support / praise", "support",
  "Support / praise", "great",
  "Support / praise", "legend",
  "Support / praise", "amazing"
)

theme_flow_min_keyword_count <- 2
# Prune the keyword column for readability. Three keywords per theme keeps the
# flow chart interpretive instead of turning the middle column into a wall of text.
theme_flow_top_keywords_per_theme <- 3
theme_flow_max_keywords_total <- 18

theme_word_counts_all <- words_clean %>%
  inner_join(theme_dictionary, by = "word") %>%
  count(theme, word, sort = TRUE)

theme_word_counts_pruned <- theme_word_counts_all %>%
  filter(n >= theme_flow_min_keyword_count)

# Fail-soft fallback: if the minimum-count filter is too strict for this sample,
# keep the strongest available dictionary hits rather than producing a blank plot.
if (nrow(theme_word_counts_pruned) == 0 && nrow(theme_word_counts_all) > 0) {
  theme_word_counts_pruned <- theme_word_counts_all
}

theme_word_counts <- theme_word_counts_pruned %>%
  group_by(theme) %>%
  slice_max(n, n = theme_flow_top_keywords_per_theme, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(desc(n), theme, word) %>%
  slice_head(n = theme_flow_max_keywords_total) %>%
  mutate(
    signal_strength = case_when(
      n >= quantile(n, 0.75, na.rm = TRUE) ~ "High repeated signal",
      n >= quantile(n, 0.40, na.rm = TRUE) ~ "Medium repeated signal",
      TRUE ~ "Lower repeated signal"
    ),
    theme_label = str_wrap(theme, width = 16),
    word_label = str_wrap(word, width = 10),
    signal_label = str_wrap(signal_strength, width = 16)
  )

flow_pruning_summary <- tibble(
  setting = c(
    "Minimum keyword count kept",
    "Maximum keywords per theme",
    "Maximum total keyword rows",
    "Keyword rows displayed",
    "Keyword rows before pruning"
  ),
  value = c(
    theme_flow_min_keyword_count,
    theme_flow_top_keywords_per_theme,
    theme_flow_max_keywords_total,
    nrow(theme_word_counts),
    nrow(theme_word_counts_all)
  )
)

flow_pruning_summary %>%
  safe_table(caption = "Theme Flow Pruning Settings")
Theme Flow Pruning Settings
setting value
Minimum keyword count kept 2
Maximum keywords per theme 3
Maximum total keyword rows 18
Keyword rows displayed 18
Keyword rows before pruning 36
theme_word_counts %>%
  select(theme, word, n, signal_strength) %>%
  safe_table(caption = "Pruned Theme Dictionary Hits Used in the Flow Chart", max_rows = 30)
Pruned Theme Dictionary Hits Used in the Flow Chart
theme word n signal_strength
Music / nostalgia afroman 225 High repeated signal
Support / praise love 87 High repeated signal
Police / legal cops 68 High repeated signal
Police / legal court 44 High repeated signal
Police / legal police 38 High repeated signal
Music / nostalgia music 31 Medium repeated signal
Humor / reaction lol 27 Medium repeated signal
Support / praise legend 22 Medium repeated signal
Politics / public issue freedom 17 Medium repeated signal
Humor / reaction hilarious 16 Medium repeated signal
Humor / reaction funny 12 Medium repeated signal
Support / praise support 11 Lower repeated signal
Music / nostalgia rap 9 Lower repeated signal
Cannabis / culture weed 5 Lower repeated signal
Politics / public issue government 5 Lower repeated signal
Politics / public issue corruption 4 Lower repeated signal
Cannabis / culture blunt 2 Lower repeated signal
Cannabis / culture smoke 2 Lower repeated signal
# Knit-safe replacement for the theme-flow visual.
# The alluvial and curved-flow versions were visually interesting, but both
# stalled during knitting in this environment. This version uses only a small
# heatmap matrix, which is much faster and more reliable for RPubs.

if (nrow(theme_word_counts) > 0) {
  theme_matrix_data <- theme_word_counts %>%
    group_by(theme) %>%
    slice_max(n, n = 2, with_ties = FALSE) %>%
    ungroup() %>%
    arrange(theme, desc(n), word) %>%
    slice_head(n = 12) %>%
    mutate(
      theme = str_wrap(theme, width = 18),
      word = str_wrap(word, width = 12),
      signal_strength = factor(
        signal_strength,
        levels = c("Lower repeated signal", "Medium repeated signal", "High repeated signal")
      ),
      label = paste0(word, "\n", "n=", n)
    )

  theme_signal_heatmap <- ggplot(
    theme_matrix_data,
    aes(x = signal_strength, y = reorder(word, n), fill = n)
  ) +
    geom_tile(color = afroman_palette$paper, linewidth = 1.1) +
    geom_text(
      aes(label = label),
      color = afroman_palette$ink,
      fontface = "bold",
      size = 3.5,
      lineheight = 0.9
    ) +
    facet_wrap(~ theme, scales = "free_y", ncol = 2) +
    scale_fill_gradient(
      low = afroman_palette$hemp_light,
      high = afroman_palette$deep_green
    ) +
    labs(
      title = "Theme Signal Heatmap: Afroman YouTube Comments",
      subtitle = "Knit-safe replacement for the crowded theme flow: top 2 keywords per theme, capped at 12 rows",
      x = "Repeated-signal category",
      y = NULL,
      fill = "Keyword\nmentions",
      caption = "MSBA 580 | Lab 5 | Dictionary-based exploratory theme map | Interprets comment language, not verified audience identity"
    ) +
    theme_afroman(base_size = 13) +
    theme(
      strip.text = element_text(face = "bold", color = afroman_palette$paper, size = 12),
      strip.background = element_rect(fill = afroman_palette$deep_green, color = afroman_palette$earth),
      axis.text.x = element_text(angle = 20, hjust = 1, face = "bold", color = afroman_palette$deep_green),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      panel.grid = element_blank(),
      legend.position = "right",
      plot.title = element_text(size = 20, face = "bold", color = afroman_palette$deep_green),
      plot.subtitle = element_text(size = 12, color = afroman_palette$earth)
    )

  print(theme_signal_heatmap)
} else {
  theme_signal_fallback <- ggplot() +
    annotate(
      "label",
      x = 0,
      y = 0,
      label = "No theme dictionary hits were detected for the theme signal heatmap.",
      fill = afroman_palette$paper,
      color = afroman_palette$deep_green,
      fontface = "bold",
      size = 5,
      label.size = 0.8
    ) +
    xlim(-1, 1) +
    ylim(-1, 1) +
    labs(
      title = "Theme Signal Heatmap Fallback",
      subtitle = "Core lab outputs above still rendered",
      caption = "MSBA 580 | Lab 5 | Fallback visual"
    ) +
    theme_afroman(base_size = 13) +
    theme(
      axis.text = element_blank(),
      axis.title = element_blank(),
      axis.ticks = element_blank(),
      panel.grid = element_blank()
    )

  print(theme_signal_fallback)
}

Lab Reflection Questions

Question 1

Which additional words did you add to the stopword list manually? Why does it make sense to remove those words as stopwords? Also, does the final network graph change after removing these additional stopwords? If so, briefly explain how and why.

I manually added video and song to the stopword list. These words make sense to remove because the dataset comes from YouTube comments about a music video, so those words are likely to appear frequently because of the setting rather than because they reveal a specific theme. In other words, they are context words, but they are not the most useful words for interpreting deeper audience reactions, humor, controversy, nostalgia, or sentiment. The final network graph changed because removing those terms prevented broad media words from acting as generic hub nodes. This made the graph cleaner and helped the remaining word clusters focus more on meaningful relationships inside the comment language.

Question 2

Change n in filter(n > 1) to a different value. What is your final value of n? Will it give you a better result?

I changed the filter from filter(n > 1) to filter(n > 2), so my final comparison value is 2. This means a word pair must appear together in at least three comments before it becomes eligible for the network graph. I then pruned the displayed network to the strongest approximate 50% of remaining pairs so the visual is easier to read. I think this gives a better result because it removes more one-off pairings and reduces graph clutter while still preserving the strongest recurring relationships. A much higher threshold might be even cleaner, but it could remove smaller themes that are still useful in a comment dataset.

Question 3

On the BlueSky post, there were two co-occurrence network graphs. Which one do you like better? Why?

I prefer the cleaner network graph with fewer nodes and stronger visible relationships. For business communication, the best graph is not always the one with the most information; it is usually the one that helps the audience understand the main pattern quickly. A dense graph can be useful during exploration, but it can become difficult to interpret when too many weak connections are shown at the same time. For a final report or presentation, I would choose the graph with a higher co-occurrence threshold because it better separates signal from noise. This is similar to what I did in this lab by raising the threshold from filter(n > 1) to filter(n > 2).

Question 4

Which word pairs surprised you? Did buy cluster more with optimism words like dip or caution words like wait and valuation?

In my own Afroman comment dataset, the most interesting word pairs are the ones that reveal repeated audience reactions rather than simple topic labels. I would look for pairs that connect emotion, humor, identity, politics, nostalgia, or repeated phrases from the video. For the stock-related example in the discussion question, I would interpret buy differently depending on whether it appears closer to optimism words like dip or caution words like wait and valuation. If buy clusters with dip, that suggests a more opportunistic or optimistic investing frame. If it clusters with wait or valuation, that suggests a more cautious decision-making frame.

Question 5

Should we treat stock tickers, $SPCX and SpaceX, as the same token instead of separate ones?

In most business text analysis settings, I would treat $SPCX and SpaceX as the same token if the goal is to measure conversation about the same company, brand, or investment theme. Keeping them separate may split the signal across two labels even though users may be talking about the same underlying topic. However, I would document the decision clearly because ticker symbols and company names can sometimes carry different meanings depending on context. For example, a ticker may appear more often in investor discussions, while the company name may appear more often in general news or public discussion. A good approach would be to create a normalization step where related tokens are recoded into one standard label.

Question 6

What is the risk of drawing conclusions from a co-occurrence network built on only a small sample of comments? What is a good sample size? How would you communicate that limitation in a business report?

The main risk is that a small sample can make random or temporary word pairings look more meaningful than they really are. Co-occurrence networks are useful for exploration, but they can be sensitive to sample size, repeated phrases, unusually active commenters, bots, jokes, or event-specific language. A good sample size depends on the platform and business question, but larger samples are generally better because they make repeated patterns more stable. For a business report, I would avoid claiming that the graph represents the entire audience unless the data collection method supports that claim. I would describe the network as an exploratory map of the available comment sample and recommend validating the pattern with a larger sample, multiple time windows, or another data source.

Question 7

Is there a way to analyze comments or reviews on topics that interest you in relation to a significant event? If so, how would you identify the event? Why is that event significant, and how might it influence the comments or reviews you analyze?

Yes. A strong approach would be to compare comments before and after a significant event, such as a product launch, controversy, policy announcement, patch update, legal decision, earnings report, or viral media moment. I would identify the event by using a reliable timestamp from a primary source, official announcement, news article, company post, or platform release note. The event is significant if it plausibly changes what people are paying attention to or how they feel about the topic. For example, in a policy shock analytics project, an executive order could be used as the event date, and comments or search interest before and after the order could be compared. The event might influence the language by increasing uncertainty, emotional reactions, political framing, brand mentions, or repeated questions from the public.

Interpretation

The co-occurrence network shows which words tend to appear together in the same Afroman YouTube comments. Thicker and darker edges represent word pairs that co-occur more often. Larger nodes represent words with more connections to other words in the filtered graph.

The matrix, network, arc diagram, map, and alluvial flow each answer a slightly different question. The matrix gives exact word-pair counts. The network shows clusters and central connector words. The state heatmap shows place references in the text. The arc diagram shows repeated word-pair links in a cleaner linear layout. The alluvial flow connects observed words to manually documented themes.

The most important limitation is that these are exploratory patterns from the available comment sample. They should be treated as directional signals rather than definitive conclusions about the full YouTube audience.

Reproducibility Notes

  • The analysis uses a cleaned CSV file rather than live API collection.
  • No API key is included in this report.
  • The network layout uses set.seed(42) so the graph is reproducible when knitted again.
  • The manual stopwords used in this version are video and song.
  • The final network graph uses filter(n > 2) to reduce visual clutter.
  • The state heatmap is based on text mentions of state names, not verified commenter locations.
  • The alluvial flow uses a transparent manual theme dictionary.
  • Output graphics are saved in the visuals/ folder.

References

Chiericato, E. (2015). Co Citations and Co Occurrences in SEO: The Complete Guide. Web Marketing Academy.
https://webmarketing.academy/en/co-citazioni-co-occorrenze-guida-definitiva/

Silge, J., and Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media.
https://www.tidytextmining.com/

Robinson, D. (2021). widyr: Widen, Process, then Re-Tidy Data. R package.
https://CRAN.R-project.org/package=widyr

Csardi, G., and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695.
https://igraph.org

Pedersen, T. L. (2024). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package.
https://CRAN.R-project.org/package=ggraph