For this assignment, I reused the YouTube comments collected in Lab 4 from the documentary “The Town That Drinks the Most Sugary Drinks in the World” by Unreported World. The documentary explores the high consumption of Coca-Cola in Chiapas, Mexico, and the resulting health concerns related to diabetes and sugary drinks.
Instead of analyzing individual word frequencies, this lab examines word co-occurrence, which identifies words that frequently appear together within the same comments. Co-occurrence analysis helps reveal relationships between concepts and provides a better understanding of the main discussion themes.
library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(knitr)
comments_clean <- read_csv("youtube_comments.csv")
head(comments_clean)
## # A tibble: 6 × 4
## authorDisplayName textOriginal publishedAt likeCount
## <chr> <chr> <dttm> <dbl>
## 1 @joetteconner6861 How can you blame Coca-… 2026-05-29 20:36:47 0
## 2 @chrisangflo6102 7 yrs soda free, I drin… 2026-05-30 05:17:00 0
## 3 @roznerniyesuhmives8156 since 1990s i dring pep… 2026-05-30 10:45:36 1
## 4 @td4079 Coca Cola is gross. Now… 2026-05-30 18:10:36 0
## 5 @frankrumble6754 Coke is good.... I am 6… 2026-05-30 21:56:18 0
## 6 @saffy4352 Absolutely dumbfounded … 2026-05-30 22:56:55 0
data(stop_words)
custom_stop_words <- tibble(
word = c("people", "drink"),
lexicon = "custom"
)
all_stop_words <- bind_rows(stop_words, custom_stop_words)
youtube_words <- comments_clean %>%
mutate(id = row_number()) %>%
select(id, textOriginal) %>%
unnest_tokens(word, textOriginal) %>%
anti_join(all_stop_words, by = "word") %>%
filter(str_detect(word, "^[a-z]+$")) %>%
filter(str_length(word) > 2)
head(youtube_words)
## # A tibble: 6 × 2
## id word
## <int> <chr>
## 1 1 blame
## 2 1 coca
## 3 1 cola
## 4 1 dying
## 5 1 diabetes
## 6 1 eating
youtube_pairs <- youtube_words %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
head(youtube_pairs, 20)
## # A tibble: 20 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 coca cola 20
## 2 cola coke 5
## 3 coca coke 4
## 4 drinking water 4
## 5 coca diabetes 3
## 6 cola diabetes 3
## 7 cola pepsi 3
## 8 diabetes sugar 3
## 9 sugar drinking 3
## 10 coke drinking 3
## 11 coca water 3
## 12 cola water 3
## 13 sugar water 3
## 14 coke water 3
## 15 coca shit 3
## 16 cola shit 3
## 17 cola love 3
## 18 coca pepsi 2
## 19 cola sugar 2
## 20 soda sugar 2
youtube_matrix <- youtube_pairs %>%
bind_rows(youtube_pairs %>% rename(item1 = item2, item2 = item1)) %>%
cast_sparse(item1, item2, n) %>%
as.matrix()
youtube_matrix[1:10, 1:10]
## cola coke water diabetes pepsi sugar drinking shit love mexican
## coca 20 4 3 3 2 1 1 3 2 1
## cola 0 5 3 3 3 2 1 3 3 2
## drinking 1 3 4 1 0 3 0 0 0 0
## diabetes 3 1 1 0 1 3 1 0 0 0
## sugar 2 1 3 3 1 0 3 0 0 0
## coke 5 0 3 1 1 1 3 0 0 2
## soda 0 0 2 1 0 2 0 0 0 2
## water 3 3 0 1 0 3 4 0 0 2
## mexican 2 2 2 0 0 0 0 1 1 0
## day 1 1 2 0 0 2 1 0 0 0
youtube_pairs %>%
filter(n > 1) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(alpha = 0.3) +
geom_node_point(color = "steelblue", size = 4) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
The co-occurrence network reveals several important discussion themes. The strongest cluster centers on Coca-Cola, with coca, cola, coke, and Pepsi appearing together frequently, indicating that viewers compared different soft drink brands. Another major cluster connects sugar, diabetes, water, health, and diet, suggesting that many commenters focused on the health consequences of consuming sugary drinks. The appearance of Mexico and Mexican also reflects the documentary’s emphasis on Chiapas and the cultural context of beverage consumption. Overall, the network shows that viewers discussed both the public health issues and the broader social context presented in the documentary.
I manually added people and drink to the stopword list because they appeared frequently but did not provide much meaning about the video’s main topic. Removing these words reduced unnecessary connections and allowed more meaningful relationships, such as Coca–cola, sugar–diabetes, and water–drinking, to stand out more clearly. The final network graph became easier to interpret because it focused on the most relevant themes rather than common words.
I changed the threshold from filter(n > 3) to filter(n > 1). Using a lower threshold included more word pairs in the network and produced a richer visualization. Although the graph became slightly more crowded, it revealed additional relationships among words such as sugar, diabetes, water, Mexico, and Pepsi, providing a more complete picture of the discussion.
I preferred the second co-occurrence network because it was less cluttered and easier to read. The stronger word connections stood out more clearly, making it easier to identify the primary discussion themes without being distracted by many weak relationships.
The most interesting word pairs were Coca–cola, sugar–diabetes, and drink–water. Unlike investment-related discussions where words such as buy may cluster with dip or valuation, this dataset focused primarily on public health, nutrition, and sugary drink consumption.
Yes. If $SPCX and SpaceX refer to the same company, they should be treated as the same token. Combining them prevents the discussion from being split into separate groups and produces a more accurate co-occurrence network.
A co-occurrence network based on a small number of comments may not accurately represent overall public opinion because a few comments can have a large influence on the results. Larger datasets generally produce more reliable patterns. In a business report, I would explain that the findings represent only the analyzed sample and may not reflect the opinions of all viewers.
Yes. Comments can be analyzed around significant events such as the release of a documentary, product launch, public health announcement, or major news story. In this project, the release of the documentary on sugary drink consumption in Chiapas is significant because it increased public awareness of diabetes and nutrition. Analyzing comments posted after the documentary helps identify how viewers reacted to its message and what themes emerged during the discussion.
This project demonstrated how word co-occurrence analysis can reveal relationships between ideas that are not visible through simple word frequency analysis. The network graph highlighted the connections between Coca-Cola, sugary drinks, diabetes, and health, providing a deeper understanding of audience discussions. Overall, co-occurrence analysis is a useful text mining technique for identifying themes and understanding how people connect ideas in online conversations.