Introduction

For this assignment, I reused the YouTube comments collected in Lab 4 from the documentary “The Town That Drinks the Most Sugary Drinks in the World” by Unreported World. The documentary explores the high consumption of Coca-Cola in Chiapas, Mexico, and the resulting health concerns related to diabetes and sugary drinks.

Instead of analyzing individual word frequencies, this lab examines word co-occurrence, which identifies words that frequently appear together within the same comments. Co-occurrence analysis helps reveal relationships between concepts and provides a better understanding of the main discussion themes.

Load Packages

library(tidyverse)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(knitr)

comments_clean <- read_csv("youtube_comments.csv")
head(comments_clean)

## # A tibble: 6 × 4
##   authorDisplayName       textOriginal             publishedAt         likeCount
##   <chr>                   <chr>                    <dttm>                  <dbl>
## 1 @joetteconner6861       How can you blame Coca-… 2026-05-29 20:36:47         0
## 2 @chrisangflo6102        7 yrs soda free, I drin… 2026-05-30 05:17:00         0
## 3 @roznerniyesuhmives8156 since 1990s i dring pep… 2026-05-30 10:45:36         1
## 4 @td4079                 Coca Cola is gross. Now… 2026-05-30 18:10:36         0
## 5 @frankrumble6754        Coke is good.... I am 6… 2026-05-30 21:56:18         0
## 6 @saffy4352              Absolutely dumbfounded … 2026-05-30 22:56:55         0

Step 1: Tokenize and Remove Stop Words

data(stop_words)

custom_stop_words <- tibble(
  word = c("people", "drink"),
  lexicon = "custom"
)

all_stop_words <- bind_rows(stop_words, custom_stop_words)

youtube_words <- comments_clean %>%
  mutate(id = row_number()) %>%
  select(id, textOriginal) %>%
  unnest_tokens(word, textOriginal) %>%
  anti_join(all_stop_words, by = "word") %>%
  filter(str_detect(word, "^[a-z]+$")) %>%
  filter(str_length(word) > 2)

head(youtube_words)

## # A tibble: 6 × 2
##      id word    
##   <int> <chr>   
## 1     1 blame   
## 2     1 coca    
## 3     1 cola    
## 4     1 dying   
## 5     1 diabetes
## 6     1 eating

Step 2: Count Word Pairs Within Each Comment

youtube_pairs <- youtube_words %>%
  pairwise_count(word, id, sort = TRUE, upper = FALSE)

head(youtube_pairs, 20)

## # A tibble: 20 × 3
##    item1    item2        n
##    <chr>    <chr>    <dbl>
##  1 coca     cola        20
##  2 cola     coke         5
##  3 coca     coke         4
##  4 drinking water        4
##  5 coca     diabetes     3
##  6 cola     diabetes     3
##  7 cola     pepsi        3
##  8 diabetes sugar        3
##  9 sugar    drinking     3
## 10 coke     drinking     3
## 11 coca     water        3
## 12 cola     water        3
## 13 sugar    water        3
## 14 coke     water        3
## 15 coca     shit         3
## 16 cola     shit         3
## 17 cola     love         3
## 18 coca     pepsi        2
## 19 cola     sugar        2
## 20 soda     sugar        2

Step 3: Build the Co-occurrence Matrix

youtube_matrix <- youtube_pairs %>%
  bind_rows(youtube_pairs %>% rename(item1 = item2, item2 = item1)) %>%
  cast_sparse(item1, item2, n) %>%
  as.matrix()

youtube_matrix[1:10, 1:10]

##          cola coke water diabetes pepsi sugar drinking shit love mexican
## coca       20    4     3        3     2     1        1    3    2       1
## cola        0    5     3        3     3     2        1    3    3       2
## drinking    1    3     4        1     0     3        0    0    0       0
## diabetes    3    1     1        0     1     3        1    0    0       0
## sugar       2    1     3        3     1     0        3    0    0       0
## coke        5    0     3        1     1     1        3    0    0       2
## soda        0    0     2        1     0     2        0    0    0       2
## water       3    3     0        1     0     3        4    0    0       2
## mexican     2    2     2        0     0     0        0    1    1       0
## day         1    1     2        0     0     2        1    0    0       0

Step 4: Network Graph

youtube_pairs %>%
  filter(n > 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(alpha = 0.3) +
  geom_node_point(color = "steelblue", size = 4) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Results

The co-occurrence network reveals several important discussion themes. The strongest cluster centers on Coca-Cola, with coca, cola, coke, and Pepsi appearing together frequently, indicating that viewers compared different soft drink brands. Another major cluster connects sugar, diabetes, water, health, and diet, suggesting that many commenters focused on the health consequences of consuming sugary drinks. The appearance of Mexico and Mexican also reflects the documentary’s emphasis on Chiapas and the cultural context of beverage consumption. Overall, the network shows that viewers discussed both the public health issues and the broader social context presented in the documentary.

Question 1

I manually added people and drink to the stopword list because they appeared frequently but did not provide much meaning about the video’s main topic. Removing these words reduced unnecessary connections and allowed more meaningful relationships, such as Coca–cola, sugar–diabetes, and water–drinking, to stand out more clearly. The final network graph became easier to interpret because it focused on the most relevant themes rather than common words.

Question 2

I changed the threshold from filter(n > 3) to filter(n > 1). Using a lower threshold included more word pairs in the network and produced a richer visualization. Although the graph became slightly more crowded, it revealed additional relationships among words such as sugar, diabetes, water, Mexico, and Pepsi, providing a more complete picture of the discussion.

Question 3

I preferred the second co-occurrence network because it was less cluttered and easier to read. The stronger word connections stood out more clearly, making it easier to identify the primary discussion themes without being distracted by many weak relationships.

Question 4

The most interesting word pairs were Coca–cola, sugar–diabetes, and drink–water. Unlike investment-related discussions where words such as buy may cluster with dip or valuation, this dataset focused primarily on public health, nutrition, and sugary drink consumption.

Question 5

Yes. If $SPCX and SpaceX refer to the same company, they should be treated as the same token. Combining them prevents the discussion from being split into separate groups and produces a more accurate co-occurrence network.

Question 6

A co-occurrence network based on a small number of comments may not accurately represent overall public opinion because a few comments can have a large influence on the results. Larger datasets generally produce more reliable patterns. In a business report, I would explain that the findings represent only the analyzed sample and may not reflect the opinions of all viewers.

Question 7

Yes. Comments can be analyzed around significant events such as the release of a documentary, product launch, public health announcement, or major news story. In this project, the release of the documentary on sugary drink consumption in Chiapas is significant because it increased public awareness of diabetes and nutrition. Analyzing comments posted after the documentary helps identify how viewers reacted to its message and what themes emerged during the discussion.

Conclusion

This project demonstrated how word co-occurrence analysis can reveal relationships between ideas that are not visible through simple word frequency analysis. The network graph highlighted the connections between Coca-Cola, sugary drinks, diabetes, and health, providing a deeper understanding of audience discussions. Overall, co-occurrence analysis is a useful text mining technique for identifying themes and understanding how people connect ideas in online conversations.

Lab Exercise #5: Text Analysis & NLP - Word Co-occurrence Analysis

Leilani De La Cruz

2026-07-04