Introduction

YouTube provides a useful way to understand how audiences respond to online content. Since comments often include opinions, reactions, questions, and repeated themes that can be analyzed using text mining.

For this project, I’ve chosen a video about the sugar-intake crisis in Mexico. More specifically, the state of Chiapas, where people drink up to two liters of sugary drinks a day:

Video used: https://youtu.be/hqnUohxXV0I

Using the YouTube Data API and R, I collected comments from this video, cleaned the text, and performed a word frequency analysis. The results are shown using a frequency table, bar chart, and word cloud.

Loading Packages

library(tuber)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(tidytext)
library(stringr)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(knitr)

YouTube OAuth

Video ID

video_id <- "hqnUohxXV0I"

Scrape YT Comments

comments_raw <- get_comment_threads(
  filter = c(video_id = video_id),
  max_results = 100
)

head(comments_raw)
##      channelId                  videoId      
## [1,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [2,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [3,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [4,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [5,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [6,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
##      textDisplay                                                                                                                                 
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"                                                                                            
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"                      
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"                                                
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."                                                         
## [6,] "Absolutely dumbfounded people! Wtf!"                                                                                                       
##      textOriginal                                                                                                                                
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"                                                                                            
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"                      
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"                                                
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."                                                         
## [6,] "Absolutely dumbfounded people! Wtf!"                                                                                                       
##      authorDisplayName        
## [1,] "@joetteconner6861"      
## [2,] "@chrisangflo6102"       
## [3,] "@roznerniyesuhmives8156"
## [4,] "@td4079"                
## [5,] "@frankrumble6754"       
## [6,] "@saffy4352"             
##      authorProfileImageUrl                                                                                    
## [1,] "https://yt3.ggpht.com/ytc/AIdro_nKbtAxnYCtclhLTsb7J6Toril7WG4DZXIFKOaN66WOHzw=s48-c-k-c0x00ffffff-no-rj"
## [2,] "https://yt3.ggpht.com/ytc/AIdro_nFiYhK8DAC8sYFm4D6bsvHzFDFB2sY5AFb8uh0Xxo=s48-c-k-c0x00ffffff-no-rj"    
## [3,] "https://yt3.ggpht.com/ytc/AIdro_lxDwH7GbIAndAh9qTKZ-bI5XO1olkXMNilNKJQOHLTnQ=s48-c-k-c0x00ffffff-no-rj" 
## [4,] "https://yt3.ggpht.com/ytc/AIdro_nhEC5WxZ5Z1xriH5vEtQ40_YocciMpRl_HCECcGW8=s48-c-k-c0x00ffffff-no-rj"    
## [5,] "https://yt3.ggpht.com/ytc/AIdro_lTzCFJYKKtErfkytAwZ79T6l_WMADqkOgAJMQtPS4=s48-c-k-c0x00ffffff-no-rj"    
## [6,] "https://yt3.ggpht.com/ytc/AIdro_lon5esWt018Y3HcyQ9P4erkplZ-0bUhnRLy0FqL5I_dQ=s48-c-k-c0x00ffffff-no-rj" 
##      authorChannelUrl                                
## [1,] "http://www.youtube.com/@joetteconner6861"      
## [2,] "http://www.youtube.com/@chrisangflo6102"       
## [3,] "http://www.youtube.com/@roznerniyesuhmives8156"
## [4,] "http://www.youtube.com/@td4079"                
## [5,] "http://www.youtube.com/@frankrumble6754"       
## [6,] "http://www.youtube.com/@saffy4352"             
##      authorChannelId.value      canRate viewerRating likeCount
## [1,] "UC5ZOTswH_jmv7qhpl3knxQw" "TRUE"  "none"       "0"      
## [2,] "UClrov6-4vrk8Ki2L0eI-ElA" "TRUE"  "none"       "0"      
## [3,] "UCyFP4jEfsz_XX-z34SLttOw" "TRUE"  "none"       "1"      
## [4,] "UC4XVB21G_u2-dWWEPtX3iXw" "TRUE"  "none"       "0"      
## [5,] "UCLmUtv333t2o7d6jJxe2tnw" "TRUE"  "none"       "0"      
## [6,] "UCKNpeDq57Qjeujgpg18xxBA" "TRUE"  "none"       "0"      
##      publishedAt            updatedAt             
## [1,] "2026-05-29T20:36:47Z" "2026-05-29T20:36:47Z"
## [2,] "2026-05-30T05:17:00Z" "2026-05-30T05:17:00Z"
## [3,] "2026-05-30T10:45:36Z" "2026-05-30T10:45:36Z"
## [4,] "2026-05-30T18:10:36Z" "2026-05-30T18:10:36Z"
## [5,] "2026-05-30T21:56:18Z" "2026-05-30T21:56:18Z"
## [6,] "2026-05-30T22:56:55Z" "2026-05-30T22:56:55Z"
## attr(,"tuber_api_calls")
## [1] 1
## attr(,"tuber_quota_used")
## [1] 1
## attr(,"tuber_timestamp")
## [1] "2026-07-03 13:37:05 PDT"
## attr(,"tuber_function")
## [1] "get_comment_threads"
## attr(,"tuber_parameters")
## attr(,"tuber_parameters")$filter
##      video_id 
## "hqnUohxXV0I" 
## 
## attr(,"tuber_parameters")$part
## [1] "snippet"
## 
## attr(,"tuber_parameters")$max_results
## [1] 100
## 
## attr(,"tuber_results_found")
## [1] 100
## attr(,"tuber_response_format")
## [1] "data.frame"
## 
## --- Tuber Metadata ---
## function: get_comment_threads  api_calls: 1  results_found: 100  timestamp: 2026-07-03 13:37:05  
## (Use tuber_info() for full metadata)
glimpse(comments_raw)
##  'tuber_result' chr [1:100, 1:13] "UClz-d22g9Lj7agZSbaRKuqA" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:13] "channelId" "videoId" "textDisplay" "textOriginal" ...
##  - attr(*, "tuber_api_calls")= num 1
##  - attr(*, "tuber_quota_used")= num 1
##  - attr(*, "tuber_timestamp")= POSIXct[1:1], format: "2026-07-03 13:37:05"
##  - attr(*, "tuber_function")= chr "get_comment_threads"
##  - attr(*, "tuber_parameters")=List of 3
##   ..$ filter     : Named chr "hqnUohxXV0I"
##   .. ..- attr(*, "names")= chr "video_id"
##   ..$ part       : chr "snippet"
##   ..$ max_results: num 100
##  - attr(*, "tuber_results_found")= int 100
##  - attr(*, "tuber_response_format")= chr "data.frame"

Clean Comment Data

comments_clean <- comments_raw %>%
  as_tibble() %>%
  select(
    authorDisplayName,
    textOriginal,
    publishedAt,
    likeCount
  )

head(comments_clean)
## # A tibble: 6 × 4
##   authorDisplayName       textOriginal                     publishedAt likeCount
##   <tbr_rslt>              <tbr_rslt>                       <tbr_rslt>  <tbr_rsl>
## 1 @joetteconner6861       How can you blame Coca-Cola for… 2026-05-29… 0        
## 2 @chrisangflo6102        7 yrs soda free, I drink a lot … 2026-05-30… 0        
## 3 @roznerniyesuhmives8156 since 1990s i dring pepsi & col… 2026-05-30… 1        
## 4 @td4079                 Coca Cola is gross. Now if we w… 2026-05-30… 0        
## 5 @frankrumble6754        Coke is good.... I am 66 and I … 2026-05-30… 0        
## 6 @saffy4352              Absolutely dumbfounded people! … 2026-05-30… 0

Number of Comments Collected

nrow(comments_clean)
## [1] 100

Save Comments to CSV

write_csv(comments_clean, "youtube_comments.csv")

Summary of Data Collected

comments_clean$likeCount <- as.numeric(comments_clean$likeCount)

summary_table <- data.frame(
  metric = c("Comments analyzed", "Total comment likes", "Average likes per comment"),
  value = c(
    nrow(comments_clean),
    sum(comments_clean$likeCount, na.rm = TRUE),
    round(mean(comments_clean$likeCount, na.rm = TRUE), 1)
  )
)

kable(summary_table, caption = "Summary of Collected YouTube Comment Data")
Summary of Collected YouTube Comment Data
metric value
Comments analyzed 100
Total comment likes 5
Average likes per comment 0

Clean Text for Analysis

data(stop_words)

words <- comments_clean %>%
  select(textOriginal) %>%
  unnest_tokens(word, textOriginal) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%
  filter(str_length(word) > 2)

Word Frequency Analysis

word_freq <- words %>%
  count(word, sort = TRUE)

head(word_freq, 20)
## # A tibble: 20 × 2
##    word         n
##    <chr>    <int>
##  1 coca        28
##  2 cola        27
##  3 drink       24
##  4 water       21
##  5 coke        18
##  6 people      12
##  7 diabetes    10
##  8 drinking     9
##  9 mexican      8
## 10 sugar        8
## 11 drinks       7
## 12 mexico       7
## 13 bad          6
## 14 day          6
## 15 soda         6
## 16 diet         5
## 17 health       5
## 18 it’s         5
## 19 i’m          5
## 20 love         5

Top 20 Most Common Words

kable(
  head(word_freq, 20),
  caption = "Top 20 Most Common Words in the YouTube Comments"
)
Top 20 Most Common Words in the YouTube Comments
word n
coca 28
cola 27
drink 24
water 21
coke 18
people 12
diabetes 10
drinking 9
mexican 8
sugar 8
drinks 7
mexico 7
bad 6
day 6
soda 6
diet 5
health 5
it’s 5
i’m 5
love 5

Bar Chart

word_freq %>%
  slice_max(n, n = 15) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Most Common Words in YouTube Comments",
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

Word Cloud

set.seed(123)

wordcloud(
  words = word_freq$word,
  freq = word_freq$n,
  min.freq = 2,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

# Results and Discussion

A total of 100 YouTube comments were collected from the video.

The word frequency analysis shows that the most common words in the comments were coca, cola, drink, water, coke, people, diabetes, drinking, mexican, and sugar. These words closely reflect the video’s focus on the high consumption of sugary drinks in Chiapas, Mexico, and the resulting public health concerns.

The frequent appearance of words such as sugar, diabetes, health, and water suggests that many viewers were discussing the health risks associated with excessive soft drink consumption. At the same time, words such as mexican, mexico, and people indicate that commenters were also talking about the social, cultural, and economic factors influencing beverage consumption in the region. Overall, the comments demonstrate that viewers were actively engaging with the video’s message about nutrition, public health, and the impact of Coca-Cola on communities in southern Mexico.

Key Themes

Overall, the comments reveal four major themes:

  1. Health and diabetes: Many viewers discussed the relationship between sugary drinks, diabetes, and other long-term health problems.

  2. Coca-Cola consumption: The repeated appearance of words such as coca, cola, coke, and drink shows that Coca-Cola was the primary focus of the discussion.

  3. Water versus soda: The frequent use of the word water suggests that viewers compared drinking water with consuming sugary beverages and discussed healthier alternatives.

  4. Social and cultural factors: Words such as mexican and mexico indicate that many commenters were discussing the economic and cultural reasons why sugary drinks are so widely consumed in Chiapas.

Conclusion

This project demonstrated how YouTube comments can be collected and analyzed using the YouTube Data API and R. After cleaning the comments, word frequency analysis identified the most common topics discussed by viewers. The frequency table, bar chart, and word cloud showed that conversations centered on Coca-Cola, sugary drinks, diabetes, water, and public health.

The findings support the video’s message that excessive sugary drink consumption has become a significant public health issue in Chiapas, Mexico. The comments also show that viewers were not only reacting to the documentary but were discussing broader issues such as nutrition, health education, poverty, and access to healthier alternatives. Similar to this week’s reading on competitive intelligence, analyzing publicly available social media comments can provide valuable insight into public opinion and consumer perceptions. These insights can help researchers, organizations, and businesses better understand how audiences respond to important social and health issues.

References

Google Developers. (2024). YouTube Data API v3 Documentation. https://developers.google.com/youtube/v3

Silge, J., & Robinson, D. (2024). Text Mining with R. https://www.tidytextmining.com/

World Health Organization. (2023). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes

Channel 4 News. (2021). Unreported World: The town that drinks the most sugary drinks in the world. https://www.youtube.com/watch?v=hqnUohxXV0I