YouTube provides a useful way to understand how audiences respond to online content. Since comments often include opinions, reactions, questions, and repeated themes that can be analyzed using text mining.
For this project, I’ve chosen a video about the sugar-intake crisis in Mexico. More specifically, the state of Chiapas, where people drink up to two liters of sugary drinks a day:
Video used: https://youtu.be/hqnUohxXV0I
Using the YouTube Data API and R, I collected comments from this video, cleaned the text, and performed a word frequency analysis. The results are shown using a frequency table, bar chart, and word cloud.
library(tuber)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(tidytext)
library(stringr)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(knitr)
video_id <- "hqnUohxXV0I"
comments_raw <- get_comment_threads(
filter = c(video_id = video_id),
max_results = 100
)
head(comments_raw)
## channelId videoId
## [1,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [2,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [3,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [4,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [5,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [6,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## textDisplay
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."
## [6,] "Absolutely dumbfounded people! Wtf!"
## textOriginal
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."
## [6,] "Absolutely dumbfounded people! Wtf!"
## authorDisplayName
## [1,] "@joetteconner6861"
## [2,] "@chrisangflo6102"
## [3,] "@roznerniyesuhmives8156"
## [4,] "@td4079"
## [5,] "@frankrumble6754"
## [6,] "@saffy4352"
## authorProfileImageUrl
## [1,] "https://yt3.ggpht.com/ytc/AIdro_nKbtAxnYCtclhLTsb7J6Toril7WG4DZXIFKOaN66WOHzw=s48-c-k-c0x00ffffff-no-rj"
## [2,] "https://yt3.ggpht.com/ytc/AIdro_nFiYhK8DAC8sYFm4D6bsvHzFDFB2sY5AFb8uh0Xxo=s48-c-k-c0x00ffffff-no-rj"
## [3,] "https://yt3.ggpht.com/ytc/AIdro_lxDwH7GbIAndAh9qTKZ-bI5XO1olkXMNilNKJQOHLTnQ=s48-c-k-c0x00ffffff-no-rj"
## [4,] "https://yt3.ggpht.com/ytc/AIdro_nhEC5WxZ5Z1xriH5vEtQ40_YocciMpRl_HCECcGW8=s48-c-k-c0x00ffffff-no-rj"
## [5,] "https://yt3.ggpht.com/ytc/AIdro_lTzCFJYKKtErfkytAwZ79T6l_WMADqkOgAJMQtPS4=s48-c-k-c0x00ffffff-no-rj"
## [6,] "https://yt3.ggpht.com/ytc/AIdro_lon5esWt018Y3HcyQ9P4erkplZ-0bUhnRLy0FqL5I_dQ=s48-c-k-c0x00ffffff-no-rj"
## authorChannelUrl
## [1,] "http://www.youtube.com/@joetteconner6861"
## [2,] "http://www.youtube.com/@chrisangflo6102"
## [3,] "http://www.youtube.com/@roznerniyesuhmives8156"
## [4,] "http://www.youtube.com/@td4079"
## [5,] "http://www.youtube.com/@frankrumble6754"
## [6,] "http://www.youtube.com/@saffy4352"
## authorChannelId.value canRate viewerRating likeCount
## [1,] "UC5ZOTswH_jmv7qhpl3knxQw" "TRUE" "none" "0"
## [2,] "UClrov6-4vrk8Ki2L0eI-ElA" "TRUE" "none" "0"
## [3,] "UCyFP4jEfsz_XX-z34SLttOw" "TRUE" "none" "1"
## [4,] "UC4XVB21G_u2-dWWEPtX3iXw" "TRUE" "none" "0"
## [5,] "UCLmUtv333t2o7d6jJxe2tnw" "TRUE" "none" "0"
## [6,] "UCKNpeDq57Qjeujgpg18xxBA" "TRUE" "none" "0"
## publishedAt updatedAt
## [1,] "2026-05-29T20:36:47Z" "2026-05-29T20:36:47Z"
## [2,] "2026-05-30T05:17:00Z" "2026-05-30T05:17:00Z"
## [3,] "2026-05-30T10:45:36Z" "2026-05-30T10:45:36Z"
## [4,] "2026-05-30T18:10:36Z" "2026-05-30T18:10:36Z"
## [5,] "2026-05-30T21:56:18Z" "2026-05-30T21:56:18Z"
## [6,] "2026-05-30T22:56:55Z" "2026-05-30T22:56:55Z"
## attr(,"tuber_api_calls")
## [1] 1
## attr(,"tuber_quota_used")
## [1] 1
## attr(,"tuber_timestamp")
## [1] "2026-07-03 13:37:05 PDT"
## attr(,"tuber_function")
## [1] "get_comment_threads"
## attr(,"tuber_parameters")
## attr(,"tuber_parameters")$filter
## video_id
## "hqnUohxXV0I"
##
## attr(,"tuber_parameters")$part
## [1] "snippet"
##
## attr(,"tuber_parameters")$max_results
## [1] 100
##
## attr(,"tuber_results_found")
## [1] 100
## attr(,"tuber_response_format")
## [1] "data.frame"
##
## --- Tuber Metadata ---
## function: get_comment_threads api_calls: 1 results_found: 100 timestamp: 2026-07-03 13:37:05
## (Use tuber_info() for full metadata)
glimpse(comments_raw)
## 'tuber_result' chr [1:100, 1:13] "UClz-d22g9Lj7agZSbaRKuqA" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:13] "channelId" "videoId" "textDisplay" "textOriginal" ...
## - attr(*, "tuber_api_calls")= num 1
## - attr(*, "tuber_quota_used")= num 1
## - attr(*, "tuber_timestamp")= POSIXct[1:1], format: "2026-07-03 13:37:05"
## - attr(*, "tuber_function")= chr "get_comment_threads"
## - attr(*, "tuber_parameters")=List of 3
## ..$ filter : Named chr "hqnUohxXV0I"
## .. ..- attr(*, "names")= chr "video_id"
## ..$ part : chr "snippet"
## ..$ max_results: num 100
## - attr(*, "tuber_results_found")= int 100
## - attr(*, "tuber_response_format")= chr "data.frame"
comments_clean <- comments_raw %>%
as_tibble() %>%
select(
authorDisplayName,
textOriginal,
publishedAt,
likeCount
)
head(comments_clean)
## # A tibble: 6 × 4
## authorDisplayName textOriginal publishedAt likeCount
## <tbr_rslt> <tbr_rslt> <tbr_rslt> <tbr_rsl>
## 1 @joetteconner6861 How can you blame Coca-Cola for… 2026-05-29… 0
## 2 @chrisangflo6102 7 yrs soda free, I drink a lot … 2026-05-30… 0
## 3 @roznerniyesuhmives8156 since 1990s i dring pepsi & col… 2026-05-30… 1
## 4 @td4079 Coca Cola is gross. Now if we w… 2026-05-30… 0
## 5 @frankrumble6754 Coke is good.... I am 66 and I … 2026-05-30… 0
## 6 @saffy4352 Absolutely dumbfounded people! … 2026-05-30… 0
nrow(comments_clean)
## [1] 100
write_csv(comments_clean, "youtube_comments.csv")
comments_clean$likeCount <- as.numeric(comments_clean$likeCount)
summary_table <- data.frame(
metric = c("Comments analyzed", "Total comment likes", "Average likes per comment"),
value = c(
nrow(comments_clean),
sum(comments_clean$likeCount, na.rm = TRUE),
round(mean(comments_clean$likeCount, na.rm = TRUE), 1)
)
)
kable(summary_table, caption = "Summary of Collected YouTube Comment Data")
| metric | value |
|---|---|
| Comments analyzed | 100 |
| Total comment likes | 5 |
| Average likes per comment | 0 |
data(stop_words)
words <- comments_clean %>%
select(textOriginal) %>%
unnest_tokens(word, textOriginal) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "^[0-9]+$")) %>%
filter(str_length(word) > 2)
word_freq <- words %>%
count(word, sort = TRUE)
head(word_freq, 20)
## # A tibble: 20 × 2
## word n
## <chr> <int>
## 1 coca 28
## 2 cola 27
## 3 drink 24
## 4 water 21
## 5 coke 18
## 6 people 12
## 7 diabetes 10
## 8 drinking 9
## 9 mexican 8
## 10 sugar 8
## 11 drinks 7
## 12 mexico 7
## 13 bad 6
## 14 day 6
## 15 soda 6
## 16 diet 5
## 17 health 5
## 18 it’s 5
## 19 i’m 5
## 20 love 5
kable(
head(word_freq, 20),
caption = "Top 20 Most Common Words in the YouTube Comments"
)
| word | n |
|---|---|
| coca | 28 |
| cola | 27 |
| drink | 24 |
| water | 21 |
| coke | 18 |
| people | 12 |
| diabetes | 10 |
| drinking | 9 |
| mexican | 8 |
| sugar | 8 |
| drinks | 7 |
| mexico | 7 |
| bad | 6 |
| day | 6 |
| soda | 6 |
| diet | 5 |
| health | 5 |
| it’s | 5 |
| i’m | 5 |
| love | 5 |
word_freq %>%
slice_max(n, n = 15) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 15 Most Common Words in YouTube Comments",
x = "Word",
y = "Frequency"
) +
theme_minimal()
set.seed(123)
wordcloud(
words = word_freq$word,
freq = word_freq$n,
min.freq = 2,
max.words = 100,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)
# Results and Discussion
A total of 100 YouTube comments were collected from the video.
The word frequency analysis shows that the most common words in the comments were coca, cola, drink, water, coke, people, diabetes, drinking, mexican, and sugar. These words closely reflect the video’s focus on the high consumption of sugary drinks in Chiapas, Mexico, and the resulting public health concerns.
The frequent appearance of words such as sugar, diabetes, health, and water suggests that many viewers were discussing the health risks associated with excessive soft drink consumption. At the same time, words such as mexican, mexico, and people indicate that commenters were also talking about the social, cultural, and economic factors influencing beverage consumption in the region. Overall, the comments demonstrate that viewers were actively engaging with the video’s message about nutrition, public health, and the impact of Coca-Cola on communities in southern Mexico.
Overall, the comments reveal four major themes:
Health and diabetes: Many viewers discussed the relationship between sugary drinks, diabetes, and other long-term health problems.
Coca-Cola consumption: The repeated appearance of words such as coca, cola, coke, and drink shows that Coca-Cola was the primary focus of the discussion.
Water versus soda: The frequent use of the word water suggests that viewers compared drinking water with consuming sugary beverages and discussed healthier alternatives.
Social and cultural factors: Words such as mexican and mexico indicate that many commenters were discussing the economic and cultural reasons why sugary drinks are so widely consumed in Chiapas.
This project demonstrated how YouTube comments can be collected and analyzed using the YouTube Data API and R. After cleaning the comments, word frequency analysis identified the most common topics discussed by viewers. The frequency table, bar chart, and word cloud showed that conversations centered on Coca-Cola, sugary drinks, diabetes, water, and public health.
The findings support the video’s message that excessive sugary drink consumption has become a significant public health issue in Chiapas, Mexico. The comments also show that viewers were not only reacting to the documentary but were discussing broader issues such as nutrition, health education, poverty, and access to healthier alternatives. Similar to this week’s reading on competitive intelligence, analyzing publicly available social media comments can provide valuable insight into public opinion and consumer perceptions. These insights can help researchers, organizations, and businesses better understand how audiences respond to important social and health issues.
Google Developers. (2024). YouTube Data API v3 Documentation. https://developers.google.com/youtube/v3
Silge, J., & Robinson, D. (2024). Text Mining with R. https://www.tidytextmining.com/
World Health Organization. (2023). Diabetes. https://www.who.int/news-room/fact-sheets/detail/diabetes
Channel 4 News. (2021). Unreported World: The town that drinks the most sugary drinks in the world. https://www.youtube.com/watch?v=hqnUohxXV0I