Introduction

YouTube provides a useful way to understand how audiences respond to online content. Since comments often include opinions, reactions, questions, and repeated themes that can be analyzed using text mining.

For this project, I’ve chosen a video about the sugar-intake crisis in Mexico. More specifically, the state of Chiapas, where people drink up to two liters of sugary drinks a day:

Video used: https://youtu.be/hqnUohxXV0I

Using the YouTube Data API and R, I collected comments from this video, cleaned the text, and performed a word frequency analysis. The results are shown using a frequency table, bar chart, and word cloud.

Loading Packages

library(tuber)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(tidytext)
library(stringr)
library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(knitr)

YouTube OAuth

Video ID

video_id <- "hqnUohxXV0I"

Scrape YT Comments

comments_raw <- get_comment_threads(
  filter = c(video_id = video_id),
  max_results = 100
)

head(comments_raw)

##      channelId                  videoId      
## [1,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [2,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [3,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [4,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [5,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
## [6,] "UClz-d22g9Lj7agZSbaRKuqA" "hqnUohxXV0I"
##      textDisplay                                                                                                                                 
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"                                                                                            
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"                      
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"                                                
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."                                                         
## [6,] "Absolutely dumbfounded people! Wtf!"                                                                                                       
##      textOriginal                                                                                                                                
## [1,] "How can you blame Coca-Cola for these people dying of diabetes? When they're eating all the other junk too. You must be working for Pepsi,"
## [2,] "7 yrs soda free, I drink a lot cofee though 😅"                                                                                            
## [3,] "since 1990s i dring pepsi & cola light without any sugar but unfortunately, my teeth gone because of this acid stuff"                      
## [4,] "Coca Cola is gross. Now if we were talking Pepsi I’d understand the obsession but Coke? 🙄"                                                
## [5,] "Coke is good.... I am 66 and I am drinking it since age 11 ....It is my water...."                                                         
## [6,] "Absolutely dumbfounded people! Wtf!"                                                                                                       
##      authorDisplayName        
## [1,] "@joetteconner6861"      
## [2,] "@chrisangflo6102"       
## [3,] "@roznerniyesuhmives8156"
## [4,] "@td4079"                
## [5,] "@frankrumble6754"       
## [6,] "@saffy4352"             
##      authorProfileImageUrl                                                                                    
## [1,] "https://yt3.ggpht.com/ytc/AIdro_nKbtAxnYCtclhLTsb7J6Toril7WG4DZXIFKOaN66WOHzw=s48-c-k-c0x00ffffff-no-rj"
## [2,] "https://yt3.ggpht.com/ytc/AIdro_nFiYhK8DAC8sYFm4D6bsvHzFDFB2sY5AFb8uh0Xxo=s48-c-k-c0x00ffffff-no-rj"    
## [3,] "https://yt3.ggpht.com/ytc/AIdro_lxDwH7GbIAndAh9qTKZ-bI5XO1olkXMNilNKJQOHLTnQ=s48-c-k-c0x00ffffff-no-rj" 
## [4,] "https://yt3.ggpht.com/ytc/AIdro_nhEC5WxZ5Z1xriH5vEtQ40_YocciMpRl_HCECcGW8=s48-c-k-c0x00ffffff-no-rj"    
## [5,] "https://yt3.ggpht.com/ytc/AIdro_lTzCFJYKKtErfkytAwZ79T6l_WMADqkOgAJMQtPS4=s48-c-k-c0x00ffffff-no-rj"    
## [6,] "https://yt3.ggpht.com/ytc/AIdro_lon5esWt018Y3HcyQ9P4erkplZ-0bUhnRLy0FqL5I_dQ=s48-c-k-c0x00ffffff-no-rj" 
##      authorChannelUrl                                
## [1,] "http://www.youtube.com/@joetteconner6861"      
## [2,] "http://www.youtube.com/@chrisangflo6102"       
## [3,] "http://www.youtube.com/@roznerniyesuhmives8156"
## [4,] "http://www.youtube.com/@td4079"                
## [5,] "http://www.youtube.com/@frankrumble6754"       
## [6,] "http://www.youtube.com/@saffy4352"             
##      authorChannelId.value      canRate viewerRating likeCount
## [1,] "UC5ZOTswH_jmv7qhpl3knxQw" "TRUE"  "none"       "0"      
## [2,] "UClrov6-4vrk8Ki2L0eI-ElA" "TRUE"  "none"       "0"      
## [3,] "UCyFP4jEfsz_XX-z34SLttOw" "TRUE"  "none"       "1"      
## [4,] "UC4XVB21G_u2-dWWEPtX3iXw" "TRUE"  "none"       "0"      
## [5,] "UCLmUtv333t2o7d6jJxe2tnw" "TRUE"  "none"       "0"      
## [6,] "UCKNpeDq57Qjeujgpg18xxBA" "TRUE"  "none"       "0"      
##      publishedAt            updatedAt             
## [1,] "2026-05-29T20:36:47Z" "2026-05-29T20:36:47Z"
## [2,] "2026-05-30T05:17:00Z" "2026-05-30T05:17:00Z"
## [3,] "2026-05-30T10:45:36Z" "2026-05-30T10:45:36Z"
## [4,] "2026-05-30T18:10:36Z" "2026-05-30T18:10:36Z"
## [5,] "2026-05-30T21:56:18Z" "2026-05-30T21:56:18Z"
## [6,] "2026-05-30T22:56:55Z" "2026-05-30T22:56:55Z"
## attr(,"tuber_api_calls")
## [1] 1
## attr(,"tuber_quota_used")
## [1] 1
## attr(,"tuber_timestamp")
## [1] "2026-07-03 13:37:05 PDT"
## attr(,"tuber_function")
## [1] "get_comment_threads"
## attr(,"tuber_parameters")
## attr(,"tuber_parameters")$filter
##      video_id 
## "hqnUohxXV0I" 
## 
## attr(,"tuber_parameters")$part
## [1] "snippet"
## 
## attr(,"tuber_parameters")$max_results
## [1] 100
## 
## attr(,"tuber_results_found")
## [1] 100
## attr(,"tuber_response_format")
## [1] "data.frame"
## 
## --- Tuber Metadata ---
## function: get_comment_threads  api_calls: 1  results_found: 100  timestamp: 2026-07-03 13:37:05  
## (Use tuber_info() for full metadata)

glimpse(comments_raw)

##  'tuber_result' chr [1:100, 1:13] "UClz-d22g9Lj7agZSbaRKuqA" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:13] "channelId" "videoId" "textDisplay" "textOriginal" ...
##  - attr(*, "tuber_api_calls")= num 1
##  - attr(*, "tuber_quota_used")= num 1
##  - attr(*, "tuber_timestamp")= POSIXct[1:1], format: "2026-07-03 13:37:05"
##  - attr(*, "tuber_function")= chr "get_comment_threads"
##  - attr(*, "tuber_parameters")=List of 3
##   ..$ filter     : Named chr "hqnUohxXV0I"
##   .. ..- attr(*, "names")= chr "video_id"
##   ..$ part       : chr "snippet"
##   ..$ max_results: num 100
##  - attr(*, "tuber_results_found")= int 100
##  - attr(*, "tuber_response_format")= chr "data.frame"

Clean Comment Data

comments_clean <- comments_raw %>%
  as_tibble() %>%
  select(
    authorDisplayName,
    textOriginal,
    publishedAt,
    likeCount
  )

head(comments_clean)

## # A tibble: 6 × 4
##   authorDisplayName       textOriginal                     publishedAt likeCount
##   <tbr_rslt>              <tbr_rslt>                       <tbr_rslt>  <tbr_rsl>
## 1 @joetteconner6861       How can you blame Coca-Cola for… 2026-05-29… 0        
## 2 @chrisangflo6102        7 yrs soda free, I drink a lot … 2026-05-30… 0        
## 3 @roznerniyesuhmives8156 since 1990s i dring pepsi & col… 2026-05-30… 1        
## 4 @td4079                 Coca Cola is gross. Now if we w… 2026-05-30… 0        
## 5 @frankrumble6754        Coke is good.... I am 66 and I … 2026-05-30… 0        
## 6 @saffy4352              Absolutely dumbfounded people! … 2026-05-30… 0

Number of Comments Collected

nrow(comments_clean)

## [1] 100

Save Comments to CSV

write_csv(comments_clean, "youtube_comments.csv")

Summary of Data Collected

comments_clean$likeCount <- as.numeric(comments_clean$likeCount)

summary_table <- data.frame(
  metric = c("Comments analyzed", "Total comment likes", "Average likes per comment"),
  value = c(
    nrow(comments_clean),
    sum(comments_clean$likeCount, na.rm = TRUE),
    round(mean(comments_clean$likeCount, na.rm = TRUE), 1)
  )
)

kable(summary_table, caption = "Summary of Collected YouTube Comment Data")

Summary of Collected YouTube Comment Data
metric	value
Comments analyzed	100
Total comment likes	5
Average likes per comment	0

Clean Text for Analysis

data(stop_words)

words <- comments_clean %>%
  select(textOriginal) %>%
  unnest_tokens(word, textOriginal) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!str_detect(word, "^[0-9]+$")) %>%
  filter(str_length(word) > 2)

Word Frequency Analysis

word_freq <- words %>%
  count(word, sort = TRUE)

head(word_freq, 20)

## # A tibble: 20 × 2
##    word         n
##    <chr>    <int>
##  1 coca        28
##  2 cola        27
##  3 drink       24
##  4 water       21
##  5 coke        18
##  6 people      12
##  7 diabetes    10
##  8 drinking     9
##  9 mexican      8
## 10 sugar        8
## 11 drinks       7
## 12 mexico       7
## 13 bad          6
## 14 day          6
## 15 soda         6
## 16 diet         5
## 17 health       5
## 18 it’s         5
## 19 i’m          5
## 20 love         5

Top 20 Most Common Words

kable(
  head(word_freq, 20),
  caption = "Top 20 Most Common Words in the YouTube Comments"
)

Top 20 Most Common Words in the YouTube Comments
word	n
coca	28
cola	27
drink	24
water	21
coke	18
people	12
diabetes	10
drinking	9
mexican	8
sugar	8
drinks	7
mexico	7
bad	6
day	6
soda	6
diet	5
health	5
it’s	5
i’m	5
love	5

Bar Chart

word_freq %>%
  slice_max(n, n = 15) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Most Common Words in YouTube Comments",
    x = "Word",
    y = "Frequency"
  ) +
  theme_minimal()

Word Cloud

set.seed(123)

wordcloud(
  words = word_freq$word,
  freq = word_freq$n,
  min.freq = 2,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

# Results and Discussion

A total of 100 YouTube comments were collected from the video.

The word frequency analysis shows that the most common words in the comments were coca, cola, drink, water, coke, people, diabetes, drinking, mexican, and sugar. These words closely reflect the video’s focus on the high consumption of sugary drinks in Chiapas, Mexico, and the resulting public health concerns.

The frequent appearance of words such as sugar, diabetes, health, and water suggests that many viewers were discussing the health risks associated with excessive soft drink consumption. At the same time, words such as mexican, mexico, and people indicate that commenters were also talking about the social, cultural, and economic factors influencing beverage consumption in the region. Overall, the comments demonstrate that viewers were actively engaging with the video’s message about nutrition, public health, and the impact of Coca-Cola on communities in southern Mexico.

Lab Exercise #4: YouTube Comment Word Frequency Analysis

Leilani De La Cruz

2026-07-03