For this assignment, I went ahead and collected and analyzed Youtube comments that are related to CoreWeave, the AI cloud infrastructure company, this continues a thread from my earlier labs(newsAPI sentiment analysis and Bluesky exercise) which tracked public talks around CoreWeave’s role in the AI infrastructure buildout
My data source is the comment section of a Big Technology Podcast episode on Youtube about a debate on CoreWeave (video ID: m1uh7Ka6868). This video is chosen because it features substantive discussion of CoreWeave’s business model and ultimately the broader debate around AI infrastructure the way the finances work, which made the comment section a good source of public opinion on the topic.
I used the Youtube Data API v3, which was done by using the tuber package in R via the OAuth autnetication. These are the steps I used below:
library(tuber)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(tidytext)
library(ggplot2)
library(wordcloud2)
comments <- read.csv("coreweave_comments_raw.csv", stringsAsFactors = FALSE)
nrow(comments)
## [1] 119
My scraping of comments from the video produced about 119 total commetns from 76 unique commenters, posted between January and June 2026, with the average of about 2.19 likes per comment . After removing the stopwords, the data gave me 761 unqie tokens. The top three was “interview,” “coreweave,” and “providers,” followed closely by financially themed terms such as “company,” “money,” “cloud,” “data,” “gpus,” “chips,” and “bubble.” I think this means that the viewers are less engaged with the podcast format itself and more about hte underlying financial debate about CoreWeave’s business model.
texts <- comments %>% select(textOriginal)
clean_text <- texts %>%
mutate(text = str_to_lower(textOriginal)) %>%
mutate(text = str_replace_all(text, "[^a-z\\s]", " ")) %>%
mutate(text = str_squish(text))
data("stop_words")
tokens <- clean_text %>%
select(text) %>%
unnest_tokens(word, text)
tokens_clean <- tokens %>%
anti_join(stop_words, by = "word") %>%
filter(str_length(word) > 2)
word_counts <- tokens_clean %>%
count(word, sort = TRUE)
head(word_counts, 20)
## word n
## 1 interview 32
## 2 coreweave 20
## 3 providers 14
## 4 company 13
## 5 money 11
## 6 alex 10
## 7 guys 10
## 8 cloud 9
## 9 data 9
## 10 don 9
## 11 questions 9
## 12 gpus 8
## 13 tier 8
## 14 bubble 7
## 15 build 7
## 16 chips 7
## 17 gpu 7
## 18 michael 7
## 19 amazing 6
## 20 building 6
word_counts %>%
slice_max(n, n = 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill = n)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low = "#a1d99b", high = "#006d2c") +
labs(
title = "Top 20 Most Frequent Words in CoreWeave YouTube Comments",
subtitle = "Source: Big Technology Podcast — CoreWeave Debate",
x = "Frequency",
y = NULL
) +
theme_minimal(base_size = 13)
wc_data <- word_counts %>%
filter(n >= 2) %>%
slice_max(n, n = 100)
wordcloud2(
data = wc_data,
size = 0.6,
color = "random-dark",
backgroundColor = "white"
)
cat("=== Dataset Summary ===\n")
## === Dataset Summary ===
cat("Total comments collected: ", nrow(comments), "\n")
## Total comments collected: 119
cat("Unique commenters: ", n_distinct(comments$authorDisplayName), "\n")
## Unique commenters: 76
cat("Avg. likes per comment: ", round(mean(comments$likeCount, na.rm = TRUE), 2), "\n")
## Avg. likes per comment: 2.19
cat("Total unique tokens: ", nrow(word_counts), "\n")
## Total unique tokens: 761
cat("Top 3 words: ", paste(head(word_counts$word, 3), collapse = ", "), "\n")
## Top 3 words: interview, coreweave, providers
The word frequency results for comments on the debate video about CoreWeave show that audience engagement centered heaviuly on financial mechanics rather than the podcast content itself. Beyond expected terms like “coreweave,” “interview,” and “providers,” high-frequency words such as “money,” “gpus,” “chips,” “cloud,” “bubble,” and “crypto” point to a clear viewer preoccupation with how CoreWeave is financed rather than what it builds. This is a real ongoing debate in the financial media: CoreWeave has used Nvidia GPUs describe as central to looking at wheter AI infrastructure spending reflects durable demand or speculative excess (Quartz, 2026) Everyone is talking about the “bubble” and “crypto” which is among the top terms that are notable, since CoreWeave originated as a cryptocurrency mining operation before going to AI compute, and commenters appear to be trying to draw a connection between the history an skepticism about the durability of the current AI infrastructure boom. Overall, after lookign at the comment section, i cam to a conclusion that it serves less as a discussion of the podcasts content and more of something to go back and look at whetter GPU debt financing represents a true infrastructure investment or a repeat of past financial bubbles.
Quartz. (2026, May 8). GPU-collateralized debt explained: AI financing risks. https://qz.com/gpu-collateralized-debt-ai-neocloud-coreweave-financing-risks-050526