Video Link: https://www.youtube.com/watch?v=KCZJ6Ttsp-A
Video Title: How To Invest in ETFs | Ultimate Guide
For my video, I chose a industry/brand topic that I started getting into this year which is investing in the stock market, more specifically in ETF’s. I’ve used this video and many other videos to help guide me into which ETF’s are worth investing into. I’m curious to see the results on this video for this YouTube scraping exercise.
For my code, I’ve added both a word frequency chart and a word cloud chart and they provide some insight towards what the users thought about the video. The most common words based from the word frequency chart was video, market, investing, etf, portfolio and stocks.These are all heavily correlated with the topic of the video suggesting that many of the comments are engaged in discussion with the video. Through the word cloud chart, we can see many words with the same words but expanded more. We get more positive words such as informative, amazing,easy and other positive comments reflecting the video’s excellent delivery in ETF guidance. Although these are all positive comments showcasing the quality of the video, I did notice words that either seem to broad or too random that may not have anything to do with the video. I suspect that there are comments that possibly mention “Nice video”, “Amazing video”, or “informative video” which may lead an unrelated word to become the highest word frequency. Even through suspicion of bot accounts, there are accounts that may produce “fake” comments to increase the comment count and may skew the results. Just from looking at these visualizations, there doesn’t seem to be much bots commenting due to how most of these words are related to the general themes of the video such as investing, growth, and portfolios.
By the end of this tutorial, you will be able to: 1. Register a
project in Google Cloud Console and enable the
YouTube Data API v3. 2. Create OAuth 2.0 credentials
and authorize R to access YouTube data on your behalf. 3. Use the
tuber package to pull comments from any public YouTube
video. 4. Clean the scraped data with dplyr and export it
to a CSV file. 5. Recognize and fix the most common errors students run
into during this process. # Prerequisites - R and RStudio installed - A
Google account (a personal Gmail account works fine) - Packages:
tuber, dplyr, readr - A YouTube
video URL you want to scrape comments from (we’ll use a real example
below) — # Part 1 — Get API Access from Google Cloud Console Before R
can talk to YouTube, you need to tell Google that your project is
allowed to request data. This happens in three steps inside the Google Cloud Console. ##
Step 1.1 — Create a project and enable the YouTube Data API v3 1. Sign
in to console.cloud.google.com
and create a new project (or pick an existing one) using the project
selector at the top of the page. 2. Go to APIs & Services →
Library. 3. Search for YouTube Data API v3 and
open it. 4. Click Enable.
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='90%', fig.cap="Enabling the YouTube Data API v3 for your project"} knitr::include_graphics("images/01_enable_api.jpg")
## Step 1.2 — Create an OAuth 2.0 Client ID The tuber
package authenticates as you, not as an anonymous script, so
you need an OAuth client rather than a plain API key. 1. Go to
APIs & Services → Credentials. 2. Click +
CREATE CREDENTIALS → OAuth client ID.
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='90%', fig.cap="Creating credentials: choose OAuth client ID"} knitr::include_graphics("images/02_create_oauth_client.jpg")
3. For Application type, choose Web
application and give it any name (e.g., “MSBA 580 YouTube
Scraper”). 4. Under Authorized redirect URIs, click
+ ADD URI and enter exactly:
http://localhost:1410/
Why this exact URL?
tuberauthenticates through thehttrpackage, which spins up a temporary local web server on port1410to catch Google’s response. If this URI doesn’t match exactly (including the trailing slash), authentication will fail. 5. Click Create. Google will show you a Client ID and Client Secret — copy both somewhere safe. You’ll paste them into R in Part 2. ## Step 1.3 — Add yourself as a test user New OAuth apps start in Testing mode, which means only approved accounts can authenticate. If you skip this step, you’ll hit a403: access_deniederror the first time you try to log in from R. 1. Go to APIs & Services → OAuth consent screen. 2. Scroll to Test users and click + ADD USERS. 3. Enter the Gmail address you’ll use to authenticate (your own account is fine) and click Save. ```{r, eval=TRUE, echo=FALSE, fig.align=‘center’, out.width=‘80%’, fig.cap=“Adding yourself as a test user under the OAuth consent screen”} knitr::include_graphics(“images/03_test_users.jpg”)
---
# Part 2 — Connect R to the API
## Step 2.1 — Load the package
``` r
library(tuber)
library(dplyr)
library(readr)
yt_oauth()Paste the Client ID and Client
Secret you copied in Step 1.2 below. > �� Never
commit real credentials to GitHub or share them in a script you hand
in. Treat them like a password — store them in a separate,
untracked file (e.g., an .Renviron file) for real
projects.
library(tuber)
app_id <- "YOUTUBE_APP_ID"
app_secret <- "YOUTUBE_APP_SECRET"
yt_oauth(app_id, app_secret)
When you run this, R will ask:
Use a local file ('.httr-oauth'), to cache OAuth access credentials
between R sessions?
1: Yes
2: No
Choose 1: Yes so you don’t have to log in again
every session. ## Step 2.3 — Approve access in your browser A browser
window will open automatically. Because the app is still in
Testing mode, you’ll see a warning screen first:
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='65%', fig.cap='This warning is expected for apps in Testing mode — click Continue'} knitr::include_graphics("images/04_unverified_app_warning.jpg")
Click Continue, then Allow on the next
screen that lists what the app can access. Your browser tab will then
say “Authentication complete. Please close this page and return to
R.” — and your R console will print
Authentication complete. — # Part 3 — Scrape Comments from
a Video Step by Step ## Step 3.1 — Get the video ID The video ID is the
part of the URL after v=. For example, the code below
demonstrates how to scrape comments from a YouTube video about SpaceX
step by step.
https://www.youtube.com/watch?v=KCZJ6Ttsp-A
└────┬────┘
video_id
video_id <- "KCZJ6Ttsp-A" # "How To Invest in ETFs | Ultimate Guide"
comments_raw <- get_all_comments(video_id = video_id)
head(comments_raw)
## authorDisplayName
## 1 @JoshuaMayo
## 2 @honeypotqueens9865
## 3 @Dee-rc2lt
## 4 @devonpruitt9456
## 5 @devonpruitt9456
## 6 @dahirukabiru672
## authorProfileImageUrl
## 1 https://yt3.ggpht.com/xKDMsCpOCBE-trjD1xQzNaiR6_xAUKVzHv4OQ_XHrV6jbHR2G5_NHK5d3P_wkWWaSMsqiBgdfA=s48-c-k-c0x00ffffff-no-rj
## 2 https://yt3.ggpht.com/B5hFbAfoCnwej2iVnBh2IEaZZGNNm92i8uvj3LBOItFRPCocoWeo9NawBCvc-eNlvwj7XKtv1Q=s48-c-k-c0x00ffffff-no-rj
## 3 https://yt3.ggpht.com/ytc/AIdro_n6Mc61AwG8RuqiyQtG_ureSrBRiRvFrgCLR1KjMr4=s48-c-k-c0x00ffffff-no-rj
## 4 https://yt3.ggpht.com/ytc/AIdro_m_h2e2XEszk7BVOmbnFWUv1lRV3qfyKH0bTdMDr3I=s48-c-k-c0x00ffffff-no-rj
## 5 https://yt3.ggpht.com/ytc/AIdro_m_h2e2XEszk7BVOmbnFWUv1lRV3qfyKH0bTdMDr3I=s48-c-k-c0x00ffffff-no-rj
## 6 https://yt3.ggpht.com/F6CEnnNFXADq27uyZ0hgiZg-tuKYnamxJ_fi6SsrwjQ48BsGeaS4UP8bRsgUqkuAzpWmI1TTIw=s48-c-k-c0x00ffffff-no-rj
## authorChannelUrl authorChannelId.value
## 1 http://www.youtube.com/@JoshuaMayo UCJZ7zr9a6AT6STkyOFztgiQ
## 2 http://www.youtube.com/@honeypotqueens9865 UCt60aQ7XDsuIS8D4YSEfpMg
## 3 http://www.youtube.com/@Dee-rc2lt UC6VB3xX_6tkDwx_zMYxWy_Q
## 4 http://www.youtube.com/@devonpruitt9456 UCQ7kiaSi7o3HPQKDYn7N4Lg
## 5 http://www.youtube.com/@devonpruitt9456 UCQ7kiaSi7o3HPQKDYn7N4Lg
## 6 http://www.youtube.com/@dahirukabiru672 UCG5T5Fb4lEflBLbjJJyt_gg
## videoId
## 1 KCZJ6Ttsp-A
## 2 KCZJ6Ttsp-A
## 3 KCZJ6Ttsp-A
## 4 KCZJ6Ttsp-A
## 5 KCZJ6Ttsp-A
## 6 KCZJ6Ttsp-A
## textDisplay
## 1 A monster of an ETF guide! Let me know if there are other videos you'd guys like to see. 👍
## 2 Which CDs 💿 would you recommend with the highest dividends and compound interest
## 3 Proper portfolio allocation
## 4 @Dee-rc2lt this!!!!
## 5 Brother , I am proud of you. I am a new financial advisor in this field . You probably have your thoughts about my profession lol. However , watching your videos help me refresh my knowledge and mirror the way you convey financial principles in a clear and concise manner. I am inspired to create my own platform . Rarely do I subscribe to specific people , but you caught my interest . Blessings upon you. Please continue to share!! <br><br>Request : What are your thoughts about portfolio allocation across multiple accounts (ex. 401K + IRA).
## 6 Hey.... You can get connected to Mrs Anna with this number here 👆she is always online
## textOriginal
## 1 A monster of an ETF guide! Let me know if there are other videos you'd guys like to see. 👍
## 2 Which CDs 💿 would you recommend with the highest dividends and compound interest
## 3 Proper portfolio allocation
## 4 @Dee-rc2lt this!!!!
## 5 Brother , I am proud of you. I am a new financial advisor in this field . You probably have your thoughts about my profession lol. However , watching your videos help me refresh my knowledge and mirror the way you convey financial principles in a clear and concise manner. I am inspired to create my own platform . Rarely do I subscribe to specific people , but you caught my interest . Blessings upon you. Please continue to share!! \n\nRequest : What are your thoughts about portfolio allocation across multiple accounts (ex. 401K + IRA).
## 6 Hey.... You can get connected to Mrs Anna with this number here 👆she is always online
## canRate viewerRating likeCount publishedAt updatedAt
## 1 TRUE none 460 2022-02-05T00:09:39Z 2022-02-05T00:09:39Z
## 2 TRUE none 10 2022-02-05T03:10:04Z 2022-02-05T03:10:04Z
## 3 TRUE none 9 2022-02-05T14:41:11Z 2022-02-05T14:41:11Z
## 4 TRUE none 1 2022-02-06T10:53:06Z 2022-02-06T10:53:06Z
## 5 TRUE none 8 2022-02-06T10:54:07Z 2022-02-06T10:54:07Z
## 6 TRUE none 0 2022-02-07T01:59:33Z 2022-02-07T01:59:33Z
## id moderationStatus
## 1 Ugw8K1L11NUIgrm8Qop4AaABAg <NA>
## 2 Ugw8K1L11NUIgrm8Qop4AaABAg.9Y2IzT9_3519Y2cctUVmwA <NA>
## 3 Ugw8K1L11NUIgrm8Qop4AaABAg.9Y2IzT9_3519Y3rijJrazd <NA>
## 4 Ugw8K1L11NUIgrm8Qop4AaABAg.9Y2IzT9_3519Y61Q6M2Lwa <NA>
## 5 Ugw8K1L11NUIgrm8Qop4AaABAg.9Y2IzT9_3519Y61XYIgVk- <NA>
## 6 Ugw8K1L11NUIgrm8Qop4AaABAg.9Y2IzT9_3519Y7e973jW4o <NA>
## parentId
## 1 <NA>
## 2 Ugw8K1L11NUIgrm8Qop4AaABAg
## 3 Ugw8K1L11NUIgrm8Qop4AaABAg
## 4 Ugw8K1L11NUIgrm8Qop4AaABAg
## 5 Ugw8K1L11NUIgrm8Qop4AaABAg
## 6 Ugw8K1L11NUIgrm8Qop4AaABAg
##
## --- Tuber Metadata ---
## function: get_all_comments api_calls: 47 results_found: 1654 timestamp: 2026-06-29 17:22:38
## (Use tuber_info() for full metadata)
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='65%', fig.cap='Scraping YouTube comments'} knitr::include_graphics("images/05_scraping.png")
This returns a data frame with one row per top-level comment and reply.
Depending on the video’s popularity, this can take anywhere from a few
seconds to a few minutes. ## Step 3.3 — Lifesaver trick: recovering
output you forgot to save It happens to everyone: you run
get_all_comments(video_id = "...") directly in the console
without assigning it to anything, and the scrape (which can take a
while) finishes, but the result wasn’t saved anywhere. As long
as you haven’t run anything else in the console since, R keeps
the most recent top-level result in .Last.value. Let’s see
how we can solve this problem as your API may have a limit and you do
not want to run the same scraping tasks again and again (also see the
sreenshot below).
comments1 <- .Last.value
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='65%', fig.cap='Scraping YouTube comments'} knitr::include_graphics("images/06_saving_scraping_results_last.value.jpg. png")
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='65%', fig.cap='Scraping YouTube comments'} knitr::include_graphics("images/07_saving_scraping_results.png")
This is much faster than re-scraping the video from scratch. — # Part 4
— Clean the Data with dplyr and the pipe operator (%>%
or |>) ## Step 4.1 — Always check the real column names first Don’t
guess at column names — tuber’s output doesn’t always match
what you’d expect. For example, the unique comment identifier is stored
in a column called id, not
comment_id. Run this first:
head(comments1)
## $help_type
## NULL
glimpse(comments1)
## List of 1
## $ help_type: NULL
Once you know the real column names, wrap the scrape in a
dplyr pipeline that converts it to a tibble, removes
accidental duplicate rows, and keeps only the columns you need:
comments_df <- comments_raw
names(comments_df)
## [1] "authorDisplayName" "authorProfileImageUrl" "authorChannelUrl"
## [4] "authorChannelId.value" "videoId" "textDisplay"
## [7] "textOriginal" "canRate" "viewerRating"
## [10] "likeCount" "publishedAt" "updatedAt"
## [13] "id" "moderationStatus" "parentId"
comments_df <- comments_df %>%
select(authorDisplayName, textOriginal, publishedAt, likeCount) %>%
filter(!is.na(textOriginal)) %>%
distinct(textOriginal, .keep_all = TRUE) %>%
rename(text = textOriginal) %>%
mutate(comment_id = row_number())
head(comments_df)
## authorDisplayName
## 1 @JoshuaMayo
## 2 @honeypotqueens9865
## 3 @Dee-rc2lt
## 4 @devonpruitt9456
## 5 @devonpruitt9456
## 6 @dahirukabiru672
## text
## 1 A monster of an ETF guide! Let me know if there are other videos you'd guys like to see. 👍
## 2 Which CDs 💿 would you recommend with the highest dividends and compound interest
## 3 Proper portfolio allocation
## 4 @Dee-rc2lt this!!!!
## 5 Brother , I am proud of you. I am a new financial advisor in this field . You probably have your thoughts about my profession lol. However , watching your videos help me refresh my knowledge and mirror the way you convey financial principles in a clear and concise manner. I am inspired to create my own platform . Rarely do I subscribe to specific people , but you caught my interest . Blessings upon you. Please continue to share!! \n\nRequest : What are your thoughts about portfolio allocation across multiple accounts (ex. 401K + IRA).
## 6 Hey.... You can get connected to Mrs Anna with this number here 👆she is always online
## publishedAt likeCount comment_id
## 1 2022-02-05T00:09:39Z 460 1
## 2 2022-02-05T03:10:04Z 10 2
## 3 2022-02-05T14:41:11Z 9 3
## 4 2022-02-06T10:53:06Z 1 4
## 5 2022-02-06T10:54:07Z 8 5
## 6 2022-02-07T01:59:33Z 0 6
##
## --- Tuber Metadata ---
## function: get_all_comments api_calls: 47 results_found: 1654 timestamp: 2026-06-29 17:22:38
## (Use tuber_info() for full metadata)
library(tidytext)
comments_words <- comments_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
head(comments_words, 20)
## word n
## 1 video 238
## 2 market 184
## 3 investing 178
## 4 etfs 164
## 5 etf 160
## 6 portfolio 142
## 7 financial 123
## 8 money 123
## 9 stocks 123
## 10 investment 117
## 11 advisor 100
## 12 time 96
## 13 invest 93
## 14 i’m 77
## 15 stock 75
## 16 videos 72
## 17 lot 67
## 18 buy 65
## 19 trading 61
## 20 term 60
##
## --- Tuber Metadata ---
## function: get_all_comments api_calls: 47 results_found: 1654 timestamp: 2026-06-29 17:22:38
## (Use tuber_info() for full metadata)
library(wordcloud)
library(RColorBrewer)
set.seed(123)
wordcloud(
words = comments_words$word,
freq = comments_words$n,
max.words = 100,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)
library(ggplot2)
comments_words %>%
slice_max(n, n = 15) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 15 Most Common Words",
x = "Word",
y = "Frequency"
)
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='65%', fig.cap='Scraping YouTube comments'} knitr::include_graphics("images/08_error_handling_comment_id_not_found.png ")
## Let’s have a discussion | Function | What it does | |—|—| |
as_tibble() | Converts the result into a tibble so
dplyr verbs behave predictably | |
distinct(id, .keep_all = TRUE) | Removes duplicate rows if
the API returns overlapping replies, keeping all other columns | |
select(...) | Keeps only the columns you actually need for
analysis | — # Part 5 — Save Your Data
write_csv(comments_df, "comments1.csv")
getwd()
## [1] "C:/Users/Darre/OneDrive - CSUCI/Documents"
Once saved, you can reload it anytime without re-scraping:
comments_clean <- read_csv("comments1.csv")
{r, eval=TRUE, echo=FALSE, fig.align='center', out.width='90%', fig.cap="The final scraped and cleaned comments, opened in Excel"} knitr::include_graphics("images/05_final_csv_output.jpg")
— # Troubleshooting Field Guide | What you see | Likely cause | Fix |
|—|—|—| | Error 403: access_denied, “…has not completed the
Google verification process” | Your account isn’t on the test-user list
yet | OAuth consent screen → Test users → + ADD USERS →
enter your Gmail address (Step 1.3) | | “Google hasn’t verified this
app” | Normal — your app is in Testing publishing status | Click
Continue (only do this for apps you created yourself) |
|
Error in distinct(): Must use existing variables. x comment_id not found
| The real column is named id, not comment_id
| Run glimpse(comments_raw) to confirm actual column names
before selecting | | Browser never redirects back to R / hangs at
Waiting for authentication in browser... | The Authorized
redirect URI doesn’t match | Confirm it’s exactly
http://localhost:1410/ in your OAuth client settings (Step
1.2) | | Lost your scraped data after forgetting to assign it | Result
wasn’t saved to a variable | Recover with
comments_raw <- .Last.value, but only if nothing else
ran in the console since (Step 3.3) | — # Summary In this tutorial you:
- Enabled the YouTube Data API v3 and created OAuth 2.0 credentials in
Google Cloud Console - Authenticated R against your Google account using
tuber::yt_oauth() - Scraped every comment from a YouTube
video with get_all_comments() - Cleaned the result with a
dplyr pipeline and exported it to CSV - Learned how to
recognize and fix the most common errors in this workflow Next
steps: try running vader or tidytext
sentiment analysis on comments_clean$textOriginal, or
compare comment sentiment and engagement (likeCount) across
two competing brands’ videos. # References - Sysoev, J. (tuber package
documentation). tuber: Access
to YouTube via the API - Google Developers. YouTube Data API v3
Reference - Wickham, H., et al. dplyr: A Grammar of Data
Manipulation — Jimmy Zhenning Xu, Ph.D.,| github.com/utjimmyx