1 Lab purpose

Goal: This report collects and analyzes public YouTube comments from a video related to public reaction, legal conflict, and online audience response. The assignment requirement is to collect at least 100 comments or posts, clean the text, perform word frequency analysis, create at least one visualization, and interpret the findings.

This polished build goes beyond the minimum requirement by adding several audience analytics views:

Word frequency: What words appear most often?
Phrase analysis: What repeated two-word phrases reveal the main narrative frames?
Theme coding: How can key words be grouped into interpretable audience themes?
Timeline analysis: When did comment activity appear strongest?
Engagement concentration: Are likes spread evenly, or do a few comments dominate attention?
Comment length versus engagement: Do short or long comments attract more likes?
Calendar heatmap: What time/day patterns appear in the collected comments?
Location references in text: Do commenters explicitly write place names that can be counted as a cautious text-reference proxy?

Graphics export: All report graphics are automatically saved as PNG files in the lab4_graphics/ folder with descriptive names. This makes the visual outputs reusable for a slide deck, portfolio page, README, or final-project documentation.

1Watch the source.
Use the sticky video for context before reading the audience signals.

2Scan the language.
Words, phrases, and themes show what the crowd kept returning to.

3Follow the attention.
Engagement charts show which comments carried the most visible reaction.

Project process map: from source video to audience signals

This visual roadmap frames the workflow before the report moves into the charts: collect the comments, clean the signal, visualize repeated language, follow engagement, and tell the story carefully.

Infographic process map for the Afroman YouTube comment analysis project

Selected video URL: https://www.youtube.com/watch?v=u4AiuqQpB1U
Selected video ID: u4AiuqQpB1U
Primary method: YouTube Data API v3 using an API-key workflow, with local CSV caching for reproducibility.
Current safe workflow: This version is designed to knit from the saved CSV after the API key has been removed.

2 Part 1 - Five-sentence reading reflection

Open five-sentence Week 4 reading reflection

Brooks (2026) was useful because it connects social media analytics to a practical managerial problem: how quickly organizations should respond when negative public reaction begins spreading online. The reading shows that user-generated posts can be treated as more than casual conversation because they can reveal the timing, intensity, and persistence of public backlash. I found the distinction between negative and non-negative post behavior especially important because it suggests that public communication strategies should not assume all social media attention behaves the same way. For my final Executive Order Signal Impact Dashboard, this reading supports the idea that online discussion volume and word choice can serve as early signals of public awareness, confusion, disagreement, or support. The reading also highlights an important limitation for my own work: conclusions from one platform or one event should be presented carefully and treated as evidence from a specific data source rather than universal proof.

3 Part 2 - Data collection and text analysis

3.1 1. Packages and reproducibility setup

# ================================================================
# PACKAGE SETUP
# ================================================================
# This chunk installs missing packages and loads everything needed.
# The package list is intentionally limited to common packages that work
# reliably in Posit Cloud / RStudio Cloud.

required_packages <- c(
  "dplyr",
  "forcats",
  "ggplot2",
  "httr",
  "jsonlite",
  "knitr",
  "lubridate",
  "readr",
  "scales",
  "stringr",
  "tibble",
  "tidyr",
  "tidytext"
)

missing_packages <- required_packages[!(required_packages %in% rownames(installed.packages()))]

if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

invisible(lapply(required_packages, library, character.only = TRUE))

3.2 2. Credential notes

Do not publish your real API key. This build is intended to knit from saved CSV files after collection is complete. If the CSV files are missing and you need to recollect data, set the API key in the Console or temporarily in the chunk below, then remove it before publishing.

Run this once in the Console, not in the published report, if you need to recollect data:

# Replace the placeholder with your real YouTube Data API key.
# This key often starts with AIza...
# Do not publish the real key to RPubs, GitHub, Canvas comments, or screenshots.

Sys.setenv(YOUTUBE_API_KEY = "PASTE_YOUR_YOUTUBE_API_KEY_HERE")

Credential decoder:

Credential	Usually looks like	Used in this main workflow?	Where it goes
YouTube Data API key	Starts with `AIza...`	Yes, only if recollecting	`Sys.setenv(YOUTUBE_API_KEY = "...")`
OAuth Client ID / App ID	Ends with `.apps.googleusercontent.com`	No, appendix only	`app_id <- "..."`
OAuth Client Secret / App Secret	Often starts with `GOCSPX-...`	No, appendix only	`app_secret <- "..."`
Google Cloud project name / ID	Project label or slug	No	Does not go in R

3.3 3. Project settings

# ================================================================
# PROJECT SETTINGS
# ================================================================
# Store reusable settings in one place. For a future project, change only
# these values and rerun the same workflow.

video_id <- "u4AiuqQpB1U"
video_url <- paste0("https://www.youtube.com/watch?v=", video_id)

# The assignment requires 100+ comments where available. This high cap was used
# during the collection pass. The API may return fewer comments than the public
# YouTube page displays because of endpoint limits, unavailable comments, replies,
# spam filtering, or pagination behavior.
max_comments_to_collect <- 25000

# During the full collection pass, the API initially returned 1,105 accessible
# comment records. After text cleaning and deduplication, the analysis file
# retained the final cleaned count reported below.
raw_comments_initial_collection_count <- 1105

raw_comments_file <- paste0("youtube_comments_raw_", video_id, ".csv")
clean_comments_file <- paste0("youtube_comments_clean_", video_id, ".csv")

3.4 4. Helper functions for collection and analysis

# ================================================================
# HELPER FUNCTIONS
# ================================================================

# get_col() safely extracts a column from a data frame.
# This avoids hard crashes if the API returns a slightly different structure.
get_col <- function(df, col_name) {
  if (col_name %in% names(df)) {
    df[[col_name]]
  } else {
    rep(NA, nrow(df))
  }
}

# get_youtube_video_metadata() collects basic video information.
# This is useful for documenting the source in the report.
get_youtube_video_metadata <- function(video_id, api_key) {
  response <- httr::GET(
    "https://www.googleapis.com/youtube/v3/videos",
    query = list(
      part = "snippet,statistics",
      id = video_id,
      key = api_key
    )
  )

  if (httr::status_code(response) != 200) {
    warning("Video metadata request failed. The report can continue if comments are available.")
    return(tibble::tibble())
  }

  parsed <- jsonlite::fromJSON(
    httr::content(response, as = "text", encoding = "UTF-8"),
    flatten = TRUE
  )

  if (!"items" %in% names(parsed) || nrow(parsed$items) == 0) {
    return(tibble::tibble())
  }

  items <- tibble::as_tibble(parsed$items)

  tibble::tibble(
    video_id = video_id,
    title = get_col(items, "snippet.title"),
    channel_title = get_col(items, "snippet.channelTitle"),
    published_at = get_col(items, "snippet.publishedAt"),
    view_count = suppressWarnings(as.numeric(get_col(items, "statistics.viewCount"))),
    comment_count_visible_on_youtube = suppressWarnings(as.numeric(get_col(items, "statistics.commentCount")))
  )
}

# get_youtube_comments_api_key() collects top-level public comments from a YouTube video.
# It uses pagination through nextPageToken until it reaches max_comments or the API ends.
# Note: commentThreads captures top-level comment threads and reply counts. It may not
# return every visible nested reply on YouTube.
get_youtube_comments_api_key <- function(video_id, api_key, max_comments = 25000) {
  if (nchar(api_key) == 0) {
    stop("API key is missing. Set Sys.setenv(YOUTUBE_API_KEY = 'your_key_here') or provide saved CSV files.")
  }

  base_url <- "https://www.googleapis.com/youtube/v3/commentThreads"
  all_comments <- list()
  page_token <- NULL

  repeat {
    query_params <- list(
      part = "snippet",
      videoId = video_id,
      key = api_key,
      maxResults = 100,
      textFormat = "plainText",
      order = "relevance"
    )

    if (!is.null(page_token)) {
      query_params$pageToken <- page_token
    }

    response <- httr::GET(base_url, query = query_params)

    if (httr::status_code(response) != 200) {
      error_text <- httr::content(response, as = "text", encoding = "UTF-8")
      stop(paste("YouTube API request failed:", httr::status_code(response), error_text))
    }

    parsed <- jsonlite::fromJSON(
      httr::content(response, as = "text", encoding = "UTF-8"),
      flatten = TRUE
    )

    if (!"items" %in% names(parsed) || nrow(parsed$items) == 0) {
      break
    }

    items <- tibble::as_tibble(parsed$items)

    page_comments <- tibble::tibble(
      comment_id = get_col(items, "id"),
      author = get_col(items, "snippet.topLevelComment.snippet.authorDisplayName"),
      text = get_col(items, "snippet.topLevelComment.snippet.textDisplay"),
      published_at = get_col(items, "snippet.topLevelComment.snippet.publishedAt"),
      like_count = suppressWarnings(as.numeric(get_col(items, "snippet.topLevelComment.snippet.likeCount"))),
      reply_count = suppressWarnings(as.numeric(get_col(items, "snippet.totalReplyCount"))),
      video_id = video_id,
      source_method = "YouTube Data API v3 commentThreads.list"
    )

    all_comments[[length(all_comments) + 1]] <- page_comments
    current_total <- nrow(dplyr::bind_rows(all_comments))

    if (current_total >= max_comments || is.null(parsed$nextPageToken)) {
      break
    }

    page_token <- parsed$nextPageToken
    Sys.sleep(0.25)
  }

  dplyr::bind_rows(all_comments) |>
    dplyr::distinct(comment_id, .keep_all = TRUE) |>
    dplyr::slice_head(n = max_comments)
}

# clean_comment_data() standardizes the raw YouTube data.
clean_comment_data <- function(comments_raw) {
  # Handle both raw API files and already-cleaned cache files.
  # Some files have text_original already; raw files usually only have text.
  comments_raw <- tibble::as_tibble(comments_raw)

  if (!"text_original" %in% names(comments_raw)) {
    comments_raw <- comments_raw |>
      dplyr::mutate(text_original = as.character(text))
  }

  if (!"reply_count" %in% names(comments_raw)) {
    comments_raw <- comments_raw |>
      dplyr::mutate(reply_count = 0)
  }

  comments_raw |>
    dplyr::mutate(
      text_original = as.character(text_original),
      text_clean = text_original |>
        stringr::str_to_lower() |>
        stringr::str_replace_all("http[s]?://\\S+", " ") |>
        stringr::str_replace_all("www\\.\\S+", " ") |>
        stringr::str_replace_all("[^a-z\\s]", " ") |>
        stringr::str_squish(),
      published_at = suppressWarnings(lubridate::ymd_hms(published_at)),
      like_count = suppressWarnings(as.numeric(like_count)),
      reply_count = suppressWarnings(as.numeric(reply_count)),
      comment_length_words = stringr::str_count(text_clean, "\\S+"),
      total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0)
    ) |>
    dplyr::filter(!is.na(text_clean), text_clean != "") |>
    dplyr::distinct(comment_id, .keep_all = TRUE)
}

# rolling_mean_right() calculates a simple rolling average without adding another package.
rolling_mean_right <- function(x, window = 7) {
  stats::filter(x, rep(1 / window, window), sides = 1)
}

# gini_coefficient() measures concentration. 0 = perfectly equal, 1 = highly concentrated.
gini_coefficient <- function(x) {
  x <- x[!is.na(x)]
  x <- x[x >= 0]
  if (length(x) == 0 || sum(x) == 0) {
    return(NA_real_)
  }
  x <- sort(x)
  n <- length(x)
  (2 * sum(seq_len(n) * x)) / (n * sum(x)) - (n + 1) / n
}

# Reusable Afroman-inspired leafy palette for page and graph consistency.
hemp_green <- "#1F6B3A"
hemp_green_dark <- "#12351F"
hemp_sage <- "#70A83B"
hemp_light <- "#DDE7BE"
hemp_cream <- "#F7F1DD"
hemp_parchment <- "#EFE7C8"
hemp_soil <- "#6B4A2D"
hemp_gold <- "#D6A11F"
hemp_mint <- "#A7C957"
hemp_gray <- "#5F665A"

# A reusable visual theme keeps charts consistent.
theme_lab <- function() {
  ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(
      plot.background = ggplot2::element_rect(fill = hemp_cream, color = NA),
      panel.background = ggplot2::element_rect(fill = hemp_cream, color = NA),
      plot.title = ggplot2::element_text(face = "bold", size = 15, color = hemp_green_dark),
      plot.subtitle = ggplot2::element_text(size = 11, color = hemp_soil),
      axis.title = ggplot2::element_text(face = "bold", color = hemp_green_dark),
      axis.text = ggplot2::element_text(color = "#2D3328"),
      panel.grid.major = ggplot2::element_line(color = "#E5E0C4"),
      panel.grid.minor = ggplot2::element_blank(),
      legend.position = "bottom",
      legend.title = ggplot2::element_text(face = "bold", color = hemp_green_dark),
      plot.caption = ggplot2::element_text(color = hemp_gray)
    )
}

# save_report_plot() writes each polished ggplot to the graphics folder with a clear filename.
# The plot still appears in the knitted report because the object is printed after saving.
save_report_plot <- function(plot_object, filename, width = 11, height = 6.5) {
  ggplot2::ggsave(
    filename = file.path(graphics_dir, filename),
    plot = plot_object,
    width = width,
    height = height,
    dpi = 300,
    bg = hemp_cream
  )
  invisible(plot_object)
}

3.5 5. Collect or load YouTube comments

Current reproducible mode: If youtube_comments_clean_u4AiuqQpB1U.csv exists, this report loads that cleaned file and does not call the API. This makes the report safe to knit after the API key is removed.

# ================================================================
# LOAD CLEAN CSV, LOAD RAW CSV, OR COLLECT FROM API
# ================================================================
# Priority order:
# 1. Load saved clean CSV if available.
# 2. Else load raw CSV and clean it.
# 3. Else collect from the API if an API key is available.

api_key <- Sys.getenv("YOUTUBE_API_KEY")

if (file.exists(clean_comments_file)) {
  comments_clean <- readr::read_csv(clean_comments_file, show_col_types = FALSE) |>
    clean_comment_data()
  collection_status <- paste("Loaded existing clean CSV:", clean_comments_file)

} else if (file.exists(raw_comments_file)) {
  comments_raw <- readr::read_csv(raw_comments_file, show_col_types = FALSE)
  comments_clean <- clean_comment_data(comments_raw)
  readr::write_csv(comments_clean, clean_comments_file)
  collection_status <- paste("Loaded raw CSV, cleaned it, and saved:", clean_comments_file)

} else {
  comments_raw <- get_youtube_comments_api_key(
    video_id = video_id,
    api_key = api_key,
    max_comments = max_comments_to_collect
  )
  readr::write_csv(comments_raw, raw_comments_file)
  comments_clean <- clean_comment_data(comments_raw)
  readr::write_csv(comments_clean, clean_comments_file)
  collection_status <- paste("Collected from API and saved:", raw_comments_file, "and", clean_comments_file)
}

collection_status

## [1] "Loaded existing clean CSV: youtube_comments_clean_u4AiuqQpB1U.csv"

dplyr::glimpse(comments_clean)

## Rows: 1,051
## Columns: 12
## $ comment_id           <chr> "UgzxqV6MPR5RLmUBmal4AaABAg", "UgxUrfoE4p2mooMGNO…
## $ author               <chr> "@AP-kg9dz", "@kurtwpg", "@thedont2154", "@hidden…
## $ text                 <chr> "Songs clowning on corrupt cops is my new favorit…
## $ published_at         <dttm> 2026-03-20 15:51:20, 2026-03-26 02:44:02, 2026-0…
## $ like_count           <dbl> 34228, 4484, 11333, 233, 9066, 30273, 25758, 500,…
## $ reply_count          <dbl> 146, 24, 55, 9, 111, 212, 135, 3, 63, 29, 2, 97, …
## $ video_id             <chr> "u4AiuqQpB1U", "u4AiuqQpB1U", "u4AiuqQpB1U", "u4A…
## $ source_method        <chr> "YouTube Data API v3 commentThreads.list", "YouTu…
## $ text_original        <chr> "Songs clowning on corrupt cops is my new favorit…
## $ text_clean           <chr> "songs clowning on corrupt cops is my new favorit…
## $ comment_length_words <int> 10, 22, 13, 9, 22, 20, 30, 11, 32, 14, 14, 17, 37…
## $ total_engagement     <dbl> 34374, 4508, 11388, 242, 9177, 30485, 25893, 503,…

4 Source and dataset summary

Reader note: On desktop, the table of contents is shortened and scrollable so the source video can stay open beneath it with breathing room, rather than covering the report text. On narrower screens, the video returns to the normal report flow.

4.1 6. Source summary

# Metadata is optional. If no API key is present, the report still proceeds.

if (nchar(api_key) > 0) {
  video_metadata <- get_youtube_video_metadata(video_id, api_key)
} else {
  video_metadata <- tibble::tibble()
}

source_summary_table <- tibble::tibble(
  Field = c(
    "Video ID",
    "Video URL",
    "Collection method",
    "Primary data type",
    "API metadata status",
    "Reproducible cache file"
  ),
  Value = c(
    video_id,
    paste0("<a href='", video_url, "'>", video_url, "</a>"),
    "YouTube Data API v3 commentThreads.list using API-key collection, then saved CSV cache",
    "Public top-level YouTube comment records returned by the API",
    ifelse(nrow(video_metadata) > 0, "Available in this knit", "Unavailable in this knit because no API key was used"),
    clean_comments_file
  )
)

knitr::kable(
  source_summary_table,
  caption = "Source summary for the YouTube comment analysis",
  escape = FALSE,
  align = c("l", "l")
)

Source summary for the YouTube comment analysis
Field	Value
Video ID	u4AiuqQpB1U
Video URL	https://www.youtube.com/watch?v=u4AiuqQpB1U
Collection method	YouTube Data API v3 commentThreads.list using API-key collection, then saved CSV cache
Primary data type	Public top-level YouTube comment records returned by the API
API metadata status	Available in this knit
Reproducible cache file	youtube_comments_clean_u4AiuqQpB1U.csv

if (nrow(video_metadata) > 0) {
  video_metadata_formatted <- video_metadata |>
    dplyr::mutate(
      view_count = scales::comma(view_count),
      comment_count_visible_on_youtube = scales::comma(comment_count_visible_on_youtube)
    )

  knitr::kable(
    video_metadata_formatted,
    caption = "Optional YouTube video metadata from the API",
    align = "l"
  )
}

Optional YouTube video metadata from the API
video_id	title	channel_title	published_at	view_count	comment_count_visible_on_youtube
u4AiuqQpB1U	RANDY WALTERS IS A SON OF A BITCH	ogafroman	2026-03-16T12:45:54Z	3,614,030	22,464

# Dataset summary for documentation.
# This vertical two-column format is easier to read in HTML than one wide row.
# It also clarifies the difference between the initial API return and the final
# cleaned analysis dataset.

comments_analyzed <- nrow(comments_clean)
unique_authors_count <- dplyr::n_distinct(comments_clean$author)
first_comment_value <- min(comments_clean$published_at, na.rm = TRUE)
latest_comment_value <- max(comments_clean$published_at, na.rm = TRUE)
total_likes_value <- sum(comments_clean$like_count, na.rm = TRUE)
total_replies_value <- sum(comments_clean$reply_count, na.rm = TRUE)
median_length_value <- median(comments_clean$comment_length_words, na.rm = TRUE)
average_length_value <- mean(comments_clean$comment_length_words, na.rm = TRUE)

raw_cache_count <- if (file.exists(raw_comments_file)) {
  suppressWarnings(nrow(readr::read_csv(raw_comments_file, show_col_types = FALSE)))
} else {
  NA_integer_
}

raw_collection_display <- if (!is.na(raw_cache_count)) {
  scales::comma(raw_cache_count)
} else {
  paste0(scales::comma(raw_comments_initial_collection_count), " during initial API collection")
}

comment_summary <- tibble::tibble(
  Metric = c(
    "Records initially returned by API collection",
    "Comments analyzed after cleaning",
    "Unique authors in cleaned dataset",
    "First collected comment date",
    "Latest collected comment date",
    "Total likes on analyzed comments",
    "Total replies to analyzed comments",
    "Median comment length",
    "Average comment length"
  ),
  Value = c(
    raw_collection_display,
    scales::comma(comments_analyzed),
    scales::comma(unique_authors_count),
    format(first_comment_value, "%Y-%m-%d %H:%M:%S"),
    format(latest_comment_value, "%Y-%m-%d %H:%M:%S"),
    scales::comma(total_likes_value),
    scales::comma(total_replies_value),
    paste0(round(median_length_value, 1), " words"),
    paste0(round(average_length_value, 1), " words")
  )
)

cat(
  '<div class="stat-grid">',
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(comments_analyzed), '</div><div class="stat-label">cleaned comments</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(unique_authors_count), '</div><div class="stat-label">unique authors</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(total_likes_value), '</div><div class="stat-label">likes captured</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(total_replies_value), '</div><div class="stat-label">replies captured</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', format(first_comment_value, "%b %d"), ' - ', format(latest_comment_value, "%b %d"), '</div><div class="stat-label">comment window</div></div>'),
  '</div>'
)

1,051

cleaned comments

997

unique authors

464,878

likes captured

3,660

replies captured

Mar 16 - Jun 22

comment window

knitr::kable(
  comment_summary,
  caption = "Summary of collected and analyzed YouTube comments",
  align = c("l", "r")
)

Summary of collected and analyzed YouTube comments
Metric	Value
Records initially returned by API collection	1,105
Comments analyzed after cleaning	1,051
Unique authors in cleaned dataset	997
First collected comment date	2026-03-16 12:46:46
Latest collected comment date	2026-06-22 16:17:53
Total likes on analyzed comments	464,878
Total replies to analyzed comments	3,660
Median comment length	10 words
Average comment length	14.6 words

Plain-language reading: The initial API collection returned more records than the final cleaned analysis file because cleaning removes unusable or duplicate rows and standardizes the text. The cleaned dataset is the version used for word frequency, phrase analysis, engagement analysis, and all visualizations below.

4.2 6.1 Example of a cleaned comment

The table below shows one real comment from the dataset before and after cleaning. The cleaning step lowercases the text, removes links and punctuation, and standardizes spacing so the comment can be tokenized into words and phrases.

# ================================================================
# CLEANED COMMENT EXAMPLE
# ================================================================
# Choose a readable real example from the dataset. The selection avoids very
# short comments and extremely long comments so the before/after cleaning
# example is useful for the reader.

example_comment <- comments_clean |>
  dplyr::filter(comment_length_words >= 8, comment_length_words <= 35) |>
  dplyr::arrange(dplyr::desc(like_count)) |>
  dplyr::slice(1) |>
  dplyr::transmute(
    `Original comment preview` = stringr::str_trunc(text_original, 220),
    `Cleaned text used for analysis` = stringr::str_trunc(text_clean, 220),
    `Words after cleaning` = comment_length_words,
    `Likes` = scales::comma(like_count)
  )

knitr::kable(
  example_comment,
  caption = "Example of one real comment before and after text cleaning"
)

Example of one real comment before and after text cleaning
Original comment preview	Cleaned text used for analysis	Words after cleaning	Likes
These diss tracks against individual cops are the most gangster shit I’ve ever seen lmao	these diss tracks against individual cops are the most gangster shit i ve ever seen lmao	16	56,587

5 Text preparation

5.1 7. Tokenize words and remove stop words

# ================================================================
# WORD TOKENIZATION
# ================================================================
# Tokenization converts each comment into individual words.
# Stop words are common words such as "the", "and", and "is" that usually
# do not help identify themes.

custom_stop_words <- tibble::tibble(
  word = c(
    "youtube", "video", "watch", "channel", "comment", "comments",
    "just", "like", "really", "get", "got", "can", "will", "one",
    "people", "thing", "things", "way", "make", "makes", "say", "said",
    "u", "im", "dont", "didnt", "doesnt", "cant", "youre", "ive",
    "thats", "theyre", "hes", "shes", "wasnt", "isnt", "arent",
    "would", "could", "should", "also", "even", "still", "know"
  ),
  lexicon = "custom"
)

all_stop_words <- dplyr::bind_rows(tidytext::stop_words, custom_stop_words)

comment_words <- comments_clean |>
  dplyr::select(comment_id, text_clean) |>
  tidytext::unnest_tokens(word, text_clean) |>
  dplyr::anti_join(all_stop_words, by = "word") |>
  dplyr::filter(stringr::str_detect(word, "^[a-z]+$")) |>
  dplyr::filter(nchar(word) > 2)

word_counts <- comment_words |>
  dplyr::count(word, sort = TRUE)

knitr::kable(head(word_counts, 25), caption = "Top 25 most frequent meaningful words")

Top 25 most frequent meaningful words
word	n
afroman	225
song	103
love	87
randy	83
cops	68
court	44
walters	43
shit	40
police	38
time	38
music	31
son	31
county	27
day	27
lol	27
bitch	26
banger	24
corrupt	24
don	24
god	23
head	22
judge	22
legend	22
wife	22
adams	21

5.2 8. Tokenize two-word phrases

# ================================================================
# BIGRAM TOKENIZATION
# ================================================================
# Bigrams are two-word phrases. They are useful because a single word can be
# ambiguous, while a phrase often carries context.

comment_bigrams <- comments_clean |>
  dplyr::select(comment_id, text_clean) |>
  tidytext::unnest_tokens(bigram, text_clean, token = "ngrams", n = 2) |>
  tidyr::separate(bigram, into = c("word1", "word2"), sep = " ") |>
  dplyr::filter(
    !word1 %in% all_stop_words$word,
    !word2 %in% all_stop_words$word,
    stringr::str_detect(word1, "^[a-z]+$"),
    stringr::str_detect(word2, "^[a-z]+$")
  ) |>
  tidyr::unite(bigram, word1, word2, sep = " ")

bigram_counts <- comment_bigrams |>
  dplyr::count(bigram, sort = TRUE)

knitr::kable(head(bigram_counts, 20), caption = "Top 20 most frequent two-word phrases")

Top 20 most frequent two-word phrases
bigram	n
randy walters	39
pound cake	19
adams county	18
lemon pound	14
god bless	11
randy walter	9
county sheriff	8
diss track	6
diss tracks	6
reckless ben	6
corrupt cops	5
hell yeah	5
love afroman	5
streisand effect	5
bless america	4
crooked cops	4
hip hop	4
law enforcement	4
police officers	4
rent free	4

6 Visualization guide

How to read this section: The charts move from simple description to richer interpretation. Word frequency identifies repeated vocabulary. Bigrams reveal repeated phrases and narrative frames. Theme coding groups individual words into audience frames. Engagement charts show which comments captured attention. Timeline and heatmap views show when the discussion appeared most active. The location-mention proxy checks whether commenters explicitly wrote state or country names, while warning that this is not true viewer geography.

Strong signal: words, phrases, engagement Exploratory signal: timing and length patterns Cautious proxy: place names mentioned in text

7 Signal 1: The Words That Kept Coming Back

7.1 Goal

This chart answers: What words appeared most often after cleaning the comments?

A lollipop chart is a polished alternative to a basic bar chart. The dot marks each word’s count, and the line makes it easier to compare terms across the ranked list.

# ================================================================
# VISUALIZATION 1: TOP TERMS LOLLIPOP CHART
# ================================================================

top_terms <- word_counts |>
  dplyr::slice_max(n, n = 25) |>
  dplyr::arrange(n)

plot_top_terms <- ggplot2::ggplot(top_terms, ggplot2::aes(x = n, y = forcats::fct_reorder(word, n))) +
  ggplot2::geom_segment(
    ggplot2::aes(x = 0, xend = n, y = forcats::fct_reorder(word, n), yend = forcats::fct_reorder(word, n)),
    linewidth = 0.9,
    color = hemp_light
  ) +
  ggplot2::geom_point(size = 3.4, color = hemp_green_dark) +
  ggplot2::labs(
    title = "Most Frequent Meaningful Words in the YouTube Comment Sample",
    subtitle = "Top 25 terms after lowercasing, punctuation cleanup, and stop-word removal",
    x = "Number of appearances",
    y = "Word",
    caption = "Source: YouTube Data API v3 commentThreads.list; analysis uses saved CSV cache."
  ) +
  theme_lab()

save_report_plot(plot_top_terms, "01_top_terms_lollipop.png")
plot_top_terms

7.2 How to interpret it

The highest-ranked words show the most repeated vocabulary in the comment section. Repeated words do not prove agreement or sentiment by themselves, but they identify the main objects of attention. In this sample, high-frequency terms are useful for spotting whether viewers are focused on people, institutions, legal concepts, music, humor, or accountability.

8 Signal 2: Repeated Phrases and Public Framing

8.1 Goal

This chart answers: What repeated two-word phrases reveal the clearest audience frames?

Single words can lose context. For example, a name, institution, or legal term may be more meaningful as a phrase than as an isolated word. Bigram analysis helps reveal repeated phrases such as names, institutions, legal ideas, or recurring jokes.

# ================================================================
# VISUALIZATION 2: BIGRAM PHRASE CHART
# ================================================================

top_bigrams <- bigram_counts |>
  dplyr::slice_max(n, n = 20) |>
  dplyr::arrange(n)

plot_bigrams <- ggplot2::ggplot(top_bigrams, ggplot2::aes(x = n, y = forcats::fct_reorder(bigram, n))) +
  ggplot2::geom_col(fill = hemp_sage) +
  ggplot2::labs(
    title = "Most Common Two-Word Phrases in the Comments",
    subtitle = "Bigrams reveal repeated names, institutions, jokes, and legal frames",
    x = "Number of appearances",
    y = "Two-word phrase",
    caption = "Bigrams created from cleaned YouTube comment text."
  ) +
  theme_lab()

save_report_plot(plot_bigrams, "02_bigram_phrase_chart.png")
plot_bigrams

8.2 How to interpret it

Phrases are useful because they show how viewers connect ideas. If the chart is dominated by names, the discussion is person-centered. If it is dominated by institutional phrases, the discussion is system-centered. If legal phrases appear often, the audience is framing the video as a rights, court, or accountability issue. If music or humor phrases appear often, viewers may be interpreting the content as entertainment, protest, or both.

9 Signal 3: Turning Word Counts into Audience Frames

9.1 Goal

This chart answers: Can repeated words be grouped into broader audience themes?

This is not a machine-learning topic model. It is an interpretable, manually defined theme dictionary that groups important words into categories. The purpose is to translate word counts into plain-language audience frames.

# ================================================================
# THEME DICTIONARY
# ================================================================
# These categories are manually defined for interpretability.
# They should be read as directional audience frames, not objective topics.

theme_dictionary <- tibble::tribble(
  ~word, ~theme,
  "afroman", "Artist / identity",
  "afro", "Artist / identity",
  "artist", "Artist / identity",
  "rapper", "Artist / identity",
  "man", "Artist / identity",

  "court", "Law / court",
  "case", "Law / court",
  "judge", "Law / court",
  "sue", "Law / court",
  "sued", "Law / court",
  "lawsuit", "Law / court",
  "defamation", "Law / court",
  "amendment", "Law / court",
  "speech", "Law / court",
  "legal", "Law / court",
  "law", "Law / court",

  "cops", "Police / corruption",
  "police", "Police / corruption",
  "sheriff", "Police / corruption",
  "corrupt", "Police / corruption",
  "raid", "Police / corruption",
  "county", "Police / corruption",
  "officer", "Police / corruption",
  "officers", "Police / corruption",

  "song", "Music / humor",
  "songs", "Music / humor",
  "music", "Music / humor",
  "banger", "Music / humor",
  "diss", "Music / humor",
  "track", "Music / humor",
  "funny", "Music / humor",
  "laugh", "Music / humor",
  "comedy", "Music / humor",

  "love", "Support / praise",
  "thank", "Support / praise",
  "thanks", "Support / praise",
  "legend", "Support / praise",
  "bless", "Support / praise",
  "respect", "Support / praise",
  "favorite", "Support / praise",
  "best", "Support / praise",

  "freedom", "Rights / accountability",
  "rights", "Rights / accountability",
  "accountability", "Rights / accountability",
  "justice", "Rights / accountability",
  "wrong", "Rights / accountability",
  "truth", "Rights / accountability"
)

theme_counts <- word_counts |>
  dplyr::inner_join(theme_dictionary, by = "word") |>
  dplyr::group_by(theme) |>
  dplyr::summarise(theme_mentions = sum(n), .groups = "drop") |>
  dplyr::arrange(dplyr::desc(theme_mentions))

knitr::kable(theme_counts, caption = "Theme counts from manually grouped high-meaning terms")

Theme counts from manually grouped high-meaning terms
theme	theme_mentions
Artist / identity	254
Music / humor	224
Police / corruption	205
Law / court	145
Support / praise	137
Rights / accountability	38

# ================================================================
# VISUALIZATION 3: THEME-CODED AUDIENCE FRAMES
# ================================================================

plot_theme_bars <- ggplot2::ggplot(theme_counts, ggplot2::aes(x = theme_mentions, y = forcats::fct_reorder(theme, theme_mentions))) +
  ggplot2::geom_col(fill = hemp_soil) +
  ggplot2::labs(
    title = "Theme-Level View of the Comment Conversation",
    subtitle = "Selected high-meaning words grouped into interpretable audience frames",
    x = "Total word mentions",
    y = "Theme",
    caption = "Themes are manually defined for interpretation and should be read as directional, not exhaustive."
  ) +
  theme_lab()

save_report_plot(plot_theme_bars, "03_theme_level_audience_frames.png")
plot_theme_bars

9.2 How to interpret it

The theme chart turns many individual word counts into a smaller number of audience frames. If Law / court and Police / corruption dominate, viewers are interpreting the video through legal accountability and institutional trust. If Music / humor is strong, viewers are also responding to the entertainment format. If Support / praise is strong, the comment section is not only discussing the event but also expressing approval toward the creator or message.

10 Signal 4: When the Conversation Spiked

10.1 Goal

This chart answers: When were the collected comments posted, and did activity spike or fade?

The bars show daily comment count. The line shows a simple 7-day rolling average, which smooths the chart so the overall trend is easier to read.

# ================================================================
# VISUALIZATION 4: COMMENT ACTIVITY OVER TIME
# ================================================================

comments_by_day <- comments_clean |>
  dplyr::mutate(comment_date = as.Date(published_at)) |>
  dplyr::filter(!is.na(comment_date)) |>
  dplyr::count(comment_date, name = "comments_posted") |>
  dplyr::arrange(comment_date) |>
  dplyr::mutate(rolling_7_day_average = as.numeric(rolling_mean_right(comments_posted, window = 7)))

plot_comment_activity <- ggplot2::ggplot(comments_by_day, ggplot2::aes(x = comment_date, y = comments_posted)) +
  ggplot2::geom_col(fill = hemp_light) +
  ggplot2::geom_line(ggplot2::aes(y = rolling_7_day_average), linewidth = 1.2, color = hemp_gold, na.rm = TRUE) +
  ggplot2::labs(
    title = "Comment Activity Over Time",
    subtitle = "Daily comment counts with a 7-day rolling average",
    x = "Comment date",
    y = "Number of collected comments",
    caption = "Bars show daily counts; gold line shows smoothed 7-day average."
  ) +
  theme_lab()

save_report_plot(plot_comment_activity, "04_comment_activity_over_time.png")
plot_comment_activity

10.2 How to interpret it

A sharp spike suggests a burst of audience attention, often linked to a video upload, news event, repost, or related controversy. A long tail suggests that the video continued receiving comments after the initial attention window. Multiple spikes can indicate that public attention was reactivated over time.

11 Signal 5: The Comments That Carried the Room

11.1 Goal

This chart answers: Are likes spread evenly across comments, or concentrated among a small number of highly visible comments?

This is an advanced view because it shifts from word frequency to attention concentration. It helps show whether the average comment or the highly liked comments are more likely to shape the visible conversation.

# ================================================================
# ENGAGEMENT CONCENTRATION PREP
# ================================================================

engagement_ranked <- comments_clean |>
  dplyr::mutate(total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0)) |>
  dplyr::arrange(dplyr::desc(total_engagement)) |>
  dplyr::mutate(
    comment_rank = dplyr::row_number(),
    cumulative_engagement = cumsum(total_engagement),
    total_engagement_all = sum(total_engagement, na.rm = TRUE),
    cumulative_engagement_share = cumulative_engagement / total_engagement_all,
    comment_share = comment_rank / dplyr::n()
  )

engagement_gini <- gini_coefficient(comments_clean$total_engagement)
share_top_1_percent <- engagement_ranked |>
  dplyr::filter(comment_share <= 0.01) |>
  dplyr::summarise(value = max(cumulative_engagement_share, na.rm = TRUE)) |>
  dplyr::pull(value)
share_top_10_percent <- engagement_ranked |>
  dplyr::filter(comment_share <= 0.10) |>
  dplyr::summarise(value = max(cumulative_engagement_share, na.rm = TRUE)) |>
  dplyr::pull(value)

engagement_summary <- tibble::tibble(
  metric = c("Engagement Gini coefficient", "Share of engagement captured by top 1% of comments", "Share of engagement captured by top 10% of comments"),
  value = c(
    round(engagement_gini, 3),
    scales::percent(share_top_1_percent, accuracy = 0.1),
    scales::percent(share_top_10_percent, accuracy = 0.1)
  )
)

knitr::kable(engagement_summary, caption = "Engagement concentration summary")

Engagement concentration summary
metric	value
Engagement Gini coefficient	0.958
Share of engagement captured by top 1% of comments	52.2%
Share of engagement captured by top 10% of comments	95.8%

# ================================================================
# VISUALIZATION 5: ENGAGEMENT CONCENTRATION CURVE
# ================================================================

plot_engagement_concentration <- ggplot2::ggplot(engagement_ranked, ggplot2::aes(x = comment_share, y = cumulative_engagement_share)) +
  ggplot2::geom_line(linewidth = 1.2, color = hemp_green_dark) +
  ggplot2::geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = hemp_gray) +
  ggplot2::scale_x_continuous(labels = scales::percent_format()) +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::labs(
    title = "Engagement Concentration Across Comments",
    subtitle = "The curve shows whether a small share of comments captured most likes and replies",
    x = "Share of comments, ranked from most engaged to least engaged",
    y = "Cumulative share of total engagement",
    caption = "Dashed line = perfectly even distribution. Curved line above it = concentrated engagement."
  ) +
  theme_lab()

save_report_plot(plot_engagement_concentration, "05_engagement_concentration_curve.png")
plot_engagement_concentration

11.2 How to interpret it

The dashed diagonal line represents a perfectly even world where 10% of comments would receive 10% of engagement, 50% of comments would receive 50% of engagement, and so on. If the actual curve rises above that dashed line, then engagement is concentrated among the most visible comments. This matters because social media perception is often shaped by the comments that receive the most likes and replies, not by the average comment.

12 Signal 6: Does Comment Style Affect Attention?

12.1 Goal

This chart answers: Do shorter or longer comments appear to attract more engagement?

The y-axis uses a log scale because comment likes are usually highly skewed. Without that scaling, a few extremely popular comments would flatten most of the chart.

# ================================================================
# VISUALIZATION 6: COMMENT LENGTH VS ENGAGEMENT
# ================================================================

comments_length_engagement <- comments_clean |>
  dplyr::mutate(
    total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0),
    engagement_plus_one = total_engagement + 1
  ) |>
  dplyr::filter(!is.na(comment_length_words), comment_length_words > 0)

plot_length_vs_engagement <- ggplot2::ggplot(comments_length_engagement, ggplot2::aes(x = comment_length_words, y = engagement_plus_one)) +
  ggplot2::geom_point(alpha = 0.45, color = hemp_green_dark) +
  ggplot2::geom_smooth(method = "loess", se = FALSE, color = hemp_gold, linewidth = 1.1) +
  ggplot2::scale_y_log10(labels = scales::comma_format()) +
  ggplot2::labs(
    title = "Comment Length vs. Engagement",
    subtitle = "Testing whether short punchy comments or longer comments attracted more attention",
    x = "Comment length in words",
    y = "Total engagement plus 1, log scale",
    caption = "Total engagement = likes + replies. Log scale improves readability for skewed engagement data."
  ) +
  theme_lab()

save_report_plot(plot_length_vs_engagement, "06_comment_length_vs_engagement.png")
plot_length_vs_engagement

12.2 How to interpret it

Each point is one comment. Points higher on the chart received more likes and replies. The smooth trend line shows the general relationship between comment length and engagement. If the line rises, longer comments tended to receive more engagement; if it falls, shorter comments tended to perform better; if it is mostly flat, length alone does not explain engagement.

13 Signal 7: Time Patterns in the Crowd

13.1 Goal

This chart answers: When were comments posted by day of week and hour of day?

This is a non-standard visualization for a short lab because it treats comments as behavioral time-stamped events, not just text. It can reveal whether activity clusters around certain hours or days.

# ================================================================
# VISUALIZATION 7: DAY-OF-WEEK BY HOUR HEATMAP
# ================================================================

comment_time_heatmap <- comments_clean |>
  dplyr::filter(!is.na(published_at)) |>
  dplyr::mutate(
    day_of_week = lubridate::wday(published_at, label = TRUE, abbr = TRUE, week_start = 1),
    hour_of_day = lubridate::hour(published_at)
  ) |>
  dplyr::count(day_of_week, hour_of_day, name = "comments_posted")

plot_calendar_heatmap <- ggplot2::ggplot(comment_time_heatmap, ggplot2::aes(x = hour_of_day, y = day_of_week, fill = comments_posted)) +
  ggplot2::geom_tile(color = hemp_cream) +
  ggplot2::scale_x_continuous(breaks = seq(0, 23, by = 3)) +
  ggplot2::scale_fill_gradient(low = "#F3F1DE", high = hemp_green_dark, labels = scales::comma_format()) +
  ggplot2::labs(
    title = "When Did Viewers Comment?",
    subtitle = "Heatmap of collected comments by day of week and hour of day",
    x = "Hour of day, UTC timestamp from YouTube API",
    y = "Day of week",
    fill = "Comments",
    caption = "Timestamps use YouTube API time values and may not match each viewer's local time zone."
  ) +
  theme_lab()

save_report_plot(plot_calendar_heatmap, "07_comment_timing_heatmap.png")
plot_calendar_heatmap

13.2 How to interpret it

Darker cells represent more comments. If the heatmap has clear dark bands, comments are clustered during specific hours or days. If activity is more evenly distributed, discussion was less time-concentrated. Because YouTube API timestamps are not viewer-local time zones, this chart should be interpreted as a collection-time pattern rather than a precise audience schedule.

14 Signal 8: Places Mentioned, Not Places Measured

14.1 Goal

This section answers: Which places are named inside the comments themselves?

The YouTube comment data used in this report does not provide a reliable viewer state or country field. The official YouTube commentThread resource documents fields such as thread ID, video/channel IDs, top-level comment details, reply count, public visibility, and a limited replies object, but it does not provide commenter geography. Therefore, this section is only a location-reference-in-text proxy: it counts places that commenters wrote inside comments. It should not be interpreted as where commenters are actually from.

# ================================================================
# LOCATION-MENTION DICTIONARY
# ================================================================
# This is a cautious proxy analysis. It only counts explicit state/country
# names written in the text. It does not infer hidden viewer location.

us_state_names <- tibble::tibble(
  location = c(
    "alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut",
    "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa",
    "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan",
    "minnesota", "mississippi", "missouri", "montana", "nebraska", "nevada",
    "new hampshire", "new jersey", "new mexico", "new york", "north carolina",
    "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island",
    "south carolina", "south dakota", "tennessee", "texas", "utah", "vermont",
    "virginia", "washington", "west virginia", "wisconsin", "wyoming"
  ),
  location_type = "U.S. state name"
)

country_names <- tibble::tibble(
  location = c(
    "america", "united states", "usa", "canada", "mexico", "england", "ireland", "scotland",
    "wales", "france", "germany", "italy", "spain", "australia", "new zealand", "brazil",
    "india", "china", "japan", "ukraine", "russia", "poland", "netherlands", "sweden",
    "norway", "denmark", "finland", "south africa"
  ),
  location_type = "Country or country reference"
)

location_dictionary <- dplyr::bind_rows(us_state_names, country_names) |>
  dplyr::distinct(location, location_type)

extract_location_mentions <- function(data, dictionary) {
  location_rows <- lapply(seq_len(nrow(dictionary)), function(i) {
    location_value <- dictionary$location[i]
    location_type_value <- dictionary$location_type[i]
    pattern_value <- paste0("\\b", stringr::str_replace_all(location_value, " ", "\\\\s+"), "\\b")

    data |>
      dplyr::filter(stringr::str_detect(text_clean, stringr::regex(pattern_value, ignore_case = TRUE))) |>
      dplyr::transmute(
        comment_id = comment_id,
        location = stringr::str_to_title(location_value),
        location_type = location_type_value
      )
  })

  dplyr::bind_rows(location_rows) |>
    dplyr::distinct(comment_id, location, .keep_all = TRUE)
}

location_mentions <- extract_location_mentions(comments_clean, location_dictionary)

location_counts <- location_mentions |>
  dplyr::count(location_type, location, sort = TRUE)

if (nrow(location_counts) > 0) {
  knitr::kable(
    location_counts |>
      dplyr::slice_head(n = 20),
    caption = "Top explicit location references found inside comment text",
    align = c("l", "l", "r")
  )
} else {
  knitr::kable(
    tibble::tibble(
      Result = "No state or country names from the dictionary were detected in the cleaned comments.",
      Interpretation = "This does not mean viewers came from no regions; it only means commenters did not explicitly write recognizable place names in the sampled text."
    ),
    caption = "Location-mention proxy result"
  )
}

Top explicit location references found inside comment text
location_type	location	n
Country or country reference	America	18
U.S. state name	Ohio	7
Country or country reference	Germany	5
Country or country reference	Usa	5
Country or country reference	France	4
Country or country reference	South Africa	4
Country or country reference	Ireland	3
U.S. state name	Mississippi	3
U.S. state name	Utah	3
Country or country reference	Australia	2
Country or country reference	Canada	2
U.S. state name	California	2
Country or country reference	Brazil	1
Country or country reference	Denmark	1
Country or country reference	Mexico	1
Country or country reference	Poland	1
U.S. state name	Arkansas	1
U.S. state name	Delaware	1
U.S. state name	Missouri	1
U.S. state name	Nebraska	1

# ================================================================
# VISUALIZATION 8: LOCATION-MENTION PROXY CHART
# ================================================================

if (nrow(location_counts) > 0) {
  top_location_counts <- location_counts |>
    dplyr::slice_max(n, n = min(15, nrow(location_counts))) |>
    dplyr::arrange(n)

  plot_location_mentions <- ggplot2::ggplot(
    top_location_counts,
    ggplot2::aes(x = n, y = forcats::fct_reorder(location, n), fill = location_type)
  ) +
    ggplot2::geom_col() +
    ggplot2::scale_fill_manual(values = c(hemp_green_dark, hemp_gold, hemp_sage)) +
    ggplot2::labs(
      title = "Locations Named Inside Comment Text",
      subtitle = "This counts written location references, not where commenters are from",
      x = "Number of comment mentions",
      y = "Mentioned location",
      fill = "Location type",
      caption = "This is a text-reference proxy using a small state/country dictionary; it is not viewer geography."
    ) +
    theme_lab()

  save_report_plot(plot_location_mentions, "08_location_mentions_proxy.png")
  plot_location_mentions
} else {
  plot_location_mentions <- ggplot2::ggplot() +
    ggplot2::annotate(
      "text",
      x = 0,
      y = 0,
      label = "No explicit state or country mentions were detected.\nYouTube comments do not provide viewer geography in this dataset.",
      size = 5,
      color = hemp_green_dark
    ) +
    ggplot2::xlim(-1, 1) +
    ggplot2::ylim(-1, 1) +
    ggplot2::labs(
      title = "Location References in Comment Text",
      subtitle = "No detected state or country names in sampled comment text",
      x = NULL,
      y = NULL,
      caption = "This is not evidence of where viewers are located; it only reflects text mentions."
    ) +
    theme_lab() +
    ggplot2::theme(axis.text = ggplot2::element_blank(), panel.grid = ggplot2::element_blank())

  save_report_plot(plot_location_mentions, "08_location_mentions_proxy.png")
  plot_location_mentions
}

14.2 How to interpret it

If location names appear, this chart shows places that were explicitly written inside comments, such as states, countries, or broad national references. This can be useful for identifying rhetorical references like “in America,” “from Texas,” or “here in Canada,” but it is not evidence of where commenters live or where viewers are located. A proper regional audience analysis would require data that YouTube comments do not expose in this API response, such as viewer analytics from the channel owner or another source with location metadata.

15 Signal 9: The Comments People Lifted Up

15.1 Goal

This table answers: Which comments actually captured the most visible attention?

Word frequency describes the overall conversation, but highly engaged comments often shape what later viewers see first. This table helps connect quantitative engagement metrics back to readable audience language.

# ================================================================
# TOP ENGAGED COMMENTS TABLE
# ================================================================

top_engaged_comments <- comments_clean |>
  dplyr::mutate(
    total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0),
    comment_preview = stringr::str_trunc(text_original, width = 150)
  ) |>
  dplyr::arrange(dplyr::desc(total_engagement)) |>
  dplyr::select(author, comment_preview, like_count, reply_count, total_engagement, published_at) |>
  dplyr::slice_head(n = 10)

knitr::kable(top_engaged_comments, caption = "Top 10 comments by total engagement")

Top 10 comments by total engagement
author	comment_preview	like_count	reply_count	total_engagement	published_at
@mulgwisin	These diss tracks against individual cops are the most gangster shit I’ve ever seen lmao	56587	248	56835	2026-03-18 17:29:56
@AP-kg9dz	Songs clowning on corrupt cops is my new favorite genre.	34228	146	34374	2026-03-20 15:51:20
@slayanddecay6009	Remember Randy is ON record that he cannot confirm that Afroman did or did not have sex with his wife	30273	212	30485	2026-03-19 18:55:30
@Whiston555	Writing a song about fucking his wife and then him having to say for the record that he wasn’t sure it didn’t happen is an all time move.	25758	135	25893	2026-03-19 06:02:40
@Somename1010	“Make memes of your enemies until they cry then make memes of them crying” Sun Tzu probably	22652	97	22749	2026-03-20 05:36:19
@kikusui8881	AFROMAN SUPERBOWL HALFTIME SHOW	18962	153	19115	2026-03-21 06:05:05
@clikzip	Afroman making the whole second half of his career off of that raid 😂	14165	88	14253	2026-03-16 13:23:32
@lawrencium2652	He wore the suit in court today 😂	14106	50	14156	2026-03-16 21:23:41
@prod.mordihi4772	It’s a rappers dreams to have their dis track played in court in front of the person they are dissing and win 😂🎉	13772	53	13825	2026-03-19 19:01:21
@CornyAir	I hope he wins his case against these corrupt mf cops.
Edit: HE WON, LET’S GOO FREE SPEECH	12490	192	12682	2026-03-16 12:50:58

15.2 How to interpret it

This table should be read as a visibility check. These comments are not necessarily representative of the average commenter, but they are the comments that received the most measurable attention in the sample. In a real marketing or public-affairs dashboard, these high-engagement comments would be useful for identifying repeated slogans, jokes, accusations, or support statements that may shape broader audience perception.

16 Integrated interpretation of findings

16.1 Main interpretation

This analysis used 1,051 cleaned YouTube comments from 997 unique authors, covering comments from March 16, 2026 through June 22, 2026. The most frequent cleaned terms were afroman, song, love, randy, cops, court, walters, shit, which suggests that the audience conversation centered on a mix of people, institutions, legal conflict, humor, and accountability. The phrase-level view sharpened that interpretation because the most common bigrams included randy walters, pound cake, adams county, lemon pound, god bless, showing that many viewers repeated recognizable names, concepts, or story frames rather than only using disconnected single words.

The theme-coded analysis suggests that the strongest manually grouped frame was Artist / identity. This does not mean the entire comment section held one unified opinion, but it does show that repeated vocabulary clustered around a recognizable audience frame. For a business analytics or public-affairs use case, this is useful because it translates raw comments into a practical question: what story is the audience collectively building around the event?

The engagement visuals add a second layer of interpretation. The concentration curve and top-engaged-comments table show that attention is not only about how many comments exist, but also about which comments become highly visible through likes and replies. This matters because social media conversations are often shaped by a small number of highly engaged comments, much like a few loud signal fires lighting up the map before the rest of the army knows where the battle is.

16.2 Limitations

This analysis should be interpreted as an exploratory, platform-specific signal rather than a complete measurement of public opinion. The YouTube Data API may not return every visible comment, every nested reply, or comments filtered by YouTube moderation systems. The analysis also uses word frequency and manually defined theme groups, which are useful for discovering repeated language but do not fully capture sarcasm, irony, disagreement, or the intent behind a comment. For a larger final project, this workflow should be combined with other sources such as Google Trends, Bluesky posts, news mentions, or additional YouTube videos.

17 Screenshots or code snippets showing data collection

The data collection process is documented through the code chunks above. For the final submitted report, screenshots can be added showing the Google Cloud API setup, the successful comment collection output, or the saved CSV file.

18 Appendix A - Optional OAuth / tuber workflow

This section is included for alignment with the course tutorial. It is not the main workflow used in this polished report because Posit Cloud can have trouble with local browser redirect authentication.

18.1 Credential placement for OAuth

OAuth item	R object
OAuth Client ID / App ID	`app_id`
OAuth Client Secret / App Secret	`app_secret`
API key	Not used in this OAuth chunk

# Optional OAuth path using tuber.
# Do not publish real Client ID or Client Secret.
# Use this only if OAuth authentication is working in your R environment.

library(tuber)

app_id <- "YOUR_CLIENT_ID.apps.googleusercontent.com"
app_secret <- "YOUR_CLIENT_SECRET"

tuber::yt_oauth(app_id, app_secret)

# Optional tuber scrape after successful OAuth.
# This can return top-level comments and replies depending on video settings and API availability.

comments_raw_tuber <- tuber::get_all_comments(video_id = video_id)

comments_tuber_clean <- comments_raw_tuber |>
  tibble::as_tibble() |>
  dplyr::distinct(id, .keep_all = TRUE) |>
  dplyr::transmute(
    comment_id = id,
    author = authorDisplayName,
    text = textOriginal,
    published_at = publishedAt,
    like_count = likeCount,
    reply_count = NA_real_,
    video_id = video_id,
    source_method = "tuber::get_all_comments"
  ) |>
  clean_comment_data()

readr::write_csv(comments_tuber_clean, paste0("youtube_comments_tuber_", video_id, ".csv"))

19 References and base requirements map

The table below is included to make the report easier to review. It is not intended as a grading claim; it simply points to where the core requested elements appear in the report.

Base assignment element	Where it appears in this report	Notes
Five-sentence reflection on one Week 4 reading	Part 1 - Five-sentence reading reflection	Reflection is based on Brooks (2026).
Use YouTube or Bluesky API data	Part 2 - Data collection and text analysis and Source summary	This report uses the YouTube Data API v3 comment workflow with saved CSV caching.
Gather at least 100 comments/posts, or as many as available	Source and dataset summary	The cleaned dataset contains more than 100 comments.
Clean and prepare text data	Text preparation and Example of a cleaned comment	Cleaning includes lowercasing, removing links/punctuation, trimming whitespace, and removing stop words for token analysis.
Perform word frequency / common-term analysis	Tokenize words and remove stop words and Signal 1	Word counts are shown as a table and lollipop chart.
Include at least one text-analysis visualization	Visualization guide through Signal 9	The report includes multiple visualizations, including word frequency, bigrams, themes, timing, engagement, and location references in text.
Briefly discuss key themes or insights	Integrated interpretation of findings	The interpretation connects word counts, phrases, themes, and engagement patterns.
Use external sources to support the argument	References below	References include course readings, YouTube API documentation, and text-mining resources.
Show data collection process through code snippets or screenshots	Credential notes, Collect or load YouTube comments, and Appendix A	Code is folded by default but available to expand in the HTML report.

19.1 References

Brooks, T. L. (2026). Measuring the Diffusion Speed of Negative Sentiment on Social Media: Insights for Companies Responding to Consumer Backlash.

Campan, A., & Holtke, N. (2024). Beyond Twitter: Exploring Alternative API Sources for Social Media Analytics.

Google Developers. (n.d.). YouTube Data API v3 Reference. https://developers.google.com/youtube/v3

Google Developers. (n.d.). YouTube Embedded Players and Player Parameters. https://developers.google.com/youtube/player_parameters

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/

Sysoev, J. (n.d.). tuber: Access YouTube from R. https://soodoku.github.io/tuber/

Wickham, H., Francois, R., Henry, L., Muller, K., & Vaughan, D. (n.d.). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org/

Xu, J. Z. (2026). Scraping Comments from a SpaceX YouTube Video Using R: A Step-by-Step Tutorial Using the tuber Package and the YouTube Data API v3.

Lab 4 - YouTube Comment Word Frequency Analysis

Is Randy Walters a Son of a Bitch (according to Randy Walters is a Son of a Bitch YouTube comments analysis)

Shep feat. Prof Jimmy Basecode Remix Collab ’26

2026-06-24

1 Lab purpose

Project process map: from source video to audience signals

2 Part 1 - Five-sentence reading reflection

3 Part 2 - Data collection and text analysis

3.1 1. Packages and reproducibility setup

3.2 2. Credential notes

3.3 3. Project settings

3.4 4. Helper functions for collection and analysis

3.5 5. Collect or load YouTube comments

4 Source and dataset summary

4.1 6. Source summary

4.2 6.1 Example of a cleaned comment

5 Text preparation

5.1 7. Tokenize words and remove stop words

5.2 8. Tokenize two-word phrases

6 Visualization guide

7 Signal 1: The Words That Kept Coming Back

7.1 Goal

7.2 How to interpret it

8 Signal 2: Repeated Phrases and Public Framing

8.1 Goal

8.2 How to interpret it

9 Signal 3: Turning Word Counts into Audience Frames

9.1 Goal

9.2 How to interpret it

10 Signal 4: When the Conversation Spiked

10.1 Goal

10.2 How to interpret it

11 Signal 5: The Comments That Carried the Room

11.1 Goal

11.2 How to interpret it

12 Signal 6: Does Comment Style Affect Attention?

12.1 Goal

12.2 How to interpret it

13 Signal 7: Time Patterns in the Crowd

13.1 Goal

13.2 How to interpret it

14 Signal 8: Places Mentioned, Not Places Measured

14.1 Goal

14.2 How to interpret it

15 Signal 9: The Comments People Lifted Up

15.1 Goal

15.2 How to interpret it

16 Integrated interpretation of findings

16.1 Main interpretation

16.2 Limitations

17 Screenshots or code snippets showing data collection

18 Appendix A - Optional OAuth / tuber workflow

18.1 Credential placement for OAuth

19 References and base requirements map

19.1 References