1 Lab purpose

Goal: This report collects and analyzes public YouTube comments from a video related to public reaction, legal conflict, and online audience response. The assignment requirement is to collect at least 100 comments or posts, clean the text, perform word frequency analysis, create at least one visualization, and interpret the findings.

This polished build goes beyond the minimum requirement by adding several audience analytics views:

  • Word frequency: What words appear most often?
  • Phrase analysis: What repeated two-word phrases reveal the main narrative frames?
  • Theme coding: How can key words be grouped into interpretable audience themes?
  • Timeline analysis: When did comment activity appear strongest?
  • Engagement concentration: Are likes spread evenly, or do a few comments dominate attention?
  • Comment length versus engagement: Do short or long comments attract more likes?
  • Calendar heatmap: What time/day patterns appear in the collected comments?
  • Location references in text: Do commenters explicitly write place names that can be counted as a cautious text-reference proxy?

Graphics export: All report graphics are automatically saved as PNG files in the lab4_graphics/ folder with descriptive names. This makes the visual outputs reusable for a slide deck, portfolio page, README, or final-project documentation.

1Watch the source.
Use the sticky video for context before reading the audience signals.
2Scan the language.
Words, phrases, and themes show what the crowd kept returning to.
3Follow the attention.
Engagement charts show which comments carried the most visible reaction.

Project process map: from source video to audience signals

This visual roadmap frames the workflow before the report moves into the charts: collect the comments, clean the signal, visualize repeated language, follow engagement, and tell the story carefully.

Infographic process map for the Afroman YouTube comment analysis project

Selected video URL: https://www.youtube.com/watch?v=u4AiuqQpB1U
Selected video ID: u4AiuqQpB1U
Primary method: YouTube Data API v3 using an API-key workflow, with local CSV caching for reproducibility.
Current safe workflow: This version is designed to knit from the saved CSV after the API key has been removed.

2 Part 1 - Five-sentence reading reflection

Open five-sentence Week 4 reading reflection

Brooks (2026) was useful because it connects social media analytics to a practical managerial problem: how quickly organizations should respond when negative public reaction begins spreading online. The reading shows that user-generated posts can be treated as more than casual conversation because they can reveal the timing, intensity, and persistence of public backlash. I found the distinction between negative and non-negative post behavior especially important because it suggests that public communication strategies should not assume all social media attention behaves the same way. For my final Executive Order Signal Impact Dashboard, this reading supports the idea that online discussion volume and word choice can serve as early signals of public awareness, confusion, disagreement, or support. The reading also highlights an important limitation for my own work: conclusions from one platform or one event should be presented carefully and treated as evidence from a specific data source rather than universal proof.

3 Part 2 - Data collection and text analysis

3.1 1. Packages and reproducibility setup

# ================================================================
# PACKAGE SETUP
# ================================================================
# This chunk installs missing packages and loads everything needed.
# The package list is intentionally limited to common packages that work
# reliably in Posit Cloud / RStudio Cloud.

required_packages <- c(
  "dplyr",
  "forcats",
  "ggplot2",
  "httr",
  "jsonlite",
  "knitr",
  "lubridate",
  "readr",
  "scales",
  "stringr",
  "tibble",
  "tidyr",
  "tidytext"
)

missing_packages <- required_packages[!(required_packages %in% rownames(installed.packages()))]

if (length(missing_packages) > 0) {
  install.packages(missing_packages)
}

invisible(lapply(required_packages, library, character.only = TRUE))

3.2 2. Credential notes

Do not publish your real API key. This build is intended to knit from saved CSV files after collection is complete. If the CSV files are missing and you need to recollect data, set the API key in the Console or temporarily in the chunk below, then remove it before publishing.

Run this once in the Console, not in the published report, if you need to recollect data:

# Replace the placeholder with your real YouTube Data API key.
# This key often starts with AIza...
# Do not publish the real key to RPubs, GitHub, Canvas comments, or screenshots.

Sys.setenv(YOUTUBE_API_KEY = "PASTE_YOUR_YOUTUBE_API_KEY_HERE")

Credential decoder:

Credential Usually looks like Used in this main workflow? Where it goes
YouTube Data API key Starts with AIza... Yes, only if recollecting Sys.setenv(YOUTUBE_API_KEY = "...")
OAuth Client ID / App ID Ends with .apps.googleusercontent.com No, appendix only app_id <- "..."
OAuth Client Secret / App Secret Often starts with GOCSPX-... No, appendix only app_secret <- "..."
Google Cloud project name / ID Project label or slug No Does not go in R

3.3 3. Project settings

# ================================================================
# PROJECT SETTINGS
# ================================================================
# Store reusable settings in one place. For a future project, change only
# these values and rerun the same workflow.

video_id <- "u4AiuqQpB1U"
video_url <- paste0("https://www.youtube.com/watch?v=", video_id)

# The assignment requires 100+ comments where available. This high cap was used
# during the collection pass. The API may return fewer comments than the public
# YouTube page displays because of endpoint limits, unavailable comments, replies,
# spam filtering, or pagination behavior.
max_comments_to_collect <- 25000

# During the full collection pass, the API initially returned 1,105 accessible
# comment records. After text cleaning and deduplication, the analysis file
# retained the final cleaned count reported below.
raw_comments_initial_collection_count <- 1105

raw_comments_file <- paste0("youtube_comments_raw_", video_id, ".csv")
clean_comments_file <- paste0("youtube_comments_clean_", video_id, ".csv")

3.4 4. Helper functions for collection and analysis

# ================================================================
# HELPER FUNCTIONS
# ================================================================

# get_col() safely extracts a column from a data frame.
# This avoids hard crashes if the API returns a slightly different structure.
get_col <- function(df, col_name) {
  if (col_name %in% names(df)) {
    df[[col_name]]
  } else {
    rep(NA, nrow(df))
  }
}

# get_youtube_video_metadata() collects basic video information.
# This is useful for documenting the source in the report.
get_youtube_video_metadata <- function(video_id, api_key) {
  response <- httr::GET(
    "https://www.googleapis.com/youtube/v3/videos",
    query = list(
      part = "snippet,statistics",
      id = video_id,
      key = api_key
    )
  )

  if (httr::status_code(response) != 200) {
    warning("Video metadata request failed. The report can continue if comments are available.")
    return(tibble::tibble())
  }

  parsed <- jsonlite::fromJSON(
    httr::content(response, as = "text", encoding = "UTF-8"),
    flatten = TRUE
  )

  if (!"items" %in% names(parsed) || nrow(parsed$items) == 0) {
    return(tibble::tibble())
  }

  items <- tibble::as_tibble(parsed$items)

  tibble::tibble(
    video_id = video_id,
    title = get_col(items, "snippet.title"),
    channel_title = get_col(items, "snippet.channelTitle"),
    published_at = get_col(items, "snippet.publishedAt"),
    view_count = suppressWarnings(as.numeric(get_col(items, "statistics.viewCount"))),
    comment_count_visible_on_youtube = suppressWarnings(as.numeric(get_col(items, "statistics.commentCount")))
  )
}

# get_youtube_comments_api_key() collects top-level public comments from a YouTube video.
# It uses pagination through nextPageToken until it reaches max_comments or the API ends.
# Note: commentThreads captures top-level comment threads and reply counts. It may not
# return every visible nested reply on YouTube.
get_youtube_comments_api_key <- function(video_id, api_key, max_comments = 25000) {
  if (nchar(api_key) == 0) {
    stop("API key is missing. Set Sys.setenv(YOUTUBE_API_KEY = 'your_key_here') or provide saved CSV files.")
  }

  base_url <- "https://www.googleapis.com/youtube/v3/commentThreads"
  all_comments <- list()
  page_token <- NULL

  repeat {
    query_params <- list(
      part = "snippet",
      videoId = video_id,
      key = api_key,
      maxResults = 100,
      textFormat = "plainText",
      order = "relevance"
    )

    if (!is.null(page_token)) {
      query_params$pageToken <- page_token
    }

    response <- httr::GET(base_url, query = query_params)

    if (httr::status_code(response) != 200) {
      error_text <- httr::content(response, as = "text", encoding = "UTF-8")
      stop(paste("YouTube API request failed:", httr::status_code(response), error_text))
    }

    parsed <- jsonlite::fromJSON(
      httr::content(response, as = "text", encoding = "UTF-8"),
      flatten = TRUE
    )

    if (!"items" %in% names(parsed) || nrow(parsed$items) == 0) {
      break
    }

    items <- tibble::as_tibble(parsed$items)

    page_comments <- tibble::tibble(
      comment_id = get_col(items, "id"),
      author = get_col(items, "snippet.topLevelComment.snippet.authorDisplayName"),
      text = get_col(items, "snippet.topLevelComment.snippet.textDisplay"),
      published_at = get_col(items, "snippet.topLevelComment.snippet.publishedAt"),
      like_count = suppressWarnings(as.numeric(get_col(items, "snippet.topLevelComment.snippet.likeCount"))),
      reply_count = suppressWarnings(as.numeric(get_col(items, "snippet.totalReplyCount"))),
      video_id = video_id,
      source_method = "YouTube Data API v3 commentThreads.list"
    )

    all_comments[[length(all_comments) + 1]] <- page_comments
    current_total <- nrow(dplyr::bind_rows(all_comments))

    if (current_total >= max_comments || is.null(parsed$nextPageToken)) {
      break
    }

    page_token <- parsed$nextPageToken
    Sys.sleep(0.25)
  }

  dplyr::bind_rows(all_comments) |>
    dplyr::distinct(comment_id, .keep_all = TRUE) |>
    dplyr::slice_head(n = max_comments)
}

# clean_comment_data() standardizes the raw YouTube data.
clean_comment_data <- function(comments_raw) {
  # Handle both raw API files and already-cleaned cache files.
  # Some files have text_original already; raw files usually only have text.
  comments_raw <- tibble::as_tibble(comments_raw)

  if (!"text_original" %in% names(comments_raw)) {
    comments_raw <- comments_raw |>
      dplyr::mutate(text_original = as.character(text))
  }

  if (!"reply_count" %in% names(comments_raw)) {
    comments_raw <- comments_raw |>
      dplyr::mutate(reply_count = 0)
  }

  comments_raw |>
    dplyr::mutate(
      text_original = as.character(text_original),
      text_clean = text_original |>
        stringr::str_to_lower() |>
        stringr::str_replace_all("http[s]?://\\S+", " ") |>
        stringr::str_replace_all("www\\.\\S+", " ") |>
        stringr::str_replace_all("[^a-z\\s]", " ") |>
        stringr::str_squish(),
      published_at = suppressWarnings(lubridate::ymd_hms(published_at)),
      like_count = suppressWarnings(as.numeric(like_count)),
      reply_count = suppressWarnings(as.numeric(reply_count)),
      comment_length_words = stringr::str_count(text_clean, "\\S+"),
      total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0)
    ) |>
    dplyr::filter(!is.na(text_clean), text_clean != "") |>
    dplyr::distinct(comment_id, .keep_all = TRUE)
}

# rolling_mean_right() calculates a simple rolling average without adding another package.
rolling_mean_right <- function(x, window = 7) {
  stats::filter(x, rep(1 / window, window), sides = 1)
}

# gini_coefficient() measures concentration. 0 = perfectly equal, 1 = highly concentrated.
gini_coefficient <- function(x) {
  x <- x[!is.na(x)]
  x <- x[x >= 0]
  if (length(x) == 0 || sum(x) == 0) {
    return(NA_real_)
  }
  x <- sort(x)
  n <- length(x)
  (2 * sum(seq_len(n) * x)) / (n * sum(x)) - (n + 1) / n
}

# Reusable Afroman-inspired leafy palette for page and graph consistency.
hemp_green <- "#1F6B3A"
hemp_green_dark <- "#12351F"
hemp_sage <- "#70A83B"
hemp_light <- "#DDE7BE"
hemp_cream <- "#F7F1DD"
hemp_parchment <- "#EFE7C8"
hemp_soil <- "#6B4A2D"
hemp_gold <- "#D6A11F"
hemp_mint <- "#A7C957"
hemp_gray <- "#5F665A"

# A reusable visual theme keeps charts consistent.
theme_lab <- function() {
  ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(
      plot.background = ggplot2::element_rect(fill = hemp_cream, color = NA),
      panel.background = ggplot2::element_rect(fill = hemp_cream, color = NA),
      plot.title = ggplot2::element_text(face = "bold", size = 15, color = hemp_green_dark),
      plot.subtitle = ggplot2::element_text(size = 11, color = hemp_soil),
      axis.title = ggplot2::element_text(face = "bold", color = hemp_green_dark),
      axis.text = ggplot2::element_text(color = "#2D3328"),
      panel.grid.major = ggplot2::element_line(color = "#E5E0C4"),
      panel.grid.minor = ggplot2::element_blank(),
      legend.position = "bottom",
      legend.title = ggplot2::element_text(face = "bold", color = hemp_green_dark),
      plot.caption = ggplot2::element_text(color = hemp_gray)
    )
}

# save_report_plot() writes each polished ggplot to the graphics folder with a clear filename.
# The plot still appears in the knitted report because the object is printed after saving.
save_report_plot <- function(plot_object, filename, width = 11, height = 6.5) {
  ggplot2::ggsave(
    filename = file.path(graphics_dir, filename),
    plot = plot_object,
    width = width,
    height = height,
    dpi = 300,
    bg = hemp_cream
  )
  invisible(plot_object)
}

3.5 5. Collect or load YouTube comments

Current reproducible mode: If youtube_comments_clean_u4AiuqQpB1U.csv exists, this report loads that cleaned file and does not call the API. This makes the report safe to knit after the API key is removed.

# ================================================================
# LOAD CLEAN CSV, LOAD RAW CSV, OR COLLECT FROM API
# ================================================================
# Priority order:
# 1. Load saved clean CSV if available.
# 2. Else load raw CSV and clean it.
# 3. Else collect from the API if an API key is available.

api_key <- Sys.getenv("YOUTUBE_API_KEY")

if (file.exists(clean_comments_file)) {
  comments_clean <- readr::read_csv(clean_comments_file, show_col_types = FALSE) |>
    clean_comment_data()
  collection_status <- paste("Loaded existing clean CSV:", clean_comments_file)

} else if (file.exists(raw_comments_file)) {
  comments_raw <- readr::read_csv(raw_comments_file, show_col_types = FALSE)
  comments_clean <- clean_comment_data(comments_raw)
  readr::write_csv(comments_clean, clean_comments_file)
  collection_status <- paste("Loaded raw CSV, cleaned it, and saved:", clean_comments_file)

} else {
  comments_raw <- get_youtube_comments_api_key(
    video_id = video_id,
    api_key = api_key,
    max_comments = max_comments_to_collect
  )
  readr::write_csv(comments_raw, raw_comments_file)
  comments_clean <- clean_comment_data(comments_raw)
  readr::write_csv(comments_clean, clean_comments_file)
  collection_status <- paste("Collected from API and saved:", raw_comments_file, "and", clean_comments_file)
}

collection_status
## [1] "Loaded existing clean CSV: youtube_comments_clean_u4AiuqQpB1U.csv"
dplyr::glimpse(comments_clean)
## Rows: 1,051
## Columns: 12
## $ comment_id           <chr> "UgzxqV6MPR5RLmUBmal4AaABAg", "UgxUrfoE4p2mooMGNO…
## $ author               <chr> "@AP-kg9dz", "@kurtwpg", "@thedont2154", "@hidden…
## $ text                 <chr> "Songs clowning on corrupt cops is my new favorit…
## $ published_at         <dttm> 2026-03-20 15:51:20, 2026-03-26 02:44:02, 2026-0…
## $ like_count           <dbl> 34228, 4484, 11333, 233, 9066, 30273, 25758, 500,…
## $ reply_count          <dbl> 146, 24, 55, 9, 111, 212, 135, 3, 63, 29, 2, 97, …
## $ video_id             <chr> "u4AiuqQpB1U", "u4AiuqQpB1U", "u4AiuqQpB1U", "u4A…
## $ source_method        <chr> "YouTube Data API v3 commentThreads.list", "YouTu…
## $ text_original        <chr> "Songs clowning on corrupt cops is my new favorit…
## $ text_clean           <chr> "songs clowning on corrupt cops is my new favorit…
## $ comment_length_words <int> 10, 22, 13, 9, 22, 20, 30, 11, 32, 14, 14, 17, 37…
## $ total_engagement     <dbl> 34374, 4508, 11388, 242, 9177, 30485, 25893, 503,…

4 Source and dataset summary

Reader note: On desktop, the table of contents is shortened and scrollable so the source video can stay open beneath it with breathing room, rather than covering the report text. On narrower screens, the video returns to the normal report flow.

4.1 6. Source summary

# Metadata is optional. If no API key is present, the report still proceeds.

if (nchar(api_key) > 0) {
  video_metadata <- get_youtube_video_metadata(video_id, api_key)
} else {
  video_metadata <- tibble::tibble()
}

source_summary_table <- tibble::tibble(
  Field = c(
    "Video ID",
    "Video URL",
    "Collection method",
    "Primary data type",
    "API metadata status",
    "Reproducible cache file"
  ),
  Value = c(
    video_id,
    paste0("<a href='", video_url, "'>", video_url, "</a>"),
    "YouTube Data API v3 commentThreads.list using API-key collection, then saved CSV cache",
    "Public top-level YouTube comment records returned by the API",
    ifelse(nrow(video_metadata) > 0, "Available in this knit", "Unavailable in this knit because no API key was used"),
    clean_comments_file
  )
)

knitr::kable(
  source_summary_table,
  caption = "Source summary for the YouTube comment analysis",
  escape = FALSE,
  align = c("l", "l")
)
Source summary for the YouTube comment analysis
Field Value
Video ID u4AiuqQpB1U
Video URL https://www.youtube.com/watch?v=u4AiuqQpB1U
Collection method YouTube Data API v3 commentThreads.list using API-key collection, then saved CSV cache
Primary data type Public top-level YouTube comment records returned by the API
API metadata status Available in this knit
Reproducible cache file youtube_comments_clean_u4AiuqQpB1U.csv
if (nrow(video_metadata) > 0) {
  video_metadata_formatted <- video_metadata |>
    dplyr::mutate(
      view_count = scales::comma(view_count),
      comment_count_visible_on_youtube = scales::comma(comment_count_visible_on_youtube)
    )

  knitr::kable(
    video_metadata_formatted,
    caption = "Optional YouTube video metadata from the API",
    align = "l"
  )
}
Optional YouTube video metadata from the API
video_id title channel_title published_at view_count comment_count_visible_on_youtube
u4AiuqQpB1U RANDY WALTERS IS A SON OF A BITCH ogafroman 2026-03-16T12:45:54Z 3,614,030 22,464
# Dataset summary for documentation.
# This vertical two-column format is easier to read in HTML than one wide row.
# It also clarifies the difference between the initial API return and the final
# cleaned analysis dataset.

comments_analyzed <- nrow(comments_clean)
unique_authors_count <- dplyr::n_distinct(comments_clean$author)
first_comment_value <- min(comments_clean$published_at, na.rm = TRUE)
latest_comment_value <- max(comments_clean$published_at, na.rm = TRUE)
total_likes_value <- sum(comments_clean$like_count, na.rm = TRUE)
total_replies_value <- sum(comments_clean$reply_count, na.rm = TRUE)
median_length_value <- median(comments_clean$comment_length_words, na.rm = TRUE)
average_length_value <- mean(comments_clean$comment_length_words, na.rm = TRUE)

raw_cache_count <- if (file.exists(raw_comments_file)) {
  suppressWarnings(nrow(readr::read_csv(raw_comments_file, show_col_types = FALSE)))
} else {
  NA_integer_
}

raw_collection_display <- if (!is.na(raw_cache_count)) {
  scales::comma(raw_cache_count)
} else {
  paste0(scales::comma(raw_comments_initial_collection_count), " during initial API collection")
}

comment_summary <- tibble::tibble(
  Metric = c(
    "Records initially returned by API collection",
    "Comments analyzed after cleaning",
    "Unique authors in cleaned dataset",
    "First collected comment date",
    "Latest collected comment date",
    "Total likes on analyzed comments",
    "Total replies to analyzed comments",
    "Median comment length",
    "Average comment length"
  ),
  Value = c(
    raw_collection_display,
    scales::comma(comments_analyzed),
    scales::comma(unique_authors_count),
    format(first_comment_value, "%Y-%m-%d %H:%M:%S"),
    format(latest_comment_value, "%Y-%m-%d %H:%M:%S"),
    scales::comma(total_likes_value),
    scales::comma(total_replies_value),
    paste0(round(median_length_value, 1), " words"),
    paste0(round(average_length_value, 1), " words")
  )
)

cat(
  '<div class="stat-grid">',
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(comments_analyzed), '</div><div class="stat-label">cleaned comments</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(unique_authors_count), '</div><div class="stat-label">unique authors</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(total_likes_value), '</div><div class="stat-label">likes captured</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', scales::comma(total_replies_value), '</div><div class="stat-label">replies captured</div></div>'),
  paste0('<div class="stat-card"><div class="stat-value">', format(first_comment_value, "%b %d"), ' - ', format(latest_comment_value, "%b %d"), '</div><div class="stat-label">comment window</div></div>'),
  '</div>'
)
1,051
cleaned comments
997
unique authors
464,878
likes captured
3,660
replies captured
Mar 16 - Jun 22
comment window
knitr::kable(
  comment_summary,
  caption = "Summary of collected and analyzed YouTube comments",
  align = c("l", "r")
)
Summary of collected and analyzed YouTube comments
Metric Value
Records initially returned by API collection 1,105
Comments analyzed after cleaning 1,051
Unique authors in cleaned dataset 997
First collected comment date 2026-03-16 12:46:46
Latest collected comment date 2026-06-22 16:17:53
Total likes on analyzed comments 464,878
Total replies to analyzed comments 3,660
Median comment length 10 words
Average comment length 14.6 words

Plain-language reading: The initial API collection returned more records than the final cleaned analysis file because cleaning removes unusable or duplicate rows and standardizes the text. The cleaned dataset is the version used for word frequency, phrase analysis, engagement analysis, and all visualizations below.

4.2 6.1 Example of a cleaned comment

The table below shows one real comment from the dataset before and after cleaning. The cleaning step lowercases the text, removes links and punctuation, and standardizes spacing so the comment can be tokenized into words and phrases.

# ================================================================
# CLEANED COMMENT EXAMPLE
# ================================================================
# Choose a readable real example from the dataset. The selection avoids very
# short comments and extremely long comments so the before/after cleaning
# example is useful for the reader.

example_comment <- comments_clean |>
  dplyr::filter(comment_length_words >= 8, comment_length_words <= 35) |>
  dplyr::arrange(dplyr::desc(like_count)) |>
  dplyr::slice(1) |>
  dplyr::transmute(
    `Original comment preview` = stringr::str_trunc(text_original, 220),
    `Cleaned text used for analysis` = stringr::str_trunc(text_clean, 220),
    `Words after cleaning` = comment_length_words,
    `Likes` = scales::comma(like_count)
  )

knitr::kable(
  example_comment,
  caption = "Example of one real comment before and after text cleaning"
)
Example of one real comment before and after text cleaning
Original comment preview Cleaned text used for analysis Words after cleaning Likes
These diss tracks against individual cops are the most gangster shit I’ve ever seen lmao these diss tracks against individual cops are the most gangster shit i ve ever seen lmao 16 56,587

5 Text preparation

5.1 7. Tokenize words and remove stop words

# ================================================================
# WORD TOKENIZATION
# ================================================================
# Tokenization converts each comment into individual words.
# Stop words are common words such as "the", "and", and "is" that usually
# do not help identify themes.

custom_stop_words <- tibble::tibble(
  word = c(
    "youtube", "video", "watch", "channel", "comment", "comments",
    "just", "like", "really", "get", "got", "can", "will", "one",
    "people", "thing", "things", "way", "make", "makes", "say", "said",
    "u", "im", "dont", "didnt", "doesnt", "cant", "youre", "ive",
    "thats", "theyre", "hes", "shes", "wasnt", "isnt", "arent",
    "would", "could", "should", "also", "even", "still", "know"
  ),
  lexicon = "custom"
)

all_stop_words <- dplyr::bind_rows(tidytext::stop_words, custom_stop_words)

comment_words <- comments_clean |>
  dplyr::select(comment_id, text_clean) |>
  tidytext::unnest_tokens(word, text_clean) |>
  dplyr::anti_join(all_stop_words, by = "word") |>
  dplyr::filter(stringr::str_detect(word, "^[a-z]+$")) |>
  dplyr::filter(nchar(word) > 2)

word_counts <- comment_words |>
  dplyr::count(word, sort = TRUE)

knitr::kable(head(word_counts, 25), caption = "Top 25 most frequent meaningful words")
Top 25 most frequent meaningful words
word n
afroman 225
song 103
love 87
randy 83
cops 68
court 44
walters 43
shit 40
police 38
time 38
music 31
son 31
county 27
day 27
lol 27
bitch 26
banger 24
corrupt 24
don 24
god 23
head 22
judge 22
legend 22
wife 22
adams 21

5.2 8. Tokenize two-word phrases

# ================================================================
# BIGRAM TOKENIZATION
# ================================================================
# Bigrams are two-word phrases. They are useful because a single word can be
# ambiguous, while a phrase often carries context.

comment_bigrams <- comments_clean |>
  dplyr::select(comment_id, text_clean) |>
  tidytext::unnest_tokens(bigram, text_clean, token = "ngrams", n = 2) |>
  tidyr::separate(bigram, into = c("word1", "word2"), sep = " ") |>
  dplyr::filter(
    !word1 %in% all_stop_words$word,
    !word2 %in% all_stop_words$word,
    stringr::str_detect(word1, "^[a-z]+$"),
    stringr::str_detect(word2, "^[a-z]+$")
  ) |>
  tidyr::unite(bigram, word1, word2, sep = " ")

bigram_counts <- comment_bigrams |>
  dplyr::count(bigram, sort = TRUE)

knitr::kable(head(bigram_counts, 20), caption = "Top 20 most frequent two-word phrases")
Top 20 most frequent two-word phrases
bigram n
randy walters 39
pound cake 19
adams county 18
lemon pound 14
god bless 11
randy walter 9
county sheriff 8
diss track 6
diss tracks 6
reckless ben 6
corrupt cops 5
hell yeah 5
love afroman 5
streisand effect 5
bless america 4
crooked cops 4
hip hop 4
law enforcement 4
police officers 4
rent free 4

6 Visualization guide

How to read this section: The charts move from simple description to richer interpretation. Word frequency identifies repeated vocabulary. Bigrams reveal repeated phrases and narrative frames. Theme coding groups individual words into audience frames. Engagement charts show which comments captured attention. Timeline and heatmap views show when the discussion appeared most active. The location-mention proxy checks whether commenters explicitly wrote state or country names, while warning that this is not true viewer geography.

Strong signal: words, phrases, engagement Exploratory signal: timing and length patterns Cautious proxy: place names mentioned in text

7 Signal 1: The Words That Kept Coming Back

7.1 Goal

This chart answers: What words appeared most often after cleaning the comments?

A lollipop chart is a polished alternative to a basic bar chart. The dot marks each word’s count, and the line makes it easier to compare terms across the ranked list.

# ================================================================
# VISUALIZATION 1: TOP TERMS LOLLIPOP CHART
# ================================================================

top_terms <- word_counts |>
  dplyr::slice_max(n, n = 25) |>
  dplyr::arrange(n)

plot_top_terms <- ggplot2::ggplot(top_terms, ggplot2::aes(x = n, y = forcats::fct_reorder(word, n))) +
  ggplot2::geom_segment(
    ggplot2::aes(x = 0, xend = n, y = forcats::fct_reorder(word, n), yend = forcats::fct_reorder(word, n)),
    linewidth = 0.9,
    color = hemp_light
  ) +
  ggplot2::geom_point(size = 3.4, color = hemp_green_dark) +
  ggplot2::labs(
    title = "Most Frequent Meaningful Words in the YouTube Comment Sample",
    subtitle = "Top 25 terms after lowercasing, punctuation cleanup, and stop-word removal",
    x = "Number of appearances",
    y = "Word",
    caption = "Source: YouTube Data API v3 commentThreads.list; analysis uses saved CSV cache."
  ) +
  theme_lab()

save_report_plot(plot_top_terms, "01_top_terms_lollipop.png")
plot_top_terms

7.2 How to interpret it

The highest-ranked words show the most repeated vocabulary in the comment section. Repeated words do not prove agreement or sentiment by themselves, but they identify the main objects of attention. In this sample, high-frequency terms are useful for spotting whether viewers are focused on people, institutions, legal concepts, music, humor, or accountability.

8 Signal 2: Repeated Phrases and Public Framing

8.1 Goal

This chart answers: What repeated two-word phrases reveal the clearest audience frames?

Single words can lose context. For example, a name, institution, or legal term may be more meaningful as a phrase than as an isolated word. Bigram analysis helps reveal repeated phrases such as names, institutions, legal ideas, or recurring jokes.

# ================================================================
# VISUALIZATION 2: BIGRAM PHRASE CHART
# ================================================================

top_bigrams <- bigram_counts |>
  dplyr::slice_max(n, n = 20) |>
  dplyr::arrange(n)

plot_bigrams <- ggplot2::ggplot(top_bigrams, ggplot2::aes(x = n, y = forcats::fct_reorder(bigram, n))) +
  ggplot2::geom_col(fill = hemp_sage) +
  ggplot2::labs(
    title = "Most Common Two-Word Phrases in the Comments",
    subtitle = "Bigrams reveal repeated names, institutions, jokes, and legal frames",
    x = "Number of appearances",
    y = "Two-word phrase",
    caption = "Bigrams created from cleaned YouTube comment text."
  ) +
  theme_lab()

save_report_plot(plot_bigrams, "02_bigram_phrase_chart.png")
plot_bigrams

8.2 How to interpret it

Phrases are useful because they show how viewers connect ideas. If the chart is dominated by names, the discussion is person-centered. If it is dominated by institutional phrases, the discussion is system-centered. If legal phrases appear often, the audience is framing the video as a rights, court, or accountability issue. If music or humor phrases appear often, viewers may be interpreting the content as entertainment, protest, or both.

9 Signal 3: Turning Word Counts into Audience Frames

9.1 Goal

This chart answers: Can repeated words be grouped into broader audience themes?

This is not a machine-learning topic model. It is an interpretable, manually defined theme dictionary that groups important words into categories. The purpose is to translate word counts into plain-language audience frames.

# ================================================================
# THEME DICTIONARY
# ================================================================
# These categories are manually defined for interpretability.
# They should be read as directional audience frames, not objective topics.

theme_dictionary <- tibble::tribble(
  ~word, ~theme,
  "afroman", "Artist / identity",
  "afro", "Artist / identity",
  "artist", "Artist / identity",
  "rapper", "Artist / identity",
  "man", "Artist / identity",

  "court", "Law / court",
  "case", "Law / court",
  "judge", "Law / court",
  "sue", "Law / court",
  "sued", "Law / court",
  "lawsuit", "Law / court",
  "defamation", "Law / court",
  "amendment", "Law / court",
  "speech", "Law / court",
  "legal", "Law / court",
  "law", "Law / court",

  "cops", "Police / corruption",
  "police", "Police / corruption",
  "sheriff", "Police / corruption",
  "corrupt", "Police / corruption",
  "raid", "Police / corruption",
  "county", "Police / corruption",
  "officer", "Police / corruption",
  "officers", "Police / corruption",

  "song", "Music / humor",
  "songs", "Music / humor",
  "music", "Music / humor",
  "banger", "Music / humor",
  "diss", "Music / humor",
  "track", "Music / humor",
  "funny", "Music / humor",
  "laugh", "Music / humor",
  "comedy", "Music / humor",

  "love", "Support / praise",
  "thank", "Support / praise",
  "thanks", "Support / praise",
  "legend", "Support / praise",
  "bless", "Support / praise",
  "respect", "Support / praise",
  "favorite", "Support / praise",
  "best", "Support / praise",

  "freedom", "Rights / accountability",
  "rights", "Rights / accountability",
  "accountability", "Rights / accountability",
  "justice", "Rights / accountability",
  "wrong", "Rights / accountability",
  "truth", "Rights / accountability"
)

theme_counts <- word_counts |>
  dplyr::inner_join(theme_dictionary, by = "word") |>
  dplyr::group_by(theme) |>
  dplyr::summarise(theme_mentions = sum(n), .groups = "drop") |>
  dplyr::arrange(dplyr::desc(theme_mentions))

knitr::kable(theme_counts, caption = "Theme counts from manually grouped high-meaning terms")
Theme counts from manually grouped high-meaning terms
theme theme_mentions
Artist / identity 254
Music / humor 224
Police / corruption 205
Law / court 145
Support / praise 137
Rights / accountability 38
# ================================================================
# VISUALIZATION 3: THEME-CODED AUDIENCE FRAMES
# ================================================================

plot_theme_bars <- ggplot2::ggplot(theme_counts, ggplot2::aes(x = theme_mentions, y = forcats::fct_reorder(theme, theme_mentions))) +
  ggplot2::geom_col(fill = hemp_soil) +
  ggplot2::labs(
    title = "Theme-Level View of the Comment Conversation",
    subtitle = "Selected high-meaning words grouped into interpretable audience frames",
    x = "Total word mentions",
    y = "Theme",
    caption = "Themes are manually defined for interpretation and should be read as directional, not exhaustive."
  ) +
  theme_lab()

save_report_plot(plot_theme_bars, "03_theme_level_audience_frames.png")
plot_theme_bars

9.2 How to interpret it

The theme chart turns many individual word counts into a smaller number of audience frames. If Law / court and Police / corruption dominate, viewers are interpreting the video through legal accountability and institutional trust. If Music / humor is strong, viewers are also responding to the entertainment format. If Support / praise is strong, the comment section is not only discussing the event but also expressing approval toward the creator or message.

10 Signal 4: When the Conversation Spiked

10.1 Goal

This chart answers: When were the collected comments posted, and did activity spike or fade?

The bars show daily comment count. The line shows a simple 7-day rolling average, which smooths the chart so the overall trend is easier to read.

# ================================================================
# VISUALIZATION 4: COMMENT ACTIVITY OVER TIME
# ================================================================

comments_by_day <- comments_clean |>
  dplyr::mutate(comment_date = as.Date(published_at)) |>
  dplyr::filter(!is.na(comment_date)) |>
  dplyr::count(comment_date, name = "comments_posted") |>
  dplyr::arrange(comment_date) |>
  dplyr::mutate(rolling_7_day_average = as.numeric(rolling_mean_right(comments_posted, window = 7)))

plot_comment_activity <- ggplot2::ggplot(comments_by_day, ggplot2::aes(x = comment_date, y = comments_posted)) +
  ggplot2::geom_col(fill = hemp_light) +
  ggplot2::geom_line(ggplot2::aes(y = rolling_7_day_average), linewidth = 1.2, color = hemp_gold, na.rm = TRUE) +
  ggplot2::labs(
    title = "Comment Activity Over Time",
    subtitle = "Daily comment counts with a 7-day rolling average",
    x = "Comment date",
    y = "Number of collected comments",
    caption = "Bars show daily counts; gold line shows smoothed 7-day average."
  ) +
  theme_lab()

save_report_plot(plot_comment_activity, "04_comment_activity_over_time.png")
plot_comment_activity

10.2 How to interpret it

A sharp spike suggests a burst of audience attention, often linked to a video upload, news event, repost, or related controversy. A long tail suggests that the video continued receiving comments after the initial attention window. Multiple spikes can indicate that public attention was reactivated over time.

11 Signal 5: The Comments That Carried the Room

11.1 Goal

This chart answers: Are likes spread evenly across comments, or concentrated among a small number of highly visible comments?

This is an advanced view because it shifts from word frequency to attention concentration. It helps show whether the average comment or the highly liked comments are more likely to shape the visible conversation.

# ================================================================
# ENGAGEMENT CONCENTRATION PREP
# ================================================================

engagement_ranked <- comments_clean |>
  dplyr::mutate(total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0)) |>
  dplyr::arrange(dplyr::desc(total_engagement)) |>
  dplyr::mutate(
    comment_rank = dplyr::row_number(),
    cumulative_engagement = cumsum(total_engagement),
    total_engagement_all = sum(total_engagement, na.rm = TRUE),
    cumulative_engagement_share = cumulative_engagement / total_engagement_all,
    comment_share = comment_rank / dplyr::n()
  )

engagement_gini <- gini_coefficient(comments_clean$total_engagement)
share_top_1_percent <- engagement_ranked |>
  dplyr::filter(comment_share <= 0.01) |>
  dplyr::summarise(value = max(cumulative_engagement_share, na.rm = TRUE)) |>
  dplyr::pull(value)
share_top_10_percent <- engagement_ranked |>
  dplyr::filter(comment_share <= 0.10) |>
  dplyr::summarise(value = max(cumulative_engagement_share, na.rm = TRUE)) |>
  dplyr::pull(value)

engagement_summary <- tibble::tibble(
  metric = c("Engagement Gini coefficient", "Share of engagement captured by top 1% of comments", "Share of engagement captured by top 10% of comments"),
  value = c(
    round(engagement_gini, 3),
    scales::percent(share_top_1_percent, accuracy = 0.1),
    scales::percent(share_top_10_percent, accuracy = 0.1)
  )
)

knitr::kable(engagement_summary, caption = "Engagement concentration summary")
Engagement concentration summary
metric value
Engagement Gini coefficient 0.958
Share of engagement captured by top 1% of comments 52.2%
Share of engagement captured by top 10% of comments 95.8%
# ================================================================
# VISUALIZATION 5: ENGAGEMENT CONCENTRATION CURVE
# ================================================================

plot_engagement_concentration <- ggplot2::ggplot(engagement_ranked, ggplot2::aes(x = comment_share, y = cumulative_engagement_share)) +
  ggplot2::geom_line(linewidth = 1.2, color = hemp_green_dark) +
  ggplot2::geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = hemp_gray) +
  ggplot2::scale_x_continuous(labels = scales::percent_format()) +
  ggplot2::scale_y_continuous(labels = scales::percent_format()) +
  ggplot2::labs(
    title = "Engagement Concentration Across Comments",
    subtitle = "The curve shows whether a small share of comments captured most likes and replies",
    x = "Share of comments, ranked from most engaged to least engaged",
    y = "Cumulative share of total engagement",
    caption = "Dashed line = perfectly even distribution. Curved line above it = concentrated engagement."
  ) +
  theme_lab()

save_report_plot(plot_engagement_concentration, "05_engagement_concentration_curve.png")
plot_engagement_concentration

11.2 How to interpret it

The dashed diagonal line represents a perfectly even world where 10% of comments would receive 10% of engagement, 50% of comments would receive 50% of engagement, and so on. If the actual curve rises above that dashed line, then engagement is concentrated among the most visible comments. This matters because social media perception is often shaped by the comments that receive the most likes and replies, not by the average comment.

12 Signal 6: Does Comment Style Affect Attention?

12.1 Goal

This chart answers: Do shorter or longer comments appear to attract more engagement?

The y-axis uses a log scale because comment likes are usually highly skewed. Without that scaling, a few extremely popular comments would flatten most of the chart.

# ================================================================
# VISUALIZATION 6: COMMENT LENGTH VS ENGAGEMENT
# ================================================================

comments_length_engagement <- comments_clean |>
  dplyr::mutate(
    total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0),
    engagement_plus_one = total_engagement + 1
  ) |>
  dplyr::filter(!is.na(comment_length_words), comment_length_words > 0)

plot_length_vs_engagement <- ggplot2::ggplot(comments_length_engagement, ggplot2::aes(x = comment_length_words, y = engagement_plus_one)) +
  ggplot2::geom_point(alpha = 0.45, color = hemp_green_dark) +
  ggplot2::geom_smooth(method = "loess", se = FALSE, color = hemp_gold, linewidth = 1.1) +
  ggplot2::scale_y_log10(labels = scales::comma_format()) +
  ggplot2::labs(
    title = "Comment Length vs. Engagement",
    subtitle = "Testing whether short punchy comments or longer comments attracted more attention",
    x = "Comment length in words",
    y = "Total engagement plus 1, log scale",
    caption = "Total engagement = likes + replies. Log scale improves readability for skewed engagement data."
  ) +
  theme_lab()

save_report_plot(plot_length_vs_engagement, "06_comment_length_vs_engagement.png")
plot_length_vs_engagement

12.2 How to interpret it

Each point is one comment. Points higher on the chart received more likes and replies. The smooth trend line shows the general relationship between comment length and engagement. If the line rises, longer comments tended to receive more engagement; if it falls, shorter comments tended to perform better; if it is mostly flat, length alone does not explain engagement.

13 Signal 7: Time Patterns in the Crowd

13.1 Goal

This chart answers: When were comments posted by day of week and hour of day?

This is a non-standard visualization for a short lab because it treats comments as behavioral time-stamped events, not just text. It can reveal whether activity clusters around certain hours or days.

# ================================================================
# VISUALIZATION 7: DAY-OF-WEEK BY HOUR HEATMAP
# ================================================================

comment_time_heatmap <- comments_clean |>
  dplyr::filter(!is.na(published_at)) |>
  dplyr::mutate(
    day_of_week = lubridate::wday(published_at, label = TRUE, abbr = TRUE, week_start = 1),
    hour_of_day = lubridate::hour(published_at)
  ) |>
  dplyr::count(day_of_week, hour_of_day, name = "comments_posted")

plot_calendar_heatmap <- ggplot2::ggplot(comment_time_heatmap, ggplot2::aes(x = hour_of_day, y = day_of_week, fill = comments_posted)) +
  ggplot2::geom_tile(color = hemp_cream) +
  ggplot2::scale_x_continuous(breaks = seq(0, 23, by = 3)) +
  ggplot2::scale_fill_gradient(low = "#F3F1DE", high = hemp_green_dark, labels = scales::comma_format()) +
  ggplot2::labs(
    title = "When Did Viewers Comment?",
    subtitle = "Heatmap of collected comments by day of week and hour of day",
    x = "Hour of day, UTC timestamp from YouTube API",
    y = "Day of week",
    fill = "Comments",
    caption = "Timestamps use YouTube API time values and may not match each viewer's local time zone."
  ) +
  theme_lab()

save_report_plot(plot_calendar_heatmap, "07_comment_timing_heatmap.png")
plot_calendar_heatmap

13.2 How to interpret it

Darker cells represent more comments. If the heatmap has clear dark bands, comments are clustered during specific hours or days. If activity is more evenly distributed, discussion was less time-concentrated. Because YouTube API timestamps are not viewer-local time zones, this chart should be interpreted as a collection-time pattern rather than a precise audience schedule.

14 Signal 8: Places Mentioned, Not Places Measured

14.1 Goal

This section answers: Which places are named inside the comments themselves?

The YouTube comment data used in this report does not provide a reliable viewer state or country field. The official YouTube commentThread resource documents fields such as thread ID, video/channel IDs, top-level comment details, reply count, public visibility, and a limited replies object, but it does not provide commenter geography. Therefore, this section is only a location-reference-in-text proxy: it counts places that commenters wrote inside comments. It should not be interpreted as where commenters are actually from.

# ================================================================
# LOCATION-MENTION DICTIONARY
# ================================================================
# This is a cautious proxy analysis. It only counts explicit state/country
# names written in the text. It does not infer hidden viewer location.

us_state_names <- tibble::tibble(
  location = c(
    "alabama", "alaska", "arizona", "arkansas", "california", "colorado", "connecticut",
    "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa",
    "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan",
    "minnesota", "mississippi", "missouri", "montana", "nebraska", "nevada",
    "new hampshire", "new jersey", "new mexico", "new york", "north carolina",
    "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island",
    "south carolina", "south dakota", "tennessee", "texas", "utah", "vermont",
    "virginia", "washington", "west virginia", "wisconsin", "wyoming"
  ),
  location_type = "U.S. state name"
)

country_names <- tibble::tibble(
  location = c(
    "america", "united states", "usa", "canada", "mexico", "england", "ireland", "scotland",
    "wales", "france", "germany", "italy", "spain", "australia", "new zealand", "brazil",
    "india", "china", "japan", "ukraine", "russia", "poland", "netherlands", "sweden",
    "norway", "denmark", "finland", "south africa"
  ),
  location_type = "Country or country reference"
)

location_dictionary <- dplyr::bind_rows(us_state_names, country_names) |>
  dplyr::distinct(location, location_type)

extract_location_mentions <- function(data, dictionary) {
  location_rows <- lapply(seq_len(nrow(dictionary)), function(i) {
    location_value <- dictionary$location[i]
    location_type_value <- dictionary$location_type[i]
    pattern_value <- paste0("\\b", stringr::str_replace_all(location_value, " ", "\\\\s+"), "\\b")

    data |>
      dplyr::filter(stringr::str_detect(text_clean, stringr::regex(pattern_value, ignore_case = TRUE))) |>
      dplyr::transmute(
        comment_id = comment_id,
        location = stringr::str_to_title(location_value),
        location_type = location_type_value
      )
  })

  dplyr::bind_rows(location_rows) |>
    dplyr::distinct(comment_id, location, .keep_all = TRUE)
}

location_mentions <- extract_location_mentions(comments_clean, location_dictionary)

location_counts <- location_mentions |>
  dplyr::count(location_type, location, sort = TRUE)

if (nrow(location_counts) > 0) {
  knitr::kable(
    location_counts |>
      dplyr::slice_head(n = 20),
    caption = "Top explicit location references found inside comment text",
    align = c("l", "l", "r")
  )
} else {
  knitr::kable(
    tibble::tibble(
      Result = "No state or country names from the dictionary were detected in the cleaned comments.",
      Interpretation = "This does not mean viewers came from no regions; it only means commenters did not explicitly write recognizable place names in the sampled text."
    ),
    caption = "Location-mention proxy result"
  )
}
Top explicit location references found inside comment text
location_type location n
Country or country reference America 18
U.S. state name Ohio 7
Country or country reference Germany 5
Country or country reference Usa 5
Country or country reference France 4
Country or country reference South Africa 4
Country or country reference Ireland 3
U.S. state name Mississippi 3
U.S. state name Utah 3
Country or country reference Australia 2
Country or country reference Canada 2
U.S. state name California 2
Country or country reference Brazil 1
Country or country reference Denmark 1
Country or country reference Mexico 1
Country or country reference Poland 1
U.S. state name Arkansas 1
U.S. state name Delaware 1
U.S. state name Missouri 1
U.S. state name Nebraska 1
# ================================================================
# VISUALIZATION 8: LOCATION-MENTION PROXY CHART
# ================================================================

if (nrow(location_counts) > 0) {
  top_location_counts <- location_counts |>
    dplyr::slice_max(n, n = min(15, nrow(location_counts))) |>
    dplyr::arrange(n)

  plot_location_mentions <- ggplot2::ggplot(
    top_location_counts,
    ggplot2::aes(x = n, y = forcats::fct_reorder(location, n), fill = location_type)
  ) +
    ggplot2::geom_col() +
    ggplot2::scale_fill_manual(values = c(hemp_green_dark, hemp_gold, hemp_sage)) +
    ggplot2::labs(
      title = "Locations Named Inside Comment Text",
      subtitle = "This counts written location references, not where commenters are from",
      x = "Number of comment mentions",
      y = "Mentioned location",
      fill = "Location type",
      caption = "This is a text-reference proxy using a small state/country dictionary; it is not viewer geography."
    ) +
    theme_lab()

  save_report_plot(plot_location_mentions, "08_location_mentions_proxy.png")
  plot_location_mentions
} else {
  plot_location_mentions <- ggplot2::ggplot() +
    ggplot2::annotate(
      "text",
      x = 0,
      y = 0,
      label = "No explicit state or country mentions were detected.\nYouTube comments do not provide viewer geography in this dataset.",
      size = 5,
      color = hemp_green_dark
    ) +
    ggplot2::xlim(-1, 1) +
    ggplot2::ylim(-1, 1) +
    ggplot2::labs(
      title = "Location References in Comment Text",
      subtitle = "No detected state or country names in sampled comment text",
      x = NULL,
      y = NULL,
      caption = "This is not evidence of where viewers are located; it only reflects text mentions."
    ) +
    theme_lab() +
    ggplot2::theme(axis.text = ggplot2::element_blank(), panel.grid = ggplot2::element_blank())

  save_report_plot(plot_location_mentions, "08_location_mentions_proxy.png")
  plot_location_mentions
}

14.2 How to interpret it

If location names appear, this chart shows places that were explicitly written inside comments, such as states, countries, or broad national references. This can be useful for identifying rhetorical references like “in America,” “from Texas,” or “here in Canada,” but it is not evidence of where commenters live or where viewers are located. A proper regional audience analysis would require data that YouTube comments do not expose in this API response, such as viewer analytics from the channel owner or another source with location metadata.

15 Signal 9: The Comments People Lifted Up

15.1 Goal

This table answers: Which comments actually captured the most visible attention?

Word frequency describes the overall conversation, but highly engaged comments often shape what later viewers see first. This table helps connect quantitative engagement metrics back to readable audience language.

# ================================================================
# TOP ENGAGED COMMENTS TABLE
# ================================================================

top_engaged_comments <- comments_clean |>
  dplyr::mutate(
    total_engagement = dplyr::coalesce(like_count, 0) + dplyr::coalesce(reply_count, 0),
    comment_preview = stringr::str_trunc(text_original, width = 150)
  ) |>
  dplyr::arrange(dplyr::desc(total_engagement)) |>
  dplyr::select(author, comment_preview, like_count, reply_count, total_engagement, published_at) |>
  dplyr::slice_head(n = 10)

knitr::kable(top_engaged_comments, caption = "Top 10 comments by total engagement")
Top 10 comments by total engagement
author comment_preview like_count reply_count total_engagement published_at
@mulgwisin These diss tracks against individual cops are the most gangster shit I’ve ever seen lmao 56587 248 56835 2026-03-18 17:29:56
@AP-kg9dz Songs clowning on corrupt cops is my new favorite genre. 34228 146 34374 2026-03-20 15:51:20
@slayanddecay6009 Remember Randy is ON record that he cannot confirm that Afroman did or did not have sex with his wife 30273 212 30485 2026-03-19 18:55:30
@Whiston555 Writing a song about fucking his wife and then him having to say for the record that he wasn’t sure it didn’t happen is an all time move. 25758 135 25893 2026-03-19 06:02:40
@Somename1010 “Make memes of your enemies until they cry then make memes of them crying” Sun Tzu probably 22652 97 22749 2026-03-20 05:36:19
@kikusui8881 AFROMAN SUPERBOWL HALFTIME SHOW 18962 153 19115 2026-03-21 06:05:05
@clikzip Afroman making the whole second half of his career off of that raid 😂 14165 88 14253 2026-03-16 13:23:32
@lawrencium2652 He wore the suit in court today 😂 14106 50 14156 2026-03-16 21:23:41
@prod.mordihi4772 It’s a rappers dreams to have their dis track played in court in front of the person they are dissing and win 😂🎉 13772 53 13825 2026-03-19 19:01:21
@CornyAir I hope he wins his case against these corrupt mf cops.
Edit: HE WON, LET’S GOO FREE SPEECH 12490 192 12682 2026-03-16 12:50:58

15.2 How to interpret it

This table should be read as a visibility check. These comments are not necessarily representative of the average commenter, but they are the comments that received the most measurable attention in the sample. In a real marketing or public-affairs dashboard, these high-engagement comments would be useful for identifying repeated slogans, jokes, accusations, or support statements that may shape broader audience perception.

16 Integrated interpretation of findings

16.1 Main interpretation

This analysis used 1,051 cleaned YouTube comments from 997 unique authors, covering comments from March 16, 2026 through June 22, 2026. The most frequent cleaned terms were afroman, song, love, randy, cops, court, walters, shit, which suggests that the audience conversation centered on a mix of people, institutions, legal conflict, humor, and accountability. The phrase-level view sharpened that interpretation because the most common bigrams included randy walters, pound cake, adams county, lemon pound, god bless, showing that many viewers repeated recognizable names, concepts, or story frames rather than only using disconnected single words.

The theme-coded analysis suggests that the strongest manually grouped frame was Artist / identity. This does not mean the entire comment section held one unified opinion, but it does show that repeated vocabulary clustered around a recognizable audience frame. For a business analytics or public-affairs use case, this is useful because it translates raw comments into a practical question: what story is the audience collectively building around the event?

The engagement visuals add a second layer of interpretation. The concentration curve and top-engaged-comments table show that attention is not only about how many comments exist, but also about which comments become highly visible through likes and replies. This matters because social media conversations are often shaped by a small number of highly engaged comments, much like a few loud signal fires lighting up the map before the rest of the army knows where the battle is.

16.2 Limitations

This analysis should be interpreted as an exploratory, platform-specific signal rather than a complete measurement of public opinion. The YouTube Data API may not return every visible comment, every nested reply, or comments filtered by YouTube moderation systems. The analysis also uses word frequency and manually defined theme groups, which are useful for discovering repeated language but do not fully capture sarcasm, irony, disagreement, or the intent behind a comment. For a larger final project, this workflow should be combined with other sources such as Google Trends, Bluesky posts, news mentions, or additional YouTube videos.

17 Screenshots or code snippets showing data collection

The data collection process is documented through the code chunks above. For the final submitted report, screenshots can be added showing the Google Cloud API setup, the successful comment collection output, or the saved CSV file.

18 Appendix A - Optional OAuth / tuber workflow

This section is included for alignment with the course tutorial. It is not the main workflow used in this polished report because Posit Cloud can have trouble with local browser redirect authentication.

18.1 Credential placement for OAuth

OAuth item R object
OAuth Client ID / App ID app_id
OAuth Client Secret / App Secret app_secret
API key Not used in this OAuth chunk
# Optional OAuth path using tuber.
# Do not publish real Client ID or Client Secret.
# Use this only if OAuth authentication is working in your R environment.

library(tuber)

app_id <- "YOUR_CLIENT_ID.apps.googleusercontent.com"
app_secret <- "YOUR_CLIENT_SECRET"

tuber::yt_oauth(app_id, app_secret)
# Optional tuber scrape after successful OAuth.
# This can return top-level comments and replies depending on video settings and API availability.

comments_raw_tuber <- tuber::get_all_comments(video_id = video_id)

comments_tuber_clean <- comments_raw_tuber |>
  tibble::as_tibble() |>
  dplyr::distinct(id, .keep_all = TRUE) |>
  dplyr::transmute(
    comment_id = id,
    author = authorDisplayName,
    text = textOriginal,
    published_at = publishedAt,
    like_count = likeCount,
    reply_count = NA_real_,
    video_id = video_id,
    source_method = "tuber::get_all_comments"
  ) |>
  clean_comment_data()

readr::write_csv(comments_tuber_clean, paste0("youtube_comments_tuber_", video_id, ".csv"))

19 References and base requirements map

The table below is included to make the report easier to review. It is not intended as a grading claim; it simply points to where the core requested elements appear in the report.

Base assignment element Where it appears in this report Notes
Five-sentence reflection on one Week 4 reading Part 1 - Five-sentence reading reflection Reflection is based on Brooks (2026).
Use YouTube or Bluesky API data Part 2 - Data collection and text analysis and Source summary This report uses the YouTube Data API v3 comment workflow with saved CSV caching.
Gather at least 100 comments/posts, or as many as available Source and dataset summary The cleaned dataset contains more than 100 comments.
Clean and prepare text data Text preparation and Example of a cleaned comment Cleaning includes lowercasing, removing links/punctuation, trimming whitespace, and removing stop words for token analysis.
Perform word frequency / common-term analysis Tokenize words and remove stop words and Signal 1 Word counts are shown as a table and lollipop chart.
Include at least one text-analysis visualization Visualization guide through Signal 9 The report includes multiple visualizations, including word frequency, bigrams, themes, timing, engagement, and location references in text.
Briefly discuss key themes or insights Integrated interpretation of findings The interpretation connects word counts, phrases, themes, and engagement patterns.
Use external sources to support the argument References below References include course readings, YouTube API documentation, and text-mining resources.
Show data collection process through code snippets or screenshots Credential notes, Collect or load YouTube comments, and Appendix A Code is folded by default but available to expand in the HTML report.

19.1 References

Brooks, T. L. (2026). Measuring the Diffusion Speed of Negative Sentiment on Social Media: Insights for Companies Responding to Consumer Backlash.

Campan, A., & Holtke, N. (2024). Beyond Twitter: Exploring Alternative API Sources for Social Media Analytics.

Google Developers. (n.d.). YouTube Data API v3 Reference. https://developers.google.com/youtube/v3

Google Developers. (n.d.). YouTube Embedded Players and Player Parameters. https://developers.google.com/youtube/player_parameters

Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media. https://www.tidytextmining.com/

Sysoev, J. (n.d.). tuber: Access YouTube from R. https://soodoku.github.io/tuber/

Wickham, H., Francois, R., Henry, L., Muller, K., & Vaughan, D. (n.d.). dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org/

Xu, J. Z. (2026). Scraping Comments from a SpaceX YouTube Video Using R: A Step-by-Step Tutorial Using the tuber Package and the YouTube Data API v3.