Major Assignment3

1. Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.

Changes in sentiment towards Drake throughout 03.2024 - 03.2025

For your reference: The “Drake vs. Kendrick Lamar” Feud

  • The Context: Drake and Kendrick Lamar are two of the most commercially successful and critically acclaimed figures in modern hip-hop. For over a decade, they have maintained a subtle rivalry regarding who claims the title of the “greatest rapper” of their generation.

  • The Incident (May 2024): The tension escalated into a direct conflict in early 2024. In May, Kendrick Lamar released a series of “diss tracks” (songs intended to insult a rival), most notably the global hit “Not Like Us.”

  • The Climax (February 2025): The rivalry reached its definitive peak when Kendrick Lamar headlined the Super Bowl LIX Halftime Show. This performance was widely interpreted by the public and media as his ultimate “victory lap,” effectively cementing his dominance over Drake in the cultural narrative.

2. Search Reddit threads using a keyword of your choice.

Load Packages

packages <-c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "here", "ggdark", "syuzhet", "sentimentr", "lubridate")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Search Reddit threads

Keyword

# # using keyword
# drake_1 <- find_thread_urls(keywords = "drake",
#                               sort_by = 'relevance',
#                               period = 'all') %>%
#   drop_na()
# 
# rownames(drake_1) <- NULL
# 
# drake_1 <- drake_1 %>%
#   mutate(
#     date_utc = as.Date(date_utc)  # Convert PostDate to Date value
#   ) %>%
#   filter(
#     date_utc >= as.Date("2024-03-01"), # Filter
#     date_utc <= as.Date("2025-03-31")
#   )

Subreddit

# # using subreddit
# #hiphopheads: biggest hiphop subreddit
# drake_2 <- find_thread_urls(keywords= "drake", 
#                             subreddit = "hiphopheads", 
#                             sort_by = 'relevance', 
#                             period = 'all') %>% 
#   drop_na()
# 
# rownames(drake_2) <- NULL
# 
# drake_2 <- drake_2 %>%
#   mutate(
#     date_utc = as.Date(date_utc)  # Convert PostDate to Date value
#   ) %>%
#   filter(
#     date_utc >= as.Date("2024-03-01"), # Filter
#     date_utc <= as.Date("2025-03-31")
#   )
# # using subreddit
# #hiphopheads: biggest hiphop subreddit
# drake_3 <- find_thread_urls(keywords= "drake",
#                             subreddit = "rap", 
#                             sort_by = 'relevance', 
#                             period = 'all') %>% 
#   drop_na()
# 
# rownames(drake_3) <- NULL
# 
# drake_3 <- drake_3 %>%
#   mutate(
#     date_utc = as.Date(date_utc)  # Convert PostDate to Date value
#   ) %>%
#   filter(
#     date_utc >= as.Date("2024-03-01"), # Filter
#     date_utc <= as.Date("2025-03-31")
#   )

3. Clean your text data and then tokenize it.

Clean Data

# #Merge Data
# drake_total <- bind_rows(drake_1, drake_2, drake_3)
# 
# #Drop Duplicates
# drake_total <- drake_total %>% distinct()
# 
# # Sanitize text
# drake_total %<>% 
#   mutate(across(
#     where(is.character),
#     ~ .x %>%
#         str_replace_all("\\|", "/") %>%   # replace vertical bars
#         str_replace_all("\\n", " ") %>%   # replace newlines
#         str_squish()                      # clean up extra spaces
#   ))

Download and reload data

# Save merged data as CSV
# write.csv(drake_total, "drake_total.csv", row.names = FALSE)

# Reload the saved CSV
drake_total <- read.csv("drake_total.csv", stringsAsFactors = FALSE)

Tokenization

# Word tokenization
words <- drake_total %>% 
  unnest_tokens(output = word, input = text, token = "words")

4. Generate a word cloud that illustrates the frequency of words excluding your keyword.

# Word tokenization
words <- drake_total %>% 
  unnest_tokens(output = word, input = text, token = "words") %>%
  filter(word != "drake")
words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2()

It seems there are too many stop words, so I will remove stop words and see word cloud again.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- words %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]")) %>%
  filter(!str_detect(word, replace_reg))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 17559, After: 6642
words_clean %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(size = 0.8, shuffle = FALSE)

It is funny that “Kendrick” is the most frequently mentioned word in the data on “Drake”.

5. Conduct a tri-gram analysis.

Extract tri-grams from your text data.

# Get tri-grams.
drake_trigram <- drake_total %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

Remove tri-grams containing stop words or non-alphabetic terms.

#separate the paired words into three columns
words_ngram_pair <- drake_trigram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column, word 2 column, word 3 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]") & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

Present the frequency of tri-grams in a table.

# Sort the new tri-gram (n=3) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
drake partynextdoor prod 11
prod dj lewis 5
prod kid masterpiece 5
certified lover boy 3
drake partynextdoor feat 3
partynextdoor prod dj 3
partynextdoor prod kid 3
wah gwan delilah 3
10m sales us.html 2
46t y9orgxgtsfhfmnh0bdarlw edit 2
dark lane demo 2
die hard drake 2
dj lewis noel 2
dr dre ft 2
drake prod kid 2
drake’s front door 2
dre ft kendrick 2
express.com entertainment music 2
ft future drake 2
future ft kendrick 2

Discuss any noteworthy tri-grams you come across.

I had to search the context, and I think there are some interesting tri-grams according to the incident.

1. future ft kendrick (Frequency: 2)

Context: This phrase is likely refer to Kendrick Lamar’s feature on Future and Metro Boomin’s track, “Like That,” released in March 2024. Discussion: This trigram seems mark the catalyst of the feud. In his verse, Kendrick Lamar rejected the idea of a “Big 3” (Drake, J. Cole, Kendrick) and famously claimed, “It’s just big me,” which officially ignited the conflict between the two artists.

2. drake’s front door (Frequency: 2)

Context: This phrase is likely refer to the shooting incident involving a security guard outside Drake’s Toronto residence (The Embassy) in May 2024. Discussion: This keyword highlights the escalation of the feud from music to real-world violence. It reflects public concern that the lyrical battle had crossed a line into dangerous territory, shifting the online discourse from entertainment to safety concerns.

3. wah gwan delilah (Frequency: 3)

Context: This is likely refer to the title of a parody remix of “Hey There Delilah” released by Drake featuring a heavy Jamaican Patois accent. Discussion: Released in the aftermath of the main diss battle, this track caused confusion and ridicule among listeners. The appearance of this keyword suggests that public sentiment questioned Drake’s seriousness or state of mind following his perceived loss in the feud, viewing the release as either bizarre behavior or an attempt to “troll” the audience.

4. drake partynextdoor prod (Frequency: 11), drake partynextdoor feat (Frequency: 3)

However, the most frequently mentioned trigram was drake partynextdoor prod (Frequency: 11), and drake partynextdoor feat was also mentioned three times. This prominence is likely attributed to the release of their collaborative album in February 2025. PartyNextDoor is recognized as one of the few associates who remained loyal to Drake throughout the controversy. Thus, their continued partnership and the timing of this release served as a significant talking point, contrasting with the departure of other collaborators.

6. Perform a sentiment analysis on your text data using a dictionary method that accommodates negations.

#Merge Title and Text
drake_sentiment <- drake_total %>%
  mutate(
    title = replace_na(title, ""),
    text  = replace_na(text, ""),
    title_text = str_c(title, text, sep = ". ")
  )

#Separate Setences
drake_sentiment_sts <- drake_sentiment %>%
  mutate(title_text_split = get_sentences(title_text))

#Sentiment Score
drake_sentiment_scores <- drake_sentiment_sts %$%
  sentiment_by(title_text_split)
drake_sentiment$sentiment_dict <- drake_sentiment_scores %>% pull(ave_sentiment)
drake_sentiment$word_count     <- drake_sentiment_scores %>% pull(word_count)

drake_sentiment %>%
  select(title_text, sentiment_dict, word_count) %>%
  head()

7. Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

examples_sentiment <- bind_rows(
  drake_sentiment %>%
    arrange(sentiment_dict) %>%
    slice_head(n = 5) %>%
    mutate(sentiment_type = "most_negative"),
  
  drake_sentiment %>%
    arrange(desc(sentiment_dict)) %>%
    slice_head(n = 5) %>%
    mutate(sentiment_type = "most_positive")
)

examples_sentiment %>%
  select(sentiment_type, sentiment_dict, title_text) %>%
  knitr::kable()
sentiment_type sentiment_dict title_text
most_negative -0.7100469 Universal Music Group Responds to Drake Legal Filing Over Not Like Us: Offensive & Untrue.
most_negative -0.6250000 Tupac Shakurs Estate Threatens to Sue Drake Over Diss Track Featuring AI-Generated Tupac Voice.
most_negative -0.6147009 Drake claims label should have refused to release Kendrick Lamars Not Like Us.
most_negative -0.5809475 Yuno Miles Says He Regrets Kendrick & Drake Diss: The Past Two Days Felt Terrible.
most_negative -0.5773503 Drake officially loses.
most_positive 0.5964787 What’s your favorite Kendrick song and your favorite Drake song?. With this beef going on a lot of people are picking sides, but they’re both great artist in their own regard and lane so what’s your favorite song from each? Mine for Kendrick is Poe Man’s Dreams and from Drake It’s Redemption.
most_positive 0.5964787 What’s your favorite Kendrick song and your favorite Drake song?. With this beef going on a lot of people are picking sides, but they’re both great artist in their own regard and lane so what’s your favorite song from each? Mine for Kendrick is Poe Man’s Dreams and from Drake It’s Redemption.
most_positive 0.5659970 Kendrick Lamars music streams increase by almost 50% while Drakes drops amid beef. https://www.the-express.com/entertainment/music/137028/kendrick-lamar-music-streams-increase-drake-beef
most_positive 0.4157609 [FRESH VIDEO] DRAKE - NOKIA (Official Music Video).
most_positive 0.3897114 [Highlight] Drake Maye delivers a perfect throw for first career touchdown pass.

I do not think this result is credible. For example, “Universal Music Group Responds to Drake Legal Filing Over”Not Like Us: Offensive & Untrue.”” is categorized as most_negative phrase, but it seems this was because of the song’s name. So in my perspective, this is neutral phrase rather than negative tone.

8. Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least THREE plots.

1. Sentiment change over the period

drake_sentiment_clean <- drake_sentiment %>%
  filter(!is.na(sentiment_dict),
         !is.na(date_utc)) %>%
  mutate(
    date_utc = as.Date(date_utc)
  )

# Monthly Average
sentiment_monthly <- drake_sentiment_clean %>%
  group_by(month = floor_date(date_utc, "month")) %>%
  summarise(
    mean_sentiment = mean(sentiment_dict),
    n_posts        = n(),
    .groups = "drop"
  )

# Significant Date
event_dates <- tibble(
  event_date = as.Date(c("2024-03-22", "2024-05-04", "2025-02-09")),
  event_label = c("Like That release",
                  "Not Like Us release",
                  "Super Bowl LIX (Kendrick)")
)

ggplot(sentiment_monthly, aes(x = month, y = mean_sentiment)) +
  geom_line() +
  geom_point() +
  geom_vline(data = event_dates,
             aes(xintercept = as.numeric(event_date)),
             linetype = "dashed") +
  geom_text(data = event_dates,
            aes(x = event_date,
                y = max(sentiment_monthly$mean_sentiment, na.rm = TRUE),
                label = event_label),
            angle = 90, vjust = -0.5, hjust = 1, size = 3) +
  labs(
    title = "Monthly Average Sentiment of Drake-related Reddit Posts",
    x = "Month",
    y = "Average sentiment score"
  ) +
  theme_minimal()

Correlation with Key Events: The sentiment timeline strongly mirrors real-world events in the feud. The sharp decline in April 2024 coincides immediately with the release of “Like That” (March 22), suggesting an instant negative reaction from the community as the feud ignited.

Volatility and “Diss Fatigue”: The graph displays significant volatility, particularly the erratic ups and downs between May and October 2024. This suggests that public sentiment was highly reactive to individual news cycles and track releases rather than following a steady trend.

The Pre-Super Bowl Low: Interestingly, the lowest sentiment point occurs in January 2025, just before the Super Bowl. This drastic dip may reflect the community’s negative anticipation of Kendrick Lamar’s “victory lap” performance or dissatisfaction with Drake’s activities during that period (such as the build-up to the release of his collaborative projects).

2. Distribution of sentiment over months

drake_phase <- drake_sentiment_clean %>%
  mutate(
    phase = case_when(
      date_utc >= as.Date("2024-03-01") & date_utc <= as.Date("2024-04-30") ~ "Phase 1: Early feud",
      date_utc >= as.Date("2024-05-01") & date_utc <= as.Date("2025-01-31") ~ "Phase 2: Diss war",
      date_utc >= as.Date("2025-02-01") & date_utc <= as.Date("2025-03-31") ~ "Phase 3: Post Super Bowl",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(!is.na(phase))

ggplot(drake_phase, aes(x = phase, y = sentiment_dict)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.15, outlier.size = 0.7) +
  labs(
    title = "Sentiment Distribution by Feud Phase",
    x = "Phase",
    y = "Sentiment score"
  ) +
  theme_minimal()

Consistency of the Median: Despite the intensity of the rivalry, the median sentiment (represented by the bold horizontal line within the box plots) remains hovering near zero across all three phases. This indicates that while the “loudest” posts may have been extreme, the average discourse on the subreddit remained relatively balanced or neutral.

Polarization in Phase 1: The “Early Feud” phase exhibits a slightly stretched distribution compared to the others. This suggests a period of higher polarization where fans were fiercely debating the start of the conflict, resulting in a mix of both defensive praise for Drake and sharp criticism.

Stabilization in Phase 3: By the “Post Super Bowl” phase, the shape of the violin plot appears slightly more condensed. This implies that as the narrative of the feud settled—with the public largely viewing Kendrick as the victor—the emotional intensity of the Reddit threads cooled down, leading to fewer extreme sentiment outliers.

3. Mention Kendrick vs No mention Kendrick

drake_kendrick <- drake_sentiment_clean %>%
  mutate(
    has_kendrick = if_else(
      str_detect(str_to_lower(title_text), "kendrick"),
      "Kendrick mentioned",
      "Kendrick not mentioned"
    )
  )

ggplot(drake_kendrick, aes(x = has_kendrick, y = sentiment_dict)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.15, outlier.size = 0.7) +
  labs(
    title = "Sentiment When Kendrick is Mentioned vs Not Mentioned",
    x = "",
    y = "Sentiment score"
  ) +
  theme_minimal()

Unexpected Similarity: Visually, the distributions for posts mentioning Kendrick Lamar versus those that do not are remarkably similar. This is a counter-intuitive finding; one might expect threads discussing a bitter rival to be significantly more negative.

The “Spectacle” Factor: The fact that sentiment does not drop significantly when Kendrick is mentioned suggests that the subreddit users may view the feud as entertainment. Words associated with high-profile rap battles (e.g., “legendary,” “classic,” “winner”) are often coded as positive in sentiment dictionaries, which might be counterbalancing the negative context of the conflict.

Drake’s Baseline Sentiment: This comparison implies that Drake’s sentiment on Reddit is not solely defined by his rivalry with Kendrick. The “Not Mentioned” category still contains a wide range of negative scores, indicating that listeners have critiques of Drake’s music or behavior independent of the feud itself.