Major 3 Reddit Data Analysis

Zihan Weng

2024-11-25

packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", 
              "igraph", "ggraph", "wordcloud2", "textdata", "sentimentr", "lubridate", 
              "ggdark", "stringi", "knitr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Step 1. Introduction

Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.

Examine gamer’s sentiment toward the game “Silk Song” for all time using Reddit threads.

Step 2. Reddit Data

Search Reddit threads using a keyword of your choice.

As discussed in class, searching by keyword or subreddit on Reddit yields a maximum of 250 results per query. To gather sufficient data for analysis, I searched for the keyword across multiple subreddits and combined the results.

# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("Silksong")
## parsing URLs on page 1...
subreddit_list[1:30,c('subreddit','title','subscribers')] %>% knitr::kable()
subreddit title subscribers
4viev2 Silksong Silksong 70880
9cp9hk Silksong_R34 Silksong_R34 34
5j1d11 SilksongRefuge SilksongRefuge 66
35gpe HollowKnight Hollow Knight 865148
3em9zp IsSilksongOut IsSilksongOut 1997
5ichn6 silksong_warzone silksong_warzone 82
j958e HollowKnightMemes Git Gud! 230041
oh0nj HollowKnightArt HK Art 33838
8k5z4a SilksongSpeedrun SilksongSpeedrun 77
7j4pmp FakeSilksong FakeSilksong 47
320zt8 SilksongIsntReal SilksongIsntReal 1753
4xauj7 hollowknightdaily hollowknightdaily 2063
6lpz1m SilksongCult SilksongCult 181
5hurjx RefugeOfSilksong RefugeOfSilksong 579
2uzei TwoBestFriendsPlay The Hypest Subreddit on the Internet 108756
wjzyd HollowKnightSilksong Hollow Knight: Silksong 0
4d09fo Memes_O_Silksong Memes_O_Silksong 38
4bnfht PantheonOfFusions PantheonOfFusions 836
2vath Trophies Trophies 164724
5rorcd RefugeOfSilksongMemes RefugeOfSilksongMemes 46
3h47q NintendoSwitch Nintendo Switch - News, Updates, and Information 7433023
2tvxq metroidvania Metroidvania: guided non-linearity; utility-based exploration 97778
5in4mc silksongwar silksongwar 31
997q70 Silksongforsanepeople Silksongforsanepeople 293
2qh03 gaming r/gaming 44400726
wm01b SilkSongMemes SSMEMES 0
5hoq65 silksongisout silksongisout 241
2qhwp Games Quality Gaming Content and Discussion – /r/Games 3339711
122hf1 Eldenring r/EldenRing 3803003
5ll3nj silksongisreal silksongisreal 391
# Fetch threads using the keyword
threads_1 <- find_thread_urls(keywords = "Silksong", subreddit = 'Silksong', sort_by = 'relevance', period = 'all') %>% 
  drop_na()

rownames(threads_1) <- NULL

# Fetch threads from the another two subreddits
threads_2 <- find_thread_urls(keywords = "Silksong", subreddit = 'HollowKnight', sort_by = 'relevance', period = 'all') %>% 
  drop_na()
rownames(threads_2) <- NULL

threads_3 <- find_thread_urls(keywords = "Silksong", subreddit = 'Games', sort_by = 'relevance', period = 'all') %>% 
  drop_na()
rownames(threads_3) <- NULL

# Combine threads from both searches and remove duplicates
threads_combined <- bind_rows(threads_1, threads_2, threads_3) %>% 
  distinct(url, .keep_all = TRUE)

# Save the combined threads as a CSV file to avoid knitting error with RedditExtractoR package
write.csv(threads_combined, "threads_combined.csv", row.names = FALSE)
# Load the previously saved data
threads_combined <- read.csv("threads_combined.csv")
print(paste("Number of threads retrieved:", nrow(threads_combined)))
## [1] "Number of threads retrieved: 602"

Following this process, I compiled the data related to my chosen keyword, “Silksong,” and saved it into a CSV file for future use if needed. I tried multiple spelling/format of it, like “Silk Song” or “SilkSong”, and I decided to keep this one for the best result while having rich-enough data. The final dataset contains 602 entries from three subreddits combined. I set the time period to “all” and sorted the results by “relevance” to ensure a comprehensive dataset, as this game is relatively new and I wanted to avoid restrictions that might limit the results.

Step 3. Data Cleaning

Clean your text data and then tokenize it.

# Combine title and text for analysis
threads_combined <- threads_combined %>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

# Load stop words
data("stop_words")

# Define regular pattern to remove URLs and HTML entities
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Clean text
threads_clean <- threads_combined %>% 
  mutate(title_text = str_replace_all(title_text, replace_reg, ""))

# tokenize
words <- threads_clean %>% 
  unnest_tokens(output = word, input = title_text, token = 'words') %>% 
  # Remove stop words
  anti_join(stop_words, by = "word") %>% 
  # Remove non-alphabetic strings
  filter(str_detect(word, "^[a-z]+$"))

Now we have the cleaned and tokenized dataset threads_clean and that is ready for any future analysis.

Step 4. Word Cloud

Generate a word cloud that illustrates the frequency of words except your keyword.

# Remove the keyword "Silk" and "Song" and its related formats of the words
words <- words %>% filter(!word %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

# Create a custom color palette
n <- 20
h <- runif(n, 0, 1) # Any color
s <- runif(n, 0.6, 1) # Vivid
v <- runif(n, 0.3, 0.7) # Neither too dark nor too bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

# Generate the word cloud
words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

From the above generated word cloud we see that words with the highest frequency are “game”, “hollow”, “knight”, “team”, “cherry”, “day”, “wait”, “trailer”, “releasing”, “announce”, “news”, “website”, and so on. Most of these are either the developer/publisher of the game, its predecessor game, or waiting for it to be released.

Step 5. Tri-gram Analysis

Conduct a tri-gram analysis.

Extract tri-grams from your text data. Remove tri-grams containing stop words or non-alphabetic terms. Present the frequency of tri-grams in a table. Discuss any noteworthy tri-grams you come across. If no meaningful tri-grams are found, you may analyze bi-grams as well. However, you still need to show results of the tri-grams.

# 1. Extract tri-grams from the text data
trigrams <- threads_clean %>%
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  select(title_text) %>%
  unnest_tokens(output = trigram, input = title_text, token = "ngrams", n = 3)

# Separate trigrams into individual words
trigrams_separated <- trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# 2. Filter out trigrams containing stop words or non-alphabetic terms
trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$"),
         str_detect(word3, "^[a-z]+$")) %>%
  filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word3 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

# Remove trigrams with non-ASCII characters
trigrams_filtered <- trigrams_filtered %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# 3. Count the frequency of trigrams
trigram_counts <- trigrams_filtered %>%
  count(word1, word2, word3, sort = TRUE)

# Present the frequency of trigrams in a table
trigram_counts %>% head(25) %>% knitr::kable()
word1 word2 word3 n
drawing hollow knight 96
poorly drawing hollow 95
tba official website 48
official twitter page 38
tba announce trailer 29
picture link trailer 20
link trailer link 18
steam page link 15
release date official 13
kinda funny games 12
date official website 10
summer game fest 8
tba official twitter 8
tba missed spring 7
indie games coming 6
gameplay trailer coming 5
xbox games showcase 5
action platformer multiplayer 4
bethesda games showcase 4
discord live updates 4
nintendo indie world 4
xbox bethesda games 4
date announce trailer 3
elder scrolls online 3
gameplay trailer releasing 3

I decided to perform the bi-grams as well to see if there’s any more meaningful results, which follows the same process as tri-gram analysis.

bigrams <- threads_clean %>%
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  select(title_text) %>%
  unnest_tokens(output = bigram, input = title_text, token = "ngrams", n = 2)

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$")) %>%
  filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

bigrams_filtered <- bigrams_filtered %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

bigram_counts <- bigrams_filtered %>%
  count(word1, word2, sort = TRUE)

bigram_counts %>% head(25) %>% knitr::kable()
word1 word2 n
hollow knight 185
release date 103
official website 102
drawing hollow 96
poorly drawing 95
announce trailer 82
team cherry 57
tba official 56
official twitter 44
twitter page 38
tba announce 29
nintendo direct 27
nintendo switch 27
steam page 24
tba missed 21
link trailer 20
picture link 20
release window 18
trailer link 18
page link 16
date official 13
game awards 13
gameplay trailer 13
funny games 12
kinda funny 12

Both results gave some related/meaningful analysis:
1. Both tables show Silksong’s strong association with its predecessor game, “Hollow Knight.” Players are likely discussing comparisons, expectations, or continuity between the two games.
2. The developers of the game, Team Cherry, are a central focus of discussions, reflecting player interest in their development progress, communication, or announcements.
3. It is evident that fans are eagerly awaiting updates/news for the game and emotional investment the community has in Silk Song’s development and launch (background: this game was announced to be released 5 years ago, and was delayed until now and has not been released yet).
4. The community has a lot of discussions around its trailer.
5. Many are talking about the game’s (or developer’s) social media, websites, pages on different gaming platforms, and so on.
6. Discussions about in-game contents are evident.
Overall, both bi-grams and tri-grams analysis provide strong overall narrative. It seems the community has an ongoing comparison of “Silk Song” and “Hollow Knight”, their engagement with the developer, and its content/trailer.

Step 6. Sentiment Analysis

Perform a sentiment analysis on your text data using a dictionary method that accommodates negations.

You are welcome to apply a deep learning-based model to enrich your analysis, but employing the dictionary method is imperative.

# Perform sentiment analysis using the sentimentr package
sentiment_scores <- sentiment_by(threads_clean$title_text)

# Add sentiment scores to the threads data frame
threads_clean$sentiment_scores <- sentiment_scores$ave_sentiment
threads_clean$word_count <- sentiment_scores$word_count

Step 7. Sentiment Analysis Outcomes

Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

I will select the 5 most negative and 5 most positive texts as the 10 sample texts alongsie their sentiment scores to evaluate the credibility of the sentiment analysis outcomes.

# Select the 10 samples
threads_sentiment <- threads_clean %>%
  select(title_text, sentiment_scores)

threads_sentiment_sorted <- threads_sentiment %>%
  arrange(sentiment_scores)

# 5 most negative texts
top_negative <- threads_sentiment_sorted %>% head(5)

# 5 most positive texts
top_positive <- threads_sentiment_sorted %>% tail(5)

# Combine the samples
sample_texts <- bind_rows(top_negative, top_positive)

# Display the sample texts and their sentiment scores
sample_texts %>% knitr::kable()
title_text sentiment_scores
Steam Scream: The Revenge Official Trailer. -0.8164966
Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out. -0.6653056
Silksong was disappointing . -0.5773503
I will no longer remain silent, this bullying tactic to keep me quiet is laughable.. -0.5163978
Silksong fans have no hope. -0.4919350
To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy. 0.5735393
Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner. 0.6250000
NEW SILKSONG TEASER ART. 0.7000000
Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust! 1.0547937
Helpful Silksong guide for New Fans!. 1.1226828
# or in separate tables if needed
top_negative %>% select(title_text, sentiment_scores) %>% knitr::kable()
title_text sentiment_scores
Steam Scream: The Revenge Official Trailer. -0.8164966
Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out. -0.6653056
Silksong was disappointing . -0.5773503
I will no longer remain silent, this bullying tactic to keep me quiet is laughable.. -0.5163978
Silksong fans have no hope. -0.4919350
top_positive %>% select(title_text, sentiment_scores) %>% knitr::kable()
title_text sentiment_scores
598 To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy. 0.5735393
599 Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner. 0.6250000
600 NEW SILKSONG TEASER ART. 0.7000000
601 Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust! 1.0547937
602 Helpful Silksong guide for New Fans!. 1.1226828

Now, I would like to evaluate the credibility of the sentiment analysis outcomes by manually checking if the sentiment scores align with the text content. From the table(s) above, I believe the sentiment analysis results are reasonably accurate, with scores generally aligning with the tone and emotional content of the text, particularly for negative statements. However, “Day 716 of poorly drawing hollow knight…” which is a more neutral statement was scored to be very negative due to its wording, and some sarcasm or exaggeration is detected in the positive samples, which could mean that the sentiment analysis may misinterpret them as more positive than intended. Additionally, for factual statements like “Game Awards Creator Geoff Keighley…,” the slightly positive sentiment may overstate the actual emotional tone. Overall, I think the sentiment analysis is credible for straightforward emotional expressions but may struggle with sarcasm or neutral factual statements.

Step 8. Insights and Conclusions

Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

# Plot 1: Distribution of sentiment scores
# We can see if the majority of the posts are positive or negative.
ggplot(threads_clean, aes(x = sentiment_scores)) +
  geom_histogram(binwidth = 0.1, fill = 'steelblue', color = 'black') +
  xlab('Sentiment Score') +
  ylab('Count') +
  ggtitle('Distribution of Sentiment Scores')

# Plot 2: Average sentiment over time (by month)
# Shows how the average sentiment has changed over time since the game's announcement

# Convert date to month-year format
threads_clean$date <- as.POSIXct(threads_clean$date_utc)
threads_clean <- threads_clean %>%
  mutate(month = floor_date(date, 'month'))

sentiment_by_month <- threads_clean %>%
  group_by(month) %>%
  summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))

ggplot(sentiment_by_month, aes(x = month, y = avg_sentiment)) +
  geom_line(color = 'steelblue') +
  xlab('Month') +
  ylab('Average Sentiment') +
  ggtitle('Average Sentiment Over Time')

# Plot 3: Sentiment by Day of Week
threads_clean <- threads_clean %>%
  mutate(day_of_week = wday(date, label = TRUE))

sentiment_by_day <- threads_clean %>%
  group_by(day_of_week) %>%
  summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))

ggplot(sentiment_by_day, aes(x = day_of_week, y = avg_sentiment)) +
  geom_bar(stat = 'identity', fill = 'steelblue') +
  xlab('Day of Week') +
  ylab('Average Sentiment') +
  ggtitle('Average Sentiment by Day of Week')

From the sentiment analysis, we can derive several intriguing insights:

  • Overall Neutral to (Slightly) Positive Sentiment: The first plot shows that most sentiment scores cluster around neutral (0), with a slight skew toward the positive side. This indicates that discussions about the game tend to be balanced, with a mix of excitement and criticism but generally leaning toward positive sentiment. The spread of negative scores suggests there are criticisms or frustrations, but these are fewer in number compared to neutral and positive sentiments. The majority of the community appears to maintain a hopeful or neutral stance, which is common in discussions of long-awaited games like Silksong. However, outliers in negative sentiment indicate the existence of specific pain points, such as the extremely long wait for the news/updates/releasing or unmet expectations.

  • Fluctuated Sentiment Over Time: Since this is a relatively new game, it was first mentioned in 2019 (when it was announced). For discussions up until today, the second plot highlights fluctuations in sentiment over time. Sentiment showed an upward trend early on, with sharp drops around specific periods (e.g., early 2020 and mid-2021), followed by recovery. These drops might correlate with missed announcements, delays, critical feedback from fans or media, or major gaming events. Sentiment improves in later years and remains mostly stable (mostly above 0 in the recent years), which suggests renewed excitement, possibly due to teaser releases, gameplay showcases, awards nominations, or published trailers. The graph reflects the ebb and flow of community engagement, with peaks likely tied to exciting announcements and troughs representing frustration over delays or lack of updates.

  • Day of the Week Variation: Average sentiment during weekdays is slightly positive, while sentiment during weekends tends to hover around neutral to slightly positive. This may stem from fans casually discussing the game or engaging with content in their free time, while major updates or announcements tend to occur during the workweek.

We’ve gained a lot of valuable insights into the gaming community’s perceptions of Silksong since its announcement throughout this analysis. During the waiting for the game to be released, it shows a generally positive and hopeful sentiment surrounding Silksong, punctuated by periods of disappointment likely tied to delays or a lack of communication on social media.