Major 3 Reddit Data Analysis

Zihan Weng

2024-11-25

packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse", 
              "igraph", "ggraph", "wordcloud2", "textdata", "sentimentr", "lubridate", 
              "ggdark", "stringi", "knitr")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Step 1. Introduction

Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.

Examine gamer’s sentiment toward the game “Silk Song” for all time using Reddit threads.

Step 2. Reddit Data

Search Reddit threads using a keyword of your choice.

As discussed in class, searching by keyword or subreddit on Reddit yields a maximum of 250 results per query. To gather sufficient data for analysis, I searched for the keyword across multiple subreddits and combined the results.

# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("Silksong")

## parsing URLs on page 1...

subreddit_list[1:30,c('subreddit','title','subscribers')] %>% knitr::kable()

	subreddit	title	subscribers
4viev2	Silksong	Silksong	70880
9cp9hk	Silksong_R34	Silksong_R34	34
5j1d11	SilksongRefuge	SilksongRefuge	66
35gpe	HollowKnight	Hollow Knight	865148
3em9zp	IsSilksongOut	IsSilksongOut	1997
5ichn6	silksong_warzone	silksong_warzone	82
j958e	HollowKnightMemes	Git Gud!	230041
oh0nj	HollowKnightArt	HK Art	33838
8k5z4a	SilksongSpeedrun	SilksongSpeedrun	77
7j4pmp	FakeSilksong	FakeSilksong	47
320zt8	SilksongIsntReal	SilksongIsntReal	1753
4xauj7	hollowknightdaily	hollowknightdaily	2063
6lpz1m	SilksongCult	SilksongCult	181
5hurjx	RefugeOfSilksong	RefugeOfSilksong	579
2uzei	TwoBestFriendsPlay	The Hypest Subreddit on the Internet	108756
wjzyd	HollowKnightSilksong	Hollow Knight: Silksong	0
4d09fo	Memes_O_Silksong	Memes_O_Silksong	38
4bnfht	PantheonOfFusions	PantheonOfFusions	836
2vath	Trophies	Trophies	164724
5rorcd	RefugeOfSilksongMemes	RefugeOfSilksongMemes	46
3h47q	NintendoSwitch	Nintendo Switch - News, Updates, and Information	7433023
2tvxq	metroidvania	Metroidvania: guided non-linearity; utility-based exploration	97778
5in4mc	silksongwar	silksongwar	31
997q70	Silksongforsanepeople	Silksongforsanepeople	293
2qh03	gaming	r/gaming	44400726
wm01b	SilkSongMemes	SSMEMES	0
5hoq65	silksongisout	silksongisout	241
2qhwp	Games	Quality Gaming Content and Discussion – /r/Games	3339711
122hf1	Eldenring	r/EldenRing	3803003
5ll3nj	silksongisreal	silksongisreal	391

# Fetch threads using the keyword
threads_1 <- find_thread_urls(keywords = "Silksong", subreddit = 'Silksong', sort_by = 'relevance', period = 'all') %>% 
  drop_na()

rownames(threads_1) <- NULL

# Fetch threads from the another two subreddits
threads_2 <- find_thread_urls(keywords = "Silksong", subreddit = 'HollowKnight', sort_by = 'relevance', period = 'all') %>% 
  drop_na()
rownames(threads_2) <- NULL

threads_3 <- find_thread_urls(keywords = "Silksong", subreddit = 'Games', sort_by = 'relevance', period = 'all') %>% 
  drop_na()
rownames(threads_3) <- NULL

# Combine threads from both searches and remove duplicates
threads_combined <- bind_rows(threads_1, threads_2, threads_3) %>% 
  distinct(url, .keep_all = TRUE)

# Save the combined threads as a CSV file to avoid knitting error with RedditExtractoR package
write.csv(threads_combined, "threads_combined.csv", row.names = FALSE)

# Load the previously saved data
threads_combined <- read.csv("threads_combined.csv")
print(paste("Number of threads retrieved:", nrow(threads_combined)))

## [1] "Number of threads retrieved: 602"

Following this process, I compiled the data related to my chosen keyword, “Silksong,” and saved it into a CSV file for future use if needed. I tried multiple spelling/format of it, like “Silk Song” or “SilkSong”, and I decided to keep this one for the best result while having rich-enough data. The final dataset contains 602 entries from three subreddits combined. I set the time period to “all” and sorted the results by “relevance” to ensure a comprehensive dataset, as this game is relatively new and I wanted to avoid restrictions that might limit the results.

Step 3. Data Cleaning

Clean your text data and then tokenize it.

# Combine title and text for analysis
threads_combined <- threads_combined %>%
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

# Load stop words
data("stop_words")

# Define regular pattern to remove URLs and HTML entities
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Clean text
threads_clean <- threads_combined %>% 
  mutate(title_text = str_replace_all(title_text, replace_reg, ""))

# tokenize
words <- threads_clean %>% 
  unnest_tokens(output = word, input = title_text, token = 'words') %>% 
  # Remove stop words
  anti_join(stop_words, by = "word") %>% 
  # Remove non-alphabetic strings
  filter(str_detect(word, "^[a-z]+$"))

Now we have the cleaned and tokenized dataset threads_clean and that is ready for any future analysis.

Step 4. Word Cloud

Generate a word cloud that illustrates the frequency of words except your keyword.

# Remove the keyword "Silk" and "Song" and its related formats of the words
words <- words %>% filter(!word %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

# Create a custom color palette
n <- 20
h <- runif(n, 0, 1) # Any color
s <- runif(n, 0.6, 1) # Vivid
v <- runif(n, 0.3, 0.7) # Neither too dark nor too bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

# Generate the word cloud
words %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

From the above generated word cloud we see that words with the highest frequency are “game”, “hollow”, “knight”, “team”, “cherry”, “day”, “wait”, “trailer”, “releasing”, “announce”, “news”, “website”, and so on. Most of these are either the developer/publisher of the game, its predecessor game, or waiting for it to be released.

Step 5. Tri-gram Analysis

Conduct a tri-gram analysis.

Extract tri-grams from your text data. Remove tri-grams containing stop words or non-alphabetic terms. Present the frequency of tri-grams in a table. Discuss any noteworthy tri-grams you come across. If no meaningful tri-grams are found, you may analyze bi-grams as well. However, you still need to show results of the tri-grams.

# 1. Extract tri-grams from the text data
trigrams <- threads_clean %>%
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  select(title_text) %>%
  unnest_tokens(output = trigram, input = title_text, token = "ngrams", n = 3)

# Separate trigrams into individual words
trigrams_separated <- trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

# 2. Filter out trigrams containing stop words or non-alphabetic terms
trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$"),
         str_detect(word3, "^[a-z]+$")) %>%
  filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word3 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

# Remove trigrams with non-ASCII characters
trigrams_filtered <- trigrams_filtered %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# 3. Count the frequency of trigrams
trigram_counts <- trigrams_filtered %>%
  count(word1, word2, word3, sort = TRUE)

# Present the frequency of trigrams in a table
trigram_counts %>% head(25) %>% knitr::kable()

word1	word2	word3	n
drawing	hollow	knight	96
poorly	drawing	hollow	95
tba	official	website	48
official	twitter	page	38
tba	announce	trailer	29
picture	link	trailer	20
link	trailer	link	18
steam	page	link	15
release	date	official	13
kinda	funny	games	12
date	official	website	10
summer	game	fest	8
tba	official	twitter	8
tba	missed	spring	7
indie	games	coming	6
gameplay	trailer	coming	5
xbox	games	showcase	5
action	platformer	multiplayer	4
bethesda	games	showcase	4
discord	live	updates	4
nintendo	indie	world	4
xbox	bethesda	games	4
date	announce	trailer	3
elder	scrolls	online	3
gameplay	trailer	releasing	3

I decided to perform the bi-grams as well to see if there’s any more meaningful results, which follows the same process as tri-gram analysis.

bigrams <- threads_clean %>%
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  select(title_text) %>%
  unnest_tokens(output = bigram, input = title_text, token = "ngrams", n = 2)

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  filter(str_detect(word1, "^[a-z]+$"),
         str_detect(word2, "^[a-z]+$")) %>%
  filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
         !word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))

bigrams_filtered <- bigrams_filtered %>%
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))

bigram_counts <- bigrams_filtered %>%
  count(word1, word2, sort = TRUE)

bigram_counts %>% head(25) %>% knitr::kable()

word1	word2	n
hollow	knight	185
release	date	103
official	website	102
drawing	hollow	96
poorly	drawing	95
announce	trailer	82
team	cherry	57
tba	official	56
official	twitter	44
twitter	page	38
tba	announce	29
nintendo	direct	27
nintendo	switch	27
steam	page	24
tba	missed	21
link	trailer	20
picture	link	20
release	window	18
trailer	link	18
page	link	16
date	official	13
game	awards	13
gameplay	trailer	13
funny	games	12
kinda	funny	12

Both results gave some related/meaningful analysis:
1. Both tables show Silksong’s strong association with its predecessor game, “Hollow Knight.” Players are likely discussing comparisons, expectations, or continuity between the two games.
2. The developers of the game, Team Cherry, are a central focus of discussions, reflecting player interest in their development progress, communication, or announcements.
3. It is evident that fans are eagerly awaiting updates/news for the game and emotional investment the community has in Silk Song’s development and launch (background: this game was announced to be released 5 years ago, and was delayed until now and has not been released yet).
4. The community has a lot of discussions around its trailer.
5. Many are talking about the game’s (or developer’s) social media, websites, pages on different gaming platforms, and so on.
6. Discussions about in-game contents are evident.
Overall, both bi-grams and tri-grams analysis provide strong overall narrative. It seems the community has an ongoing comparison of “Silk Song” and “Hollow Knight”, their engagement with the developer, and its content/trailer.

Step 6. Sentiment Analysis

Perform a sentiment analysis on your text data using a dictionary method that accommodates negations.

You are welcome to apply a deep learning-based model to enrich your analysis, but employing the dictionary method is imperative.

# Perform sentiment analysis using the sentimentr package
sentiment_scores <- sentiment_by(threads_clean$title_text)

# Add sentiment scores to the threads data frame
threads_clean$sentiment_scores <- sentiment_scores$ave_sentiment
threads_clean$word_count <- sentiment_scores$word_count

Step 7. Sentiment Analysis Outcomes

Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.

I will select the 5 most negative and 5 most positive texts as the 10 sample texts alongsie their sentiment scores to evaluate the credibility of the sentiment analysis outcomes.

# Select the 10 samples
threads_sentiment <- threads_clean %>%
  select(title_text, sentiment_scores)

threads_sentiment_sorted <- threads_sentiment %>%
  arrange(sentiment_scores)

# 5 most negative texts
top_negative <- threads_sentiment_sorted %>% head(5)

# 5 most positive texts
top_positive <- threads_sentiment_sorted %>% tail(5)

# Combine the samples
sample_texts <- bind_rows(top_negative, top_positive)

# Display the sample texts and their sentiment scores
sample_texts %>% knitr::kable()

title_text	sentiment_scores
Steam Scream: The Revenge Official Trailer.	-0.8164966
Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out.	-0.6653056
Silksong was disappointing .	-0.5773503
I will no longer remain silent, this bullying tactic to keep me quiet is laughable..	-0.5163978
Silksong fans have no hope.	-0.4919350
To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy.	0.5735393
Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner.	0.6250000
NEW SILKSONG TEASER ART.	0.7000000
Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust!	1.0547937
Helpful Silksong guide for New Fans!.	1.1226828

# or in separate tables if needed
top_negative %>% select(title_text, sentiment_scores) %>% knitr::kable()

title_text	sentiment_scores
Steam Scream: The Revenge Official Trailer.	-0.8164966
Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out.	-0.6653056
Silksong was disappointing .	-0.5773503
I will no longer remain silent, this bullying tactic to keep me quiet is laughable..	-0.5163978
Silksong fans have no hope.	-0.4919350

top_positive %>% select(title_text, sentiment_scores) %>% knitr::kable()

	title_text	sentiment_scores
598	To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy.	0.5735393
599	Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner.	0.6250000
600	NEW SILKSONG TEASER ART.	0.7000000
601	Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust!	1.0547937
602	Helpful Silksong guide for New Fans!.	1.1226828

Now, I would like to evaluate the credibility of the sentiment analysis outcomes by manually checking if the sentiment scores align with the text content. From the table(s) above, I believe the sentiment analysis results are reasonably accurate, with scores generally aligning with the tone and emotional content of the text, particularly for negative statements. However, “Day 716 of poorly drawing hollow knight…” which is a more neutral statement was scored to be very negative due to its wording, and some sarcasm or exaggeration is detected in the positive samples, which could mean that the sentiment analysis may misinterpret them as more positive than intended. Additionally, for factual statements like “Game Awards Creator Geoff Keighley…,” the slightly positive sentiment may overstate the actual emotional tone. Overall, I think the sentiment analysis is credible for straightforward emotional expressions but may struggle with sarcasm or neutral factual statements.

Step 8. Insights and Conclusions

Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.

# Plot 1: Distribution of sentiment scores
# We can see if the majority of the posts are positive or negative.
ggplot(threads_clean, aes(x = sentiment_scores)) +
  geom_histogram(binwidth = 0.1, fill = 'steelblue', color = 'black') +
  xlab('Sentiment Score') +
  ylab('Count') +
  ggtitle('Distribution of Sentiment Scores')

# Plot 2: Average sentiment over time (by month)
# Shows how the average sentiment has changed over time since the game's announcement

# Convert date to month-year format
threads_clean$date <- as.POSIXct(threads_clean$date_utc)
threads_clean <- threads_clean %>%
  mutate(month = floor_date(date, 'month'))

sentiment_by_month <- threads_clean %>%
  group_by(month) %>%
  summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))

ggplot(sentiment_by_month, aes(x = month, y = avg_sentiment)) +
  geom_line(color = 'steelblue') +
  xlab('Month') +
  ylab('Average Sentiment') +
  ggtitle('Average Sentiment Over Time')

# Plot 3: Sentiment by Day of Week
threads_clean <- threads_clean %>%
  mutate(day_of_week = wday(date, label = TRUE))

sentiment_by_day <- threads_clean %>%
  group_by(day_of_week) %>%
  summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))

ggplot(sentiment_by_day, aes(x = day_of_week, y = avg_sentiment)) +
  geom_bar(stat = 'identity', fill = 'steelblue') +
  xlab('Day of Week') +
  ylab('Average Sentiment') +
  ggtitle('Average Sentiment by Day of Week')

From the sentiment analysis, we can derive several intriguing insights:

Overall Neutral to (Slightly) Positive Sentiment: The first plot shows that most sentiment scores cluster around neutral (0), with a slight skew toward the positive side. This indicates that discussions about the game tend to be balanced, with a mix of excitement and criticism but generally leaning toward positive sentiment. The spread of negative scores suggests there are criticisms or frustrations, but these are fewer in number compared to neutral and positive sentiments. The majority of the community appears to maintain a hopeful or neutral stance, which is common in discussions of long-awaited games like Silksong. However, outliers in negative sentiment indicate the existence of specific pain points, such as the extremely long wait for the news/updates/releasing or unmet expectations.
Fluctuated Sentiment Over Time: Since this is a relatively new game, it was first mentioned in 2019 (when it was announced). For discussions up until today, the second plot highlights fluctuations in sentiment over time. Sentiment showed an upward trend early on, with sharp drops around specific periods (e.g., early 2020 and mid-2021), followed by recovery. These drops might correlate with missed announcements, delays, critical feedback from fans or media, or major gaming events. Sentiment improves in later years and remains mostly stable (mostly above 0 in the recent years), which suggests renewed excitement, possibly due to teaser releases, gameplay showcases, awards nominations, or published trailers. The graph reflects the ebb and flow of community engagement, with peaks likely tied to exciting announcements and troughs representing frustration over delays or lack of updates.
Day of the Week Variation: Average sentiment during weekdays is slightly positive, while sentiment during weekends tends to hover around neutral to slightly positive. This may stem from fans casually discussing the game or engaging with content in their free time, while major updates or announcements tend to occur during the workweek.

We’ve gained a lot of valuable insights into the gaming community’s perceptions of Silksong since its announcement throughout this analysis. During the waiting for the game to be released, it shows a generally positive and hopeful sentiment surrounding Silksong, punctuated by periods of disappointment likely tied to delays or a lack of communication on social media.