packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", "tidyverse",
"igraph", "ggraph", "wordcloud2", "textdata", "sentimentr", "lubridate",
"ggdark", "stringi", "knitr")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
Step 1. Introduction
Describe in one sentence what you aim to examine using user-generated text data and sentiment analysis.
Examine gamer’s sentiment toward the game “Silk Song” for all time using Reddit threads.
Step 2. Reddit Data
Search Reddit threads using a keyword of your choice.
As discussed in class, searching by keyword or subreddit on Reddit yields a maximum of 250 results per query. To gather sufficient data for analysis, I searched for the keyword across multiple subreddits and combined the results.
# search for subreddits
subreddit_list <- RedditExtractoR::find_subreddits("Silksong")
## parsing URLs on page 1...
subreddit_list[1:30,c('subreddit','title','subscribers')] %>% knitr::kable()
| subreddit | title | subscribers | |
|---|---|---|---|
| 4viev2 | Silksong | Silksong | 70880 |
| 9cp9hk | Silksong_R34 | Silksong_R34 | 34 |
| 5j1d11 | SilksongRefuge | SilksongRefuge | 66 |
| 35gpe | HollowKnight | Hollow Knight | 865148 |
| 3em9zp | IsSilksongOut | IsSilksongOut | 1997 |
| 5ichn6 | silksong_warzone | silksong_warzone | 82 |
| j958e | HollowKnightMemes | Git Gud! | 230041 |
| oh0nj | HollowKnightArt | HK Art | 33838 |
| 8k5z4a | SilksongSpeedrun | SilksongSpeedrun | 77 |
| 7j4pmp | FakeSilksong | FakeSilksong | 47 |
| 320zt8 | SilksongIsntReal | SilksongIsntReal | 1753 |
| 4xauj7 | hollowknightdaily | hollowknightdaily | 2063 |
| 6lpz1m | SilksongCult | SilksongCult | 181 |
| 5hurjx | RefugeOfSilksong | RefugeOfSilksong | 579 |
| 2uzei | TwoBestFriendsPlay | The Hypest Subreddit on the Internet | 108756 |
| wjzyd | HollowKnightSilksong | Hollow Knight: Silksong | 0 |
| 4d09fo | Memes_O_Silksong | Memes_O_Silksong | 38 |
| 4bnfht | PantheonOfFusions | PantheonOfFusions | 836 |
| 2vath | Trophies | Trophies | 164724 |
| 5rorcd | RefugeOfSilksongMemes | RefugeOfSilksongMemes | 46 |
| 3h47q | NintendoSwitch | Nintendo Switch - News, Updates, and Information | 7433023 |
| 2tvxq | metroidvania | Metroidvania: guided non-linearity; utility-based exploration | 97778 |
| 5in4mc | silksongwar | silksongwar | 31 |
| 997q70 | Silksongforsanepeople | Silksongforsanepeople | 293 |
| 2qh03 | gaming | r/gaming | 44400726 |
| wm01b | SilkSongMemes | SSMEMES | 0 |
| 5hoq65 | silksongisout | silksongisout | 241 |
| 2qhwp | Games | Quality Gaming Content and Discussion – /r/Games | 3339711 |
| 122hf1 | Eldenring | r/EldenRing | 3803003 |
| 5ll3nj | silksongisreal | silksongisreal | 391 |
# Fetch threads using the keyword
threads_1 <- find_thread_urls(keywords = "Silksong", subreddit = 'Silksong', sort_by = 'relevance', period = 'all') %>%
drop_na()
rownames(threads_1) <- NULL
# Fetch threads from the another two subreddits
threads_2 <- find_thread_urls(keywords = "Silksong", subreddit = 'HollowKnight', sort_by = 'relevance', period = 'all') %>%
drop_na()
rownames(threads_2) <- NULL
threads_3 <- find_thread_urls(keywords = "Silksong", subreddit = 'Games', sort_by = 'relevance', period = 'all') %>%
drop_na()
rownames(threads_3) <- NULL
# Combine threads from both searches and remove duplicates
threads_combined <- bind_rows(threads_1, threads_2, threads_3) %>%
distinct(url, .keep_all = TRUE)
# Save the combined threads as a CSV file to avoid knitting error with RedditExtractoR package
write.csv(threads_combined, "threads_combined.csv", row.names = FALSE)
# Load the previously saved data
threads_combined <- read.csv("threads_combined.csv")
print(paste("Number of threads retrieved:", nrow(threads_combined)))
## [1] "Number of threads retrieved: 602"
Following this process, I compiled the data related to my chosen keyword, “Silksong,” and saved it into a CSV file for future use if needed. I tried multiple spelling/format of it, like “Silk Song” or “SilkSong”, and I decided to keep this one for the best result while having rich-enough data. The final dataset contains 602 entries from three subreddits combined. I set the time period to “all” and sorted the results by “relevance” to ensure a comprehensive dataset, as this game is relatively new and I wanted to avoid restrictions that might limit the results.
Step 3. Data Cleaning
Clean your text data and then tokenize it.
# Combine title and text for analysis
threads_combined <- threads_combined %>%
mutate(title = replace_na(title, ""),
text = replace_na(text, ""),
title_text = str_c(title, text, sep = ". "))
# Load stop words
data("stop_words")
# Define regular pattern to remove URLs and HTML entities
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
# Clean text
threads_clean <- threads_combined %>%
mutate(title_text = str_replace_all(title_text, replace_reg, ""))
# tokenize
words <- threads_clean %>%
unnest_tokens(output = word, input = title_text, token = 'words') %>%
# Remove stop words
anti_join(stop_words, by = "word") %>%
# Remove non-alphabetic strings
filter(str_detect(word, "^[a-z]+$"))
Now we have the cleaned and tokenized dataset
threads_clean and that is ready for any future
analysis.
Step 4. Word Cloud
Generate a word cloud that illustrates the frequency of words except your keyword.
# Remove the keyword "Silk" and "Song" and its related formats of the words
words <- words %>% filter(!word %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))
# Create a custom color palette
n <- 20
h <- runif(n, 0, 1) # Any color
s <- runif(n, 0.6, 1) # Vivid
v <- runif(n, 0.3, 0.7) # Neither too dark nor too bright
df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
# Generate the word cloud
words %>%
count(word, sort = TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
From the above generated word cloud we see that words with the highest frequency are “game”, “hollow”, “knight”, “team”, “cherry”, “day”, “wait”, “trailer”, “releasing”, “announce”, “news”, “website”, and so on. Most of these are either the developer/publisher of the game, its predecessor game, or waiting for it to be released.
Step 5. Tri-gram Analysis
Conduct a tri-gram analysis.
Extract tri-grams from your text data. Remove tri-grams containing stop words or non-alphabetic terms. Present the frequency of tri-grams in a table. Discuss any noteworthy tri-grams you come across. If no meaningful tri-grams are found, you may analyze bi-grams as well. However, you still need to show results of the tri-grams.
# 1. Extract tri-grams from the text data
trigrams <- threads_clean %>%
mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
select(title_text) %>%
unnest_tokens(output = trigram, input = title_text, token = "ngrams", n = 3)
# Separate trigrams into individual words
trigrams_separated <- trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
# 2. Filter out trigrams containing stop words or non-alphabetic terms
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
filter(str_detect(word1, "^[a-z]+$"),
str_detect(word2, "^[a-z]+$"),
str_detect(word3, "^[a-z]+$")) %>%
filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
!word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
!word3 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))
# Remove trigrams with non-ASCII characters
trigrams_filtered <- trigrams_filtered %>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))
# 3. Count the frequency of trigrams
trigram_counts <- trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
# Present the frequency of trigrams in a table
trigram_counts %>% head(25) %>% knitr::kable()
| word1 | word2 | word3 | n |
|---|---|---|---|
| drawing | hollow | knight | 96 |
| poorly | drawing | hollow | 95 |
| tba | official | website | 48 |
| official | page | 38 | |
| tba | announce | trailer | 29 |
| picture | link | trailer | 20 |
| link | trailer | link | 18 |
| steam | page | link | 15 |
| release | date | official | 13 |
| kinda | funny | games | 12 |
| date | official | website | 10 |
| summer | game | fest | 8 |
| tba | official | 8 | |
| tba | missed | spring | 7 |
| indie | games | coming | 6 |
| gameplay | trailer | coming | 5 |
| xbox | games | showcase | 5 |
| action | platformer | multiplayer | 4 |
| bethesda | games | showcase | 4 |
| discord | live | updates | 4 |
| nintendo | indie | world | 4 |
| xbox | bethesda | games | 4 |
| date | announce | trailer | 3 |
| elder | scrolls | online | 3 |
| gameplay | trailer | releasing | 3 |
I decided to perform the bi-grams as well to see if there’s any more meaningful results, which follows the same process as tri-gram analysis.
bigrams <- threads_clean %>%
mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
select(title_text) %>%
unnest_tokens(output = bigram, input = title_text, token = "ngrams", n = 2)
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
filter(str_detect(word1, "^[a-z]+$"),
str_detect(word2, "^[a-z]+$")) %>%
filter(!word1 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'),
!word2 %in% c('Silk', 'Song', 'silk', 'song', 'silksong'))
bigrams_filtered <- bigrams_filtered %>%
filter(stri_enc_isascii(word1) & stri_enc_isascii(word2))
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts %>% head(25) %>% knitr::kable()
| word1 | word2 | n |
|---|---|---|
| hollow | knight | 185 |
| release | date | 103 |
| official | website | 102 |
| drawing | hollow | 96 |
| poorly | drawing | 95 |
| announce | trailer | 82 |
| team | cherry | 57 |
| tba | official | 56 |
| official | 44 | |
| page | 38 | |
| tba | announce | 29 |
| nintendo | direct | 27 |
| nintendo | switch | 27 |
| steam | page | 24 |
| tba | missed | 21 |
| link | trailer | 20 |
| picture | link | 20 |
| release | window | 18 |
| trailer | link | 18 |
| page | link | 16 |
| date | official | 13 |
| game | awards | 13 |
| gameplay | trailer | 13 |
| funny | games | 12 |
| kinda | funny | 12 |
Both results gave some related/meaningful analysis:
1. Both tables show Silksong’s strong association with its predecessor
game, “Hollow Knight.” Players are likely discussing comparisons,
expectations, or continuity between the two games.
2. The developers of the game, Team Cherry, are a central focus of
discussions, reflecting player interest in their development progress,
communication, or announcements.
3. It is evident that fans are eagerly awaiting updates/news for the
game and emotional investment the community has in Silk Song’s
development and launch (background: this game was announced to be
released 5 years ago, and was delayed until now and has not been
released yet).
4. The community has a lot of discussions around its trailer.
5. Many are talking about the game’s (or developer’s) social media,
websites, pages on different gaming platforms, and so on.
6. Discussions about in-game contents are evident.
Overall, both bi-grams and tri-grams analysis provide strong overall
narrative. It seems the community has an ongoing comparison of “Silk
Song” and “Hollow Knight”, their engagement with the developer, and its
content/trailer.
Step 6. Sentiment Analysis
Perform a sentiment analysis on your text data using a dictionary method that accommodates negations.
You are welcome to apply a deep learning-based model to enrich your analysis, but employing the dictionary method is imperative.
# Perform sentiment analysis using the sentimentr package
sentiment_scores <- sentiment_by(threads_clean$title_text)
# Add sentiment scores to the threads data frame
threads_clean$sentiment_scores <- sentiment_scores$ave_sentiment
threads_clean$word_count <- sentiment_scores$word_count
Step 7. Sentiment Analysis Outcomes
Display 10 sample texts alongside their sentiment scores and evaluate the credibility of the sentiment analysis outcomes.
I will select the 5 most negative and 5 most positive texts as the 10 sample texts alongsie their sentiment scores to evaluate the credibility of the sentiment analysis outcomes.
# Select the 10 samples
threads_sentiment <- threads_clean %>%
select(title_text, sentiment_scores)
threads_sentiment_sorted <- threads_sentiment %>%
arrange(sentiment_scores)
# 5 most negative texts
top_negative <- threads_sentiment_sorted %>% head(5)
# 5 most positive texts
top_positive <- threads_sentiment_sorted %>% tail(5)
# Combine the samples
sample_texts <- bind_rows(top_negative, top_positive)
# Display the sample texts and their sentiment scores
sample_texts %>% knitr::kable()
| title_text | sentiment_scores |
|---|---|
| Steam Scream: The Revenge Official Trailer. | -0.8164966 |
| Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out. | -0.6653056 |
| Silksong was disappointing . | -0.5773503 |
| I will no longer remain silent, this bullying tactic to keep me quiet is laughable.. | -0.5163978 |
| Silksong fans have no hope. | -0.4919350 |
| To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy. | 0.5735393 |
| Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner. | 0.6250000 |
| NEW SILKSONG TEASER ART. | 0.7000000 |
| Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust! | 1.0547937 |
| Helpful Silksong guide for New Fans!. | 1.1226828 |
# or in separate tables if needed
top_negative %>% select(title_text, sentiment_scores) %>% knitr::kable()
| title_text | sentiment_scores |
|---|---|
| Steam Scream: The Revenge Official Trailer. | -0.8164966 |
| Day 716 of poorly drawing hollow knight inconsistently because I’m lazy and dumb and going insane until silksong comes out. | -0.6653056 |
| Silksong was disappointing . | -0.5773503 |
| I will no longer remain silent, this bullying tactic to keep me quiet is laughable.. | -0.5163978 |
| Silksong fans have no hope. | -0.4919350 |
top_positive %>% select(title_text, sentiment_scores) %>% knitr::kable()
| title_text | sentiment_scores | |
|---|---|---|
| 598 | To the shock of maybe about 2 people Silksong won the Most Anticipated Game Award on the Unity Awards thingy. | 0.5735393 |
| 599 | Game Awards Creator Geoff Keighley Looks Ahead to Decembers Ceremony: Theres Not a Frontrunner. | 0.6250000 |
| 600 | NEW SILKSONG TEASER ART. | 0.7000000 |
| 601 | Totally real new SilkSong character trust (bonus fan art of definitely new character). This is real totally, trust! | 1.0547937 |
| 602 | Helpful Silksong guide for New Fans!. | 1.1226828 |
Now, I would like to evaluate the credibility of the sentiment analysis outcomes by manually checking if the sentiment scores align with the text content. From the table(s) above, I believe the sentiment analysis results are reasonably accurate, with scores generally aligning with the tone and emotional content of the text, particularly for negative statements. However, “Day 716 of poorly drawing hollow knight…” which is a more neutral statement was scored to be very negative due to its wording, and some sarcasm or exaggeration is detected in the positive samples, which could mean that the sentiment analysis may misinterpret them as more positive than intended. Additionally, for factual statements like “Game Awards Creator Geoff Keighley…,” the slightly positive sentiment may overstate the actual emotional tone. Overall, I think the sentiment analysis is credible for straightforward emotional expressions but may struggle with sarcasm or neutral factual statements.
Step 8. Insights and Conclusions
Discuss intriguing insights derived from the sentiment analysis, supporting your observations with at least two plots.
# Plot 1: Distribution of sentiment scores
# We can see if the majority of the posts are positive or negative.
ggplot(threads_clean, aes(x = sentiment_scores)) +
geom_histogram(binwidth = 0.1, fill = 'steelblue', color = 'black') +
xlab('Sentiment Score') +
ylab('Count') +
ggtitle('Distribution of Sentiment Scores')
# Plot 2: Average sentiment over time (by month)
# Shows how the average sentiment has changed over time since the game's announcement
# Convert date to month-year format
threads_clean$date <- as.POSIXct(threads_clean$date_utc)
threads_clean <- threads_clean %>%
mutate(month = floor_date(date, 'month'))
sentiment_by_month <- threads_clean %>%
group_by(month) %>%
summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))
ggplot(sentiment_by_month, aes(x = month, y = avg_sentiment)) +
geom_line(color = 'steelblue') +
xlab('Month') +
ylab('Average Sentiment') +
ggtitle('Average Sentiment Over Time')
# Plot 3: Sentiment by Day of Week
threads_clean <- threads_clean %>%
mutate(day_of_week = wday(date, label = TRUE))
sentiment_by_day <- threads_clean %>%
group_by(day_of_week) %>%
summarise(avg_sentiment = mean(sentiment_scores, na.rm = TRUE))
ggplot(sentiment_by_day, aes(x = day_of_week, y = avg_sentiment)) +
geom_bar(stat = 'identity', fill = 'steelblue') +
xlab('Day of Week') +
ylab('Average Sentiment') +
ggtitle('Average Sentiment by Day of Week')
From the sentiment analysis, we can derive several intriguing
insights:
Overall Neutral to (Slightly) Positive Sentiment: The first plot shows that most sentiment scores cluster around neutral (0), with a slight skew toward the positive side. This indicates that discussions about the game tend to be balanced, with a mix of excitement and criticism but generally leaning toward positive sentiment. The spread of negative scores suggests there are criticisms or frustrations, but these are fewer in number compared to neutral and positive sentiments. The majority of the community appears to maintain a hopeful or neutral stance, which is common in discussions of long-awaited games like Silksong. However, outliers in negative sentiment indicate the existence of specific pain points, such as the extremely long wait for the news/updates/releasing or unmet expectations.
Fluctuated Sentiment Over Time: Since this is a relatively new game, it was first mentioned in 2019 (when it was announced). For discussions up until today, the second plot highlights fluctuations in sentiment over time. Sentiment showed an upward trend early on, with sharp drops around specific periods (e.g., early 2020 and mid-2021), followed by recovery. These drops might correlate with missed announcements, delays, critical feedback from fans or media, or major gaming events. Sentiment improves in later years and remains mostly stable (mostly above 0 in the recent years), which suggests renewed excitement, possibly due to teaser releases, gameplay showcases, awards nominations, or published trailers. The graph reflects the ebb and flow of community engagement, with peaks likely tied to exciting announcements and troughs representing frustration over delays or lack of updates.
Day of the Week Variation: Average sentiment during weekdays is slightly positive, while sentiment during weekends tends to hover around neutral to slightly positive. This may stem from fans casually discussing the game or engaging with content in their free time, while major updates or announcements tend to occur during the workweek.
We’ve gained a lot of valuable insights into the gaming community’s perceptions of Silksong since its announcement throughout this analysis. During the waiting for the game to be released, it shows a generally positive and hopeful sentiment surrounding Silksong, punctuated by periods of disappointment likely tied to delays or a lack of communication on social media.