In this project I attempt to perform a mild analysis of the public opinion towards gaming. With the use of text data I try to derive meaningful insights about gaming.
Social media is a popular spot for data mining nowadays, hence I opted to use a dataset that originated from the subreddit called “r/gaming” in May 2018. This dataset was created by Jonathan Hung and shared via Kaggle.
library(tidyverse)
library(lubridate)
library(tidytext)
library(tm)
library(qdap)
library(slam)
library(wordcloud)
library(viridisLite)
library(textclean)
library(topicmodels)
library(reshape2)
load("cleaned_subreddit_gaming.Rda")
This is the codebook of the variables as given by Jonathan Hung :
Before the data can be used for further analysis, they need to be cleaned. Most of the reddit comments will probably be unstructured and noisy. As this is an informal communication, we may find a lot of typos, usage of slang words, or presence of unwanted content like URLS, stopwords, emojis, etc.
The next steps attempt to create a corpus with as less noise as possible. That is done by:
1. removing URLs, unicode, all text within brackets, stopwords and punctuation, 2. replacing contractions and slang, 3. transforming all text to lower case, 4. stripping whitespace, and 5. stemming words.
csg <- cleaned_subreddit_gaming %>%
#Selecting variables
select(X1, body) %>%
#Excluding rows where the comment was "[deleted]"
filter(!body=="[deleted]")
#Renaming columns
colnames(csg)[1] <- "doc_id"
colnames(csg)[2] <- "text"
# Creating a DataframeSource
df_source<-DataframeSource(csg)
# Converting df_source to a corpus
df_corpus<-VCorpus(df_source)
#Creating a function to "clean" the corpus
clean_corpus_f <- function(corpus) {
# Removing URLs
corpus <- tm_map(corpus, content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T)))
# Removing unicode
corpus <- tm_map(corpus, content_transformer(function(x) gsub("\\<U[^\\>]*\\>", "", x)))
# Replacing contractions
corpus <- tm_map(corpus, content_transformer(replace_contraction))
# Removing all text within brackets
corpus <- tm_map(corpus, content_transformer(bracketX))
# Removing URLs
corpus <- tm_map(corpus, content_transformer( function(x) gsub("OP", "Original Poster", x)))
# Transforming to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Removing stopwords
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
# Stripping whitespace
corpus <- tm_map(corpus, stripWhitespace)
#Stemming words
corpus <- tm_map(corpus, stemDocument)
# Adding more stopwords
corpus <- tm_map(corpus, removeWords, c("game", "play"))
# Replacing slang
corpus <- tm_map(corpus, content_transformer(replace_internet_slang))
# Removing punctuation
corpus <- tm_map(corpus, removePunctuation)
return(corpus)
}
# Applying the function to the df_corpus
clean_corp <- clean_corpus_f(df_corpus)
As we can see in the examples given below, most of the text has been cleaned, but not all of it.
# Printing out cleaned up comment with id: 118
clean_corp[[118]][1]
## $content
## [1] "didnappen"
# Printing out comment 118 in original form
csg$text[118]
## [1] "didn'appen"
# Printing out cleaned up comment with id: 3339
clean_corp[[3339]][1]
## $content
## [1] ""
# Printing out comment 3339 in original form
csg$text[3339]
## [1] "[Yeah](https://i.imgur.com/detNnJv.jpg)"
# Printing out cleaned up comment with id: 19566
clean_corp[[19566]][1]
## $content
## [1] "imag far realiti laughing out loud"
# Printing out comment 19566 in original form
csg$text[19566]
## [1] "That image is not far from reality lol"
# Printing out cleaned up comment with id: 114748
clean_corp[[114748]][1]
## $content
## [1] "hot damn asian wild everi time think start git good see someth like "
# Printing out comment 114748 in original form
csg$text[114748]
## [1] "Hot damn. Asians be wild. Every time I think I'm starting to \"git good\" i see something like this."
# Printing out cleaned up comment with id: 3001
clean_corp[[3001]][1]
## $content
## [1] "see u r saying mayb ill give shit sale"
# Printing out comment 3001 in original form
csg$text[3001]
## [1] "I See what u r saying.. Maybe ill give a shit when its on sale.."
We will use this “cleaned” corpus to create a wordcloud of the most common words used.
color_pal <- plasma(n = 10)
wordcloud(clean_corp, max.words = 100, random.order = FALSE, colors = color_pal)
The first step is Tokenization.
tidy_csg <- cleaned_subreddit_gaming %>%
#Selecting variables
select(title, body) %>%
#Excluding rows where the comment was "[deleted]"
filter(!body=="[deleted]")%>%
# Grouping by the titles
group_by(title) %>%
# Transforming the non-tidy text data to tidy text data
unnest_tokens(word, body) %>%
# Removing stopwords
anti_join(stop_words, by="word") %>%
ungroup()
Now that the dataframe tokenized by individual words, we can begin our sentiment analysis. The “nrc” lexicon categorizes words into positive and negative categories, as well as by type of sentiment (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust).
Let’s see how many words belong to each sentiment.
sentiments_csg_nrc <- tidy_csg %>%
# Implementing sentiment analysis with the NRC lexicon
inner_join(get_sentiments("nrc"), by="word")
sentiments_csg_nrc %>%
# Finding how many words each sentiment has
count(sentiment, sort = TRUE)
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 positive 188885
## 2 negative 160985
## 3 trust 105785
## 4 anticipation 100426
## 5 fear 88013
## 6 joy 83926
## 7 anger 80273
## 8 sadness 69898
## 9 disgust 56626
## 10 surprise 39559
A graphical representation of the top 5 words for each sentiment is also interesting to look at.
sentiments_csg_nrc %>%
# Counting by word and sentiment
count(word,sentiment) %>%
# Grouping by sentiment
group_by(sentiment) %>%
# Taking the top 5 words for each sentiment
top_n(5) %>%
ungroup() %>%
#reordering (word, n): converts word from a character that would be plotted in alphabetical order to a factor that will be plotted in order of n.
mutate(word = reorder(word, n)) %>%
# Setting up the plot with aes()
ggplot(aes(word, n, fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales = "free") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip()
Another approach would be to only look at the positive and negative sentiments. That can be done with the use of the “bing” lexicon. We can see the most common words for both positive and negative sentiments as well as the contribution of some words to each sentiment.
#Bing Lexicon
sentiments_csg_bing <- tidy_csg %>%
# Implementing sentiment analysis with the bing lexicon
inner_join(get_sentiments("bing"), by="word")
# Wordcloud of positive vs negative words
sentiments_csg_bing %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(max.words = 100)
# Contribution to each sentiment
bcsg <- sentiments_csg_bing %>%
count(word, sentiment)%>%
filter(n > 1500) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n))
ggplot(bcsg, mapping = aes(x = word, y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip()
Term Frequency - Inverse Document Frequency is a technique used to weight the frequency of words appearing in our comments. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much. Let’s look at some of the words with high tf-idf and some of the most common words.
#Creating tf-idf
csg_tf_idf<- tidy_csg %>%
count(title, word, sort=TRUE)%>%
bind_tf_idf(word, title, n)
#high tf_idf
csg_tf_idf%>%
arrange(desc(tf_idf))
## # A tibble: 1,178,445 x 6
## title word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 <U+0627><U+0644><U+0633><U+0627><U+062D><U+0631> <U+0627><U+0644><U+0627><U+0645><U+0631><U+064A><U+0643><U+064A> <U+0646><U+064A><U+062A> <U+0648><U+0648><U+0644><U+0641> <U+0642><U+0635><U+0647> ni~ nightwolf 1 1 10.0 10.0
## 2 <U+0627><U+0644><U+0645><U+0644><U+0643><U+0647> <U+0633><U+0646><U+062F><U+064A><U+0644> , <U+0642><U+0635><U+0647> sindel <U+0641><U+064A> <U+0645><U+0648>~ <U+0633><U+0646><U+062F><U+064A><U+0644> 1 1 10.0 10.0
## 3 100% headshot rate.. ripacog 1 1 10.0 10.0
## 4 Anybody got a modded GTA V acco~ bwahahahahaha~ 1 1 10.0 10.0
## 5 Anyone know any subs that post ~ videogameartw~ 1 1 10.0 10.0
## 6 Anyone want to play fortnite ad~ fergusonyanne~ 1 1 10.0 10.0
## 7 Assetto Corsa vs Real Life | Re~ rb5 1 1 10.0 10.0
## 8 Banner and logo airtasker 1 1 10.0 10.0
## 9 Battlefield V trailer photoshop~ ouef 1 1 10.0 10.0
## 10 Can't Stop Dancin' [Serious] ceace 1 1 10.0 10.0
## # ... with 1,178,435 more rows
#high n
csg_tf_idf%>%
arrange(desc(n))
## # A tibble: 1,178,445 x 6
## title word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Fallout 76 Official Teaser Trailer fallo~ 933 0.0531 3.10 0.164
## 2 <U+FFFD><U+FFFD> fallo~ 612 0.0972 3.10 0.301
## 3 Fallout 76 Official Teaser Trailer game 604 0.0344 0.921 0.0317
## 4 Unsettling Glitch nope 504 0.706 4.31 3.05
## 5 R.I.P TotalBiscuit rip 494 0.0203 3.99 0.0809
## 6 R.I.P TotalBiscuit rest 491 0.0201 3.63 0.0732
## 7 Heres a Wheres Waldo I think youll ~ waldo 417 0.0327 7.46 0.244
## 8 R.I.P TotalBiscuit john 415 0.0170 4.92 0.0838
## 9 Heres a Wheres Waldo I think youll ~ found 402 0.0315 3.15 0.0992
## 10 R.I.P TotalBiscuit tb 397 0.0163 6.21 0.101
## # ... with 1,178,435 more rows