About this project.

In this project I attempt to perform a mild analysis of the public opinion towards gaming. With the use of text data I try to derive meaningful insights about gaming.
Social media is a popular spot for data mining nowadays, hence I opted to use a dataset that originated from the subreddit called “r/gaming” in May 2018. This dataset was created by Jonathan Hung and shared via Kaggle.

Loading Data & packages.

library(tidyverse)
library(lubridate)
library(tidytext)
library(tm)
library(qdap)
library(slam)
library(wordcloud)
library(viridisLite)
library(textclean)
library(topicmodels)
library(reshape2)
load("cleaned_subreddit_gaming.Rda")

This is the codebook of the variables as given by Jonathan Hung :

  1. Index
  2. body: Comment made on a post
  3. score_hidden: Is the score visible or is it “[score hidden]”?
  4. created_utc_x: When was the comment posted
  5. score_x: What is the score of the comment
  6. controversiality: Is the comment both heavily upvoted AND downvoted
  7. gilded_x: How many times has the comment been gilded
  8. is_moderator_comments: Is the comment posted by a moderator
  9. created_utc_y: When was the OP posted
  10. domain: What is the domain of the OP? e.g. is the OP on reddit or imgur
  11. num_comments: How many comments were made on the OP
  12. score_y: What is the score of the OP
  13. title: What is the title of the OP
  14. gilded_y: How many times was the OP gilded
  15. stickied: Is the OP stickied
  16. over_18: Is the OP rated nsfw

Cleaning the Data.

Before the data can be used for further analysis, they need to be cleaned. Most of the reddit comments will probably be unstructured and noisy. As this is an informal communication, we may find a lot of typos, usage of slang words, or presence of unwanted content like URLS, stopwords, emojis, etc.
The next steps attempt to create a corpus with as less noise as possible. That is done by:
1. removing URLs, unicode, all text within brackets, stopwords and punctuation,  2. replacing contractions and slang,  3. transforming all text to lower case,  4. stripping whitespace, and  5. stemming words. 

csg <- cleaned_subreddit_gaming %>%
  #Selecting variables 
  select(X1, body) %>%
  #Excluding rows where the comment was "[deleted]"
  filter(!body=="[deleted]") 
  

#Renaming columns
colnames(csg)[1] <- "doc_id"
colnames(csg)[2] <- "text"
# Creating a DataframeSource
df_source<-DataframeSource(csg)

# Converting df_source to a corpus
df_corpus<-VCorpus(df_source)

#Creating a function to "clean" the corpus
clean_corpus_f <- function(corpus) {
  # Removing URLs
  corpus <- tm_map(corpus, content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T)))
  # Removing unicode
  corpus <- tm_map(corpus, content_transformer(function(x) gsub("\\<U[^\\>]*\\>", "", x)))
  # Replacing contractions
  corpus <- tm_map(corpus, content_transformer(replace_contraction))
  # Removing all text within brackets
  corpus <- tm_map(corpus, content_transformer(bracketX))
  # Removing URLs
  corpus <- tm_map(corpus, content_transformer( function(x) gsub("OP", "Original Poster", x)))
  # Transforming to lower case
  corpus <- tm_map(corpus, content_transformer(tolower))
  # Removing stopwords
  corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
  # Stripping whitespace
  corpus <- tm_map(corpus, stripWhitespace)
  #Stemming words
  corpus <- tm_map(corpus, stemDocument)
  # Adding more stopwords
  corpus <- tm_map(corpus, removeWords, c("game", "play"))
  # Replacing slang
  corpus <- tm_map(corpus, content_transformer(replace_internet_slang))
  # Removing punctuation
  corpus <- tm_map(corpus, removePunctuation)
  return(corpus)
}

# Applying the function to the df_corpus
clean_corp <- clean_corpus_f(df_corpus)

Examples of the “cleaned” comments.

As we can see in the examples given below, most of the text has been cleaned, but not all of it.

# Printing out cleaned up comment with id: 118
clean_corp[[118]][1]
## $content
## [1] "didnappen"
# Printing out comment 118 in original form
csg$text[118]
## [1] "didn'appen"
# Printing out cleaned up comment with id: 3339 
clean_corp[[3339]][1]
## $content
## [1] ""
# Printing out comment 3339 in original form
csg$text[3339]
## [1] "[Yeah](https://i.imgur.com/detNnJv.jpg)"
# Printing out cleaned up comment with id: 19566 
clean_corp[[19566]][1]
## $content
## [1] "imag far realiti laughing out loud"
# Printing out comment 19566 in original form
csg$text[19566]
## [1] "That image is not far from reality lol"
# Printing out cleaned up comment with id: 114748 
clean_corp[[114748]][1]
## $content
## [1] "hot damn asian wild everi time think start git good see someth like "
# Printing out comment 114748 in original form
csg$text[114748]
## [1] "Hot damn.  Asians be wild.  Every time I think I'm starting to \"git good\" i see something like this."
# Printing out cleaned up comment with id: 3001 
clean_corp[[3001]][1]
## $content
## [1] "see u r saying mayb ill give shit sale"
# Printing out comment 3001 in original form
csg$text[3001]
## [1] "I See what u r saying.. Maybe ill give a shit when its on sale.."

Wordcloud of r/gaming posts and comments.

We will use this “cleaned” corpus to create a wordcloud of the most common words used.

color_pal <- plasma(n = 10)
wordcloud(clean_corp, max.words = 100, random.order = FALSE, colors = color_pal)

Sentiment Analysis of r/Gaming posts and comments.

The first step is Tokenization.

tidy_csg <- cleaned_subreddit_gaming %>%
  #Selecting variables 
  select(title, body) %>%
  #Excluding rows where the comment was "[deleted]"
  filter(!body=="[deleted]")%>%
  # Grouping by the titles 
  group_by(title) %>%
  # Transforming the non-tidy text data to tidy text data
  unnest_tokens(word, body) %>%
  # Removing stopwords
  anti_join(stop_words, by="word") %>%
  ungroup()

NRC lexicon.

Now that the dataframe tokenized by individual words, we can begin our sentiment analysis. The “nrc” lexicon categorizes words into positive and negative categories, as well as by type of sentiment (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust).
Let’s see how many words belong to each sentiment.

sentiments_csg_nrc <- tidy_csg %>% 
    # Implementing sentiment analysis with the NRC lexicon
    inner_join(get_sentiments("nrc"), by="word")

sentiments_csg_nrc %>%
    # Finding how many words each sentiment has
    count(sentiment, sort = TRUE)
## # A tibble: 10 x 2
##    sentiment         n
##    <chr>         <int>
##  1 positive     188885
##  2 negative     160985
##  3 trust        105785
##  4 anticipation 100426
##  5 fear          88013
##  6 joy           83926
##  7 anger         80273
##  8 sadness       69898
##  9 disgust       56626
## 10 surprise      39559

A graphical representation of the top 5 words for each sentiment is also interesting to look at.

sentiments_csg_nrc %>%
    # Counting by word and sentiment
    count(word,sentiment) %>%
    # Grouping by sentiment
    group_by(sentiment) %>%
    # Taking the top 5 words for each sentiment
    top_n(5) %>%
    ungroup() %>%
    #reordering (word, n): converts word from a character that would be plotted in alphabetical order to a factor that will be plotted in order of n.
    mutate(word = reorder(word, n)) %>%
    # Setting up the plot with aes()
    ggplot(aes(word, n, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    coord_flip()

Bing lexicon.

Another approach would be to only look at the positive and negative sentiments. That can be done with the use of the “bing” lexicon. We can see the most common words for both positive and negative sentiments as well as the contribution of some words to each sentiment.

#Bing Lexicon
sentiments_csg_bing <- tidy_csg %>% 
    # Implementing sentiment analysis with the bing lexicon
    inner_join(get_sentiments("bing"), by="word")
    
# Wordcloud of positive vs negative words
sentiments_csg_bing %>%
    count(word, sentiment, sort = TRUE) %>%
    acast(word ~ sentiment, value.var = "n", fill = 0) %>% 
    comparison.cloud(max.words = 100)

# Contribution to each sentiment
bcsg <- sentiments_csg_bing %>%
    count(word, sentiment)%>%
    filter(n > 1500) %>%
    mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
    mutate(word = reorder(word, n))

ggplot(bcsg, mapping = aes(x = word, y = n, fill = sentiment)) +
    geom_bar(alpha = 0.8, stat = "identity") +
    labs(y = "Contribution to sentiment", x = NULL) +
    coord_flip()

Exploring Term Frequency - Inverse Document Frequency (tf-idf).

Term Frequency - Inverse Document Frequency is a technique used to weight the frequency of words appearing in our comments. The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much. Let’s look at some of the words with high tf-idf and some of the most common words.

#Creating tf-idf
csg_tf_idf<- tidy_csg %>%
  count(title, word, sort=TRUE)%>%
  bind_tf_idf(word, title, n) 

#high tf_idf
csg_tf_idf%>%
  arrange(desc(tf_idf))
## # A tibble: 1,178,445 x 6
##    title                            word               n    tf   idf tf_idf
##    <chr>                            <chr>          <int> <dbl> <dbl>  <dbl>
##  1 <U+0627><U+0644><U+0633><U+0627><U+062D><U+0631> <U+0627><U+0644><U+0627><U+0645><U+0631><U+064A><U+0643><U+064A> <U+0646><U+064A><U+062A> <U+0648><U+0648><U+0644><U+0641> <U+0642><U+0635><U+0647> ni~ nightwolf          1     1  10.0   10.0
##  2 <U+0627><U+0644><U+0645><U+0644><U+0643><U+0647> <U+0633><U+0646><U+062F><U+064A><U+0644> , <U+0642><U+0635><U+0647> sindel <U+0641><U+064A> <U+0645><U+0648>~ <U+0633><U+0646><U+062F><U+064A><U+0644>              1     1  10.0   10.0
##  3 100% headshot rate..             ripacog            1     1  10.0   10.0
##  4 Anybody got a modded GTA V acco~ bwahahahahaha~     1     1  10.0   10.0
##  5 Anyone know any subs that post ~ videogameartw~     1     1  10.0   10.0
##  6 Anyone want to play fortnite ad~ fergusonyanne~     1     1  10.0   10.0
##  7 Assetto Corsa vs Real Life | Re~ rb5                1     1  10.0   10.0
##  8 Banner and logo                  airtasker          1     1  10.0   10.0
##  9 Battlefield V trailer photoshop~ ouef               1     1  10.0   10.0
## 10 Can't Stop Dancin' [Serious]     ceace              1     1  10.0   10.0
## # ... with 1,178,435 more rows
#high n
csg_tf_idf%>%
  arrange(desc(n))
## # A tibble: 1,178,445 x 6
##    title                                   word       n     tf   idf tf_idf
##    <chr>                                   <chr>  <int>  <dbl> <dbl>  <dbl>
##  1 Fallout 76 – Official Teaser Trailer    fallo~   933 0.0531 3.10  0.164 
##  2 <U+FFFD><U+FFFD>                                      fallo~   612 0.0972 3.10  0.301 
##  3 Fallout 76 – Official Teaser Trailer    game     604 0.0344 0.921 0.0317
##  4 Unsettling Glitch                       nope     504 0.706  4.31  3.05  
##  5 R.I.P TotalBiscuit                      rip      494 0.0203 3.99  0.0809
##  6 R.I.P TotalBiscuit                      rest     491 0.0201 3.63  0.0732
##  7 Here’s a Where’s Waldo I think you’ll ~ waldo    417 0.0327 7.46  0.244 
##  8 R.I.P TotalBiscuit                      john     415 0.0170 4.92  0.0838
##  9 Here’s a Where’s Waldo I think you’ll ~ found    402 0.0315 3.15  0.0992
## 10 R.I.P TotalBiscuit                      tb       397 0.0163 6.21  0.101 
## # ... with 1,178,435 more rows