Background

I was curious about the Internet’s attitudes toward the newest Nexus specialist: Probius! I first scraped reddit comments and Tweets from the day Probius was announced (March 3rd) to present (March 21). Let’s get started!

Setup

First, we load the required packages.

suppressPackageStartupMessages({
  library(RedditExtractoR)
  library(twitteR)
  library(tm)
  library(ggplot2)
  library(dplyr)
  library(wordcloud)
  library(RWeka)
  library(topicmodels)
  library(sentiment)
  library(data.table)
})

And then we load the tweets and reddit comments. Note that the text was scraped periodically (for instance, Twitter’s API limits you to tweets within the past 7 days).

load("probius_twitter.Rda")
load("probius_reddit.Rda")

Twitter

We’ll start by removing numbers and URLs from tweets.

tweets_corpus<-Corpus(VectorSource(probius_twitter$text))

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(removeURL))
tweets_corpus<-tm_map(tweets_corpus, removeNumbers)

corpusCopy<-tweets_corpus #keep this for later

corpusCopy<-tm_map(corpusCopy, PlainTextDocument)

Next, we remove everything except apostrophes (as some stopwords have apostrophes) and convert the words to lowercase.

allButApost <- function(x) gsub("[^[:alnum:][:space:]']", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(allButApost))
tweets_corpus <- tm_map(tweets_corpus, tolower)

For the next step, we remove all stopwords within the SMART dictionary (this is more exhaustive than the tm package default “english”). I also created my own “Heroes of the Storm”-centric stopwords to prevent commonly used HotS speak from disproportionately affecting the results.

tweets_corpus<-tm_map(tweets_corpus, removeWords, stopwords("SMART"))

#Heroes Stop Words
heroes_stop_words<-c("probius", "heroes", "storm", "video", "youtube", "hero", "blizzheroes", 
                     "trailer", "character", "spotlight", "ptr", "newest", "heroesofthestorm", "meet", "hgc", "play",
                     "hots", "nexus", "blizzard", "blizzhero", "youtub", "patch", "joins", "probe", "protoss", "starcraft",
                     "development")

tweets_corpus<-tm_map(tweets_corpus, removeWords, heroes_stop_words)

tweets_corpus<-tm_map(tweets_corpus, removePunctuation)

Finally, we stem the tweets (running and ran become run)

tweets_corpus<-tm_map(tweets_corpus, stemDocument)
tweets_corpus<-tm_map(tweets_corpus, PlainTextDocument)

Term Document Matrix

We then convert the tweets to a term document matrix to determine the number of tweets each term is used in, as well as the total frequency.

tdm<-TermDocumentMatrix(tweets_corpus, control = list(wordLengths = c(1, Inf)))

term_freq<-rowSums(as.matrix(tdm))
term_freq<-subset(term_freq, term_freq>=20)
tdm_df<-data.frame(term=names(term_freq), freq=term_freq)

We then sort from most-frequent to least-frequent and plot.

tdm_df<- arrange(tdm_df, desc(freq))

ggplot(data=tdm_df[1:10,])+geom_bar(aes(x=reorder(term, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ ggtitle("Words Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))

Twitter loves Probius–using words such as fun, cute, excited. It appears many people anticipate receiving their ranked stag mount too.

Bigrams

We now exam the top word pairs.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

bigram = TermDocumentMatrix(tweets_corpus, control = list(tokenize = BigramTokenizer))

bi_freq = sort(rowSums(as.matrix(bigram)),decreasing = TRUE)
bi_freq.df = data.frame(word=names(bi_freq), freq=bi_freq)

ggplot(data=bi_freq.df[1:10,])+geom_bar(aes(x=reorder(word, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Word Pairs Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))

Most of the pairs express the desire for the (elemental) stag mount, in addition to excitement for Probius. Note the frequent use of mount is likely due to users referring to the Probius patch.

Trigrams

Now we move on to word triplets.

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

trigram = TermDocumentMatrix(tweets_corpus, control = list(tokenize = TrigramTokenizer))

tri_freq = sort(rowSums(as.matrix(trigram)),decreasing = TRUE)
tri_freq.df = data.frame(word=names(tri_freq), freq=tri_freq)
head(tri_freq.df, 20)

##                                                    word freq
## elemental stag mount               elemental stag mount   32
## epic elemental stag                 epic elemental stag   32
## announcement super fun           announcement super fun   18
## excited hands announcement   excited hands announcement   18
## hands announcement super       hands announcement super   18
## super fun sounds                       super fun sounds   18
## fun sounds love                         fun sounds love   17
## construct additional pylons construct additional pylons   16
## news heroesofstorm read         news heroesofstorm read   10
## small deadly latest                 small deadly latest    7
## breakdown game mount               breakdown game mount    6
## enters ranked season               enters ranked season    6
## free rotation slots                 free rotation slots    6
## increases free rotation         increases free rotation    6
## ranked season kicks                 ranked season kicks    6
## roster arrives week                 roster arrives week    6
## rotation slots lands               rotation slots lands    6
## animations fishing bobbers   animations fishing bobbers    5
## bobbers tweets dlc                   bobbers tweets dlc    5
## builds pylons ya                       builds pylons ya    5

ggplot(data=tri_freq.df[1:10,])+geom_bar(aes(x=reorder(word, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Word Triplets Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))

Elemental stag mentions are here, too. So is eagerness to play Probius. And of course, it wouldn’t be a Protoss hero without the signature Contruct additional pylons.

Topic Modelling

Here, we try to measure the different topics Twitter is discussing using topic modelling.

First, we get a Document Term Matrix. Then we apply LDA to receive the top 3 terms in every topic.

dtm<-as.DocumentTermMatrix(tdm)
rowTotals <- apply(dtm , 1, sum) 
dtm_corrected   <- dtm[rowTotals> 0, ] 
lda<-LDA(dtm_corrected, control=list(seed=123) ,k=2) 
term<-terms(lda, 3)
(term<-apply(term, MARGIN=2, paste, collapse=", "))

##                 Topic 1                 Topic 2 
##      "game, news, stag" "excited, pylons, wait"

Note there are some overlap between categories. I found having 2 topics gave the most mutually exclusive, collectively exhaustive results.

Here are the best topic names I could come up with based on the terms

Topic 1: Patch Reactions

Topic 2: Probius Reactions

Sentiment

In this section, we examine the positivist/negativity Twitter expresses toward Probius.

First we give each tweet a polarity (positive/negative/neutral).

new_twitter_df<-data.frame(text = sapply(corpusCopy, paste, collapse = " "), stringsAsFactors = FALSE)
sentiments<-sentiment(new_twitter_df$text)
table(sentiments$polarity)

## 
## negative  neutral positive 
##      114     1422      468

Then we score that polarity (1–positive, 0–neutral, -1–negative).

sentiments$score<-0
sentiments$score[sentiments$polarity=="positive"]<- 1
sentiments$score[sentiments$polarity=="negative"]<- -1
sentiments$date <- as.IDate(probius_twitter$created)
result <- aggregate(score ~ date, data = sentiments, sum)

And finally, we plot.

ggplot(sentiments, aes(x=polarity)) +
  geom_bar(aes(y=..count.., fill=polarity)) +geom_text(stat='count',aes(label=..count..),vjust=-0.2)+
  scale_fill_brewer(palette="BrBG") +
  labs(x="Polarity", y="Number of Tweets") +
  ggtitle("Twitter Sentiment Analysis of Probius")+
  theme(plot.title = element_text(hjust = 0.5))

Of the tweets that can be classified, much of the sentiment is positive. Let’s see what it looks like over time.

aggregate<-sentiments %>% group_by(polarity, date) %>% dplyr::summarize(count=n())

 ggplot(aggregate, aes(date, count)) + geom_line(aes(group=polarity, color=polarity), size=1) +
 geom_point(aes(group=polarity, color=polarity), size=2) +
 theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))+ggtitle("Twitter Sentiment of Probius Over Time")+theme(plot.title = element_text(hjust = 0.5))

We see that the sentiment over time is pretty positive. The most number of tweets come from Probius’ announcement, with a couple spikes near his PTR release and general release. Overall, things seem pretty positive on Twitter!

Now we switch to reddit comments. Will the sentiment and common words be the same? Let’s find out!

We follow the same steps as last time.

Top Words

reddit_corpus<-Corpus(VectorSource(probius_reddit$comment))

reddit_corpus<-tm_map(reddit_corpus, content_transformer(removeURL))
reddit_corpus<-tm_map(reddit_corpus, removeNumbers)

corpusCopy<-reddit_corpus #keep this for later

corpusCopy<-tm_map(corpusCopy, PlainTextDocument)

reddit_corpus<-tm_map(reddit_corpus, content_transformer(allButApost))
reddit_corpus <- tm_map(reddit_corpus, tolower)

reddit_corpus<-tm_map(reddit_corpus, removeWords, stopwords("SMART"))

#Heroes Stop Words
reddit_corpus<-tm_map(reddit_corpus, removeWords, heroes_stop_words)

reddit_corpus<-tm_map(reddit_corpus, removePunctuation)

reddit_corpus<-tm_map(reddit_corpus, stemDocument)
reddit_corpus<-tm_map(reddit_corpus, PlainTextDocument)

tdm_reddit<-TermDocumentMatrix(reddit_corpus, control = list(wordLengths = c(1, Inf)))

term_freq_reddit<-rowSums(as.matrix(tdm_reddit))
term_freq_reddit<-subset(term_freq_reddit, term_freq_reddit>=20)
tdm_df_reddit<-data.frame(term=names(term_freq_reddit), freq=term_freq_reddit)

tdm_df_reddit<- arrange(tdm_df_reddit, desc(freq))

ggplot(data=tdm_df_reddit[1:10,])+geom_bar(aes(x=reorder(term, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Words Describing Probius From Reddit")+theme(plot.title = element_text(hjust = 0.5))

Compared with Twitter (which talked about Probius’ appearance), reddit looks to be talking more about game mechanics and Probius’ role in the meta.

Note that I couldn’t do a bigram or trigram analysis due to computational limitations.

Topic Modelling

dtm_reddit<-as.DocumentTermMatrix(tdm_reddit)
rowTotals <- apply(dtm_reddit , 1, sum) 
dtm_reddit_corrected   <- dtm_reddit[rowTotals> 0, ] 
lda<-LDA(dtm_reddit_corrected, control=list(seed=123) ,k=5) 
term<-terms(lda, 5)
(term<-apply(term, MARGIN=2, paste, collapse=", "))

##                                  Topic 1 
##      "damage, talent, armor, good, tank" 
##                                  Topic 2 
##       "skin, sc, game, skins, abilities" 
##                                  Topic 3 
##        "people, game, win, level, games" 
##                                  Topic 4 
## "pylon, damage, cannon, pylons, cannons" 
##                                  Topic 5 
##           "team, wow, time, clear, good"

There are more interesting topics in the reddit comments compared to Twitter.

Topic 1: Warrior Changes

Topic 2: Cosmetics (skins and mounts)

Topic 3: Meta Changes

Topic 4: Probius Mechanics

Topic 5: Quick Match Changes

Sentiments

This is the same process as before

new_reddit_df<-data.frame(text = sapply(reddit_corpus, paste, collapse = " "), stringsAsFactors = FALSE)
sentiments_reddit<-sentiment(new_reddit_df$text)
table(sentiments_reddit$polarity)

## 
## negative  neutral positive 
##      690     7613      411

sentiments_reddit$score<-0
sentiments_reddit$score[sentiments_reddit$polarity=="positive"]<- 1
sentiments_reddit$score[sentiments_reddit$polarity=="negative"]<- -1
sentiments_reddit$date <- as.IDate(probius_reddit$date)
result <- aggregate(score ~ date, data = sentiments_reddit, sum)

ggplot(sentiments_reddit, aes(x=polarity)) +
  geom_bar(aes(y=..count.., fill=polarity)) +geom_text(stat='count',aes(label=..count..),vjust=-0.5)+
  scale_fill_brewer(palette="BrBG") +
  labs(x="Polarity", y="Number of Comments") +
  ggtitle("Reddit Sentiment Analysis of Probius")+
  theme(plot.title = element_text(hjust = 0.5))

Reddit appears to be more negative than Twitter. Perhaps this is due to a higher threshold on character restrictions. Users are able to articulate their feelings in more detail.

For example, a user might say: “I don’t like Probius because his abilities don’t reach far enough to be useful for zoning” on reddit. It is also entirely possible that reddit as a website is overall more negative than Twitter.

Finally, we plot the sentiment over time.

aggregate_reddit<-sentiments_reddit %>% group_by(polarity, date) %>% dplyr::summarize(count=n())

 ggplot(aggregate_reddit, aes(date, count)) + geom_line(aes(group=polarity, color=polarity), size=1) +
 geom_point(aes(group=polarity, color=polarity), size=2) +
 theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))+ggtitle("Reddit Sentiment Analysis of Probius Over Time")+theme(plot.title = element_text(hjust = 0.5))

Compared to Twitter, we can see that reddit is generally more negative toward our pristine probe. Reddit also appears to have far more comments on Probius after his release compared to Twitter.

Analysis

Twitter users appear to be much more receptive to the Probius announcement, but tweets appear to tail off over time.

Reddit users appear to be more negative to Probius; however, the discussion has peaks and valleys over time. Reddit users talk much more technically (meta, mechanics, etc).

Probing Probius Positivity

hots-data-guy

Background

Setup

Twitter

Term Document Matrix

Bigrams

Trigrams

Topic Modelling

Sentiment

Reddit

Top Words

Topic Modelling

Sentiments

Analysis