I was curious about the Internet’s attitudes toward the newest Nexus specialist: Probius! I first scraped reddit comments and Tweets from the day Probius was announced (March 3rd) to present (March 21). Let’s get started!
First, we load the required packages.
suppressPackageStartupMessages({
library(RedditExtractoR)
library(twitteR)
library(tm)
library(ggplot2)
library(dplyr)
library(wordcloud)
library(RWeka)
library(topicmodels)
library(sentiment)
library(data.table)
})
And then we load the tweets and reddit comments. Note that the text was scraped periodically (for instance, Twitter’s API limits you to tweets within the past 7 days).
load("probius_twitter.Rda")
load("probius_reddit.Rda")
We’ll start by removing numbers and URLs from tweets.
tweets_corpus<-Corpus(VectorSource(probius_twitter$text))
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(removeURL))
tweets_corpus<-tm_map(tweets_corpus, removeNumbers)
corpusCopy<-tweets_corpus #keep this for later
corpusCopy<-tm_map(corpusCopy, PlainTextDocument)
Next, we remove everything except apostrophes (as some stopwords have apostrophes) and convert the words to lowercase.
allButApost <- function(x) gsub("[^[:alnum:][:space:]']", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(allButApost))
tweets_corpus <- tm_map(tweets_corpus, tolower)
For the next step, we remove all stopwords within the SMART dictionary (this is more exhaustive than the tm package default “english”). I also created my own “Heroes of the Storm”-centric stopwords to prevent commonly used HotS speak from disproportionately affecting the results.
tweets_corpus<-tm_map(tweets_corpus, removeWords, stopwords("SMART"))
#Heroes Stop Words
heroes_stop_words<-c("probius", "heroes", "storm", "video", "youtube", "hero", "blizzheroes",
"trailer", "character", "spotlight", "ptr", "newest", "heroesofthestorm", "meet", "hgc", "play",
"hots", "nexus", "blizzard", "blizzhero", "youtub", "patch", "joins", "probe", "protoss", "starcraft",
"development")
tweets_corpus<-tm_map(tweets_corpus, removeWords, heroes_stop_words)
tweets_corpus<-tm_map(tweets_corpus, removePunctuation)
Finally, we stem the tweets (running and ran become run)
tweets_corpus<-tm_map(tweets_corpus, stemDocument)
tweets_corpus<-tm_map(tweets_corpus, PlainTextDocument)
We then convert the tweets to a term document matrix to determine the number of tweets each term is used in, as well as the total frequency.
tdm<-TermDocumentMatrix(tweets_corpus, control = list(wordLengths = c(1, Inf)))
term_freq<-rowSums(as.matrix(tdm))
term_freq<-subset(term_freq, term_freq>=20)
tdm_df<-data.frame(term=names(term_freq), freq=term_freq)
We then sort from most-frequent to least-frequent and plot.
tdm_df<- arrange(tdm_df, desc(freq))
ggplot(data=tdm_df[1:10,])+geom_bar(aes(x=reorder(term, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ ggtitle("Words Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))
Twitter loves Probius–using words such as fun, cute, excited. It appears many people anticipate receiving their ranked stag mount too.
We now exam the top word pairs.
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram = TermDocumentMatrix(tweets_corpus, control = list(tokenize = BigramTokenizer))
bi_freq = sort(rowSums(as.matrix(bigram)),decreasing = TRUE)
bi_freq.df = data.frame(word=names(bi_freq), freq=bi_freq)
ggplot(data=bi_freq.df[1:10,])+geom_bar(aes(x=reorder(word, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Word Pairs Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))
Most of the pairs express the desire for the (elemental) stag mount, in addition to excitement for Probius. Note the frequent use of mount is likely due to users referring to the Probius patch.
Now we move on to word triplets.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram = TermDocumentMatrix(tweets_corpus, control = list(tokenize = TrigramTokenizer))
tri_freq = sort(rowSums(as.matrix(trigram)),decreasing = TRUE)
tri_freq.df = data.frame(word=names(tri_freq), freq=tri_freq)
head(tri_freq.df, 20)
## word freq
## elemental stag mount elemental stag mount 32
## epic elemental stag epic elemental stag 32
## announcement super fun announcement super fun 18
## excited hands announcement excited hands announcement 18
## hands announcement super hands announcement super 18
## super fun sounds super fun sounds 18
## fun sounds love fun sounds love 17
## construct additional pylons construct additional pylons 16
## news heroesofstorm read news heroesofstorm read 10
## small deadly latest small deadly latest 7
## breakdown game mount breakdown game mount 6
## enters ranked season enters ranked season 6
## free rotation slots free rotation slots 6
## increases free rotation increases free rotation 6
## ranked season kicks ranked season kicks 6
## roster arrives week roster arrives week 6
## rotation slots lands rotation slots lands 6
## animations fishing bobbers animations fishing bobbers 5
## bobbers tweets dlc bobbers tweets dlc 5
## builds pylons ya builds pylons ya 5
ggplot(data=tri_freq.df[1:10,])+geom_bar(aes(x=reorder(word, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Word Triplets Describing Probius From Twitter")+theme(plot.title = element_text(hjust = 0.5))
Elemental stag mentions are here, too. So is eagerness to play Probius. And of course, it wouldn’t be a Protoss hero without the signature Contruct additional pylons.
Here, we try to measure the different topics Twitter is discussing using topic modelling.
First, we get a Document Term Matrix. Then we apply LDA to receive the top 3 terms in every topic.
dtm<-as.DocumentTermMatrix(tdm)
rowTotals <- apply(dtm , 1, sum)
dtm_corrected <- dtm[rowTotals> 0, ]
lda<-LDA(dtm_corrected, control=list(seed=123) ,k=2)
term<-terms(lda, 3)
(term<-apply(term, MARGIN=2, paste, collapse=", "))
## Topic 1 Topic 2
## "game, news, stag" "excited, pylons, wait"
Note there are some overlap between categories. I found having 2 topics gave the most mutually exclusive, collectively exhaustive results.
Here are the best topic names I could come up with based on the terms
Topic 1: Patch Reactions
Topic 2: Probius Reactions
In this section, we examine the positivist/negativity Twitter expresses toward Probius.
First we give each tweet a polarity (positive/negative/neutral).
new_twitter_df<-data.frame(text = sapply(corpusCopy, paste, collapse = " "), stringsAsFactors = FALSE)
sentiments<-sentiment(new_twitter_df$text)
table(sentiments$polarity)
##
## negative neutral positive
## 114 1422 468
Then we score that polarity (1–positive, 0–neutral, -1–negative).
sentiments$score<-0
sentiments$score[sentiments$polarity=="positive"]<- 1
sentiments$score[sentiments$polarity=="negative"]<- -1
sentiments$date <- as.IDate(probius_twitter$created)
result <- aggregate(score ~ date, data = sentiments, sum)
And finally, we plot.
ggplot(sentiments, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +geom_text(stat='count',aes(label=..count..),vjust=-0.2)+
scale_fill_brewer(palette="BrBG") +
labs(x="Polarity", y="Number of Tweets") +
ggtitle("Twitter Sentiment Analysis of Probius")+
theme(plot.title = element_text(hjust = 0.5))
Of the tweets that can be classified, much of the sentiment is positive. Let’s see what it looks like over time.
aggregate<-sentiments %>% group_by(polarity, date) %>% dplyr::summarize(count=n())
ggplot(aggregate, aes(date, count)) + geom_line(aes(group=polarity, color=polarity), size=1) +
geom_point(aes(group=polarity, color=polarity), size=2) +
theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))+ggtitle("Twitter Sentiment of Probius Over Time")+theme(plot.title = element_text(hjust = 0.5))
We see that the sentiment over time is pretty positive. The most number of tweets come from Probius’ announcement, with a couple spikes near his PTR release and general release. Overall, things seem pretty positive on Twitter!
Now we switch to reddit comments. Will the sentiment and common words be the same? Let’s find out!
We follow the same steps as last time.
reddit_corpus<-Corpus(VectorSource(probius_reddit$comment))
reddit_corpus<-tm_map(reddit_corpus, content_transformer(removeURL))
reddit_corpus<-tm_map(reddit_corpus, removeNumbers)
corpusCopy<-reddit_corpus #keep this for later
corpusCopy<-tm_map(corpusCopy, PlainTextDocument)
reddit_corpus<-tm_map(reddit_corpus, content_transformer(allButApost))
reddit_corpus <- tm_map(reddit_corpus, tolower)
reddit_corpus<-tm_map(reddit_corpus, removeWords, stopwords("SMART"))
#Heroes Stop Words
reddit_corpus<-tm_map(reddit_corpus, removeWords, heroes_stop_words)
reddit_corpus<-tm_map(reddit_corpus, removePunctuation)
reddit_corpus<-tm_map(reddit_corpus, stemDocument)
reddit_corpus<-tm_map(reddit_corpus, PlainTextDocument)
tdm_reddit<-TermDocumentMatrix(reddit_corpus, control = list(wordLengths = c(1, Inf)))
term_freq_reddit<-rowSums(as.matrix(tdm_reddit))
term_freq_reddit<-subset(term_freq_reddit, term_freq_reddit>=20)
tdm_df_reddit<-data.frame(term=names(term_freq_reddit), freq=term_freq_reddit)
tdm_df_reddit<- arrange(tdm_df_reddit, desc(freq))
ggplot(data=tdm_df_reddit[1:10,])+geom_bar(aes(x=reorder(term, freq), y=freq),stat="identity", fill="light blue")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))+ggtitle("Words Describing Probius From Reddit")+theme(plot.title = element_text(hjust = 0.5))
Compared with Twitter (which talked about Probius’ appearance), reddit looks to be talking more about game mechanics and Probius’ role in the meta.
Note that I couldn’t do a bigram or trigram analysis due to computational limitations.
dtm_reddit<-as.DocumentTermMatrix(tdm_reddit)
rowTotals <- apply(dtm_reddit , 1, sum)
dtm_reddit_corrected <- dtm_reddit[rowTotals> 0, ]
lda<-LDA(dtm_reddit_corrected, control=list(seed=123) ,k=5)
term<-terms(lda, 5)
(term<-apply(term, MARGIN=2, paste, collapse=", "))
## Topic 1
## "damage, talent, armor, good, tank"
## Topic 2
## "skin, sc, game, skins, abilities"
## Topic 3
## "people, game, win, level, games"
## Topic 4
## "pylon, damage, cannon, pylons, cannons"
## Topic 5
## "team, wow, time, clear, good"
There are more interesting topics in the reddit comments compared to Twitter.
Topic 1: Warrior Changes
Topic 2: Cosmetics (skins and mounts)
Topic 3: Meta Changes
Topic 4: Probius Mechanics
Topic 5: Quick Match Changes
This is the same process as before
new_reddit_df<-data.frame(text = sapply(reddit_corpus, paste, collapse = " "), stringsAsFactors = FALSE)
sentiments_reddit<-sentiment(new_reddit_df$text)
table(sentiments_reddit$polarity)
##
## negative neutral positive
## 690 7613 411
sentiments_reddit$score<-0
sentiments_reddit$score[sentiments_reddit$polarity=="positive"]<- 1
sentiments_reddit$score[sentiments_reddit$polarity=="negative"]<- -1
sentiments_reddit$date <- as.IDate(probius_reddit$date)
result <- aggregate(score ~ date, data = sentiments_reddit, sum)
ggplot(sentiments_reddit, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +geom_text(stat='count',aes(label=..count..),vjust=-0.5)+
scale_fill_brewer(palette="BrBG") +
labs(x="Polarity", y="Number of Comments") +
ggtitle("Reddit Sentiment Analysis of Probius")+
theme(plot.title = element_text(hjust = 0.5))
Reddit appears to be more negative than Twitter. Perhaps this is due to a higher threshold on character restrictions. Users are able to articulate their feelings in more detail.
For example, a user might say: “I don’t like Probius because his abilities don’t reach far enough to be useful for zoning” on reddit. It is also entirely possible that reddit as a website is overall more negative than Twitter.
Finally, we plot the sentiment over time.
aggregate_reddit<-sentiments_reddit %>% group_by(polarity, date) %>% dplyr::summarize(count=n())
ggplot(aggregate_reddit, aes(date, count)) + geom_line(aes(group=polarity, color=polarity), size=1) +
geom_point(aes(group=polarity, color=polarity), size=2) +
theme(text = element_text(size=18), axis.text.x = element_text(angle=90, vjust=1))+ggtitle("Reddit Sentiment Analysis of Probius Over Time")+theme(plot.title = element_text(hjust = 0.5))
Compared to Twitter, we can see that reddit is generally more negative toward our pristine probe. Reddit also appears to have far more comments on Probius after his release compared to Twitter.
Twitter users appear to be much more receptive to the Probius announcement, but tweets appear to tail off over time.
Reddit users appear to be more negative to Probius; however, the discussion has peaks and valleys over time. Reddit users talk much more technically (meta, mechanics, etc).