library(twitteR)
library(tm)
library(topicmodels)
library(sentiment)
library(ggplot2)
library(wordcloud)
library(data.table)

Collecting Tweets

First, we need to collect some tweets from @BlizzHeroes for analysis. We’ve requested up to 3200 (the maximum amount). By default, this includes replies but excludes retweets. To prevent downloading different tweets every time this code is run, I saved the inital batch of tweets as an RDS.

#tweets<-userTimeline("BlizzHeroes", n=3200)
tweets<-readRDS("tweets.rds")
length(tweets)
[1] 366

366!? But we requested 3200!

This is because the Twitter API limits us in how far back we can go. It appears we’ve only received tweets since October 2016.

Cleaning

First, we want to convert our tweets to a data frame.

tweets_df<-twListToDF(tweets)

Next, we transform tweets into a corpus of tweets. We also remove numbers and URLs from the tweets to make the data more digestible.

tweets_corpus <- Corpus(VectorSource(tweets_df$text))
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(removeURL))
tweets_corpus<-tm_map(tweets_corpus, removeNumbers)
corpusCopy<-tweets_corpus #keep this for later
corpusCopy<-tm_map(corpusCopy, PlainTextDocument)

Now we are in a bind. If we use the removePunctuation function in the tm package, R will remove apostrophes. However, our stopwords dictionary contains words with apostrophes.

I created a RegEx function to remove the apostrophes, and then converted the text to lowercase.

allButApost <- function(x) gsub("[^[:alnum:][:space:]']", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(allButApost))
tweets_corpus <- tm_map(tweets_corpus, tolower) 

Then, I removed stopwords found in the “SMART” dictionary. Stopwords are common words that won’t be useful in our analysis (think: a, the, an, very, etc). Then I removed any residual punctuation.

tweets_corpus<-tm_map(tweets_corpus, removeWords, stopwords("SMART"))
tweets_corpus<-tm_map(tweets_corpus, removePunctuation)

Stemming

In some cases in text processing, the same words can change tense. For example, pwn could also exist as pwning or pwned, but the meaning of the word is the same. Without stemming, the algorithm would count pwn, pwning, and pwned as three seperate words, instead of just “pwn.”

We fix this via stemming–removing extra letters at the end of certain words.

tweets_corpus<-tm_map(tweets_corpus, stemDocument)
tweets_corpus<-tm_map(tweets_corpus, PlainTextDocument)

Term Document Matrix

Remember how we have a corpus of each tweet? We are going to expand upon that by tallying every word in every tweet. Imagine a table with individual tweet numbers as columns and every word from every tweet as rows.

Each cell represents the number of times a word appears in a tweet. As there is a large amount of words (every word from every tweet), most tweets only contain a small fraction of the total corpus of words. As such, most cells are 0 (indicating the word does not appear in this specific tweet).

The Wikipedia page provides a nice example.

Note that Wikipedia demonstrates a Document Term Matrix, whereas here we are using a Term Document Matrix. The TDM will be the transpose of what is shown on Wikipedia.

tdm<-TermDocumentMatrix(tweets_corpus, control = list(wordLengths = c(1, Inf)))
term_freq<-rowSums(as.matrix(tdm))
term_freq<-subset(term_freq, term_freq>=20)
tdm_df<-data.frame(term=names(term_freq), freq=term_freq)

We’ll plot the most frequent words. Frequent words are defined above as those that appear over 20 times.

ggplot(data=tdm_df)+geom_bar(aes(x=term, y=freq), stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))

It looks like ‘Heroes’ is the most popular word of choice. No suprise there. Other words, such as ‘live’ and ‘tune’ might be linked to promoting e-sports and streamers.

WordClouds

Everyone’s favorite.

Here, I show words appearing at least 6 times.

tdm_matrix<-as.matrix(tdm)
word_freq<-sort(rowSums(tdm_matrix), decreasing=T)
pal<-brewer.pal(8, "Dark2")
wordcloud(words=names(word_freq), freq=word_freq, min.freq=6, random.order = F, colors=pal)

@BlizzHeroes loves calling “Heroes” to the “Nexus” and mentioning Twitch streamers. We also encounter numerous Hero names, likely from when they were released or received a rework.

Associations

We can find words associated with common words. Let’s look at the hero Lucio and see what we can find!

findAssocs(tdm, "lucio", 0.2)
$lucio
        arrives heroespowerhour      highlights     impressions    intothenexus           talks            test 
           0.50            0.50            0.50            0.50            0.50            0.50            0.47 
         public           realm            dive         itncast             amp           murky        reworked 
           0.40            0.40            0.35            0.35            0.29            0.28            0.28 
            feb 
           0.24 

Since this was scraped a week after his release, most words associated with Lucio indicate his release buzz (arrives, highlights, impressions). Other notable words include Murky and reworked since everyone’s (least?) favorite Murloc made a comeback to the Nexus after some changes.

Topic modeling

What does @BlizzHeroes tweet about on a higher level? We can use topic modeling to discover commonly associated words and group them into predetermined buckets.

After playing around with the number of topics, I settled on 5 which gave the least amount of overlap.

dtm<-as.DocumentTermMatrix(tdm)
rowTotals <- apply(dtm , 1, sum) 
dtm_corrected   <- dtm[rowTotals> 0, ] 
lda<-LDA(dtm_corrected, control=list(seed=123) ,k=5) 
term<-terms(lda, 5)
(term<-apply(term, MARGIN=2, paste, collapse=", "))
                                 Topic 1                                  Topic 2 
    "heroes, love, lunar, region, enjoy"      "hero, tune, league, matches, live" 
                                 Topic 3                                  Topic 4 
"play, kendricswissh, gt, team, valeera"   "patch, notes, stream, great, balance" 
                                 Topic 5 
 "nexus, rewards, hgc, hencyheccu, week" 

The algorithm isn’t going to assign our topics names. Here’s my attempt.

Topic 1 appears to be promoting events (Lunar New Year, Valentine’s Day).

Topic 2 is e-sports promotion.

Topic 3 looks to be related to plays, possibly pro teams?

Topic 4 is related to talking about updates to the game.

Topic 5 is possibly promoting streamers and rewards.

Note there is some overlap between topics.

So @BlizzHeroes mainly concerns: promotional events, promoting e-sports, sharing wicked plays, informing about updates, and promoting streamers.

Sentiment Analysis

What is the overall attitude of @BlizzHeroes? Here sentiment analysis saves the day. It classifies every tweet as either “negative”, “neutral”, or “positive” based on the amount of positive/negative words.

new_twitter_df<-data.frame(text = sapply(corpusCopy, paste, collapse = " "), stringsAsFactors = FALSE)
sentiments<-sentiment(new_twitter_df$text)
table(sentiments$polarity)

negative  neutral positive 
       4      171      191 
sentiments$score<-0
sentiments$score[sentiments$polarity=="positive"]<- 1
sentiments$score[sentiments$polarity=="negative"]<- -1
sentiments$date <- as.IDate(tweets_df$created)
result <- aggregate(score ~ date, data = sentiments, sum)

Wow! Most of @BlizzHeroes tweets are quite positive. Only 4 were flagged as negative. Let’s investigate what those negative tweets were.

sent[sent$polarity=="negative",]$text
[1]  @AronDark Sorry to hear that! We re putting out a fix for the issue tomorrow!                                               
[2]  @iakona We re investigating this issue, sorry for the poor experience!                                                      
[3]  @JKesselring Hi! Sorry for the poor experience. We re looking into it! Did you happen to just recently install the game?    
[4]  @daniellovejr Sorry you had a bad experience! We re talking about ways to make it less punishing when teammates leave games.
366 Levels:   ...Changes are a few months off, but we are actively looking at Murky. From our Balance Q&amp;A:<e2><U+0080><U+00A6>  ...

Most of the negative sentiment comes from apoloigizing due to game issues/bugs.

Visualizing Sentiment

Let’s examine the sentiment over time.

ggplot(data=result) + geom_smooth(aes(x=date, y=score))+xlab("Month")+ylab("Score")+ggtitle("@BlizzHeroes sentiment over time")+theme(plot.title = element_text(hjust = 0.5))

Did Blizzard get a case of the holiday blues? Their sentiment is decreasing around December. Most likely this is do to employees (social media managers) taking a break (and sending fewer positively flagged tweets).

Finally, we visualize the sentiments.

ggplot(sentiments, aes(x=polarity)) +
  geom_bar(aes(y=..count.., fill=polarity)) +
  scale_fill_brewer(palette="RdGy") +
  labs(x="Polarity", y="Number of Tweets") +
  ggtitle("Sentiment Analysis of @BlizzHeroes")+
  theme(plot.title = element_text(hjust = 0.5))

Overall, @BlizzHeroes is a very positive Twitter user. Great job, @BlizzHeroes!

---
title: '@BlizzHeroes Analysis'
author: "hots_data_guy"
output: html_notebook
---


```{r, message=FALSE, warning=FALSE}
library(twitteR)
library(tm)
library(topicmodels)
library(sentiment)
library(ggplot2)
library(wordcloud)
library(data.table)
```


##Collecting Tweets

First, we need to collect some tweets from \@BlizzHeroes for analysis. We've requested up to 3200 (the maximum amount). By default, this includes replies but excludes retweets. To prevent downloading different tweets every time this code is run, I saved the inital batch of tweets as an RDS.
```{r}
#tweets<-userTimeline("BlizzHeroes", n=3200)
tweets<-readRDS("tweets.rds")
length(tweets)
```

366!? But we requested 3200! 

This is because the Twitter API limits us in how far back we can go. It appears we've only received tweets since October 2016.  

##Cleaning 

First, we want to convert our tweets to a data frame.

```{r,}
tweets_df<-twListToDF(tweets)
```


Next, we transform tweets into a corpus of tweets. We also remove numbers and URLs from the tweets to make the data more digestible.

```{r}
tweets_corpus <- Corpus(VectorSource(tweets_df$text))
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(removeURL))
tweets_corpus<-tm_map(tweets_corpus, removeNumbers)

corpusCopy<-tweets_corpus #keep this for later
corpusCopy<-tm_map(corpusCopy, PlainTextDocument)
```


Now we are in a bind. If we use the `removePunctuation` function in the tm package, R will remove apostrophes. However, our stopwords dictionary contains words with apostrophes. 

I created a RegEx function to remove the apostrophes, and then converted the text to lowercase.
```{r}
allButApost <- function(x) gsub("[^[:alnum:][:space:]']", "", x)
tweets_corpus<-tm_map(tweets_corpus, content_transformer(allButApost))
tweets_corpus <- tm_map(tweets_corpus, tolower) 
```


Then, I removed stopwords found in the "SMART" dictionary. Stopwords are common words that won't be useful in our analysis (think: a, the, an, very, etc). Then I removed any residual punctuation.
```{r}
tweets_corpus<-tm_map(tweets_corpus, removeWords, stopwords("SMART"))
tweets_corpus<-tm_map(tweets_corpus, removePunctuation)
```


##Stemming

In some cases in text processing, the same words can change tense. For example, pwn could also exist as pwning or pwned, but the meaning of the word is the same. Without stemming, the algorithm would count pwn, pwning, and pwned as three seperate words, instead of just "pwn." 

We fix this via stemming--removing extra letters at the end of certain words. 

```{r, message=FALSE}
tweets_corpus<-tm_map(tweets_corpus, stemDocument)
tweets_corpus<-tm_map(tweets_corpus, PlainTextDocument)
```

##Term Document Matrix

Remember how we have a corpus of each tweet? We are going to expand upon that by tallying every word in every tweet. Imagine a table with individual tweet numbers as columns and every word from every tweet as rows.

Each cell represents the number of times a word appears in a tweet. As there is a large amount of words (**every** word from **every** tweet), most tweets only contain a small fraction of the total corpus of words. As such, most cells are 0 (indicating the word does not appear in this specific tweet).

The Wikipedia page provides a nice [example](https://en.wikipedia.org/wiki/Document-term_matrix). 

Note that Wikipedia demonstrates a Document Term Matrix, whereas here we are using a Term Document Matrix. The TDM will be the transpose of what is shown on Wikipedia.

```{r}
tdm<-TermDocumentMatrix(tweets_corpus, control = list(wordLengths = c(1, Inf)))

term_freq<-rowSums(as.matrix(tdm))
term_freq<-subset(term_freq, term_freq>=20)
tdm_df<-data.frame(term=names(term_freq), freq=term_freq)
```

We'll plot the most frequent words. Frequent words are defined above as those that appear over 20 times.
```{r}
ggplot(data=tdm_df)+geom_bar(aes(x=term, y=freq), stat="identity")+xlab("Terms")+ylab("Count")+coord_flip()+theme(axis.text=element_text(size=10))
```

It looks like 'Heroes' is the most popular word of choice. No suprise there. Other words, such as 'live' and 'tune' might be linked to promoting e-sports and streamers.

##WordClouds

Everyone's *favorite*. 

Here, I show words appearing at least 6 times.
```{r}
tdm_matrix<-as.matrix(tdm)

word_freq<-sort(rowSums(tdm_matrix), decreasing=T)
pal<-brewer.pal(8, "Dark2")

wordcloud(words=names(word_freq), freq=word_freq, min.freq=6, random.order = F, colors=pal)
```
\@BlizzHeroes loves calling "Heroes" to the "Nexus" and mentioning Twitch streamers. We also encounter numerous Hero names, likely from when they were released or received a rework.  


##Associations

We can find words associated with common words. Let's look at the hero `Lucio` and see what we can find!

```{r}
findAssocs(tdm, "lucio", 0.2)
```

Since this was scraped a week after his release, most words associated with Lucio indicate his release buzz (`arrives`, `highlights`, `impressions`). Other notable words include `Murky` and `reworked` since everyone's (least?) favorite Murloc made a comeback to the Nexus after some changes.

##Topic modeling 

What does \@BlizzHeroes tweet about on a higher level? We can use topic modeling to discover commonly associated words and group them into predetermined buckets. 

After playing around with the number of topics, I settled on 5 which gave the least amount of overlap.

```{r}
dtm<-as.DocumentTermMatrix(tdm)
rowTotals <- apply(dtm , 1, sum) 
dtm_corrected   <- dtm[rowTotals> 0, ] 

lda<-LDA(dtm_corrected, control=list(seed=123) ,k=5) 
term<-terms(lda, 5)
(term<-apply(term, MARGIN=2, paste, collapse=", "))
```

The algorithm isn't going to assign our topics names. Here's my attempt.

Topic 1 appears to be promoting events (Lunar New Year, Valentine's Day).

Topic 2 is e-sports promotion.

Topic 3 looks to be related to plays, possibly pro teams? 

Topic 4 is related to talking about updates to the game.

Topic 5 is possibly promoting streamers and rewards.

Note there is some overlap between topics.

So \@BlizzHeroes mainly concerns: promotional events, promoting e-sports, sharing wicked plays, informing about updates, and promoting streamers.

##Sentiment Analysis

What is the overall attitude of \@BlizzHeroes? Here sentiment analysis saves the day. It classifies every tweet as either "negative", "neutral", or "positive" based on the amount of positive/negative words.
```{r}
new_twitter_df<-data.frame(text = sapply(corpusCopy, paste, collapse = " "), stringsAsFactors = FALSE)

sentiments<-sentiment(new_twitter_df$text)
table(sentiments$polarity)

sentiments$score<-0
sentiments$score[sentiments$polarity=="positive"]<- 1
sentiments$score[sentiments$polarity=="negative"]<- -1
sentiments$date <- as.IDate(tweets_df$created)

result <- aggregate(score ~ date, data = sentiments, sum)
```
Wow! Most of \@BlizzHeroes tweets are quite positive. Only 4 were flagged as negative. Let's investigate what those negative tweets were.

```{r}
sentiments[sentiments$polarity=="negative",]$text
```

Most of the negative sentiment comes from apoloigizing due to game issues/bugs. 

##Visualizing Sentiment

Let's examine the sentiment over time.

```{r}
ggplot(data=result) + geom_smooth(aes(x=date, y=score))+xlab("Month")+ylab("Score")+ggtitle("@BlizzHeroes sentiment over time")+theme(plot.title = element_text(hjust = 0.5))
```

Did Blizzard get a case of the holiday blues? Their sentiment is decreasing around December. Most likely this is do to employees (social media managers) taking a break (and sending fewer positively flagged tweets). 


Finally, we visualize the sentiments.

```{r}
ggplot(sentiments, aes(x=polarity)) +
  geom_bar(aes(y=..count.., fill=polarity)) +
  scale_fill_brewer(palette="RdGy") +
  labs(x="Polarity", y="Number of Tweets") +
  ggtitle("Sentiment Analysis of @BlizzHeroes")+
  theme(plot.title = element_text(hjust = 0.5))
```

Overall, \@BlizzHeroes is a very positive Twitter user. Great job, \@BlizzHeroes!
