PokemonGo has been a widely popular augmented reality game since it’s launch. I want to look at the recent tweets on PokemonGo and find out the most frequently used words on twitter about it, the most popular pokemon characters, and see if we can figure out people’s sentiments and some meaningful topics. I have used R for this fun project and several packages like tm, sentiment, topicmodels, wordcloud and Rgraphviz. So let’s get started.

  1. Getting the data Download 1000 most recent tweets containing the word ‘#pokemongo’ using Twitter API and twitteR package and create a data frame.
library(twitteR)
setup_twitter_oauth(consumer_key= 'myKey', consumer_secret= 'secret', access_token='myToken', access_secret='mySecret')
ntweets <- 1000 # number of tweets to extract
tweets <- searchTwitter(searchString= "pokemongo", n=ntweets, lang="en", resultType = 'recent') # top  tweets that contain search term
tweets.df <- twListToDF(tweets) # create a data frame
head(tweets.df$text, 5)
## [1] "A wild Seadra appeared! It will be 31 meters from LDS Hospital until 1:32 PM. #PokemonGO #SLC https://t.co/EuRldr93N2"                       
## [2] "RT @NintendoNYC: The #PokemonGO Plus accessory will be avail. for purchase on 9/16 @ #NintendoNYC while supplies last, one per guest. https…"
## [3] "RT @ScufGaming: When your phone dies and you can't play #PokemonGo: https://t.co/01Gdy2wE2J"                                                 
## [4] "We're luring the best. Come catch tons of #Pokemon at North East Mall today from 3-5 PM. #PokemonGO https://t.co/XNoNTo28gu"                 
## [5] "Wen u pray pokomin go evry dae  #PokemonGO #Trump2016 https://t.co/66aDYngBd2"
  1. Cleaning the tweets

This is the lengthiest step of the process. We expand all contracted words such as ‘wouldn’t’ becomes ‘would not’. Then remove URLs, mentions beginning with '@', remove retweeted abbreviation ‘RT’, any non-English words, number, and characters such as ‘#’ or emoticons, and convert text to the lower case.

library(magrittr)
library(stringr)
library(dplyr)
# expand common contractions
contrct.substitutes <- data.frame(cont = c("n't", "'ll", "'ve", "'d", "'s", "'m", "'re", "'em"), expan = c(" not", " will", " have", " had", " is", " am", " are", " them"), stringsAsFactors = FALSE)
m=dim(contrct.substitutes)[1]
for(i in 1:m){
  tweets.df$text <- gsub(contrct.substitutes$cont[i], contrct.substitutes$expan[i], tweets.df$text, ignore.case = T)
}

tweets.df$text <- tweets.df$text %>%
  str_replace_all("[^[:graph:]]", " ") %>%          # Remove all nongraphical characters 
  str_replace_all("http[^[:space:]]*", " ") %>%     # Remove URLs
  str_replace_all("@\\S+", " ") %>%                 # Remove mentions
  str_replace_all("\\b[Rr][Tt]", " ") %>%           # Remove RT
  str_replace_all("[^[:alpha:][:space:]]*", "") %>% # Remove any non-English charc and extra whitespaces
  sapply(tolower)                                   # Convert to lowercase

After the above process the tweets loo like this.

head(tweets.df$text, 5)
## [1] "a wild seadra appeared it will be  meters from lds hospital until  pm pokemongo slc  "                           
## [2] "    the pokemongo plus accessory will be avail for purchase on   nintendonyc while supplies last one per guest  "
## [3] "    when your phone dies and you ca not play pokemongo  "                                                        
## [4] "we are luring the best come catch tons of pokemon at north east mall today from  pm pokemongo  "                 
## [5] "wen u pray pokomin go evry dae  pokemongo trump  "

We can see that the tweets look much cleaner, although some extra whitespace has appeared. We will take care of that soon enough. We continue the cleaning process using the ‘tm’ package and SnowballC which we need for stemming.

library(tm)
library(SnowballC) # helps with stemming the document

# build a corpus, and specify the source to be character vectors
tweetsCorpus <- Corpus(VectorSource(tweets.df$text))

Now we remove English Stop Words and ‘pokemon’ and ‘pokemongo’. We also perform stemming and stem completion. Stemming replaces words such as ‘luring’, ‘lures’, ‘lured’ with are replaced with the stem ‘lur’. Stem completion then replaces it with ‘lure’ or ‘lured’ etc. depending upon which one is most occuring in the tweets corpus.

# remove stopwords
myStopwords <- c(stopwords('english'), "pokemongo", "pokemon", "pokemon go")
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, myStopwords, lazy = T)

dict.Corpus <- tweetsCorpus

# stemming and stem completion words
for (i in 1:ntweets){
  tweetsCorpus[[i]]$content = stemDocument(tweetsCorpus[[i]]$content)
}

tweetsCorpus.tokenized <- lapply(tweetsCorpus, scan_tokenizer) # gives a list tweets, each tweet a character vector
corpus.stemcomplete <- lapply(tweetsCorpus.tokenized, stemCompletion, dict.Corpus) # list of character vectors
cleanedtweets.Vector <- sapply(corpus.stemcomplete, paste, collapse =" ")

tweetsCorpus <- Corpus(VectorSource(cleanedtweets.Vector))

Let’s look at the final cleaned tweets which we will use for our analysis.

## [[1]]wild seadra appeared will meters lds hospital pm slc
## 
## [[2]]plus accessory will avail purchase nintendonyc supplies last one per
## guest
## 
## [[3]]phone dies ca play
## 
## [[4]]luring best come catch tons north east mall today pm
## 
## [[5]]wen u pray pokomin go evry dae trump
  1. Words Frequency Analysis

First thing we want to look at is the most frequently used words. We use the ‘Term Document Matrix’ from tm package, ggplot2, and wordcloud packages to do this.

library(wordcloud)
library(data.table)
library(ggplot2)
#Document Term matrix                 
tdm <- TermDocumentMatrix(tweetsCorpus, control = list(wordLengths = c(1, Inf)))              

# create a data table with words and freq in decreasing order
freq <- rowSums(as.matrix(tdm))
word <-names(freq)
names(freq) <- NULL
unigm.dt <- data.table(word, freq) %>%
  arrange(desc(freq))

# Word Cloud of Most Freq occuring words in the tweets
wordcloud(unigm.dt$word, unigm.dt$freq, min.freq = 10, random.order = F, colors= brewer.pal(6, "Dark2"))

# Popular Pokemons
#pokemonChar is a charcter vector with a list of pokemongo characters
#pokemonChar <- read.csv("~/pokemongoCharacterList.txt", stringsAsFactors = F, header = F)

idx <- which(dimnames(tdm)$Terms %in% pokemonChar)
tdm.pokemonChar <- tdm[idx, ]
pokemonFreq <- rowSums(as.matrix(tdm[idx,]))
pokemonNames <- names(pokemonFreq)
names(pokemonFreq) <- NULL
pokemons <- data.frame(pokemonNames, pokemonFreq)  %>%
  arrange(desc(pokemonFreq))

ggplot(pokemons[1:20,], aes(pokemonNames, pokemonFreq)) + geom_bar(stat = "identity") + coord_flip() + xlab("Name Frequency") + ylab("PokeName")

If we look at the word cloud of words which occur at least 10 times in the tweets, we can find the most common pokemon vacabulary words such as ‘lure’, ‘spawned’, ‘catch’, ‘gym’, ‘egg’, ‘evolve’, ‘cp’ etc. The most frequently used words are ‘wild’, ‘appeared’, ‘will’, ‘pm’, ‘pokecoins’ and ‘need’. Most popoular pokemons are ‘Pikachu’ (of course), followed by ‘Rattata’, and ‘tentacruel’.

  1. Associations and Topic Modeling
library(graph)
library(Rgraphviz)
library(topicmodels)

We can look at the word correlations and find out the most associated words with a particular word. For instance let’s try ‘lure’. We see the expected words like ‘modules’, ‘rewards’, ‘types’ ‘different’, ‘popular’ and ‘idea’. After that we see an interesting graph of network of associated terms and some topics, which are based on how likely the words are to appear togther in a tweet.

findAssocs(tdm, "lure", .2)
## $lure
##          modules          rewards            types           unique 
##             0.84             0.84             0.84             0.84 
##        different              add             give             idea 
##             0.78             0.74             0.70             0.70 
##          niantic           demand           ottawa          popular 
##             0.59             0.45             0.45             0.45 
##            party          weekend               ed         sullivan 
##             0.36             0.36             0.31             0.31 
##          tossing             back          another lateshowlinecrew 
##             0.31             0.25             0.22             0.22 
##         thinking 
##             0.22
# Network of Terms
plot(tdm, findFreqTerms(tdm, lowfreq = 20)[1:25], corThreshold = .05, weighting = T)

#### Topic Modeling

dtm <- as.DocumentTermMatrix(tdm)
lda <- LDA(dtm, 5)
term <- terms(lda, 6)
term <- apply(term, MARGIN = 2, paste, collapse = ", ")

topics <- topics(lda) # 1st topic identified for every document (tweet)
topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)
suppressWarnings(
  qplot(date, ..count.., data=topics, geom="density", fill=term[topic], position="stack")
)

  1. Sentiment Analysis

Using the sentiment 140 package, let us examine the polarity of the tweets. We see that tweets about PokemonGo are pretty much neutral. I guess people are more interested in catching pokemons than to talk about their feelings.

library(sentiment)
sentiments <- sentiment(tweets.df$text)
t = table(sentiments$polarity)
ggplot(as.data.frame(t), aes(Var1, Freq)) + geom_bar(stat = 'identity', fill = 'tomato', color = 'black') + xlab("Polarity") + ylab("Tweets Count")