PokemonGo has been a widely popular augmented reality game since it’s launch. I want to look at the recent tweets on PokemonGo and find out the most frequently used words on twitter about it, the most popular pokemon characters, and see if we can figure out people’s sentiments and some meaningful topics. I have used R for this fun project and several packages like tm, sentiment, topicmodels, wordcloud and Rgraphviz. So let’s get started.
library(twitteR)
setup_twitter_oauth(consumer_key= 'myKey', consumer_secret= 'secret', access_token='myToken', access_secret='mySecret')
ntweets <- 1000 # number of tweets to extract
tweets <- searchTwitter(searchString= "pokemongo", n=ntweets, lang="en", resultType = 'recent') # top tweets that contain search term
tweets.df <- twListToDF(tweets) # create a data frame
head(tweets.df$text, 5)
## [1] "A wild Seadra appeared! It will be 31 meters from LDS Hospital until 1:32 PM. #PokemonGO #SLC https://t.co/EuRldr93N2"
## [2] "RT @NintendoNYC: The #PokemonGO Plus accessory will be avail. for purchase on 9/16 @ #NintendoNYC while supplies last, one per guest. https…"
## [3] "RT @ScufGaming: When your phone dies and you can't play #PokemonGo: https://t.co/01Gdy2wE2J"
## [4] "We're luring the best. Come catch tons of #Pokemon at North East Mall today from 3-5 PM. #PokemonGO https://t.co/XNoNTo28gu"
## [5] "Wen u pray pokomin go evry dae #PokemonGO #Trump2016 https://t.co/66aDYngBd2"
This is the lengthiest step of the process. We expand all contracted words such as ‘wouldn’t’ becomes ‘would not’. Then remove URLs, mentions beginning with '@', remove retweeted abbreviation ‘RT’, any non-English words, number, and characters such as ‘#’ or emoticons, and convert text to the lower case.
library(magrittr)
library(stringr)
library(dplyr)
# expand common contractions
contrct.substitutes <- data.frame(cont = c("n't", "'ll", "'ve", "'d", "'s", "'m", "'re", "'em"), expan = c(" not", " will", " have", " had", " is", " am", " are", " them"), stringsAsFactors = FALSE)
m=dim(contrct.substitutes)[1]
for(i in 1:m){
tweets.df$text <- gsub(contrct.substitutes$cont[i], contrct.substitutes$expan[i], tweets.df$text, ignore.case = T)
}
tweets.df$text <- tweets.df$text %>%
str_replace_all("[^[:graph:]]", " ") %>% # Remove all nongraphical characters
str_replace_all("http[^[:space:]]*", " ") %>% # Remove URLs
str_replace_all("@\\S+", " ") %>% # Remove mentions
str_replace_all("\\b[Rr][Tt]", " ") %>% # Remove RT
str_replace_all("[^[:alpha:][:space:]]*", "") %>% # Remove any non-English charc and extra whitespaces
sapply(tolower) # Convert to lowercase
After the above process the tweets loo like this.
head(tweets.df$text, 5)
## [1] "a wild seadra appeared it will be meters from lds hospital until pm pokemongo slc "
## [2] " the pokemongo plus accessory will be avail for purchase on nintendonyc while supplies last one per guest "
## [3] " when your phone dies and you ca not play pokemongo "
## [4] "we are luring the best come catch tons of pokemon at north east mall today from pm pokemongo "
## [5] "wen u pray pokomin go evry dae pokemongo trump "
We can see that the tweets look much cleaner, although some extra whitespace has appeared. We will take care of that soon enough. We continue the cleaning process using the ‘tm’ package and SnowballC which we need for stemming.
library(tm)
library(SnowballC) # helps with stemming the document
# build a corpus, and specify the source to be character vectors
tweetsCorpus <- Corpus(VectorSource(tweets.df$text))
Now we remove English Stop Words and ‘pokemon’ and ‘pokemongo’. We also perform stemming and stem completion. Stemming replaces words such as ‘luring’, ‘lures’, ‘lured’ with are replaced with the stem ‘lur’. Stem completion then replaces it with ‘lure’ or ‘lured’ etc. depending upon which one is most occuring in the tweets corpus.
# remove stopwords
myStopwords <- c(stopwords('english'), "pokemongo", "pokemon", "pokemon go")
tweetsCorpus <- tm_map(tweetsCorpus, removeWords, myStopwords, lazy = T)
dict.Corpus <- tweetsCorpus
# stemming and stem completion words
for (i in 1:ntweets){
tweetsCorpus[[i]]$content = stemDocument(tweetsCorpus[[i]]$content)
}
tweetsCorpus.tokenized <- lapply(tweetsCorpus, scan_tokenizer) # gives a list tweets, each tweet a character vector
corpus.stemcomplete <- lapply(tweetsCorpus.tokenized, stemCompletion, dict.Corpus) # list of character vectors
cleanedtweets.Vector <- sapply(corpus.stemcomplete, paste, collapse =" ")
tweetsCorpus <- Corpus(VectorSource(cleanedtweets.Vector))
Let’s look at the final cleaned tweets which we will use for our analysis.
## [[1]]wild seadra appeared will meters lds hospital pm slc
##
## [[2]]plus accessory will avail purchase nintendonyc supplies last one per
## guest
##
## [[3]]phone dies ca play
##
## [[4]]luring best come catch tons north east mall today pm
##
## [[5]]wen u pray pokomin go evry dae trump
First thing we want to look at is the most frequently used words. We use the ‘Term Document Matrix’ from tm package, ggplot2, and wordcloud packages to do this.
library(wordcloud)
library(data.table)
library(ggplot2)
#Document Term matrix
tdm <- TermDocumentMatrix(tweetsCorpus, control = list(wordLengths = c(1, Inf)))
# create a data table with words and freq in decreasing order
freq <- rowSums(as.matrix(tdm))
word <-names(freq)
names(freq) <- NULL
unigm.dt <- data.table(word, freq) %>%
arrange(desc(freq))
# Word Cloud of Most Freq occuring words in the tweets
wordcloud(unigm.dt$word, unigm.dt$freq, min.freq = 10, random.order = F, colors= brewer.pal(6, "Dark2"))
# Popular Pokemons
#pokemonChar is a charcter vector with a list of pokemongo characters
#pokemonChar <- read.csv("~/pokemongoCharacterList.txt", stringsAsFactors = F, header = F)
idx <- which(dimnames(tdm)$Terms %in% pokemonChar)
tdm.pokemonChar <- tdm[idx, ]
pokemonFreq <- rowSums(as.matrix(tdm[idx,]))
pokemonNames <- names(pokemonFreq)
names(pokemonFreq) <- NULL
pokemons <- data.frame(pokemonNames, pokemonFreq) %>%
arrange(desc(pokemonFreq))
ggplot(pokemons[1:20,], aes(pokemonNames, pokemonFreq)) + geom_bar(stat = "identity") + coord_flip() + xlab("Name Frequency") + ylab("PokeName")
If we look at the word cloud of words which occur at least 10 times in the tweets, we can find the most common pokemon vacabulary words such as ‘lure’, ‘spawned’, ‘catch’, ‘gym’, ‘egg’, ‘evolve’, ‘cp’ etc. The most frequently used words are ‘wild’, ‘appeared’, ‘will’, ‘pm’, ‘pokecoins’ and ‘need’. Most popoular pokemons are ‘Pikachu’ (of course), followed by ‘Rattata’, and ‘tentacruel’.
library(graph)
library(Rgraphviz)
library(topicmodels)
We can look at the word correlations and find out the most associated words with a particular word. For instance let’s try ‘lure’. We see the expected words like ‘modules’, ‘rewards’, ‘types’ ‘different’, ‘popular’ and ‘idea’. After that we see an interesting graph of network of associated terms and some topics, which are based on how likely the words are to appear togther in a tweet.
findAssocs(tdm, "lure", .2)
## $lure
## modules rewards types unique
## 0.84 0.84 0.84 0.84
## different add give idea
## 0.78 0.74 0.70 0.70
## niantic demand ottawa popular
## 0.59 0.45 0.45 0.45
## party weekend ed sullivan
## 0.36 0.36 0.31 0.31
## tossing back another lateshowlinecrew
## 0.31 0.25 0.22 0.22
## thinking
## 0.22
# Network of Terms
plot(tdm, findFreqTerms(tdm, lowfreq = 20)[1:25], corThreshold = .05, weighting = T)
#### Topic Modeling
dtm <- as.DocumentTermMatrix(tdm)
lda <- LDA(dtm, 5)
term <- terms(lda, 6)
term <- apply(term, MARGIN = 2, paste, collapse = ", ")
topics <- topics(lda) # 1st topic identified for every document (tweet)
topics <- data.frame(date=as.IDate(tweets.df$created), topic=topics)
suppressWarnings(
qplot(date, ..count.., data=topics, geom="density", fill=term[topic], position="stack")
)
Using the sentiment 140 package, let us examine the polarity of the tweets. We see that tweets about PokemonGo are pretty much neutral. I guess people are more interested in catching pokemons than to talk about their feelings.
library(sentiment)
sentiments <- sentiment(tweets.df$text)
t = table(sentiments$polarity)
ggplot(as.data.frame(t), aes(Var1, Freq)) + geom_bar(stat = 'identity', fill = 'tomato', color = 'black') + xlab("Polarity") + ylab("Tweets Count")