Wed Nov 25 15:30:24 2015
Below is a very brief foretaste or prototype of twitter topic modelling. I only conducted graphical analysis. I save the full-fledged algorithmic implementation for another day
Here are all the packages and options necessary to reproduce this analysis.
suppressMessages(library(twitteR))
suppressMessages(library(bit64))
suppressMessages(library(tm))
suppressMessages(library(RCurl))
suppressMessages(library(SnowballC))
suppressMessages(library(wordcloud))
options("scipen"=100, "digits"=4)
options(warn=-1)
knitr::opts_chunk$set(cache=TRUE)
We are going to pull twitts on the US president to see what people are saying about him at the moment. We will pull the most recent 3000 twitts. The US president twitter name is @POTUS.
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret,
access_token = accessToken, access_secret = accessSecret)
twitts <- searchTwitter("@POTUS", n=3000, lang = "en", resultType = "recent")
Now that we have the twitts about the US president, they will need to be cleaned and transformed in order to be analyzed. So all the functions below achieve that goal. Further explanations of what each of these functions achieve may given upon request.
twitts.txt <- sapply(twitts, function(x) x$getText())
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
clean.twitts <- Corpus(VectorSource(twitts.txt))
clean.twitts <- tm_map(clean.twitts, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')),
mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(tolower), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, removePunctuation, mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(removeURL), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x,stopwords("english")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x, c("potus","via","amp","will")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeNumbers(x), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)stripWhitespace(x), mc.cores=8)
Now that we have cleaned our twitter data, we can graphically display what are the most common words used by people on twitter that related to the US president. As one can see, White House is the most common word used related to @POTUS, although one can argue that I should have remove that term before constructing my word cloud. The second term is morningjoe, which one of the most popular morning news shows in the US. We also see terms like ISIS, medal of freedom, turkey, etc. Tomorrow is Thanksgiving day in the US, so all these make sense.
wordcloud(clean.twitts, scale = c(2,0.1), colors = rainbow(50), min.freq = 7)
In twitter parlance, this is also called User timelines. We are going to pull the 3000 twitts by the US president to see the most popular topics he twitts about. Note of caustions: User timelines belonging to protected users may only be requested when the authenticated user either “owns” the timeline or is an approved follower of the owner.
twitts <- userTimeline("@POTUS", n=3000)
twitts.txt <- sapply(twitts, function(x) x$getText())
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeShortWords <- function(x) gsub("[[:alpha:]]{4,}", "", x)
clean.twitts <- Corpus(VectorSource(twitts.txt))
clean.twitts <- tm_map(clean.twitts, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(tolower), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, removePunctuation, mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(removeURL), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x,stopwords("english")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x, c("potus","via","amp","will","today","every","can","lets","thats","just","years","year","weve","ever","people")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeNumbers(x), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)stripWhitespace(x), mc.cores=8)
Based on the word cloud below, one can clearly see the the US president favorite topic is about American. His interests are climate, health, refugees, change, congress and so on.
wordcloud(clean.twitts, scale = c(2,0.1), colors = rainbow(50), min.freq = 2)
We merely scratched the surface of what can be done with topic modeling. The next and probably lengthier step would be to take on an algorithmic approach to this analysis. The data pre-processing pipeline would change to some degree b/c words will be stemmed, short words removed, etc. What we will get at the end will be a ranked and quantifiable metrics for events or interests given a person name. I hope this gives you a taste of what we can achieve. In fact I built similar applications in the past.