Twitter Mining

Max Sop

Wed Nov 25 15:30:24 2015

Below is a very brief foretaste or prototype of twitter topic modelling. I only conducted graphical analysis. I save the full-fledged algorithmic implementation for another day

Here are all the packages and options necessary to reproduce this analysis.

suppressMessages(library(twitteR))
suppressMessages(library(bit64))
suppressMessages(library(tm))
suppressMessages(library(RCurl))
suppressMessages(library(SnowballC))
suppressMessages(library(wordcloud))
options("scipen"=100, "digits"=4)
options(warn=-1)
knitr::opts_chunk$set(cache=TRUE)

Pulling and analyzing twitts on a person

We are going to pull twitts on the US president to see what people are saying about him at the moment. We will pull the most recent 3000 twitts. The US president twitter name is @POTUS.

setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret, 
                    access_token = accessToken, access_secret = accessSecret)

twitts <- searchTwitter("@POTUS", n=3000, lang = "en", resultType = "recent")

Twitter data preprocessing

Now that we have the twitts about the US president, they will need to be cleaned and transformed in order to be analyzed. So all the functions below achieve that goal. Further explanations of what each of these functions achieve may given upon request.

twitts.txt <- sapply(twitts, function(x) x$getText())
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

clean.twitts <- Corpus(VectorSource(twitts.txt))
clean.twitts <- tm_map(clean.twitts, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')),
                              mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(tolower), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, removePunctuation, mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(removeURL), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x,stopwords("english")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x, c("potus","via","amp","will")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeNumbers(x), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)stripWhitespace(x), mc.cores=8)

Word cloud construction

Now that we have cleaned our twitter data, we can graphically display what are the most common words used by people on twitter that related to the US president. As one can see, White House is the most common word used related to @POTUS, although one can argue that I should have remove that term before constructing my word cloud. The second term is morningjoe, which one of the most popular morning news shows in the US. We also see terms like ISIS, medal of freedom, turkey, etc. Tomorrow is Thanksgiving day in the US, so all these make sense.

wordcloud(clean.twitts, scale = c(2,0.1), colors = rainbow(50), min.freq = 7)

plot of chunk word cloud

Pulling and analyzing a person twitts

In twitter parlance, this is also called User timelines. We are going to pull the 3000 twitts by the US president to see the most popular topics he twitts about. Note of caustions: User timelines belonging to protected users may only be requested when the authenticated user either “owns” the timeline or is an approved follower of the owner.

twitts <- userTimeline("@POTUS", n=3000)

Clean the US president twitts

twitts.txt <- sapply(twitts, function(x) x$getText())
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeShortWords <- function(x) gsub("[[:alpha:]]{4,}", "", x)

clean.twitts <- Corpus(VectorSource(twitts.txt))
clean.twitts <- tm_map(clean.twitts, content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(tolower), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, removePunctuation, mc.cores=8)
clean.twitts <- tm_map(clean.twitts, content_transformer(removeURL), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x,stopwords("english")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeWords(x, c("potus","via","amp","will","today","every","can","lets","thats","just","years","year","weve","ever","people")), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)removeNumbers(x), mc.cores=8)
clean.twitts <- tm_map(clean.twitts, function(x)stripWhitespace(x), mc.cores=8)

Based on the word cloud below, one can clearly see the the US president favorite topic is about American. His interests are climate, health, refugees, change, congress and so on.

wordcloud(clean.twitts, scale = c(2,0.1), colors = rainbow(50), min.freq = 2)

plot of chunk unnamed-chunk-1

Where to go from here

We merely scratched the surface of what can be done with topic modeling. The next and probably lengthier step would be to take on an algorithmic approach to this analysis. The data pre-processing pipeline would change to some degree b/c words will be stemmed, short words removed, etc. What we will get at the end will be a ranked and quantifiable metrics for events or interests given a person name. I hope this gives you a taste of what we can achieve. In fact I built similar applications in the past.