The goal of this demo is to showcase how R can be used for retrieving and anlaysing texts. You are not expected to (fully) understand the commands used here, they will be explained in more detail in later sessions. So, just have fun, play around, and remember: you can’t break anything!
First we need to install a series of packages:
install.packages(c("devtools", "rjson", "bit64", "httr", 'base64enc', 'plyr'))
install.packages('RTextTools')
library(devtools)
install_github("geoffjentry/twitteR")
install_github("kasperwelbers/corpus-tools")
Now we can access twitter using their API (fill in your own API keys here):
library(twitteR)
token = '...'
token_secret = '...'
consumer_key = "..."
consumer_secret = "..."
options(httr_oauth_cache=T)
setup_twitter_oauth(consumer_key, consumer_secret, token, token_secret)
Let’s download some tweets about #migration and #refugees:
library(plyr)
migrants_tweets <- ldply(searchTwitter("#migrants", n=1000, lang="en"), as.data.frame)
refugees_tweets <- ldply(searchTwitter("#refugees", n=1000, lang="en"), as.data.frame)
save(refugees_tweets, migrants_tweets, file="tweets.rda")
How do these tweets look? You can use View to open a data frame for viewing:
View(migrants_tweets)
Other useful commands for ispecting data are summary(), head() and tail().
So, can we create a word cloud? Step one: create a document-term matrix:
load("tweets.rda")
library(RTextTools)
dtm_migrants = create_matrix(migrants_tweets$text)
dtm_migrants
## <<DocumentTermMatrix (documents: 1000, terms: 2304)>>
## Non-/sparse entries: 11272/2292728
## Sparsity : 100%
## Maximal term length: 48
## Weighting : term frequency (tf)
So now we have a document-term matrix that contains the frequency of each word in each tweets. We can create a word cloud from this matrix using the dtm.wordcloud function:
library(corpustools)
dtm.wordcloud(dtm_migrants)
Nice, but not very informative, because ‘migrants’ and ‘refugees’ itself are included. Let’s throw them away from our matrices:
dtm_migrants = dtm_migrants[, !(colnames(dtm_migrants) %in% c('migrants', 'refugees'))]
dtm_migrants
## <<DocumentTermMatrix (documents: 1000, terms: 2302)>>
## Non-/sparse entries: 9926/2292074
## Sparsity : 100%
## Maximal term length: 48
## Weighting : term frequency (tf)
As you can see, the number of terms in migrants is now 2302 rather than 2304. Let’s try to make a word cloud again:
dtm.wordcloud(dtm_migrants)
Better! Now let’s do the same for refugees:
dtm_refugees = create_matrix(refugees_tweets$text)
dtm_refugees = dtm_refugees[, !(colnames(dtm_refugees) %in% c('migrants', 'refugees'))]
dtm.wordcloud(dtm_refugees)
Specific word choices are one of the primary means to frame a discussion. So, let’s see what words are more frequent in the refugees discussion than in the migrants discussion:
cmp = corpora.compare(dtm_refugees, dtm_migrants)
with(cmp[cmp$over>1,], dtm.wordcloud(terms=term, freqs = chi))
And the opposite:
with(cmp[cmp$over<1,], dtm.wordcloud(terms=term, freqs = chi))