This document will do a search of tweets containing the word “weather”. The query is focus on Illinois, corresponding to the state with more reports through the mping app. Common words in those tweets will be identified and showed in a wordcloud.

To start, we will need to import some packages:

library("twitteR") # To get twitter data.
## Loading required package: ROAuth
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: digest
## Loading required package: rjson
library("tm") # For text Mining
## Loading required package: NLP
library("wordcloud") # To build the wordcloud
## Loading required package: RColorBrewer
library("RColorBrewer") # To get palettes for drawing nice plots.

Also we need to import the authentication files placed in the working directory.

load("twitterauthentication.Rdata")
registerTwitterOAuth(cred)
## [1] TRUE

With the following single line we will get some tweets from Illinois in English containing the word “weather”.

weather<-searchTwitter("weather", n=2000, lang="en",since='2014-06-01', until='2014-07-31',geocode="39.739262,-89.504089,100km")
## Warning: 2000 tweets were requested but the API can only return 914

Taking a look to the tweets:

head(weather)
## [[1]]
## [1] "MWMattingly: \"Sharks are the ass hat of the sea &amp; tornadoes are the hummers of the weather channel\" - B&amp;T  #Sharknado2TheSecondOne"
## 
## [[2]]
## [1] "TylerBusch1: I'm ready for hoodie weather "
## 
## [[3]]
## [1] "ServicePdfs: #file #download Space Weather http://t.co/28QQfsJHgu"
## 
## [[4]]
## [1] "ServicePdfs: #download Space Weather @ServicePdfs"
## 
## [[5]]
## [1] "OzarkKent: I'm earning #mPOINTS Rewards in The Weather Channel. http://t.co/dxKP5mIW6o"
## 
## [[6]]
## [1] "jaimeborko25: Cool Summer: Jet Stream Dip Sends Temps Plunging and Beats Records http://t.co/eJsFUH0d0u"

In the following section the tweets are stored in a csv file.

weather.df<-do.call(rbind,lapply(weather,as.data.frame))
write.csv(weather.df,"/home/msuarez/Documents/UCSB/2014/Summer/R/weather.csv")

From here on, we will start to play around with the data.

Extracting the text from the tweets in a vector:

weather_list <- sapply(weather, function(x) x$getText())

Cleaning up text:

weather_list <- tolower(weather_list)
weather_list  <- gsub("@\\w+", "", weather_list)
weather_list<- gsub("[[:punct:]]", "", weather_list)
weather_list <- gsub("http\\w+", "", weather_list)

Constructing the lexical Corpus:

weather_corpus <- Corpus(VectorSource(weather_list))

Constructing the Term Document Matrix and applying some transformations:

weather_corpus <- tm_map(weather_corpus, stripWhitespace)
weather_corpus <- tm_map(weather_corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "http", "weather","rt")

weather_corpus <- tm_map(weather_corpus, removeWords, myStopwords)

tdm = TermDocumentMatrix(weather_corpus)

Defining TermDocumentMatrix as matrix:

m = as.matrix(tdm)

Getting word counts in decreasing order:

word_freqs = sort(rowSums(m), decreasing=TRUE) 

Creating a data frame with words and their frequencies:

dm = data.frame(word=names(word_freqs), freq=word_freqs)

Ploting wordcloud:

wordcloud(dm$word, dm$freq, scale=c(3,.4), max.words=200, random.order=FALSE, rot.per=.15, colors=brewer.pal(6, "Dark2"), font = 1, family = "serif")

plot of chunk unnamed-chunk-13