Configuring twitter

Pulling data from Twitter directly into R is quite straightforward to setup. For R, one needs the twitteR package from CRAN. A twitter account can be setup the usual way and a new app can be created by going here https://apps.twitter.com/app/new. In the “Callback URL” box, one can paste: http://127.0.0.1:1410, which avoids an error that comes up later in the process. Here’s another tutorial that describes how to setup a twitter app. It is rather old and the twitteR::getTwitterOAuth() function is deprecated in favour of twitteR::setup_twitter_oauth() function but I found the screenshots from the tutorial helpful. Here’s a more recent tutorial. Next, in RStudio, one should run the following code:

library(twitteR)
setup_twitter_oauth("consumer_key", "consumer_secret", "access_token", "access_key")
# select 1 for saving the file, throws an error (for browser authentication) if the "Callback URL" field is left blank in the twitter app dashboard (on the twitter website).

where consumer_key and consumer_secret can be obtained from the “Keys and Access Tokens” tab on the twitter application dashboard https://apps.twitter.com/app/. Once the above command is run, it should successfully have setup access to the Twitter API. More information can be found on https://dev.twitter.com/overview/api.

Using twitteR To Pull Data

Now that we have twitter setup, we can begin querying and modelling twitter data. One useful function is twitteR::getCurRateLimitInfo() which highlights the rate limit on the API, i.e., it tells us how many times we can query the twitter API per unit time for a given task (eg. sending a DM, searching tweets, downloading the status for a given user, etc). Consulting the documentation, one observes that twitter allows 15 or 180 queries in a 15-minute time span. The 15 or 180 limit depends on which function is being performed on twitter through the API. Since this tutorial retsricts itself to analyzing the text in a tweet for a given term, we primarily use the API to search for a given term and pull corresponding tweets. Consulting the official documentation, we see that the limit for search queries on tweets is 180. For example, if we pull 10 tweets containing the term coffee, it counts as 1 query and 179 queries will be possible for the next 15 minute period.

After performing a query, we can run twitteR::getCurRateLimitInfo() to see the remaining number of queries and the exact time (in GMT) at which the rate limit will be reset. This gives a large amount of information. If one is interested in only a particular task, it can be passed to the function as a string, for example “search” (twitteR::getCurRateLimitInfo(resources = "search")) which gives us how many search queries remain out of 180. More information is given in the relevant help file for the function. Running the function gives the following output:

suppressPackageStartupMessages(library(magrittr)) # for using the (pipe) %>% operator
# returns a data.frame
twitteR::getCurRateLimitInfo() %>% head(., n = 10)
##                   resource limit remaining               reset
## 1              /lists/list    15        15 2016-09-18 11:43:21
## 2       /lists/memberships    15        15 2016-09-18 11:43:21
## 3  /lists/subscribers/show    15        15 2016-09-18 11:43:21
## 4           /lists/members   180       180 2016-09-18 11:43:21
## 5     /lists/subscriptions    15        15 2016-09-18 11:43:21
## 6              /lists/show    15        15 2016-09-18 11:43:21
## 7        /lists/ownerships    15        15 2016-09-18 11:43:21
## 8       /lists/subscribers   180       180 2016-09-18 11:43:21
## 9      /lists/members/show    15        15 2016-09-18 11:43:21
## 10         /lists/statuses   180       180 2016-09-18 11:43:21

and for “search”

twitteR::getCurRateLimitInfo("search") #time in GMT
##         resource limit remaining               reset
## 1 /search/tweets   180       180 2016-09-18 11:43:21

The main function for performing a query and pulling corresponding tweets is searchTwitter(). The arguments can be found on the corresponding help page by running ?searchTwitter. The first argument is the searchString which takes the term(s) we wish to query. An example would be “coffee” or “#scala”. More information about specifying search terms is found on the official API docs. Some other arguments that can be passed to the searchTwitter() function are location, language (of the tweet), number of tweets to fetch among others. The full list of arguments can be found in the relevant help file. Consulting the API documentation on the web, it appears that the number of tweets per query is capped at 100. The documentation also lists errors that can sometimes occur when this function is called.

For this demonstration, the twitteR::search_twitter_and_store() function is used since we can use it to fetch the tweets for a given search string and store it in an SQL database. The above function automatically pulls in 5000 tweets. The next code chunk shows that we first have to register a database, i.e., open a connection to it. If the database doesn’t exist, a database with the same name is created in the project directory. Then, the twitteR::search_twitter_and_store() function is run to find tweets with either “Dengue” or “dengue”. The table argument takes the name of the database where the tweets should be stored. Rest of the arguments passed to the function are the same as that for searchTwitter().

# getting the data
register_sqlite_backend("dengue_tweets_db") # if db doesn't exist, it will be created.
## Loading required namespace: RSQLite

Previous code chunk creates a connection to the database. Next code chunk downloads the tweets from twitter and writes it to the database.

dengue_raw = search_twitter_and_store("dengue+Dengue", table_name = "dengue_tweets_db", lang = "en")

In the above code chunk, we create a database called dengue_tweets_db and store the results from the query into this database. (We can run twitteR::getCurRateLimitInfo(resources = "search") to see how many queries remain after the function call in the previous code chunk.) Beware: the query can take some time to run. The above code chunk exhausted 50 queries since it pulled 50 * 100 (tweets per query) = 5000 tweets. (Note: there were some issues for me when I tried to set the number of tweets in the function call manually so I left it at the default (5000).)

# loading the downloaded tweets
dengue_tweets_db = load_tweets_db(table_name = "dengue_tweets_db") %>% twListToDF()

The above chunk loads the downloaded tweets from the database where they were stored. The twitteR::twListToDF() function converts the downloaded tweets and corresponding metadata into a data frame for ease of analysis. We can have a look at the contents of this data frame.

dplyr::glimpse(dengue_tweets_db)
## Observations: 5,000
## Variables: 16
## $ text          <chr> "RT @DaminiNath: 284 new cases of #dengue report...
## $ favorited     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ replyToSN     <chr> NA, NA, NA, "ArvindKejriwal", NA, NA, NA, NA, "Y...
## $ created       <dttm> 2016-09-05 11:56:35, 2016-09-05 11:52:24, 2016-...
## $ truncated     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ replyToSID    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "772761793747791...
## $ id            <chr> "772765365185064960", "772764316147683328", "772...
## $ replyToUID    <chr> NA, NA, NA, "405427035", NA, NA, NA, NA, "206592...
## $ statusSource  <chr> "<a href=\"http://twitter.com/download/android\"...
## $ screenName    <chr> "manju57943813", "AgGuillergan", "khanimambobar"...
## $ retweetCount  <dbl> 21, 0, 8, 0, 0, 3, 10, 2, 1, 0, 3, 1, 0, 1, 3, 0...
## $ isRetweet     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

The twitter data frame contains lots of information. For text analyses conducted below, perhaps the first column is the most relevant as that contains the text from the tweets. The other columns could be more useful: for example, using retweet count to measure the impact of the tweet, to check whether the tweet was retweeted, obtain the spatial coordinates for the entity behind a particular tweet, etc. For the time being, we restrict ourselves to the text of the tweets (stored in the variable dengue_tweets) which are processed in the next section.

dengue_tweets = dengue_tweets_db$text
head(dengue_tweets)
## [1] "RT @DaminiNath: 284 new cases of #dengue reported in the past week. This year so far Delhi has registered 771 cases, as per municipal corpo…"  
## [2] "Is it DENGUE or what?"                                                                                                                         
## [3] "RT @Mozziebites: I'll try to tweet a few things during the mosquito/arbovirus meeting so keep an eye on #MCAA2016! #zika #dengue https://t.…"  
## [4] "@ArvindKejriwal sir dengue &amp; chikunguya is spreading in delhi like anything u must call a meeting 4 action\nRegards\nShailendra"           
## [5] "Top story: Science &amp; Research H5N1: Brazil: Recife no longer in epidemics of de… https://t.co/TbGsmNgj1s, see more https://t.co/RGUABS93Wx"
## [6] "RT @HospitalsApollo: #Dengue is a #mosquito borne infection that causes typical flu-like illness. Read on: https://t.co/ZOZnBzz23a https://…"

Preprocessing Tweets

We apply some common function to process the text data. The main packages in R for text mining/analysis are qdap and tm. Both have functions that can be used to clean up text (remove punctuations, hyperlinks, numbers, convert numeric values into words, etc.)

For preprocessing/cleaning tweets, we can write a cleaning function that applies cleaning operations on any dataset. This is coded below and described following the code chunk.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(qdap))

removeEmoji = function(vec) {
  # define emoji unicode or unicode for other related symbols
  # https://stackoverflow.com/questions/24672834/how-do-i-remove-emoji-from-string
  # emoticons
  vec = gsub(pattern = "[\U0001F600-\U0001F64F]", "", x = vec)
  # symbols and pics
  vec = gsub(pattern = "[\U0001F300-\U0001F5FF]", "", x = vec)
  # dingbats
  vec = gsub(pattern = "[\U00027000-\U00027BFF]", "", x = vec)
  return(vec)
}

# processing the tweets
clean_tweets = function(corpus) {
  # removeEmoji has to be defined
  corpus = tm_map(corpus, content_transformer(removeEmoji))
  corpus = tm_map(corpus, stripWhitespace)
  corpus = tm_map(corpus, removePunctuation)
  corpus = tm_map(corpus, removeNumbers)
  corpus = tm_map(corpus, content_transformer(tolower))
  corpus = tm_map(corpus, removeWords, c("Dengue", "dengue"))
  # removing most common english stop words
  # stopwords("en")
  corpus = tm_map(corpus, removeWords, stopwords("en"))
  # stemming the document (processing the words to have the same root)
  corpus = tm_map(corpus, stemDocument)
  return(corpus)
}

The removeEmoji function uses regular expressions to search for the Unicode characters corresponding to emojis and other symbols in the tweets. Of course, removing these would be counterproductive if sentiment analysis was the goal. Cleaning emojis and symbols from tweets was painful because of all the errors encountered. Details are given in the comments in the accompanying R script. The solution to the errors was converting the tweets from ASCII to UTF-8 which did away with the errors. Incorrectly parsed characters were removed by the removeEmoji function.

Most of the cleaning functions applied to the corpus ares elf-explanatory. The removeWords function with the argument stopwords("en") removes the most common stopwords in English. The stemDocument function performs word stemming, which involves truncating words with a common root and replacing it with a single word. For example (taken from the linked Wiki article), “fishing”, “fished” and “fisher” are replaced by their root word: fish. It should be noted that the word stem need not necessarily be a word on it’s own. More examples are found on the linked wikipedia page.

# top 30 common stopwords in english and german
stopwords("en") %>% head(30)
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"      
## [21] "herself"    "it"         "its"        "itself"     "they"      
## [26] "them"       "their"      "theirs"     "themselves" "what"
stopwords("de") %>% head(30)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"  "auch"   
## [22] "auf"     "aus"     "bei"     "bin"     "bis"     "bist"    "da"     
## [29] "damit"   "dann"

This removes the most common words. However, there might be other words that appear in almost all the documents but are not stop words. Handling these (weighting terms) is considered later on. The tweets are stored in a corpus and then cleaned/processed. A corpus is a structured set of texts.

dengue_corpus = dengue_tweets %>% 
  iconv(., "ASCII", "UTF-8", sub = "") %>% # converts tweets to UTF-8
  VectorSource() %>% # tells R to consider each tweet as a document
  VCorpus() #V := volatile which tells R to store the corpus in memory (oppo: PCorpus)

# clean the corpus
dengue_corpus_clean = clean_tweets(dengue_corpus)
dengue_corpus_clean
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5000

This shows that our corpus has 5000 documents (ie, tweets). Next up, various visualizations are created and the contents of the tweets are explored.

Visualizing Tweets

From the corpus, we can create two types of matrices: the DocumentTermMatrix (DTM) and the TermDocumentMatrix (TDM). The DTM has each document as a row and the words as columns. The TDM is the transpose of the DTM matrix, where the documents become the column and words the rows.

# creating DocumentTermMatrix from the dengue corpus
dengue_dtm = DocumentTermMatrix(dengue_corpus_clean)
dengue_dtm
## <<DocumentTermMatrix (documents: 5000, terms: 6014)>>
## Non-/sparse entries: 48788/30021212
## Sparsity           : 100%
## Maximal term length: 40
## Weighting          : term frequency (tf)

From the output, we see that the DTM is highly sparse (100%) and has only 48788 non-sparse (ie non zero) entries out of 30 million! Each column is a word that occurs in at least one document and the corresponding row index for that word indicates the documents in which it occurs. A small submatrix is printed below.

inspect(dengue_dtm[1:10,1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 0/100
## Sparsity           : 100%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs aad aajtak aakashgauttam aakashsarcasm aam aamaadmi aamaadmiparti
##   1    0      0             0             0   0        0             0
##   2    0      0             0             0   0        0             0
##   3    0      0             0             0   0        0             0
##   4    0      0             0             0   0        0             0
##   5    0      0             0             0   0        0             0
##   6    0      0             0             0   0        0             0
##   7    0      0             0             0   0        0             0
##   8    0      0             0             0   0        0             0
##   9    0      0             0             0   0        0             0
##   10   0      0             0             0   0        0             0
##     Terms
## Docs aamadamiparti aap aapdelhidiari
##   1              0   0             0
##   2              0   0             0
##   3              0   0             0
##   4              0   0             0
##   5              0   0             0
##   6              0   0             0
##   7              0   0             0
##   8              0   0             0
##   9              0   0             0
##   10             0   0             0

All the entries in this submatrix are 0 as a result of sparsity. The term/word frequencies are obtained by getting the column sums (ie summing the entries in each column) in the DocumentTermMatrix, whereas in the TermDocumentMatrix the term frequencies are obtained by summing across rows instead of columns. These are shown below.

# top n words
freq = dengue_dtm %>% as.matrix() %>% 
  colSums() %>% sort(., decreasing = TRUE) 
freq %>% head(50)
##         zika        delhi        fever          amp       vaccin 
##          971          854          775          592          589 
##         case     mosquito  chikungunya       clinic   theeasyway 
##          579          471          449          415          413 
##    hdfcergog       polici        cover       diseas       suffer 
##          412          401          347          341          338 
##        relat         govt       hospit   dispensari         high 
##          334          328          322          313          290 
##          get        death         make       direct          run 
##          287          284          279          250          246 
##          aap        today        admit microcephali       brazil 
##          243          241          233          231          230 
##      prevent         seen      dhesimd          new        https 
##          229          228          225          224          223 
##       amount        studi          zik       infect        claim 
##          217          215          215          213          208 
##         need          can         area          may          via 
##          203          202          198          198          191 
## arvindkejriw       httpst        fight      convert         call 
##          188          188          185          181          173

Some of these words convey useful information: a lot of the tweets mention Delhi, Zika, fever, mosquito, chikungunya, etc. Speaking to my family back home (in Delhi), it is confirmed that Dengue and Chinkungunya cases are rather high. Arvind Kejriwal is the current chief minister of Delhi and is mentioned in some of the tweets as a result. However, since this corpus of tweets is only a fraction of the tweets captured at a specific minute of a specific day, it may not provide a complete global picture about Dengue. Furthemore, some words are quite common but may not have anything to do with Dengue: http/https (tweets were not stripped of hyperlinks), via, may, run, amount, etc. We can also find term frequencies using the tm::findFreqTerms() function which takes minimum and maximum frequencies respectively.

findFreqTerms(dengue_dtm, 100)
##  [1] "aamaadmiparti"    "aap"              "admit"           
##  [4] "aed"              "amount"           "amp"             
##  [7] "area"             "arvindkejriw"     "bill"            
## [10] "blood"            "blooddonorsin"    "brazil"          
## [13] "breed"            "busi"             "call"            
## [16] "can"              "care"             "case"            
## [19] "caus"             "check"            "chikungunya"     
## [22] "claim"            "claus"            "clinic"          
## [25] "convert"          "copi"             "cover"           
## [28] "death"            "delhi"            "dhesimd"         
## [31] "direct"           "diseas"           "dispensari"      
## [34] "face"             "famili"           "fever"           
## [37] "fight"            "get"              "gianyar"         
## [40] "govt"             "hdfcergog"        "health"          
## [43] "helppreventdengu" "high"             "hospit"          
## [46] "https"            "httpst"           "httpstco"        
## [49] "increas"          "indiabtl"         "infect"          
## [52] "like"             "low"              "make"            
## [55] "malaria"          "may"              "medi"            
## [58] "microcephali"     "mohfwindia"       "mosquito"        
## [61] "need"             "new"              "patient"         
## [64] "platelet"         "pls"              "polici"          
## [67] "prevent"          "read"             "relat"           
## [70] "rise"             "run"              "satyendarjain"   
## [73] "seen"             "sep"              "sever"           
## [76] "spread"           "studi"            "submit"          
## [79] "suffer"           "theeasyway"       "today"           
## [82] "use"              "vaccin"           "via"             
## [85] "virus"            "will"             "zik"             
## [88] "zika"

It displays the terms that appear more than 100 times, although it does not list the correspoding frequencies.

A popular visualization tool is the wordcloud. It is displayed below.

library(wordcloud)
wordcloud(names(freq), freq, max.words = 150)

For those from Delhi, a lot of the terms make sense, which may not be the case for those outside Delhi. Examples: arvindkejriw, jpnadda (Union minister for health), aap, mohali, satyendarjain, cmodilli, aamaadmiparti, narendramodi, mohfwindia, swasthabharat, etc etc.

The wordcloud function takes several arguments, with colour, max.words being the most important ones. We can use the colour palettes that are available in the R package RColorBrewer, which can be viewed by RColorBrewer::display.brewer.all(). This is shown below:

suppressPackageStartupMessages(library(RColorBrewer))
RColorBrewer::display.brewer.all()

We replot the wordcloud with two different colour palettes, which can be accessed through the brewer.pal() function.

pal1 = brewer.pal(9, "GnBu")[-(1:3)]
wordcloud(names(freq), freq, max.words = 100, colors = pal1)
## Warning in wordcloud(names(freq), freq, max.words = 100, colors = pal1):
## chikungunya could not be fit on page. It will not be plotted.

pal2 = brewer.pal(9, "OrRd")[-(1:3)]
wordcloud(names(freq), freq, max.words = 50, colors = pal2)

pal3 = brewer.pal(8, "Dark2")
wordcloud(names(freq), freq, max.words = 50, colors = pal3)

We can plot term frequencies as bar chart too, although personally I find the wordcloud more visually appealing as well as interpretable by a broader audience. Nevertheless, a barchat is visualized below.

# top 20 terms
n_words = 20

freq_df = data.frame(name = names(freq), freq = freq, row.names = 1:length(freq)) %>%
  dplyr::mutate(name = as.character(name)) %>% # coercing factor to character
  dplyr::slice(1:n_words) %>%
  dplyr::arrange(desc(freq))

ggplot(data = freq_df, aes(x = reorder(name, freq), y = freq)) + 
  geom_bar(stat = "identity") + theme_bw() + coord_flip() + 
  ylab("Word Count") + xlab("Term")

So far, we have only analyzed single words and their frequencies. A limitation of this approach is that single words may not convey as much information as phrases or more formally, n-grams, where n is the number of tokens in the phrase. For example, a bigram would consist of 2 words: if the unigram had a high occurrence of the term “fever”“, then the bigram may be”high fever“” which conveys more information than the unigram. Similarly, a trigram (n = 3) may have the phrase “very high fever”. Thus, creating a wordcloud of bigrams or trigrams may provide more information about the object of interest.

Bigrams

Creating \(n\)-grams is rather straightforward. This is accomplished by using the NGramTokenizer() function from the R package RWeka. We create the following two functions for using Weka to create 2- and 3-gram tokens.

# gives information about the Weka_control for a given Weka functions
RWeka::WOW("NGramTokenizer")
## -max <int>
##         The max size of the Ngram (default = 3).
##  Number of arguments: 1.
## -min <int>
##         The min size of the Ngram (default = 1).
##  Number of arguments: 1.
## -delimiters <value>
##         The delimiters to use (default ' \r\n\t.,;:'"()?!').
##  Number of arguments: 1.
## trying word clouds with 2- and 3-grams
## use RWeka::NGramTokenizer
bigram_token = function(x) RWeka::NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_token = function(x) RWeka::NGramTokenizer(x, Weka_control(min = 3, max = 3))

The Weka_control() argument takes a list of further arguments with min and max being the most important arguments that specify the n in the \(n\)-gram.

# throwing some error; needs to be fixed
bigram_tdm = TermDocumentMatrix(dengue_corpus_clean, control = list(tokenize = bigram_token))
load("bigram_tdm.RData") # run in the main script and saved there

bigram_tdm
## <<TermDocumentMatrix (terms: 13459, documents: 5000)>>
## Non-/sparse entries: 48809/67246191
## Sparsity           : 100%
## Maximal term length: 46
## Weighting          : term frequency (tf)
tdm_matrix = as.matrix(bigram_tdm)
bigram_freq = rowSums(tdm_matrix) %>% sort(., decreasing = TRUE)
bigram_freq %>% head(50)
## hdfcergog theeasyway         fever clinic         polici cover 
##                  403                  391                  301 
##     dispensari fever            seen high           rt dhesimd 
##                  235                  226                  225 
##   relat microcephali          admit today         amount death 
##                  217                  216                  216 
##         brazil admit         direct relat          high amount 
##                  216                  216                  216 
##  microcephali brazil           today seen          zika direct 
##                  216                  216                  216 
##            death zik         dhesimd zika             govt run 
##                  215                  215                  171 
##       run dispensari             aap make           new vaccin 
##                  171                  149                  140 
##      make dispensari           vaccin may           delhi govt 
##                  129                  123                  108 
##        rt mohfwindia              cm busi             delhi cm 
##                  107                  103                  103 
##        famili polici           get famili            bill copi 
##                  101                  101                  100 
##           care relat           check medi         claim polici 
##                  100                  100                  100 
##         claus polici          copi submit            cover get 
##                  100                  100                  100 
##             get bill           medi claim          polici care 
##                  100                  100                  100 
##           read claus         submit claim     theeasyway check 
##                  100                  100                  100 
##       theeasyway get    theeasyway polici      theeasyway read 
##                  100                  100                  100 
##          rt indiabtl             call sep       mosquito breed 
##                   99                   97                   97 
##             pls call         convert govt 
##                   97                   96

The top bigram is “hdfcergog theeasyway” which does not appear to have much to do with dengue. Perhaps some general insurance that is offered by HDFC ergo. “rt dhesimd” upon googling results in the following twitter profile of an MD working with Zika and “rt” stands for retweet. The top 20 words are visualized as a wordcloud below.

wordcloud(names(bigram_freq), bigram_freq, max.words = 20, col = pal3, scale = c(1,2))

Trigrams

Similar to the bigram subsection, we can construct and plot the corresponding wordclouds for the trigrams as well. This is done using the following code.

# throwing some error
trigram_tdm = TermDocumentMatrix(dengue_corpus_clean, control = list(tokenize = trigram_token))
load("trigram_tdm.RData")

trigram_tdm
## <<TermDocumentMatrix (terms: 13432, documents: 5000)>>
## Non-/sparse entries: 43861/67116139
## Sparsity           : 100%
## Maximal term length: 59
## Weighting          : term frequency (tf)
# top n phrases
trigram_tdm %>% as.matrix() %>% rowSums() %>% sort(., decreasing = TRUE) %>% head(10)
##   dispensari fever clinic          admit today seen 
##                       230                       216 
##        brazil admit today direct relat microcephali 
##                       216                       216 
##         high amount death microcephali brazil admit 
##                       216                       216 
## relat microcephali brazil          seen high amount 
##                       216                       216 
##           today seen high         zika direct relat 
##                       216                       216
# wordcloud
trigram_tdm %>% as.matrix() %>% rowSums() %>% sort(., decreasing = TRUE) %>%
  wordcloud(names(.), ., max.words = 20, col = pal3, scale = c(1,2))

In this case, the bigrams and trigrams do not seem to be particularly informative compared to the unigrams.

Weighting Terms

There are various methods to assess the importance of terms in a corpus. One such method is the TF-IDF which stands for Term Frequency - Inverse Document Frequency, the details of which are given in the Wikipedia article on the topic. So far, the approach we have employed has just been based on term frequency (TF). This simply counts the number of times the term appears in each document in the corpus and then sums these counts across the documents in the corpus.

The TF is multiplied by a factor, called the inverse-document frequency (IDF), which takes into account how common a word is across documents. Mathematically, it is the log of the ratio of the total number of documents in the corpus divided by the number of documents that contain that term. Ideally, this should give less weight to the most common terms across the documents.

It is simple to take this weighting into account, by using control = list(weighting = weightTfIdf) argument in the TermDocumentMatrix() function. A wordcloud of unigrams is created using this weighting scheme.

uni_tdm_tfidf = TermDocumentMatrix(dengue_corpus_clean, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 2 1824
uni_tdm_tfidf_freq = uni_tdm_tfidf %>% as.matrix() %>% 
  rowSums() %>% sort(., decreasing = TRUE) 

# with TFIDF
uni_tdm_tfidf_freq %>% head(10)
##      fever       zika     vaccin      delhi theeasyway  hdfcergog 
##   223.5757   223.3324   212.7682   211.8991   207.6044   207.3561 
##     polici       case      cover        get 
##   205.3529   181.6139   177.2426   157.3579
# only TF
freq %>% head(10)
##        zika       delhi       fever         amp      vaccin        case 
##         971         854         775         592         589         579 
##    mosquito chikungunya      clinic  theeasyway 
##         471         449         415         413
# bar chart
data.frame(name = names(uni_tdm_tfidf_freq), freq = uni_tdm_tfidf_freq) %>%
  dplyr::slice(1:n_words) %>%
  ggplot(data = ., aes(x = reorder(name, freq), y = freq)) + 
  geom_bar(stat = "identity") + theme_bw() + coord_flip() + 
  ylab("Word Count") + xlab("Term") + ggtitle("Barplot for Terms with TFIDF Weighting")

# unigram wordcloud with weighting
par(mfrow = c(1,2))
wordcloud(names(freq), freq, max.words = 50, col = pal3, scale = c(1,2))
## Warning in wordcloud(names(freq), freq, max.words = 50, col = pal3, scale =
## c(1, : prevent could not be fit on page. It will not be plotted.
title("Weight: TF")
uni_tdm_tfidf_freq %>% wordcloud(names(.), ., max.words = 50, col = pal3, scale = c(1,2))
title(main = "Weight: TFIDF")

We can see that the top 10 names have changed between the two weighting schemes. In this case, one sees that mosquito drops out of the top 10 words under the new weighting scheme. This is understandable since mosquito doesn’t contribute additional information for those searching for dengue since dengue is caused by mosquitos. This can be useful for downweighting more frequent terms that do not contribute much information when one does not know in advance exactly which terms these are, since if we did know them we would add them to the removeWords() function in clean_tweets().

This concludes the first part of this article. A follow up article will contain different visualizations, sentiment analysis, correlations with different but similar terms (for example comparing Dengue and Zika), using a larger corpus of tweets, etc.

The static version will be posted on github, and the dynamic version of this article will be hosted on www.shinyapps.io.