Pulling data from Twitter directly into R is quite straightforward to setup. For R, one needs the twitteR package from CRAN. A twitter account can be setup the usual way and a new app can be created by going here https://apps.twitter.com/app/new. In the “Callback URL” box, one can paste: http://127.0.0.1:1410, which avoids an error that comes up later in the process. Here’s another tutorial that describes how to setup a twitter app. It is rather old and the twitteR::getTwitterOAuth() function is deprecated in favour of twitteR::setup_twitter_oauth() function but I found the screenshots from the tutorial helpful. Here’s a more recent tutorial. Next, in RStudio, one should run the following code:
library(twitteR)
setup_twitter_oauth("consumer_key", "consumer_secret", "access_token", "access_key")
# select 1 for saving the file, throws an error (for browser authentication) if the "Callback URL" field is left blank in the twitter app dashboard (on the twitter website).
where consumer_key and consumer_secret can be obtained from the “Keys and Access Tokens” tab on the twitter application dashboard https://apps.twitter.com/app/. Once the above command is run, it should successfully have setup access to the Twitter API. More information can be found on https://dev.twitter.com/overview/api.
twitteR To Pull DataNow that we have twitter setup, we can begin querying and modelling twitter data. One useful function is twitteR::getCurRateLimitInfo() which highlights the rate limit on the API, i.e., it tells us how many times we can query the twitter API per unit time for a given task (eg. sending a DM, searching tweets, downloading the status for a given user, etc). Consulting the documentation, one observes that twitter allows 15 or 180 queries in a 15-minute time span. The 15 or 180 limit depends on which function is being performed on twitter through the API. Since this tutorial retsricts itself to analyzing the text in a tweet for a given term, we primarily use the API to search for a given term and pull corresponding tweets. Consulting the official documentation, we see that the limit for search queries on tweets is 180. For example, if we pull 10 tweets containing the term coffee, it counts as 1 query and 179 queries will be possible for the next 15 minute period.
After performing a query, we can run twitteR::getCurRateLimitInfo() to see the remaining number of queries and the exact time (in GMT) at which the rate limit will be reset. This gives a large amount of information. If one is interested in only a particular task, it can be passed to the function as a string, for example “search” (twitteR::getCurRateLimitInfo(resources = "search")) which gives us how many search queries remain out of 180. More information is given in the relevant help file for the function. Running the function gives the following output:
suppressPackageStartupMessages(library(magrittr)) # for using the (pipe) %>% operator
# returns a data.frame
twitteR::getCurRateLimitInfo() %>% head(., n = 10)
## resource limit remaining reset
## 1 /lists/list 15 15 2016-09-18 11:43:21
## 2 /lists/memberships 15 15 2016-09-18 11:43:21
## 3 /lists/subscribers/show 15 15 2016-09-18 11:43:21
## 4 /lists/members 180 180 2016-09-18 11:43:21
## 5 /lists/subscriptions 15 15 2016-09-18 11:43:21
## 6 /lists/show 15 15 2016-09-18 11:43:21
## 7 /lists/ownerships 15 15 2016-09-18 11:43:21
## 8 /lists/subscribers 180 180 2016-09-18 11:43:21
## 9 /lists/members/show 15 15 2016-09-18 11:43:21
## 10 /lists/statuses 180 180 2016-09-18 11:43:21
and for “search”
twitteR::getCurRateLimitInfo("search") #time in GMT
## resource limit remaining reset
## 1 /search/tweets 180 180 2016-09-18 11:43:21
The main function for performing a query and pulling corresponding tweets is searchTwitter(). The arguments can be found on the corresponding help page by running ?searchTwitter. The first argument is the searchString which takes the term(s) we wish to query. An example would be “coffee” or “#scala”. More information about specifying search terms is found on the official API docs. Some other arguments that can be passed to the searchTwitter() function are location, language (of the tweet), number of tweets to fetch among others. The full list of arguments can be found in the relevant help file. Consulting the API documentation on the web, it appears that the number of tweets per query is capped at 100. The documentation also lists errors that can sometimes occur when this function is called.
For this demonstration, the twitteR::search_twitter_and_store() function is used since we can use it to fetch the tweets for a given search string and store it in an SQL database. The above function automatically pulls in 5000 tweets. The next code chunk shows that we first have to register a database, i.e., open a connection to it. If the database doesn’t exist, a database with the same name is created in the project directory. Then, the twitteR::search_twitter_and_store() function is run to find tweets with either “Dengue” or “dengue”. The table argument takes the name of the database where the tweets should be stored. Rest of the arguments passed to the function are the same as that for searchTwitter().
# getting the data
register_sqlite_backend("dengue_tweets_db") # if db doesn't exist, it will be created.
## Loading required namespace: RSQLite
Previous code chunk creates a connection to the database. Next code chunk downloads the tweets from twitter and writes it to the database.
dengue_raw = search_twitter_and_store("dengue+Dengue", table_name = "dengue_tweets_db", lang = "en")
In the above code chunk, we create a database called dengue_tweets_db and store the results from the query into this database. (We can run twitteR::getCurRateLimitInfo(resources = "search") to see how many queries remain after the function call in the previous code chunk.) Beware: the query can take some time to run. The above code chunk exhausted 50 queries since it pulled 50 * 100 (tweets per query) = 5000 tweets. (Note: there were some issues for me when I tried to set the number of tweets in the function call manually so I left it at the default (5000).)
# loading the downloaded tweets
dengue_tweets_db = load_tweets_db(table_name = "dengue_tweets_db") %>% twListToDF()
The above chunk loads the downloaded tweets from the database where they were stored. The twitteR::twListToDF() function converts the downloaded tweets and corresponding metadata into a data frame for ease of analysis. We can have a look at the contents of this data frame.
dplyr::glimpse(dengue_tweets_db)
## Observations: 5,000
## Variables: 16
## $ text <chr> "RT @DaminiNath: 284 new cases of #dengue report...
## $ favorited <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ replyToSN <chr> NA, NA, NA, "ArvindKejriwal", NA, NA, NA, NA, "Y...
## $ created <dttm> 2016-09-05 11:56:35, 2016-09-05 11:52:24, 2016-...
## $ truncated <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ replyToSID <chr> NA, NA, NA, NA, NA, NA, NA, NA, "772761793747791...
## $ id <chr> "772765365185064960", "772764316147683328", "772...
## $ replyToUID <chr> NA, NA, NA, "405427035", NA, NA, NA, NA, "206592...
## $ statusSource <chr> "<a href=\"http://twitter.com/download/android\"...
## $ screenName <chr> "manju57943813", "AgGuillergan", "khanimambobar"...
## $ retweetCount <dbl> 21, 0, 8, 0, 0, 3, 10, 2, 1, 0, 3, 1, 0, 1, 3, 0...
## $ isRetweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
The twitter data frame contains lots of information. For text analyses conducted below, perhaps the first column is the most relevant as that contains the text from the tweets. The other columns could be more useful: for example, using retweet count to measure the impact of the tweet, to check whether the tweet was retweeted, obtain the spatial coordinates for the entity behind a particular tweet, etc. For the time being, we restrict ourselves to the text of the tweets (stored in the variable dengue_tweets) which are processed in the next section.
dengue_tweets = dengue_tweets_db$text
head(dengue_tweets)
## [1] "RT @DaminiNath: 284 new cases of #dengue reported in the past week. This year so far Delhi has registered 771 cases, as per municipal corpo…"
## [2] "Is it DENGUE or what?"
## [3] "RT @Mozziebites: I'll try to tweet a few things during the mosquito/arbovirus meeting so keep an eye on #MCAA2016! #zika #dengue https://t.…"
## [4] "@ArvindKejriwal sir dengue & chikunguya is spreading in delhi like anything u must call a meeting 4 action\nRegards\nShailendra"
## [5] "Top story: Science & Research H5N1: Brazil: Recife no longer in epidemics of de… https://t.co/TbGsmNgj1s, see more https://t.co/RGUABS93Wx"
## [6] "RT @HospitalsApollo: #Dengue is a #mosquito borne infection that causes typical flu-like illness. Read on: https://t.co/ZOZnBzz23a https://…"
We apply some common function to process the text data. The main packages in R for text mining/analysis are qdap and tm. Both have functions that can be used to clean up text (remove punctuations, hyperlinks, numbers, convert numeric values into words, etc.)
For preprocessing/cleaning tweets, we can write a cleaning function that applies cleaning operations on any dataset. This is coded below and described following the code chunk.
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(qdap))
removeEmoji = function(vec) {
# define emoji unicode or unicode for other related symbols
# https://stackoverflow.com/questions/24672834/how-do-i-remove-emoji-from-string
# emoticons
vec = gsub(pattern = "[\U0001F600-\U0001F64F]", "", x = vec)
# symbols and pics
vec = gsub(pattern = "[\U0001F300-\U0001F5FF]", "", x = vec)
# dingbats
vec = gsub(pattern = "[\U00027000-\U00027BFF]", "", x = vec)
return(vec)
}
# processing the tweets
clean_tweets = function(corpus) {
# removeEmoji has to be defined
corpus = tm_map(corpus, content_transformer(removeEmoji))
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeWords, c("Dengue", "dengue"))
# removing most common english stop words
# stopwords("en")
corpus = tm_map(corpus, removeWords, stopwords("en"))
# stemming the document (processing the words to have the same root)
corpus = tm_map(corpus, stemDocument)
return(corpus)
}
The removeEmoji function uses regular expressions to search for the Unicode characters corresponding to emojis and other symbols in the tweets. Of course, removing these would be counterproductive if sentiment analysis was the goal. Cleaning emojis and symbols from tweets was painful because of all the errors encountered. Details are given in the comments in the accompanying R script. The solution to the errors was converting the tweets from ASCII to UTF-8 which did away with the errors. Incorrectly parsed characters were removed by the removeEmoji function.
Most of the cleaning functions applied to the corpus ares elf-explanatory. The removeWords function with the argument stopwords("en") removes the most common stopwords in English. The stemDocument function performs word stemming, which involves truncating words with a common root and replacing it with a single word. For example (taken from the linked Wiki article), “fishing”, “fished” and “fisher” are replaced by their root word: fish. It should be noted that the word stem need not necessarily be a word on it’s own. More examples are found on the linked wikipedia page.
# top 30 common stopwords in english and german
stopwords("en") %>% head(30)
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
stopwords("de") %>% head(30)
## [1] "aber" "alle" "allem" "allen" "aller" "alles" "als"
## [8] "also" "am" "an" "ander" "andere" "anderem" "anderen"
## [15] "anderer" "anderes" "anderm" "andern" "anderr" "anders" "auch"
## [22] "auf" "aus" "bei" "bin" "bis" "bist" "da"
## [29] "damit" "dann"
This removes the most common words. However, there might be other words that appear in almost all the documents but are not stop words. Handling these (weighting terms) is considered later on. The tweets are stored in a corpus and then cleaned/processed. A corpus is a structured set of texts.
dengue_corpus = dengue_tweets %>%
iconv(., "ASCII", "UTF-8", sub = "") %>% # converts tweets to UTF-8
VectorSource() %>% # tells R to consider each tweet as a document
VCorpus() #V := volatile which tells R to store the corpus in memory (oppo: PCorpus)
# clean the corpus
dengue_corpus_clean = clean_tweets(dengue_corpus)
dengue_corpus_clean
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5000
This shows that our corpus has 5000 documents (ie, tweets). Next up, various visualizations are created and the contents of the tweets are explored.
From the corpus, we can create two types of matrices: the DocumentTermMatrix (DTM) and the TermDocumentMatrix (TDM). The DTM has each document as a row and the words as columns. The TDM is the transpose of the DTM matrix, where the documents become the column and words the rows.
# creating DocumentTermMatrix from the dengue corpus
dengue_dtm = DocumentTermMatrix(dengue_corpus_clean)
dengue_dtm
## <<DocumentTermMatrix (documents: 5000, terms: 6014)>>
## Non-/sparse entries: 48788/30021212
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
From the output, we see that the DTM is highly sparse (100%) and has only 48788 non-sparse (ie non zero) entries out of 30 million! Each column is a word that occurs in at least one document and the corresponding row index for that word indicates the documents in which it occurs. A small submatrix is printed below.
inspect(dengue_dtm[1:10,1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 0/100
## Sparsity : 100%
## Maximal term length: 13
## Weighting : term frequency (tf)
##
## Terms
## Docs aad aajtak aakashgauttam aakashsarcasm aam aamaadmi aamaadmiparti
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0
## Terms
## Docs aamadamiparti aap aapdelhidiari
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 0
All the entries in this submatrix are 0 as a result of sparsity. The term/word frequencies are obtained by getting the column sums (ie summing the entries in each column) in the DocumentTermMatrix, whereas in the TermDocumentMatrix the term frequencies are obtained by summing across rows instead of columns. These are shown below.
# top n words
freq = dengue_dtm %>% as.matrix() %>%
colSums() %>% sort(., decreasing = TRUE)
freq %>% head(50)
## zika delhi fever amp vaccin
## 971 854 775 592 589
## case mosquito chikungunya clinic theeasyway
## 579 471 449 415 413
## hdfcergog polici cover diseas suffer
## 412 401 347 341 338
## relat govt hospit dispensari high
## 334 328 322 313 290
## get death make direct run
## 287 284 279 250 246
## aap today admit microcephali brazil
## 243 241 233 231 230
## prevent seen dhesimd new https
## 229 228 225 224 223
## amount studi zik infect claim
## 217 215 215 213 208
## need can area may via
## 203 202 198 198 191
## arvindkejriw httpst fight convert call
## 188 188 185 181 173
Some of these words convey useful information: a lot of the tweets mention Delhi, Zika, fever, mosquito, chikungunya, etc. Speaking to my family back home (in Delhi), it is confirmed that Dengue and Chinkungunya cases are rather high. Arvind Kejriwal is the current chief minister of Delhi and is mentioned in some of the tweets as a result. However, since this corpus of tweets is only a fraction of the tweets captured at a specific minute of a specific day, it may not provide a complete global picture about Dengue. Furthemore, some words are quite common but may not have anything to do with Dengue: http/https (tweets were not stripped of hyperlinks), via, may, run, amount, etc. We can also find term frequencies using the tm::findFreqTerms() function which takes minimum and maximum frequencies respectively.
findFreqTerms(dengue_dtm, 100)
## [1] "aamaadmiparti" "aap" "admit"
## [4] "aed" "amount" "amp"
## [7] "area" "arvindkejriw" "bill"
## [10] "blood" "blooddonorsin" "brazil"
## [13] "breed" "busi" "call"
## [16] "can" "care" "case"
## [19] "caus" "check" "chikungunya"
## [22] "claim" "claus" "clinic"
## [25] "convert" "copi" "cover"
## [28] "death" "delhi" "dhesimd"
## [31] "direct" "diseas" "dispensari"
## [34] "face" "famili" "fever"
## [37] "fight" "get" "gianyar"
## [40] "govt" "hdfcergog" "health"
## [43] "helppreventdengu" "high" "hospit"
## [46] "https" "httpst" "httpstco"
## [49] "increas" "indiabtl" "infect"
## [52] "like" "low" "make"
## [55] "malaria" "may" "medi"
## [58] "microcephali" "mohfwindia" "mosquito"
## [61] "need" "new" "patient"
## [64] "platelet" "pls" "polici"
## [67] "prevent" "read" "relat"
## [70] "rise" "run" "satyendarjain"
## [73] "seen" "sep" "sever"
## [76] "spread" "studi" "submit"
## [79] "suffer" "theeasyway" "today"
## [82] "use" "vaccin" "via"
## [85] "virus" "will" "zik"
## [88] "zika"
It displays the terms that appear more than 100 times, although it does not list the correspoding frequencies.
A popular visualization tool is the wordcloud. It is displayed below.
library(wordcloud)
wordcloud(names(freq), freq, max.words = 150)
For those from Delhi, a lot of the terms make sense, which may not be the case for those outside Delhi. Examples: arvindkejriw, jpnadda (Union minister for health), aap, mohali, satyendarjain, cmodilli, aamaadmiparti, narendramodi, mohfwindia, swasthabharat, etc etc.
The wordcloud function takes several arguments, with colour, max.words being the most important ones. We can use the colour palettes that are available in the R package RColorBrewer, which can be viewed by RColorBrewer::display.brewer.all(). This is shown below:
suppressPackageStartupMessages(library(RColorBrewer))
RColorBrewer::display.brewer.all()
We replot the wordcloud with two different colour palettes, which can be accessed through the brewer.pal() function.
pal1 = brewer.pal(9, "GnBu")[-(1:3)]
wordcloud(names(freq), freq, max.words = 100, colors = pal1)
## Warning in wordcloud(names(freq), freq, max.words = 100, colors = pal1):
## chikungunya could not be fit on page. It will not be plotted.
pal2 = brewer.pal(9, "OrRd")[-(1:3)]
wordcloud(names(freq), freq, max.words = 50, colors = pal2)
pal3 = brewer.pal(8, "Dark2")
wordcloud(names(freq), freq, max.words = 50, colors = pal3)
We can plot term frequencies as bar chart too, although personally I find the wordcloud more visually appealing as well as interpretable by a broader audience. Nevertheless, a barchat is visualized below.
# top 20 terms
n_words = 20
freq_df = data.frame(name = names(freq), freq = freq, row.names = 1:length(freq)) %>%
dplyr::mutate(name = as.character(name)) %>% # coercing factor to character
dplyr::slice(1:n_words) %>%
dplyr::arrange(desc(freq))
ggplot(data = freq_df, aes(x = reorder(name, freq), y = freq)) +
geom_bar(stat = "identity") + theme_bw() + coord_flip() +
ylab("Word Count") + xlab("Term")
So far, we have only analyzed single words and their frequencies. A limitation of this approach is that single words may not convey as much information as phrases or more formally, n-grams, where n is the number of tokens in the phrase. For example, a bigram would consist of 2 words: if the unigram had a high occurrence of the term “fever”“, then the bigram may be”high fever“” which conveys more information than the unigram. Similarly, a trigram (n = 3) may have the phrase “very high fever”. Thus, creating a wordcloud of bigrams or trigrams may provide more information about the object of interest.
Creating \(n\)-grams is rather straightforward. This is accomplished by using the NGramTokenizer() function from the R package RWeka. We create the following two functions for using Weka to create 2- and 3-gram tokens.
# gives information about the Weka_control for a given Weka functions
RWeka::WOW("NGramTokenizer")
## -max <int>
## The max size of the Ngram (default = 3).
## Number of arguments: 1.
## -min <int>
## The min size of the Ngram (default = 1).
## Number of arguments: 1.
## -delimiters <value>
## The delimiters to use (default ' \r\n\t.,;:'"()?!').
## Number of arguments: 1.
## trying word clouds with 2- and 3-grams
## use RWeka::NGramTokenizer
bigram_token = function(x) RWeka::NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_token = function(x) RWeka::NGramTokenizer(x, Weka_control(min = 3, max = 3))
The Weka_control() argument takes a list of further arguments with min and max being the most important arguments that specify the n in the \(n\)-gram.
# throwing some error; needs to be fixed
bigram_tdm = TermDocumentMatrix(dengue_corpus_clean, control = list(tokenize = bigram_token))
load("bigram_tdm.RData") # run in the main script and saved there
bigram_tdm
## <<TermDocumentMatrix (terms: 13459, documents: 5000)>>
## Non-/sparse entries: 48809/67246191
## Sparsity : 100%
## Maximal term length: 46
## Weighting : term frequency (tf)
tdm_matrix = as.matrix(bigram_tdm)
bigram_freq = rowSums(tdm_matrix) %>% sort(., decreasing = TRUE)
bigram_freq %>% head(50)
## hdfcergog theeasyway fever clinic polici cover
## 403 391 301
## dispensari fever seen high rt dhesimd
## 235 226 225
## relat microcephali admit today amount death
## 217 216 216
## brazil admit direct relat high amount
## 216 216 216
## microcephali brazil today seen zika direct
## 216 216 216
## death zik dhesimd zika govt run
## 215 215 171
## run dispensari aap make new vaccin
## 171 149 140
## make dispensari vaccin may delhi govt
## 129 123 108
## rt mohfwindia cm busi delhi cm
## 107 103 103
## famili polici get famili bill copi
## 101 101 100
## care relat check medi claim polici
## 100 100 100
## claus polici copi submit cover get
## 100 100 100
## get bill medi claim polici care
## 100 100 100
## read claus submit claim theeasyway check
## 100 100 100
## theeasyway get theeasyway polici theeasyway read
## 100 100 100
## rt indiabtl call sep mosquito breed
## 99 97 97
## pls call convert govt
## 97 96
The top bigram is “hdfcergog theeasyway” which does not appear to have much to do with dengue. Perhaps some general insurance that is offered by HDFC ergo. “rt dhesimd” upon googling results in the following twitter profile of an MD working with Zika and “rt” stands for retweet. The top 20 words are visualized as a wordcloud below.
wordcloud(names(bigram_freq), bigram_freq, max.words = 20, col = pal3, scale = c(1,2))
Similar to the bigram subsection, we can construct and plot the corresponding wordclouds for the trigrams as well. This is done using the following code.
# throwing some error
trigram_tdm = TermDocumentMatrix(dengue_corpus_clean, control = list(tokenize = trigram_token))
load("trigram_tdm.RData")
trigram_tdm
## <<TermDocumentMatrix (terms: 13432, documents: 5000)>>
## Non-/sparse entries: 43861/67116139
## Sparsity : 100%
## Maximal term length: 59
## Weighting : term frequency (tf)
# top n phrases
trigram_tdm %>% as.matrix() %>% rowSums() %>% sort(., decreasing = TRUE) %>% head(10)
## dispensari fever clinic admit today seen
## 230 216
## brazil admit today direct relat microcephali
## 216 216
## high amount death microcephali brazil admit
## 216 216
## relat microcephali brazil seen high amount
## 216 216
## today seen high zika direct relat
## 216 216
# wordcloud
trigram_tdm %>% as.matrix() %>% rowSums() %>% sort(., decreasing = TRUE) %>%
wordcloud(names(.), ., max.words = 20, col = pal3, scale = c(1,2))
In this case, the bigrams and trigrams do not seem to be particularly informative compared to the unigrams.
There are various methods to assess the importance of terms in a corpus. One such method is the TF-IDF which stands for Term Frequency - Inverse Document Frequency, the details of which are given in the Wikipedia article on the topic. So far, the approach we have employed has just been based on term frequency (TF). This simply counts the number of times the term appears in each document in the corpus and then sums these counts across the documents in the corpus.
The TF is multiplied by a factor, called the inverse-document frequency (IDF), which takes into account how common a word is across documents. Mathematically, it is the log of the ratio of the total number of documents in the corpus divided by the number of documents that contain that term. Ideally, this should give less weight to the most common terms across the documents.
It is simple to take this weighting into account, by using control = list(weighting = weightTfIdf) argument in the TermDocumentMatrix() function. A wordcloud of unigrams is created using this weighting scheme.
uni_tdm_tfidf = TermDocumentMatrix(dengue_corpus_clean, control = list(weighting = weightTfIdf))
## Warning in weighting(x): empty document(s): 2 1824
uni_tdm_tfidf_freq = uni_tdm_tfidf %>% as.matrix() %>%
rowSums() %>% sort(., decreasing = TRUE)
# with TFIDF
uni_tdm_tfidf_freq %>% head(10)
## fever zika vaccin delhi theeasyway hdfcergog
## 223.5757 223.3324 212.7682 211.8991 207.6044 207.3561
## polici case cover get
## 205.3529 181.6139 177.2426 157.3579
# only TF
freq %>% head(10)
## zika delhi fever amp vaccin case
## 971 854 775 592 589 579
## mosquito chikungunya clinic theeasyway
## 471 449 415 413
# bar chart
data.frame(name = names(uni_tdm_tfidf_freq), freq = uni_tdm_tfidf_freq) %>%
dplyr::slice(1:n_words) %>%
ggplot(data = ., aes(x = reorder(name, freq), y = freq)) +
geom_bar(stat = "identity") + theme_bw() + coord_flip() +
ylab("Word Count") + xlab("Term") + ggtitle("Barplot for Terms with TFIDF Weighting")
# unigram wordcloud with weighting
par(mfrow = c(1,2))
wordcloud(names(freq), freq, max.words = 50, col = pal3, scale = c(1,2))
## Warning in wordcloud(names(freq), freq, max.words = 50, col = pal3, scale =
## c(1, : prevent could not be fit on page. It will not be plotted.
title("Weight: TF")
uni_tdm_tfidf_freq %>% wordcloud(names(.), ., max.words = 50, col = pal3, scale = c(1,2))
title(main = "Weight: TFIDF")
We can see that the top 10 names have changed between the two weighting schemes. In this case, one sees that mosquito drops out of the top 10 words under the new weighting scheme. This is understandable since mosquito doesn’t contribute additional information for those searching for dengue since dengue is caused by mosquitos. This can be useful for downweighting more frequent terms that do not contribute much information when one does not know in advance exactly which terms these are, since if we did know them we would add them to the removeWords() function in clean_tweets().
This concludes the first part of this article. A follow up article will contain different visualizations, sentiment analysis, correlations with different but similar terms (for example comparing Dengue and Zika), using a larger corpus of tweets, etc.
The static version will be posted on github, and the dynamic version of this article will be hosted on www.shinyapps.io.