Ch11. Text Classification

Learning Objectives

  1. Understand the task of text classification and learn about its applications

  2. Learn about a basic automated way of text classification

  3. Practice the lexicon-based analysis for sentiment classification of COVID-19 Tweets

The Logic of Text Classification

Text classification refers to the task of assigning texts with one or more predefined categories. What is important here is “predefined”. An example of text classification is sentiment classification of online reviews, where the text classifer automatically assigns each given review with one of the predefined categories such as positive review versus negative review. So, we already known what to do with the texts that are to be classified into either positive or negative reviews. What we are going to do with text classification is to make algorithms learn about how to classify the texts. As a result, the classifer identifies some linguistic features that are associated with positive sentiment such as the words happy and awesome, whereas the words poor and shit are associated with negative sentiment.

Text Classification vs. Text Clustering

Text clustering is the task of grouping texts into unknown categories; that is, in text clustering, the categories that the text are classifed into are not known a priori. Given a set of texts, a clustering system identifies that certain texts are more similar to one another than others and should be assigned to the same cluster, but it will not label this cluster. And the number of clusters that a collection of texts will be split into is often not known. That is why text clustering is called unsupervised machine learning. On the other hand, text classification is supervised machine learning where the machine is supervised what categories a collection of texts are assigned with.

History of Text Classification

In the early days, the classification of texts was done manually by “domain experts” who were familiar with the topics of the texts being classified. For example, given our collection of tweets about COVID-19, we can carefully read all the tweets and manually assign each tweets with one or more categories of sentiments. As expected, this classification approach is highly accurate, in particular when the data set was relatively small and the team of annotators was also small so as to avoid inconsistency among annotators. But this approach has a critical limitation, as the number of documents that need to be classified are very large. Imagine how long it will take to read 1,000,000 tweets for their sentiment classification.

The next step in the history of text classification was rule-based systems, which used queries consisting of combinations of words to determine the category of a text. For instance, if a tweet included the words safe and thank, there could be a rule that would say that this text was part of texts expressing positive sentiment. The accuracy of this system is also high, but it suffers from a scalability (applicability) issue because building and maintaining such a rule is an expensive process.

After this, machine learning came into picture, and supervised machine learning became the effective approach at work for text classification. And supervised machine learning uses numerous algorithms that are available for automated classification, ranging from Naive Bayes to decision trees, random forest, and support vector machines (SVMs). These systems come at the cost of annotated (labeled) data, which is required to train the supervised algorithms. For example, there should a set of tweets that are already classified into positive and negative sentiments. From the data, the machine learn a certain pattern of linguistic features that is to be applied to unseen (unlabeld) tweet data that need to be assigned with either positive or negative sentiment.

R Practice for Tweet Sentiment Classification

Lexicon-based Sentiment Analysis II

Last week, we did some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of emotional words in each text. So, to analyze the sentiment of a text, we consider the sentiment content of the text as the sum of the sentiment content of the individual words. We did so by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons.

Lexicon-based sentiment analysis begins with annotating words in text with a type of emotion or its intensity score in sentiment lexicons. Among a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text, we use 1) Bing, 2) NRC, and 3) NRC-EIL, which are available in the textdata package.

  1. lexicon_bing() returns the Bing lexicon as one of the most popular general purpose English sentiment lexicons that categorizes 6,787 words in a binary fashion into positive and negative categories.

  2. lexicon_nrc() returns the NRC lexicon, which is also a general purpose English sentiment lexicon. This lexicon labels 13,901 words with 10 possible categories of sentiments or emotions: “positive”, “negative”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, and “trust”.

  3. lexicon_nrc_eil() returns the NRC Emotion Intensity Lexicon (NRC-EIL), which is a list of 5,814 English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.

Text Mining Workflow

Using tidytext and textdata packages, we can do the lexicon-based sentiment analysis on our tweet data in a tidy format. That is, our tweet data are in a tidy format that each row has a single word from each tweet.
A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge

Understanding unnest_tokens() and inner_join()

To perform lexicon-based sentiment analysis, we need to have our data in a tidy format. Using unnest_tokens(), We’ve already learned how to convert tweets in a csv file format into a tidy data format that has each word per row. When we have a tidy data format for tweets, we are ready to go for lexicon-based sentiment analysis by inner_join().

library(dplyr)
library(tidytext)

text <- data_frame(id = c(1,2,3,4,5,6,7,8,9), 
                   word = c("holiday","makes","me","happy","but","this","song","is","sad"))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
text
## # A tibble: 9 x 2
##      id word   
##   <dbl> <chr>  
## 1     1 holiday
## 2     2 makes  
## 3     3 me     
## 4     4 happy  
## 5     5 but    
## 6     6 this   
## 7     7 song   
## 8     8 is     
## 9     9 sad
lexicon <- data_frame(word = c("happy","sad","holiday","funeral"), 
                      sentiment = c("positive","negative","positive","negative"))
lexicon
## # A tibble: 4 x 2
##   word    sentiment
##   <chr>   <chr>    
## 1 happy   positive 
## 2 sad     negative 
## 3 holiday positive 
## 4 funeral negative
inner_join(text, lexicon)
## # A tibble: 3 x 3
##      id word    sentiment
##   <dbl> <chr>   <chr>    
## 1     1 holiday positive 
## 2     4 happy   positive 
## 3     9 sad     negative

So, what we need for sentiment analysis is a sentiment lexicon in a tidy data format. Using the textdata package, we can get the tidy data format for each lexicon with one emotional word (unigram) per row.

Bing Lexicon

library(textdata)
lexicon_bing()
## # A tibble: 6,787 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,777 more rows
lexicon_bing()%>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4782
## 2 positive   2005

NRC Lexicon

lexicon_nrc()
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
lexicon_nrc()%>% 
  count(sentiment)
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231

NRC-EIL

lexicon_nrc_eil()
## # A tibble: 5,814 x 3
##    term       score AffectDimension
##    <chr>      <dbl> <chr>          
##  1 outraged   0.964 anger          
##  2 brutality  0.959 anger          
##  3 hatred     0.953 anger          
##  4 hateful    0.94  anger          
##  5 terrorize  0.939 anger          
##  6 infuriated 0.938 anger          
##  7 violently  0.938 anger          
##  8 furious    0.929 anger          
##  9 enraged    0.927 anger          
## 10 furiously  0.927 anger          
## # ... with 5,804 more rows
lexicon_nrc_eil() %>% 
  count(AffectDimension)
## # A tibble: 4 x 2
##   AffectDimension     n
##   <chr>           <int>
## 1 anger            1483
## 2 fear             1765
## 3 joy              1268
## 4 sadness          1298

Before sentiment analysis, let’s describe the tweet data, first.

Let’s visualize the frequency of tweets over time.

  • First, we need to sum up tweets by a certain unit of time
  • Like how frequently was the COVID-19 issue mentioned on Twitter by hours?
  • To show this time-series graph, tweets need to be aggregated into the hour-long unit of time they were posted on
  • The function floor_date allows us to do this; using “hour” seems to work well for this hourly change in tweets

First, we will aggregate tweets into the hour-long unit of time and count the time variable by hours

### Collected tweets including "#covid19" or "#covid-19" or "#coronavirus" on April 23th.

load("covid_tweets_423.RData")
covid_tweets
## # A tibble: 18,224 x 9
##    user_id status_id created_at          screen_name text  lang  country    lat
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr> <chr>    <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@ev~ en    United~  36.0 
##  2 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Ple~ en    United~  36.9 
##  3 215583~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Ay~ tr    Azerba~  40.2 
##  4 744597~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pre~ en    South ~ -26.1 
##  5 155877~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#AT~ es    Ecuador  -1.67
##  6 998960~ 12533657~ 2020-04-23 16:51:01 amystones4  "Tha~ en    United~  53.7 
##  7 102768~ 12533657~ 2020-04-23 16:51:00 COTACYT     "Men~ es    Mexico   23.7 
##  8 247382~ 12533657~ 2020-04-23 16:50:54 bkracing123 "The~ en    United~  53.9 
##  9 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   "Thi~ en    United~  37.5 
## 10 226707~ 12533656~ 2020-04-23 16:50:42 JLeonRojas  "INF~ es    Chile   -35.5 
## # ... with 18,214 more rows, and 1 more variable: lng <dbl>
library(lubridate)
covid_tweets_hours  <- covid_tweets %>% 
  mutate(hour = floor_date(created_at, unit="hour")) %>% 
  count(hour)

covid_tweets_hours
## # A tibble: 20 x 2
##    hour                    n
##    <dttm>              <int>
##  1 2020-04-22 21:00:00   986
##  2 2020-04-22 22:00:00   986
##  3 2020-04-22 23:00:00   847
##  4 2020-04-23 00:00:00   846
##  5 2020-04-23 01:00:00   836
##  6 2020-04-23 02:00:00   955
##  7 2020-04-23 03:00:00   928
##  8 2020-04-23 04:00:00   692
##  9 2020-04-23 05:00:00   682
## 10 2020-04-23 06:00:00   657
## 11 2020-04-23 07:00:00   716
## 12 2020-04-23 08:00:00   651
## 13 2020-04-23 09:00:00   662
## 14 2020-04-23 10:00:00   721
## 15 2020-04-23 11:00:00   863
## 16 2020-04-23 12:00:00  1091
## 17 2020-04-23 13:00:00  1155
## 18 2020-04-23 14:00:00  1191
## 19 2020-04-23 15:00:00  1430
## 20 2020-04-23 16:00:00  1329

Now, we are ready to visualize the time-series data counting tweets by hours

library(ggplot2)
covid_tweets_hours %>%
  ggplot(aes(x=hour, y=n)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of hashtagging about COVID-19 on Twitter",
       subtitle = "Tweets (N=18,224) were aggregated in 1-hour intervals. Retweets were excluded.")

Visualization of tweeting frequency by languages

covid_tweets_lang <- covid_tweets %>% 
  count(lang, sort=T)
covid_tweets_lang
## # A tibble: 55 x 2
##    lang      n
##    <chr> <int>
##  1 en    10115
##  2 es     2666
##  3 in     1485
##  4 pt      932
##  5 und     842
##  6 fr      527
##  7 hi      289
##  8 it      227
##  9 de      141
## 10 ja      138
## # ... with 45 more rows
#install.packages("ISOcodes")
library(ISOcodes)
ISO_639_2 %>% dplyr::select(lang = Alpha_2, Name)
##     lang
## 1     aa
## 2     ab
## 3   <NA>
## 4   <NA>
## 5   <NA>
## 6   <NA>
## 7   <NA>
## 8   <NA>
## 9     af
## 10  <NA>
## 11    ak
## 12  <NA>
## 13    sq
## 14  <NA>
## 15  <NA>
## 16  <NA>
## 17    am
## 18  <NA>
## 19  <NA>
## 20  <NA>
## 21    ar
## 22  <NA>
## 23    an
## 24    hy
## 25  <NA>
## 26  <NA>
## 27  <NA>
## 28  <NA>
## 29    as
## 30  <NA>
## 31  <NA>
## 32  <NA>
## 33    av
## 34    ae
## 35  <NA>
## 36    ay
## 37    az
## 38  <NA>
## 39  <NA>
## 40    ba
## 41  <NA>
## 42    bm
## 43  <NA>
## 44    eu
## 45  <NA>
## 46  <NA>
## 47  <NA>
## 48    be
## 49  <NA>
## 50    bn
## 51  <NA>
## 52  <NA>
## 53    bh
## 54  <NA>
## 55  <NA>
## 56    bi
## 57  <NA>
## 58  <NA>
## 59    bs
## 60  <NA>
## 61    br
## 62  <NA>
## 63  <NA>
## 64  <NA>
## 65    bg
## 66    my
## 67  <NA>
## 68  <NA>
## 69  <NA>
## 70  <NA>
## 71    ca
## 72  <NA>
## 73  <NA>
## 74  <NA>
## 75    ch
## 76  <NA>
## 77    ce
## 78  <NA>
## 79    zh
## 80  <NA>
## 81  <NA>
## 82  <NA>
## 83  <NA>
## 84  <NA>
## 85  <NA>
## 86    cu
## 87    cv
## 88  <NA>
## 89  <NA>
## 90  <NA>
## 91  <NA>
## 92    kw
## 93    co
## 94  <NA>
## 95  <NA>
## 96  <NA>
## 97    cr
## 98  <NA>
## 99  <NA>
## 100 <NA>
## 101 <NA>
## 102   cs
## 103 <NA>
## 104   da
## 105 <NA>
## 106 <NA>
## 107 <NA>
## 108 <NA>
## 109 <NA>
## 110 <NA>
## 111   dv
## 112 <NA>
## 113 <NA>
## 114 <NA>
## 115 <NA>
## 116 <NA>
## 117   nl
## 118 <NA>
## 119   dz
## 120 <NA>
## 121 <NA>
## 122 <NA>
## 123 <NA>
## 124   en
## 125 <NA>
## 126   eo
## 127   et
## 128   ee
## 129 <NA>
## 130 <NA>
## 131   fo
## 132 <NA>
## 133   fj
## 134 <NA>
## 135   fi
## 136 <NA>
## 137 <NA>
## 138   fr
## 139 <NA>
## 140 <NA>
## 141 <NA>
## 142 <NA>
## 143   fy
## 144   ff
## 145 <NA>
## 146 <NA>
## 147 <NA>
## 148 <NA>
## 149 <NA>
## 150   ka
## 151   de
## 152 <NA>
## 153 <NA>
## 154   gd
## 155   ga
## 156   gl
## 157   gv
## 158 <NA>
## 159 <NA>
## 160 <NA>
## 161 <NA>
## 162 <NA>
## 163 <NA>
## 164 <NA>
## 165   el
## 166   gn
## 167 <NA>
## 168   gu
## 169 <NA>
## 170 <NA>
## 171   ht
## 172   ha
## 173 <NA>
## 174   he
## 175   hz
## 176 <NA>
## 177 <NA>
## 178   hi
## 179 <NA>
## 180 <NA>
## 181   ho
## 182   hr
## 183 <NA>
## 184   hu
## 185 <NA>
## 186 <NA>
## 187   ig
## 188   is
## 189   io
## 190   ii
## 191 <NA>
## 192   iu
## 193   ie
## 194 <NA>
## 195   ia
## 196 <NA>
## 197   id
## 198 <NA>
## 199 <NA>
## 200   ik
## 201 <NA>
## 202 <NA>
## 203   it
## 204   jv
## 205 <NA>
## 206   ja
## 207 <NA>
## 208 <NA>
## 209 <NA>
## 210 <NA>
## 211 <NA>
## 212   kl
## 213 <NA>
## 214   kn
## 215 <NA>
## 216   ks
## 217   kr
## 218 <NA>
## 219   kk
## 220 <NA>
## 221 <NA>
## 222 <NA>
## 223   km
## 224 <NA>
## 225   ki
## 226   rw
## 227   ky
## 228 <NA>
## 229 <NA>
## 230   kv
## 231   kg
## 232   ko
## 233 <NA>
## 234 <NA>
## 235 <NA>
## 236 <NA>
## 237 <NA>
## 238 <NA>
## 239   kj
## 240 <NA>
## 241   ku
## 242 <NA>
## 243 <NA>
## 244 <NA>
## 245 <NA>
## 246   lo
## 247   la
## 248   lv
## 249 <NA>
## 250   li
## 251   ln
## 252   lt
## 253 <NA>
## 254 <NA>
## 255   lb
## 256 <NA>
## 257   lu
## 258   lg
## 259 <NA>
## 260 <NA>
## 261 <NA>
## 262 <NA>
## 263   mk
## 264 <NA>
## 265 <NA>
## 266   mh
## 267 <NA>
## 268 <NA>
## 269   ml
## 270 <NA>
## 271   mi
## 272 <NA>
## 273   mr
## 274 <NA>
## 275   ms
## 276 <NA>
## 277 <NA>
## 278 <NA>
## 279 <NA>
## 280 <NA>
## 281 <NA>
## 282 <NA>
## 283 <NA>
## 284   mg
## 285   mt
## 286 <NA>
## 287 <NA>
## 288 <NA>
## 289 <NA>
## 290   mn
## 291 <NA>
## 292 <NA>
## 293 <NA>
## 294 <NA>
## 295 <NA>
## 296 <NA>
## 297 <NA>
## 298 <NA>
## 299 <NA>
## 300 <NA>
## 301 <NA>
## 302   na
## 303   nv
## 304   nr
## 305   nd
## 306   ng
## 307 <NA>
## 308   ne
## 309 <NA>
## 310 <NA>
## 311 <NA>
## 312 <NA>
## 313   nn
## 314   nb
## 315 <NA>
## 316 <NA>
## 317   no
## 318 <NA>
## 319 <NA>
## 320 <NA>
## 321 <NA>
## 322   ny
## 323 <NA>
## 324 <NA>
## 325 <NA>
## 326 <NA>
## 327   oc
## 328   oj
## 329   or
## 330   om
## 331 <NA>
## 332   os
## 333 <NA>
## 334 <NA>
## 335 <NA>
## 336 <NA>
## 337 <NA>
## 338 <NA>
## 339   pa
## 340 <NA>
## 341 <NA>
## 342 <NA>
## 343   fa
## 344 <NA>
## 345 <NA>
## 346   pi
## 347   pl
## 348 <NA>
## 349   pt
## 350 <NA>
## 351 <NA>
## 352   ps
## 354   qu
## 355 <NA>
## 356 <NA>
## 357 <NA>
## 358 <NA>
## 359   rm
## 360 <NA>
## 361   ro
## 362   rn
## 363 <NA>
## 364   ru
## 365 <NA>
## 366   sg
## 367 <NA>
## 368 <NA>
## 369 <NA>
## 370 <NA>
## 371   sa
## 372 <NA>
## 373 <NA>
## 374 <NA>
## 375 <NA>
## 376 <NA>
## 377 <NA>
## 378 <NA>
## 379 <NA>
## 380 <NA>
## 381 <NA>
## 382   si
## 383 <NA>
## 384 <NA>
## 385 <NA>
## 386   sk
## 387   sl
## 388 <NA>
## 389   se
## 390 <NA>
## 391 <NA>
## 392 <NA>
## 393   sm
## 394 <NA>
## 395   sn
## 396   sd
## 397 <NA>
## 398 <NA>
## 399   so
## 400 <NA>
## 401   st
## 402   es
## 403   sc
## 404 <NA>
## 405   sr
## 406 <NA>
## 407 <NA>
## 408   ss
## 409 <NA>
## 410   su
## 411 <NA>
## 412 <NA>
## 413   sw
## 414   sv
## 415 <NA>
## 416 <NA>
## 417   ty
## 418 <NA>
## 419   ta
## 420   tt
## 421   te
## 422 <NA>
## 423 <NA>
## 424 <NA>
## 425   tg
## 426   tl
## 427   th
## 428   bo
## 429 <NA>
## 430   ti
## 431 <NA>
## 432 <NA>
## 433 <NA>
## 434 <NA>
## 435 <NA>
## 436 <NA>
## 437   to
## 438 <NA>
## 439 <NA>
## 440   tn
## 441   ts
## 442   tk
## 443 <NA>
## 444 <NA>
## 445   tr
## 446 <NA>
## 447 <NA>
## 448   tw
## 449 <NA>
## 450 <NA>
## 451 <NA>
## 452   ug
## 453   uk
## 454 <NA>
## 455 <NA>
## 456   ur
## 457   uz
## 458 <NA>
## 459   ve
## 460   vi
## 461   vo
## 462 <NA>
## 463 <NA>
## 464 <NA>
## 465 <NA>
## 466 <NA>
## 467   cy
## 468 <NA>
## 469   wa
## 470   wo
## 471 <NA>
## 472   xh
## 473 <NA>
## 474 <NA>
## 475   yi
## 476   yo
## 477 <NA>
## 478 <NA>
## 479 <NA>
## 480 <NA>
## 481 <NA>
## 482   za
## 483 <NA>
## 484   zu
## 485 <NA>
## 486 <NA>
## 487 <NA>
##                                                                                 Name
## 1                                                                               Afar
## 2                                                                          Abkhazian
## 3                                                                           Achinese
## 4                                                                              Acoli
## 5                                                                            Adangme
## 6                                                                     Adyghe; Adygei
## 7                                                             Afro-Asiatic languages
## 8                                                                           Afrihili
## 9                                                                          Afrikaans
## 10                                                                              Ainu
## 11                                                                              Akan
## 12                                                                          Akkadian
## 13                                                                          Albanian
## 14                                                                             Aleut
## 15                                                              Algonquian languages
## 16                                                                    Southern Altai
## 17                                                                           Amharic
## 18                                                        English, Old (ca.450-1100)
## 19                                                                            Angika
## 20                                                                  Apache languages
## 21                                                                            Arabic
## 22                    Official Aramaic (700-300 BCE); Imperial Aramaic (700-300 BCE)
## 23                                                                         Aragonese
## 24                                                                          Armenian
## 25                                                               Mapudungun; Mapuche
## 26                                                                           Arapaho
## 27                                                              Artificial languages
## 28                                                                            Arawak
## 29                                                                          Assamese
## 30                                            Asturian; Bable; Leonese; Asturleonese
## 31                                                              Athapascan languages
## 32                                                              Australian languages
## 33                                                                            Avaric
## 34                                                                           Avestan
## 35                                                                            Awadhi
## 36                                                                            Aymara
## 37                                                                       Azerbaijani
## 38                                                                   Banda languages
## 39                                                                Bamileke languages
## 40                                                                           Bashkir
## 41                                                                           Baluchi
## 42                                                                           Bambara
## 43                                                                          Balinese
## 44                                                                            Basque
## 45                                                                              Basa
## 46                                                                  Baltic languages
## 47                                                                   Beja; Bedawiyet
## 48                                                                        Belarusian
## 49                                                                             Bemba
## 50                                                                           Bengali
## 51                                                                  Berber languages
## 52                                                                          Bhojpuri
## 53                                                                  Bihari languages
## 54                                                                             Bikol
## 55                                                                         Bini; Edo
## 56                                                                           Bislama
## 57                                                                           Siksika
## 58                                                                   Bantu languages
## 59                                                                           Bosnian
## 60                                                                              Braj
## 61                                                                            Breton
## 62                                                                   Batak languages
## 63                                                                            Buriat
## 64                                                                          Buginese
## 65                                                                         Bulgarian
## 66                                                                           Burmese
## 67                                                                       Blin; Bilin
## 68                                                                             Caddo
## 69                                                 Central American Indian languages
## 70                                                                      Galibi Carib
## 71                                                                Catalan; Valencian
## 72                                                               Caucasian languages
## 73                                                                           Cebuano
## 74                                                                  Celtic languages
## 75                                                                          Chamorro
## 76                                                                           Chibcha
## 77                                                                           Chechen
## 78                                                                          Chagatai
## 79                                                                           Chinese
## 80                                                                          Chuukese
## 81                                                                              Mari
## 82                                                                    Chinook jargon
## 83                                                                           Choctaw
## 84                                                            Chipewyan; Dene Suline
## 85                                                                          Cherokee
## 86  Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic
## 87                                                                           Chuvash
## 88                                                                          Cheyenne
## 89                                                                  Chamic languages
## 90                                                                       Montenegrin
## 91                                                                            Coptic
## 92                                                                           Cornish
## 93                                                                          Corsican
## 94                                                Creoles and pidgins, English based
## 95                                                 Creoles and pidgins, French-based
## 96                                             Creoles and pidgins, Portuguese-based
## 97                                                                              Cree
## 98                                                    Crimean Tatar; Crimean Turkish
## 99                                                               Creoles and pidgins
## 100                                                                        Kashubian
## 101                                                               Cushitic languages
## 102                                                                            Czech
## 103                                                                           Dakota
## 104                                                                           Danish
## 105                                                                           Dargwa
## 106                                                             Land Dayak languages
## 107                                                                         Delaware
## 108                                                               Slave (Athapascan)
## 109                                                                           Dogrib
## 110                                                                            Dinka
## 111                                                       Divehi; Dhivehi; Maldivian
## 112                                                                            Dogri
## 113                                                              Dravidian languages
## 114                                                                    Lower Sorbian
## 115                                                                            Duala
## 116                                                     Dutch, Middle (ca.1050-1350)
## 117                                                                   Dutch; Flemish
## 118                                                                            Dyula
## 119                                                                         Dzongkha
## 120                                                                             Efik
## 121                                                               Egyptian (Ancient)
## 122                                                                           Ekajuk
## 123                                                                          Elamite
## 124                                                                          English
## 125                                                      English, Middle (1100-1500)
## 126                                                                        Esperanto
## 127                                                                         Estonian
## 128                                                                              Ewe
## 129                                                                           Ewondo
## 130                                                                             Fang
## 131                                                                          Faroese
## 132                                                                            Fanti
## 133                                                                           Fijian
## 134                                                               Filipino; Pilipino
## 135                                                                          Finnish
## 136                                                           Finno-Ugrian languages
## 137                                                                              Fon
## 138                                                                           French
## 139                                                    French, Middle (ca.1400-1600)
## 140                                                        French, Old (842-ca.1400)
## 141                                                                 Northern Frisian
## 142                                                                  Eastern Frisian
## 143                                                                  Western Frisian
## 144                                                                            Fulah
## 145                                                                         Friulian
## 146                                                                               Ga
## 147                                                                             Gayo
## 148                                                                            Gbaya
## 149                                                               Germanic languages
## 150                                                                         Georgian
## 151                                                                           German
## 152                                                                             Geez
## 153                                                                       Gilbertese
## 154                                                          Gaelic; Scottish Gaelic
## 155                                                                            Irish
## 156                                                                         Galician
## 157                                                                             Manx
## 158                                               German, Middle High (ca.1050-1500)
## 159                                                   German, Old High (ca.750-1050)
## 160                                                                            Gondi
## 161                                                                        Gorontalo
## 162                                                                           Gothic
## 163                                                                            Grebo
## 164                                                         Greek, Ancient (to 1453)
## 165                                                            Greek, Modern (1453-)
## 166                                                                          Guarani
## 167                                                Swiss German; Alemannic; Alsatian
## 168                                                                         Gujarati
## 169                                                                         Gwich'in
## 170                                                                            Haida
## 171                                                          Haitian; Haitian Creole
## 172                                                                            Hausa
## 173                                                                         Hawaiian
## 174                                                                           Hebrew
## 175                                                                           Herero
## 176                                                                       Hiligaynon
## 177                                    Himachali languages; Western Pahari languages
## 178                                                                            Hindi
## 179                                                                          Hittite
## 180                                                                      Hmong; Mong
## 181                                                                        Hiri Motu
## 182                                                                         Croatian
## 183                                                                    Upper Sorbian
## 184                                                                        Hungarian
## 185                                                                             Hupa
## 186                                                                             Iban
## 187                                                                             Igbo
## 188                                                                        Icelandic
## 189                                                                              Ido
## 190                                                                Sichuan Yi; Nuosu
## 191                                                                    Ijo languages
## 192                                                                        Inuktitut
## 193                                                          Interlingue; Occidental
## 194                                                                            Iloko
## 195                       Interlingua (International Auxiliary Language Association)
## 196                                                                  Indic languages
## 197                                                                       Indonesian
## 198                                                          Indo-European languages
## 199                                                                           Ingush
## 200                                                                          Inupiaq
## 201                                                                Iranian languages
## 202                                                              Iroquoian languages
## 203                                                                          Italian
## 204                                                                         Javanese
## 205                                                                           Lojban
## 206                                                                         Japanese
## 207                                                                    Judeo-Persian
## 208                                                                     Judeo-Arabic
## 209                                                                      Kara-Kalpak
## 210                                                                           Kabyle
## 211                                                                  Kachin; Jingpho
## 212                                                         Kalaallisut; Greenlandic
## 213                                                                            Kamba
## 214                                                                          Kannada
## 215                                                                  Karen languages
## 216                                                                         Kashmiri
## 217                                                                           Kanuri
## 218                                                                             Kawi
## 219                                                                           Kazakh
## 220                                                                        Kabardian
## 221                                                                            Khasi
## 222                                                                Khoisan languages
## 223                                                                    Central Khmer
## 224                                                                 Khotanese; Sakan
## 225                                                                   Kikuyu; Gikuyu
## 226                                                                      Kinyarwanda
## 227                                                                  Kirghiz; Kyrgyz
## 228                                                                         Kimbundu
## 229                                                                          Konkani
## 230                                                                             Komi
## 231                                                                            Kongo
## 232                                                                           Korean
## 233                                                                         Kosraean
## 234                                                                           Kpelle
## 235                                                                  Karachay-Balkar
## 236                                                                         Karelian
## 237                                                                    Kru languages
## 238                                                                           Kurukh
## 239                                                               Kuanyama; Kwanyama
## 240                                                                            Kumyk
## 241                                                                          Kurdish
## 242                                                                          Kutenai
## 243                                                                           Ladino
## 244                                                                           Lahnda
## 245                                                                            Lamba
## 246                                                                              Lao
## 247                                                                            Latin
## 248                                                                          Latvian
## 249                                                                         Lezghian
## 250                                                 Limburgan; Limburger; Limburgish
## 251                                                                          Lingala
## 252                                                                       Lithuanian
## 253                                                                            Mongo
## 254                                                                             Lozi
## 255                                                     Luxembourgish; Letzeburgesch
## 256                                                                       Luba-Lulua
## 257                                                                     Luba-Katanga
## 258                                                                            Ganda
## 259                                                                          Luiseno
## 260                                                                            Lunda
## 261                                                         Luo (Kenya and Tanzania)
## 262                                                                           Lushai
## 263                                                                       Macedonian
## 264                                                                         Madurese
## 265                                                                           Magahi
## 266                                                                      Marshallese
## 267                                                                         Maithili
## 268                                                                          Makasar
## 269                                                                        Malayalam
## 270                                                                         Mandingo
## 271                                                                            Maori
## 272                                                           Austronesian languages
## 273                                                                          Marathi
## 274                                                                            Masai
## 275                                                                            Malay
## 276                                                                           Moksha
## 277                                                                           Mandar
## 278                                                                            Mende
## 279                                                         Irish, Middle (900-1200)
## 280                                                                  Mi'kmaq; Micmac
## 281                                                                      Minangkabau
## 282                                                                Uncoded languages
## 283                                                              Mon-Khmer languages
## 284                                                                         Malagasy
## 285                                                                          Maltese
## 286                                                                           Manchu
## 287                                                                         Manipuri
## 288                                                                 Manobo languages
## 289                                                                           Mohawk
## 290                                                                        Mongolian
## 291                                                                            Mossi
## 292                                                               Multiple languages
## 293                                                                  Munda languages
## 294                                                                            Creek
## 295                                                                        Mirandese
## 296                                                                          Marwari
## 297                                                                  Mayan languages
## 298                                                                            Erzya
## 299                                                                Nahuatl languages
## 300                                                  North American Indian languages
## 301                                                                       Neapolitan
## 302                                                                            Nauru
## 303                                                                   Navajo; Navaho
## 304                                                    Ndebele, South; South Ndebele
## 305                                                    Ndebele, North; North Ndebele
## 306                                                                           Ndonga
## 307                                   Low German; Low Saxon; German, Low; Saxon, Low
## 308                                                                           Nepali
## 309                                                              Nepal Bhasa; Newari
## 310                                                                             Nias
## 311                                                      Niger-Kordofanian languages
## 312                                                                           Niuean
## 313                                            Norwegian Nynorsk; Nynorsk, Norwegian
## 314                                              Bokmal, Norwegian; Norwegian Bokmal
## 315                                                                            Nogai
## 316                                                                       Norse, Old
## 317                                                                        Norwegian
## 318                                                                             N'Ko
## 319                                                     Pedi; Sepedi; Northern Sotho
## 320                                                                 Nubian languages
## 321                              Classical Newari; Old Newari; Classical Nepal Bhasa
## 322                                                          Chichewa; Chewa; Nyanja
## 323                                                                         Nyamwezi
## 324                                                                         Nyankole
## 325                                                                            Nyoro
## 326                                                                            Nzima
## 327                                                              Occitan (post 1500)
## 328                                                                           Ojibwa
## 329                                                                            Oriya
## 330                                                                            Oromo
## 331                                                                            Osage
## 332                                                                Ossetian; Ossetic
## 333                                                     Turkish, Ottoman (1500-1928)
## 334                                                                Otomian languages
## 335                                                                 Papuan languages
## 336                                                                       Pangasinan
## 337                                                                          Pahlavi
## 338                                                            Pampanga; Kapampangan
## 339                                                                 Panjabi; Punjabi
## 340                                                                       Papiamento
## 341                                                                          Palauan
## 342                                                   Persian, Old (ca.600-400 B.C.)
## 343                                                                          Persian
## 344                                                             Philippine languages
## 345                                                                       Phoenician
## 346                                                                             Pali
## 347                                                                           Polish
## 348                                                                        Pohnpeian
## 349                                                                       Portuguese
## 350                                                                Prakrit languages
## 351                                 Provencal, Old (to 1500); Occitan, Old (to 1500)
## 352                                                                   Pushto; Pashto
## 354                                                                          Quechua
## 355                                                                       Rajasthani
## 356                                                                          Rapanui
## 357                                                   Rarotongan; Cook Islands Maori
## 358                                                                Romance languages
## 359                                                                          Romansh
## 360                                                                           Romany
## 361                                                    Romanian; Moldavian; Moldovan
## 362                                                                            Rundi
## 363                                            Aromanian; Arumanian; Macedo-Romanian
## 364                                                                          Russian
## 365                                                                          Sandawe
## 366                                                                            Sango
## 367                                                                            Yakut
## 368                                                  South American Indian languages
## 369                                                               Salishan languages
## 370                                                                Samaritan Aramaic
## 371                                                                         Sanskrit
## 372                                                                            Sasak
## 373                                                                          Santali
## 374                                                                         Sicilian
## 375                                                                            Scots
## 376                                                                           Selkup
## 377                                                                Semitic languages
## 378                                                              Irish, Old (to 900)
## 379                                                                   Sign Languages
## 380                                                                             Shan
## 381                                                                           Sidamo
## 382                                                               Sinhala; Sinhalese
## 383                                                                 Siouan languages
## 384                                                           Sino-Tibetan languages
## 385                                                                 Slavic languages
## 386                                                                           Slovak
## 387                                                                        Slovenian
## 388                                                                    Southern Sami
## 389                                                                    Northern Sami
## 390                                                                   Sami languages
## 391                                                                        Lule Sami
## 392                                                                       Inari Sami
## 393                                                                           Samoan
## 394                                                                       Skolt Sami
## 395                                                                            Shona
## 396                                                                           Sindhi
## 397                                                                          Soninke
## 398                                                                          Sogdian
## 399                                                                           Somali
## 400                                                                Songhai languages
## 401                                                                  Sotho, Southern
## 402                                                               Spanish; Castilian
## 403                                                                        Sardinian
## 404                                                                     Sranan Tongo
## 405                                                                          Serbian
## 406                                                                            Serer
## 407                                                           Nilo-Saharan languages
## 408                                                                            Swati
## 409                                                                           Sukuma
## 410                                                                        Sundanese
## 411                                                                             Susu
## 412                                                                         Sumerian
## 413                                                                          Swahili
## 414                                                                          Swedish
## 415                                                                 Classical Syriac
## 416                                                                           Syriac
## 417                                                                         Tahitian
## 418                                                                    Tai languages
## 419                                                                            Tamil
## 420                                                                            Tatar
## 421                                                                           Telugu
## 422                                                                            Timne
## 423                                                                           Tereno
## 424                                                                            Tetum
## 425                                                                            Tajik
## 426                                                                          Tagalog
## 427                                                                             Thai
## 428                                                                          Tibetan
## 429                                                                            Tigre
## 430                                                                         Tigrinya
## 431                                                                              Tiv
## 432                                                                          Tokelau
## 433                                                            Klingon; tlhIngan-Hol
## 434                                                                          Tlingit
## 435                                                                         Tamashek
## 436                                                                    Tonga (Nyasa)
## 437                                                            Tonga (Tonga Islands)
## 438                                                                        Tok Pisin
## 439                                                                        Tsimshian
## 440                                                                           Tswana
## 441                                                                           Tsonga
## 442                                                                          Turkmen
## 443                                                                          Tumbuka
## 444                                                                   Tupi languages
## 445                                                                          Turkish
## 446                                                                 Altaic languages
## 447                                                                           Tuvalu
## 448                                                                              Twi
## 449                                                                         Tuvinian
## 450                                                                           Udmurt
## 451                                                                         Ugaritic
## 452                                                                   Uighur; Uyghur
## 453                                                                        Ukrainian
## 454                                                                          Umbundu
## 455                                                                     Undetermined
## 456                                                                             Urdu
## 457                                                                            Uzbek
## 458                                                                              Vai
## 459                                                                            Venda
## 460                                                                       Vietnamese
## 461                                                                          Volapuk
## 462                                                                            Votic
## 463                                                               Wakashan languages
## 464                                                               Wolaitta; Wolaytta
## 465                                                                            Waray
## 466                                                                            Washo
## 467                                                                            Welsh
## 468                                                                Sorbian languages
## 469                                                                          Walloon
## 470                                                                            Wolof
## 471                                                                    Kalmyk; Oirat
## 472                                                                            Xhosa
## 473                                                                              Yao
## 474                                                                           Yapese
## 475                                                                          Yiddish
## 476                                                                           Yoruba
## 477                                                                  Yupik languages
## 478                                                                          Zapotec
## 479                                                Blissymbols; Blissymbolics; Bliss
## 480                                                                           Zenaga
## 481                                                      Standard Moroccan Tamazight
## 482                                                                   Zhuang; Chuang
## 483                                                                  Zande languages
## 484                                                                             Zulu
## 485                                                                             Zuni
## 486                                            No linguistic content; Not applicable
## 487                                   Zaza; Dimili; Dimli; Kirdki; Kirmanjki; Zazaki
covid_tweets_lang %>% inner_join(ISO_639_2 %>% dplyr::select(lang = Alpha_2, Name))
## # A tibble: 52 x 3
##    lang      n Name              
##    <chr> <int> <chr>             
##  1 en    10115 English           
##  2 es     2666 Spanish; Castilian
##  3 pt      932 Portuguese        
##  4 fr      527 French            
##  5 hi      289 Hindi             
##  6 it      227 Italian           
##  7 de      141 German            
##  8 ja      138 Japanese          
##  9 tl      113 Tagalog           
## 10 tr      111 Turkish           
## # ... with 42 more rows
covid_tweets_lang %>% 
  inner_join(ISO_639_2 %>% dplyr::select(lang = Alpha_2, Name)) %>% 
  mutate(langague = reorder(Name, n)) %>% 
  #filter(n > 4) %>% 
  ggplot(aes(x=langague, y=n)) + 
  geom_col() +
  coord_flip()

Visualization of tweeting frequency by countries

covid_tweets_country <- covid_tweets %>% 
  count(country, sort=T)
covid_tweets_country
## # A tibble: 163 x 2
##    country            n
##    <chr>          <int>
##  1 United States   4998
##  2 India           1527
##  3 Indonesia       1324
##  4 United Kingdom  1311
##  5 Brazil           942
##  6 Canada           658
##  7 Spain            640
##  8 Mexico           567
##  9 Nigeria          552
## 10 Colombia         381
## # ... with 153 more rows
covid_tweets_country %>% 
  mutate(country = reorder(country, n)) %>% 
  top_n(20,n) %>% 
  ggplot(aes(x=country, y=n)) + 
  geom_col() +
  coord_flip()

Visualization of tweeting frequency by geo-locations: We can visualize COVID-19 tweets by geo-location over the world.

covid_tweets_geo <- covid_tweets %>% 
  group_by(lng,lat,country) %>% 
  summarise(sum = n()) %>% 
  filter(!is.na(lng)|!is.na(lat))
covid_tweets_geo
## # A tibble: 7,130 x 4
## # Groups:   lng, lat [7,129]
##      lng   lat country         sum
##    <dbl> <dbl> <chr>         <int>
##  1 -167.  23.7 United States     1
##  2 -162.  60.8 United States     1
##  3 -159.  22.2 United States     1
##  4 -159.  22.0 United States     1
##  5 -158.  21.4 United States     1
##  6 -158.  21.3 United States     1
##  7 -158.  21.4 United States     1
##  8 -158.  21.3 United States     1
##  9 -158.  21.3 United States     1
## 10 -158.  21.3 United States     1
## # ... with 7,120 more rows
#install.packages("maps")
#install.packages("viridis")
#install.packages("rnaturalearth")
library(maps)
library(viridis)
## Loading required package: viridisLite
library(rnaturalearth)

# World map
world_map <- map_data("world")

# Plot lat and lng points onto map
ggplot(data=world_map, aes(x=long, y=lat)) +
  geom_polygon(aes(group=group, fill=region)) +
  geom_point(data = covid_tweets_geo, aes(x=lng, y=lat, size=sum),
             colour="tomato", alpha=0.5) +
  xlab("Longitude") + ylab("Latitutde") +
  ggtitle("Twitter Map on #COVID-19") +
  scale_fill_viridis_d()+
  theme_void()+
  theme(legend.position = "none")

# Mapping Europe
some.eu.countries <- c(
  "Portugal", "Spain", "France", "Switzerland", "Germany",
  "Austria", "Belgium", "UK", "United Kingdom","Netherlands", "The Netherlands",
  "Denmark", "Poland", "Italy", 
  "Croatia", "Slovenia", "Hungary", "Slovakia",
  "Czech Republic"
)

some.eu.maps <- map_data("world", region = some.eu.countries)

ggplot(data=some.eu.maps, aes(x=long, y=lat)) +
  geom_polygon(aes(group=group, fill=region)) +
  geom_point(data=covid_tweets_geo %>% filter(country %in% some.eu.countries), 
             aes(x=lng, y=lat, size=sum),
             colour="tomato", alpha=0.5) +
  ggtitle("Twitter Map on #COVID-19") +
  scale_fill_viridis_d()+
  theme_void()+
  theme(legend.position = "none")

# Mapping US
usmap <- map_data("state")
ggplot(data=usmap, aes(x=long, y=lat)) +
  geom_polygon(aes(group=group, fill=region)) +
  geom_point(data=covid_tweets_geo %>% 
               filter(country == "United States") %>% 
               filter(sum > 19), 
             aes(x=lng, y=lat, size=sum),
             colour="tomato", alpha=0.5) +
  ggtitle("Twitter Map on #COVID-19") +
  scale_fill_viridis_d()+
  theme_void()+
  theme(legend.position = "none")

Let’s preprocess tweet text

We know how to convert our tweet data into a tidy text data format.

library(stringr)
library(stopwords)
covid_tweets # This dataset contains 18,224 tweets about COVID-19 including geo-location information.
## # A tibble: 18,224 x 9
##    user_id status_id created_at          screen_name text  lang  country    lat
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr> <chr>    <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ "@ev~ en    United~  36.0 
##  2 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ "Ple~ en    United~  36.9 
##  3 215583~ 12533658~ 2020-04-23 16:51:09 KOROGLU_BA~ "@Ay~ tr    Azerba~  40.2 
##  4 744597~ 12533657~ 2020-04-23 16:51:05 FoodFocusSA "Pre~ en    South ~ -26.1 
##  5 155877~ 12533657~ 2020-04-23 16:51:01 opcionsecu~ "#AT~ es    Ecuador  -1.67
##  6 998960~ 12533657~ 2020-04-23 16:51:01 amystones4  "Tha~ en    United~  53.7 
##  7 102768~ 12533657~ 2020-04-23 16:51:00 COTACYT     "Men~ es    Mexico   23.7 
##  8 247382~ 12533657~ 2020-04-23 16:50:54 bkracing123 "The~ en    United~  53.9 
##  9 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   "Thi~ en    United~  37.5 
## 10 226707~ 12533656~ 2020-04-23 16:50:42 JLeonRojas  "INF~ es    Chile   -35.5 
## # ... with 18,214 more rows, and 1 more variable: lng <dbl>
covid_tweets_tidy <- covid_tweets %>% 
  filter(lang == "en") %>% # Selecting tweets only written in English
  mutate(hour = floor_date(created_at, unit="hour")) %>% # Creating a variable to aggregate tweets into the hour-long unit of time
  mutate(text = str_replace_all(text, "[#@]?[^[:ascii:]]+", " ")) %>% # Removing non-ASCII characters 
  mutate(text = str_replace_all(text, "&amp;|&lt;|&gt;|&quot;|RT", " ")) %>% # Removing HTML tags and retweet marker
  unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets 
  filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
  filter(str_detect(word, "[a-z]")) # Selecting words that should contain any alphbetical letter
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid_tweets_tidy
## # A tibble: 167,641 x 10
##    user_id status_id created_at          screen_name lang  country   lat   lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl> <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  2 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  3 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  4 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  5 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  6 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  7 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  8 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
##  9 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
## 10 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.
## # ... with 167,631 more rows, and 2 more variables: hour <dttm>, word <chr>

When text is formed into tidy data, we are ready to do sentiment analysis using inner_join

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) # Joining with the bing lexicon
## Joining, by = "word"
## # A tibble: 15,936 x 11
##    user_id status_id created_at          screen_name lang  country   lat     lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>   <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.  
##  2 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.  
##  3 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ en    United~  36.9  -81.1 
##  4 998960~ 12533657~ 2020-04-23 16:51:01 amystones4  en    United~  53.7   -1.65
##  5 998960~ 12533657~ 2020-04-23 16:51:01 amystones4  en    United~  53.7   -1.65
##  6 247382~ 12533657~ 2020-04-23 16:50:54 bkracing123 en    United~  53.9   -1.21
##  7 247382~ 12533657~ 2020-04-23 16:50:54 bkracing123 en    United~  53.9   -1.21
##  8 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   en    United~  37.5 -121.  
##  9 314986~ 12533656~ 2020-04-23 16:50:39 dande_hema~ en    India    15.9   80.8 
## 10 314986~ 12533656~ 2020-04-23 16:50:39 dande_hema~ en    India    15.9   80.8 
## # ... with 15,926 more rows, and 3 more variables: hour <dttm>, word <chr>,
## #   sentiment <chr>
covid_tweets_tidy %>% 
  inner_join(lexicon_nrc())
## Joining, by = "word"
## # A tibble: 57,217 x 11
##    user_id status_id created_at          screen_name lang  country   lat    lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>  <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  2 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  3 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  4 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  5 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  6 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  7 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  8 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
##  9 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115. 
## 10 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ en    United~  36.9  -81.1
## # ... with 57,207 more rows, and 3 more variables: hour <dttm>, word <chr>,
## #   sentiment <chr>
covid_tweets_tidy %>% 
  inner_join(lexicon_nrc_eil(), by=c("word"="term")) # Joining with the NRC-EIL by "word"="term". A character vector of variables to join by; using a variable with a common name across the two data sets; To join by a variable with different names, we specify that the variable "word" in covid_tweets_tidy is matched to the variable "term" in lexicon_nrc_eil().
## # A tibble: 23,895 x 12
##    user_id status_id created_at          screen_name lang  country   lat     lng
##    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>   <dbl>   <dbl>
##  1 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.  
##  2 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.  
##  3 479491~ 12533658~ 2020-04-23 16:51:11 Vegastechh~ en    United~  36.0 -115.  
##  4 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ en    United~  36.9  -81.1 
##  5 169480~ 12533658~ 2020-04-23 16:51:11 Coachjmorr~ en    United~  36.9  -81.1 
##  6 247382~ 12533657~ 2020-04-23 16:50:54 bkracing123 en    United~  53.9   -1.21
##  7 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   en    United~  37.5 -121.  
##  8 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   en    United~  37.5 -121.  
##  9 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   en    United~  37.5 -121.  
## 10 175662~ 12533657~ 2020-04-23 16:50:51 AnnStrahm   en    United~  37.5 -121.  
## # ... with 23,885 more rows, and 4 more variables: hour <dttm>, word <chr>,
## #   score <dbl>, AffectDimension <chr>

Counting sentiments by hour

After joining a tidy text data set with a sentiment lexicon, we can count the sentiment variable by the time variable

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(hour, sentiment) # Counting sentiments by hour 
## Joining, by = "word"
## # A tibble: 40 x 3
##    hour                sentiment     n
##    <dttm>              <chr>     <int>
##  1 2020-04-22 21:00:00 negative    516
##  2 2020-04-22 21:00:00 positive    525
##  3 2020-04-22 22:00:00 negative    524
##  4 2020-04-22 22:00:00 positive    507
##  5 2020-04-22 23:00:00 negative    349
##  6 2020-04-22 23:00:00 positive    418
##  7 2020-04-23 00:00:00 negative    356
##  8 2020-04-23 00:00:00 positive    405
##  9 2020-04-23 01:00:00 negative    370
## 10 2020-04-23 01:00:00 positive    397
## # ... with 30 more rows
covid_tweets_tidy %>% 
  inner_join(lexicon_nrc()) %>% 
  count(hour, sentiment)
## Joining, by = "word"
## # A tibble: 200 x 3
##    hour                sentiment        n
##    <dttm>              <chr>        <int>
##  1 2020-04-22 21:00:00 anger          223
##  2 2020-04-22 21:00:00 anticipation   379
##  3 2020-04-22 21:00:00 disgust        179
##  4 2020-04-22 21:00:00 fear           384
##  5 2020-04-22 21:00:00 joy            280
##  6 2020-04-22 21:00:00 negative       577
##  7 2020-04-22 21:00:00 positive       720
##  8 2020-04-22 21:00:00 sadness        289
##  9 2020-04-22 21:00:00 surprise       201
## 10 2020-04-22 21:00:00 trust          462
## # ... with 190 more rows
covid_tweets_tidy %>% 
  inner_join(lexicon_nrc_eil(), by=c("word"="term")) %>% 
  count(hour, AffectDimension)
## # A tibble: 80 x 3
##    hour                AffectDimension     n
##    <dttm>              <chr>           <int>
##  1 2020-04-22 21:00:00 anger             237
##  2 2020-04-22 21:00:00 fear              467
##  3 2020-04-22 21:00:00 joy               536
##  4 2020-04-22 21:00:00 sadness           338
##  5 2020-04-22 22:00:00 anger             244
##  6 2020-04-22 22:00:00 fear              478
##  7 2020-04-22 22:00:00 joy               548
##  8 2020-04-22 22:00:00 sadness           330
##  9 2020-04-22 23:00:00 anger             146
## 10 2020-04-22 23:00:00 fear              320
## # ... with 70 more rows

Let’s visualize the time trend of sentiment in tweets toward COVID-19

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour, y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The Bing Lexicon was used to measure sentiment in tweets")
## Joining, by = "word"

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc()) %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour, y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The NRC Lexicon was used to measure sentiment in tweets")
## Joining, by = "word"

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc_eil(), by=c("word"="term")) %>% 
  count(hour, AffectDimension) %>% 
  ggplot(aes(x=hour, y=n, colour=AffectDimension)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The NRC-EIL was used to measure sentiment in tweets")

What are the most common sentiment words used in the tweets?

We can analyze word counts that contribute to positive and negative sentiment in tweets. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

# Word count on tweets
covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(sentiment, word, sort=TRUE) # Counting words by sentiments 
## Joining, by = "word"
## # A tibble: 1,973 x 3
##    sentiment word         n
##    <chr>     <chr>    <int>
##  1 positive  like       460
##  2 positive  good       337
##  3 positive  work       316
##  4 positive  positive   301
##  5 negative  virus      282
##  6 positive  thank      255
##  7 positive  safe       234
##  8 positive  well       232
##  9 positive  trump      223
## 10 negative  crisis     212
## # ... with 1,963 more rows
covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(sentiment, word, sort=TRUE) %>% 
  filter(sentiment=="positive") %>% 
  arrange(desc(n))
## Joining, by = "word"
## # A tibble: 751 x 3
##    sentiment word         n
##    <chr>     <chr>    <int>
##  1 positive  like       460
##  2 positive  good       337
##  3 positive  work       316
##  4 positive  positive   301
##  5 positive  thank      255
##  6 positive  safe       234
##  7 positive  well       232
##  8 positive  trump      223
##  9 positive  support    205
## 10 positive  great      187
## # ... with 741 more rows
covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(sentiment, word, sort=TRUE) %>% 
  filter(sentiment=="negative") %>% 
  arrange(desc(n))
## Joining, by = "word"
## # A tibble: 1,222 x 3
##    sentiment word         n
##    <chr>     <chr>    <int>
##  1 negative  virus      282
##  2 negative  crisis     212
##  3 negative  death      197
##  4 negative  died       133
##  5 negative  hard        98
##  6 negative  infected    98
##  7 negative  lost        85
##  8 negative  die         84
##  9 negative  risk        83
## 10 negative  sick        83
## # ... with 1,212 more rows

The words like, positive, & trump are to be removed from the list of positive words because their meaning is not related to positive feelings in the context of COVID-19; And I also want to remove the word virus from the list of negative words because it is likely to be used in a way of indicating “Coronavirus”.

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc()) %>% 
  count(sentiment, word, sort=TRUE) %>% 
  group_by(sentiment) %>% 
  top_n(10) %>% 
  arrange(sentiment, desc(n)) %>% 
  ungroup
## Joining, by = "word"
## Selecting by n
## # A tibble: 102 x 3
##    sentiment word          n
##    <chr>     <chr>     <int>
##  1 anger     fight       217
##  2 anger     death       197
##  3 anger     money       119
##  4 anger     fighting    115
##  5 anger     disease      85
##  6 anger     hit          73
##  7 anger     dying        72
##  8 anger     bad          69
##  9 anger     feeling      51
## 10 anger     challenge    44
## # ... with 92 more rows

The words virus in “negative”, don in “positive” and “trust”, trump in “surprise” are to be excluded from the analysis using the NRC lexicon.

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc_eil(), by=c("word"="term")) %>% 
  count(AffectDimension, word) %>% 
  group_by(AffectDimension) %>% 
  top_n(10) %>% 
  arrange(AffectDimension, desc(n)) %>% 
  ungroup
## Selecting by n
## # A tibble: 40 x 3
##    AffectDimension word          n
##    <chr>           <chr>     <int>
##  1 anger           fight       217
##  2 anger           death       197
##  3 anger           money       119
##  4 anger           fighting    115
##  5 anger           disease      85
##  6 anger           hit          73
##  7 anger           dying        72
##  8 anger           bad          69
##  9 anger           feeling      51
## 10 anger           challenge    44
## # ... with 30 more rows

The words positive in “joy”, don in “positive” and “trust”, trump in “surprise” are to be excluded from the analysis using the NRC lexicon.

Creating a vector of words that need to be excluded from sentiment lexicons, i.e., positive, trump, etc.

words_out <- c("like", "positive", "trump", "virus", "don")

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour, y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The Bing Lexicon was used to measure sentiment in tweets")
## Joining, by = "word"

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour, y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The NRC Lexicon was used to measure sentiment in tweets")
## Joining, by = "word"

covid_tweets_tidy %>% 
  inner_join(lexicon_nrc_eil(), by=c("word"="term")) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, AffectDimension) %>% 
  ggplot(aes(x=hour, y=n, colour=AffectDimension)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The NRC-EIL was used to measure sentiment in tweets") 

We can count the frequency of words in each sentiment type by tweet

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, status_id, sentiment) 
## Joining, by = "word"
## # A tibble: 8,929 x 4
##    hour                status_id           sentiment     n
##    <dttm>              <chr>               <chr>     <int>
##  1 2020-04-22 21:00:00 1253067601535815680 negative      5
##  2 2020-04-22 21:00:00 1253067601535815680 positive      1
##  3 2020-04-22 21:00:00 1253067622519681024 negative      1
##  4 2020-04-22 21:00:00 1253067629759209472 negative      2
##  5 2020-04-22 21:00:00 1253067638491807748 negative      2
##  6 2020-04-22 21:00:00 1253067644388982785 negative      1
##  7 2020-04-22 21:00:00 1253067644388982785 positive      3
##  8 2020-04-22 21:00:00 1253067656690839552 positive      4
##  9 2020-04-22 21:00:00 1253067673103204353 positive      1
## 10 2020-04-22 21:00:00 1253067716015071232 positive      1
## # ... with 8,919 more rows

Visualize the net score (positive - negative) using the “Bing” lexicon

*Let’s consider how we can calculate the net score of sentiment by tweet: Sum of positive words minus sum of negative words in each tweet

*To do so, we need to have two separate columns for positive and negative scores

*And, there will be also some days with no emotional words in tweets

*So, we will use spread() from tidyr package

spread() takes three principal arguments:

  1. the data
  2. the key column variable whose values will become new column names.
  3. the value column variable whose values will fill the new column variables.
Spread() function

Spread() function

This yields a frequency table where the observations of sentiment for each tweet are spread across multiple rows, 9,559 observations from 7,203 tweets of 4 variables (hour,status_id, sentiment, n)

library(tibble)

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(hour, status_id, sentiment)
## Joining, by = "word"
## # A tibble: 9,559 x 4
##    hour                status_id           sentiment     n
##    <dttm>              <chr>               <chr>     <int>
##  1 2020-04-22 21:00:00 1253067601535815680 negative      5
##  2 2020-04-22 21:00:00 1253067601535815680 positive      1
##  3 2020-04-22 21:00:00 1253067622519681024 negative      1
##  4 2020-04-22 21:00:00 1253067629759209472 negative      2
##  5 2020-04-22 21:00:00 1253067638491807748 negative      2
##  6 2020-04-22 21:00:00 1253067644388982785 negative      1
##  7 2020-04-22 21:00:00 1253067644388982785 positive      3
##  8 2020-04-22 21:00:00 1253067656690839552 positive      4
##  9 2020-04-22 21:00:00 1253067673103204353 positive      1
## 10 2020-04-22 21:00:00 1253067716015071232 positive      1
## # ... with 9,549 more rows

Using spread() to key on sentiment with values from n, this becomes 7,203 observations of 4 variables (hour,status_id, negative, positive)

library(tidyr)
covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  count(hour, status_id, sentiment) %>% 
  spread(key=sentiment, value=n, fill = 0)
## Joining, by = "word"
## # A tibble: 7,203 x 4
##    hour                status_id           negative positive
##    <dttm>              <chr>                  <dbl>    <dbl>
##  1 2020-04-22 21:00:00 1253067601535815680        5        1
##  2 2020-04-22 21:00:00 1253067622519681024        1        0
##  3 2020-04-22 21:00:00 1253067629759209472        2        0
##  4 2020-04-22 21:00:00 1253067638491807748        2        0
##  5 2020-04-22 21:00:00 1253067644388982785        1        3
##  6 2020-04-22 21:00:00 1253067656690839552        0        4
##  7 2020-04-22 21:00:00 1253067673103204353        0        1
##  8 2020-04-22 21:00:00 1253067716015071232        0        1
##  9 2020-04-22 21:00:00 1253067749305323525        2        1
## 10 2020-04-22 21:00:00 1253067782016688128        0        1
## # ... with 7,193 more rows

Calculate the net score of sentiment by positive score minus negative score

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, status_id, sentiment) %>% 
  spread(key=sentiment, value=n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## # A tibble: 6,866 x 5
##    hour                status_id           negative positive sentiment
##    <dttm>              <chr>                  <dbl>    <dbl>     <dbl>
##  1 2020-04-22 21:00:00 1253067601535815680        5        1        -4
##  2 2020-04-22 21:00:00 1253067622519681024        1        0        -1
##  3 2020-04-22 21:00:00 1253067629759209472        2        0        -2
##  4 2020-04-22 21:00:00 1253067638491807748        2        0        -2
##  5 2020-04-22 21:00:00 1253067644388982785        1        3         2
##  6 2020-04-22 21:00:00 1253067656690839552        0        4         4
##  7 2020-04-22 21:00:00 1253067673103204353        0        1         1
##  8 2020-04-22 21:00:00 1253067716015071232        0        1         1
##  9 2020-04-22 21:00:00 1253067749305323525        2        0        -2
## 10 2020-04-22 21:00:00 1253067791227392002        0        1         1
## # ... with 6,856 more rows
# Assigning each tweet with either positive or negative sentiment by the net score
covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, status_id, sentiment) %>% 
  spread(key=sentiment, value=n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  mutate(sentiment = ifelse(sentiment > 0, "Positive",
                            ifelse(sentiment < 0, "Negative", "Neutral")))
## Joining, by = "word"
## # A tibble: 6,866 x 5
##    hour                status_id           negative positive sentiment
##    <dttm>              <chr>                  <dbl>    <dbl> <chr>    
##  1 2020-04-22 21:00:00 1253067601535815680        5        1 Negative 
##  2 2020-04-22 21:00:00 1253067622519681024        1        0 Negative 
##  3 2020-04-22 21:00:00 1253067629759209472        2        0 Negative 
##  4 2020-04-22 21:00:00 1253067638491807748        2        0 Negative 
##  5 2020-04-22 21:00:00 1253067644388982785        1        3 Positive 
##  6 2020-04-22 21:00:00 1253067656690839552        0        4 Positive 
##  7 2020-04-22 21:00:00 1253067673103204353        0        1 Positive 
##  8 2020-04-22 21:00:00 1253067716015071232        0        1 Positive 
##  9 2020-04-22 21:00:00 1253067749305323525        2        0 Negative 
## 10 2020-04-22 21:00:00 1253067791227392002        0        1 Positive 
## # ... with 6,856 more rows
# ifelse(test, yes, no) returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

covid_tweets_bing <- covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, status_id, sentiment) %>% 
  spread(key=sentiment, value=n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  mutate(sentiment = ifelse(sentiment > 0, "Positive",
                            ifelse(sentiment < 0, "Negative", "Neutral")))
## Joining, by = "word"

Time-series visualization of sentiment analysis using “Bing” lexicon

# Now we are going to plot these net sentiment scores across hour-long bins. Note that we are plotting against the hour variable on the x-axis that keeps track of posted time in tweets
covid_tweets_bing %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour,y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum of Tweets",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The Bing Lexicon was used to measure sentiment in tweets")

Let’s compare with the overall sentiment toward COVID-19, where all sentiment words on Twitter are aggregated into the hour unit

covid_tweets_tidy %>% 
  inner_join(lexicon_bing()) %>% 
  filter(!word %in% words_out) %>% 
  count(hour, sentiment) %>% 
  ggplot(aes(x=hour, y=n, colour=sentiment)) +
  geom_line() +
  theme_bw() +
  labs(x = NULL, y = "Hourly Sum",
       title = "Tracing the rhythm of expressing sentiments toward COVID-19 on Twitter",
       subtitle = "The Bing Lexicon was used to measure sentiment in tweets")
## Joining, by = "word"