Loading packages necessary for analysis

library(twitteR)##Twitter interaction
library(quanteda)##Textr analysis
library(topicmodels)##topic-modelling
library(tidyverse)
library(tidytext)

If you have never used R before, or have never downloaded these packages, you would need to install them first, using the install.packages() function, and then call them using the library()` function.

The lines below enable us to access Twitter’s API. There are a lot of pages explaining how to gain access, for example, https://towardsdatascience.com/access-data-from-twitter-api-using-r-and-or-python-b8ac342d3efe. You will need a twitter regular account and to apply for a developer account. It is not complicated.

Gaining access

consumer_key <- "??????????????????????????"

consumer_secret <- "???????????????????????"

access_token <- "??????????????????????????"

access_secret <- "?????????????????????????"

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

We can request only 3200 tweets at a time; it may return fewer, depending on the API. Here we ask for 1000 tweets in English, using the Covid-19 #hashtag

covid19_tweets<- searchTwitter('#covid-19', lang="en", n = 1000)

Let’s convert the object to a data frame, which will make the analysis easier

covid19_tweets <- twListToDF(covid19_tweets)#conversion to data frame

covid19_tweets[1:10,1:5]
##                                                                                                                                                       text
## 1             RT @rapplerdotcom: #FactCheck: The all-time high record of 17.6% unemployment rate in the country was recorded in April 2020 due to the COV…
## 2  RT @armymedunion: To celebrate @BTS_twt Jimin’s birthday Oct 13<U+0001F389>, #BAMU presents “Jimin’s Promise Campaign” -a COVID-19 vaccination campaig…
## 3      RT @ACCTB_77: OH MY! DOCTOR SAYS MANY CONGRESSIONAL MEMBERS AND STAFF TREATED WITH IVE*****IN FOR COVID-19<U+200B> https://t.co/znjaDs4PHe via @re…
## 4             RT @SueSuezep: A report from MPs calls the country's early COVID-19 plans 'one of the worst public health failures in UK history' However t…
## 5             RT @RealMattCouch: EXCLUSIVE: Southwest Whistleblower Reveals Internal Documents for Covid-19 Exemptions COMPLETE VIOLATIONS of TITLE VII..…
## 6            RT @YuvaShaktiIND: UPSC aspirants are protesting at Jantar Mantar, Delhi. \n\nThey are demanding an extra attempt as their preparations were…
## 7         @AshwiniVaishnaw Dear sir due to covid 19 right know no blanket &amp; linen shall be provided in train but RAC SEATS AR… https://t.co/qZZvhWgS7M
## 8                                           Pregnant women were kept out of clinical trials. That left them vulnerable to COVID-19 https://t.co/J6JEYbbJyb
## 9             RT @drsandeepbharti: Another unrest, another injustice, another policy paralysis. Upsc aspirants pleading from last 1 year for an 1 time at…
## 10            RT @FSDKe: Women, a mainstay of Kenyan households, earn only 50% of what men earn. Check out the latest wave of the FSD Kenya #COVID19 Trac…
##    favorited favoriteCount       replyToSN             created
## 1      FALSE             0            <NA> 2021-10-13 10:32:09
## 2      FALSE             0            <NA> 2021-10-13 10:32:09
## 3      FALSE             0            <NA> 2021-10-13 10:32:09
## 4      FALSE             0            <NA> 2021-10-13 10:32:09
## 5      FALSE             0            <NA> 2021-10-13 10:32:08
## 6      FALSE             0            <NA> 2021-10-13 10:32:08
## 7      FALSE             0 AshwiniVaishnaw 2021-10-13 10:32:06
## 8      FALSE             0            <NA> 2021-10-13 10:32:06
## 9      FALSE             0            <NA> 2021-10-13 10:32:06
## 10     FALSE             0            <NA> 2021-10-13 10:32:05

Cleanup

Here, we are eliminating special characters that we don’t want (links, RT etc.). It’s mostly done using ‘regular expressions’ which is a powerful set of functions used a cross multiple platforms to work with character strings.

covid.text<-covid19_tweets$text#grab the text 
covid.text<- gsub("http.*([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("https.*([A-Za-z0-9_]+)","",covid.text)
covid.text<- gsub("#([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("@([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("[\r\n]","",covid.text)
covid.text <- gsub("[^\x01-\x7F]","",covid.text)
covid.text <- gsub("RT :","",covid.text)
covid.text[1:10]#let's look at the first ten clean tweets
##  [1] " : The all-time high record of 17.6% unemployment rate in the country was recorded in April 2020 due to the COV"              
##  [2] " To celebrate  Jimins birthday Oct 13,  presents Jimins Promise Campaign -a COVID-19 vaccination campaig"                     
##  [3] " OH MY! DOCTOR SAYS MANY CONGRESSIONAL MEMBERS AND STAFF TREATED WITH IVE*****IN FOR COVID-19 "                               
##  [4] " A report from MPs calls the country's early COVID-19 plans 'one of the worst public health failures in UK history' However t"
##  [5] " EXCLUSIVE: Southwest Whistleblower Reveals Internal Documents for Covid-19 Exemptions COMPLETE VIOLATIONS of TITLE VII.."    
##  [6] " UPSC aspirants are protesting at Jantar Mantar, Delhi. They are demanding an extra attempt as their preparations were"       
##  [7] " Dear sir due to covid 19 right know no blanket &amp; linen shall be provided in train but RAC SEATS AR "                     
##  [8] "Pregnant women were kept out of clinical trials. That left them vulnerable to COVID-19 "                                      
##  [9] " Another unrest, another injustice, another policy paralysis. Upsc aspirants pleading from last 1 year for an 1 time at"      
## [10] " Women, a mainstay of Kenyan households, earn only 50% of what men earn. Check out the latest wave of the FSD Kenya  Trac"

Additional processing before analysis

Now, we turn the cleaned text into a Document-Feature matrix.That is a matrix that tells us the number of times any given word appears in all of the tweets. Essential for text analysis of large datasets. We trim it to get rid of very rare and very common words, and then save it in a format that the topicmodels package can work with. Note that we ony use words that appear at least two times in the dataset, and no more than 75 times.

covid_dfm<- dfm(covid.text, tolower=TRUE, stem=TRUE, remove=stopwords("english"),
                remove_punct = TRUE, remove_url=TRUE, verbose=TRUE)##turning to a document feature matrix

covid_dfm_trim <- dfm_trim(covid_dfm, min_termfreq = 2, max_termfreq = 75)#trimming words

dfm_topicmodels<- convert(covid_dfm_trim, to = "topicmodels")#convert so we can use topicmodels

Topic modelling

A topic model is a family of statistical models that are used to extract the topics that characterize a large textual dataset. Please come talk to me if you want to learn more ort use it in the future

lda.model <- LDA(dfm_topicmodels, 10)
lda.model
## A LDA_VEM topic model with 10 topics.

Each column is a topic. What topics do you ‘see’ based on the words that characterize them?

as.data.frame(terms(lda.model, 20))
##      Topic 1    Topic 2  Topic 3 Topic 4  Topic 5 Topic 6  Topic 7 Topic 8
## 1       test         go    peopl   covid comeback     get     just     new
## 2        now    support      say      19       us booster  employe  pandem
## 3        can      fulli      one     get      due     flu   airlin  health
## 4       issu     travel      die  doctor     wait   peopl      now    york
## 5       like      medic      ban    time     sinc    news   govern    must
## 6     servic      pleas     year     cdc      got  presid   defund    judg
## 7       home      donat  countri     ask       th protect     unit   today
## 8     rt-pcr          t     said  inform     feel       t   counti    rule
## 9       book  administr      amp   sourc     move       s    state  exempt
## 10 quarantin   restrict       dr   begin      bad    call     texa   allow
## 11         v      jimin      old     say     sick     way      run   octob
## 12    worker      biden    today      uk     help   thank    place  mandat
## 13      form   birthday    world     see     nurs     two governor    high
## 14 philippin   fundrais children struggl      die  former      day       1
## 15    russia     direct   social   avail      icu  safest   school  religi
## 16   million      union  tuesday increas    short  effici   second  govern
## 17      work     unicef        1    need    whose      uk      tri  attack
## 18     nasal        bts   abbott  number outbreak  health        8   feder
## 19   sputnik       armi    share   state      mom  report     live  report
## 20    infect vaccineaid    thing    know     fill     one    break   chang
##      Topic 9 Topic 10
## 1      death    right
## 2       case     2021
## 3        new      amp
## 4     taiwan    india
## 5  scientist     year
## 6     record     also
## 7      adult  economi
## 8     report    virus
## 9       toll  protest
## 10       fda   global
## 11 australia     show
## 12    pfizer        |
## 13     grant     grow
## 14        10     lead
## 15   booster   govern
## 16    pandem   effect
## 17      take    still
## 18       852   expect
## 19   moderna     name
## 20    number    limit

And we can then cleanup the data some more and plot the characteristic word from each topic!

covid_topics <- tidy(lda.model, matrix = "beta")
covid_topics
## # A tibble: 14,070 x 3
##    topic term       beta
##    <int> <chr>     <dbl>
##  1     1 high  1.42e- 14
##  2     2 high  1.88e- 62
##  3     3 high  2.45e-129
##  4     4 high  3.51e-  6
##  5     5 high  7.60e- 39
##  6     6 high  1.78e-173
##  7     7 high  3.03e- 67
##  8     8 high  1.52e-  2
##  9     9 high  1.79e- 23
## 10    10 high  2.59e- 42
## # ... with 14,060 more rows
covid_top_terms <- covid_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>% 
  ungroup() %>%
  arrange(topic, -beta)

covid_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Topic modelling is extremely powerful. ‘Come see’ me if you’d like to learn more. This is the tip of the iceberg!