library(twitteR)##Twitter interaction
library(quanteda)##Textr analysis
library(topicmodels)##topic-modelling
library(tidyverse)
library(tidytext)
If you have never used R before, or have never downloaded these packages, you would need to install them first, using the install.packages() function, and then call them using the library()` function.
The lines below enable us to access Twitter’s API. There are a lot of pages explaining how to gain access, for example, https://towardsdatascience.com/access-data-from-twitter-api-using-r-and-or-python-b8ac342d3efe. You will need a twitter regular account and to apply for a developer account. It is not complicated.
consumer_key <- "??????????????????????????"
consumer_secret <- "???????????????????????"
access_token <- "??????????????????????????"
access_secret <- "?????????????????????????"
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
We can request only 3200 tweets at a time; it may return fewer, depending on the API. Here we ask for 1000 tweets in English, using the Covid-19 #hashtag
covid19_tweets<- searchTwitter('#covid-19', lang="en", n = 1000)
Let’s convert the object to a data frame, which will make the analysis easier
covid19_tweets <- twListToDF(covid19_tweets)#conversion to data frame
covid19_tweets[1:10,1:5]
## text
## 1 RT @rapplerdotcom: #FactCheck: The all-time high record of 17.6% unemployment rate in the country was recorded in April 2020 due to the COV…
## 2 RT @armymedunion: To celebrate @BTS_twt Jimin’s birthday Oct 13<U+0001F389>, #BAMU presents “Jimin’s Promise Campaign” -a COVID-19 vaccination campaig…
## 3 RT @ACCTB_77: OH MY! DOCTOR SAYS MANY CONGRESSIONAL MEMBERS AND STAFF TREATED WITH IVE*****IN FOR COVID-19<U+200B> https://t.co/znjaDs4PHe via @re…
## 4 RT @SueSuezep: A report from MPs calls the country's early COVID-19 plans 'one of the worst public health failures in UK history' However t…
## 5 RT @RealMattCouch: EXCLUSIVE: Southwest Whistleblower Reveals Internal Documents for Covid-19 Exemptions COMPLETE VIOLATIONS of TITLE VII..…
## 6 RT @YuvaShaktiIND: UPSC aspirants are protesting at Jantar Mantar, Delhi. \n\nThey are demanding an extra attempt as their preparations were…
## 7 @AshwiniVaishnaw Dear sir due to covid 19 right know no blanket & linen shall be provided in train but RAC SEATS AR… https://t.co/qZZvhWgS7M
## 8 Pregnant women were kept out of clinical trials. That left them vulnerable to COVID-19 https://t.co/J6JEYbbJyb
## 9 RT @drsandeepbharti: Another unrest, another injustice, another policy paralysis. Upsc aspirants pleading from last 1 year for an 1 time at…
## 10 RT @FSDKe: Women, a mainstay of Kenyan households, earn only 50% of what men earn. Check out the latest wave of the FSD Kenya #COVID19 Trac…
## favorited favoriteCount replyToSN created
## 1 FALSE 0 <NA> 2021-10-13 10:32:09
## 2 FALSE 0 <NA> 2021-10-13 10:32:09
## 3 FALSE 0 <NA> 2021-10-13 10:32:09
## 4 FALSE 0 <NA> 2021-10-13 10:32:09
## 5 FALSE 0 <NA> 2021-10-13 10:32:08
## 6 FALSE 0 <NA> 2021-10-13 10:32:08
## 7 FALSE 0 AshwiniVaishnaw 2021-10-13 10:32:06
## 8 FALSE 0 <NA> 2021-10-13 10:32:06
## 9 FALSE 0 <NA> 2021-10-13 10:32:06
## 10 FALSE 0 <NA> 2021-10-13 10:32:05
Here, we are eliminating special characters that we don’t want (links, RT etc.). It’s mostly done using ‘regular expressions’ which is a powerful set of functions used a cross multiple platforms to work with character strings.
covid.text<-covid19_tweets$text#grab the text
covid.text<- gsub("http.*([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("https.*([A-Za-z0-9_]+)","",covid.text)
covid.text<- gsub("#([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("@([A-Za-z0-9_]+)","",covid.text)
covid.text <- gsub("[\r\n]","",covid.text)
covid.text <- gsub("[^\x01-\x7F]","",covid.text)
covid.text <- gsub("RT :","",covid.text)
covid.text[1:10]#let's look at the first ten clean tweets
## [1] " : The all-time high record of 17.6% unemployment rate in the country was recorded in April 2020 due to the COV"
## [2] " To celebrate Jimins birthday Oct 13, presents Jimins Promise Campaign -a COVID-19 vaccination campaig"
## [3] " OH MY! DOCTOR SAYS MANY CONGRESSIONAL MEMBERS AND STAFF TREATED WITH IVE*****IN FOR COVID-19 "
## [4] " A report from MPs calls the country's early COVID-19 plans 'one of the worst public health failures in UK history' However t"
## [5] " EXCLUSIVE: Southwest Whistleblower Reveals Internal Documents for Covid-19 Exemptions COMPLETE VIOLATIONS of TITLE VII.."
## [6] " UPSC aspirants are protesting at Jantar Mantar, Delhi. They are demanding an extra attempt as their preparations were"
## [7] " Dear sir due to covid 19 right know no blanket & linen shall be provided in train but RAC SEATS AR "
## [8] "Pregnant women were kept out of clinical trials. That left them vulnerable to COVID-19 "
## [9] " Another unrest, another injustice, another policy paralysis. Upsc aspirants pleading from last 1 year for an 1 time at"
## [10] " Women, a mainstay of Kenyan households, earn only 50% of what men earn. Check out the latest wave of the FSD Kenya Trac"
Now, we turn the cleaned text into a Document-Feature matrix.That is a matrix that tells us the number of times any given word appears in all of the tweets. Essential for text analysis of large datasets. We trim it to get rid of very rare and very common words, and then save it in a format that the topicmodels package can work with. Note that we ony use words that appear at least two times in the dataset, and no more than 75 times.
covid_dfm<- dfm(covid.text, tolower=TRUE, stem=TRUE, remove=stopwords("english"),
remove_punct = TRUE, remove_url=TRUE, verbose=TRUE)##turning to a document feature matrix
covid_dfm_trim <- dfm_trim(covid_dfm, min_termfreq = 2, max_termfreq = 75)#trimming words
dfm_topicmodels<- convert(covid_dfm_trim, to = "topicmodels")#convert so we can use topicmodels
A topic model is a family of statistical models that are used to extract the topics that characterize a large textual dataset. Please come talk to me if you want to learn more ort use it in the future
lda.model <- LDA(dfm_topicmodels, 10)
lda.model
## A LDA_VEM topic model with 10 topics.
Each column is a topic. What topics do you ‘see’ based on the words that characterize them?
as.data.frame(terms(lda.model, 20))
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
## 1 test go peopl covid comeback get just new
## 2 now support say 19 us booster employe pandem
## 3 can fulli one get due flu airlin health
## 4 issu travel die doctor wait peopl now york
## 5 like medic ban time sinc news govern must
## 6 servic pleas year cdc got presid defund judg
## 7 home donat countri ask th protect unit today
## 8 rt-pcr t said inform feel t counti rule
## 9 book administr amp sourc move s state exempt
## 10 quarantin restrict dr begin bad call texa allow
## 11 v jimin old say sick way run octob
## 12 worker biden today uk help thank place mandat
## 13 form birthday world see nurs two governor high
## 14 philippin fundrais children struggl die former day 1
## 15 russia direct social avail icu safest school religi
## 16 million union tuesday increas short effici second govern
## 17 work unicef 1 need whose uk tri attack
## 18 nasal bts abbott number outbreak health 8 feder
## 19 sputnik armi share state mom report live report
## 20 infect vaccineaid thing know fill one break chang
## Topic 9 Topic 10
## 1 death right
## 2 case 2021
## 3 new amp
## 4 taiwan india
## 5 scientist year
## 6 record also
## 7 adult economi
## 8 report virus
## 9 toll protest
## 10 fda global
## 11 australia show
## 12 pfizer |
## 13 grant grow
## 14 10 lead
## 15 booster govern
## 16 pandem effect
## 17 take still
## 18 852 expect
## 19 moderna name
## 20 number limit
And we can then cleanup the data some more and plot the characteristic word from each topic!
covid_topics <- tidy(lda.model, matrix = "beta")
covid_topics
## # A tibble: 14,070 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 high 1.42e- 14
## 2 2 high 1.88e- 62
## 3 3 high 2.45e-129
## 4 4 high 3.51e- 6
## 5 5 high 7.60e- 39
## 6 6 high 1.78e-173
## 7 7 high 3.03e- 67
## 8 8 high 1.52e- 2
## 9 9 high 1.79e- 23
## 10 10 high 2.59e- 42
## # ... with 14,060 more rows
covid_top_terms <- covid_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
covid_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Topic modelling is extremely powerful. ‘Come see’ me if you’d like to learn more. This is the tip of the iceberg!