
Introduction
As we all know, COVID-19 has been a hot topic in the last three years and there has been a lot of misinformation about it, mainly due to the fact that social networks like Facebook and Twitter are a fertile ground for manipulation. People often jump to conclusions without thinking or checking their sources. This is especially dangerous in the case of famous people with many followers, because they often have influence over their audience.
Purpose of the project
The above observation motivated us to use one of the Kaggle datasets containing tweets of the Twitter verified and unverified users that have a #Covid19 hashtag and were posted during July and August 2020 to investigate whether public figures (mostly verified users) are more likely to show stronger sentiments towards the COVID-19 or toned down sentiments. Preliminary analysis and the answer to this question is not so obvious as some them might be either very vocal on their opinions regarding such a controversial topic or filter themselves to please the public or their followers. Consequently by doing the analysis below, we are trying to figure out whether their tweets reflect very strong sentiments through strong and persuasive language or whether their tweets are very general and not too political towards the topic of COVID vaccines. Similarly, we are trying to see verified users showed more positive sentiments as a whole in comparison to unverified users.
To make our Twitter account “verified”, we need to meet several conditions. As a rule, the account must be authentic, noteworthy and active. To be considered “authentic,” we must provide official identification documents. For an account to be noteworthy, it must be associated with a reputable person or brand. There are many categories of people and things that are considered notable: government personalities, companies, news organizations/journalists etc. Verified status can be revoked if a user violates Twitter rules. On the other hand, anyone can become an unverified user.
Main Assumptions
For the purpose of text mining and sentiment analysis, this project would particularly focus on tweets of which the language is English and an indication of whether the user is verified or not:
a) user_verified - verified users and non-verified users (here we will use it as a tweet identifier).
b) text - the actual content of the tweet.
Description of the data set
We started the analysis by downloading all the necessary libraries:
library(readr) # reads in CSV
library(ggplot2) # plot library
library(tidyverse) # data manipulation
library(gridExtra) # multiple plots
library(magick) # visualizations
library(tidytext) # text preprocessing
library(rtweet) # collecting Twitter Data
library(tidygraph) # a tidy API for graph manipulation
library(ggraph) # an implementation of grammar of graphics
library(wordcloud) # plot wordclouds
library(pacman)
Loading Kaggle dataset that contains tweets with a #Covid19 hashtag, collected during July and August 2020 and updated on a daily basis.
data <- read_csv("covid19_tweets.csv")
data$text <- as.character(data$text)
head(data)
## # A tibble: 6 × 13
## user_name user_location user_description user_created user_followers
## <chr> <chr> <chr> <dttm> <dbl>
## 1 ᏉᎥ☻լꂅϮ astroworld wednesday addams … 2017-05-26 05:46:42 624
## 2 Tom Basil… New York, NY Husband, Father, … 2009-04-16 20:06:23 2253
## 3 Time4fist… Pewee Valley… #Christian #Catho… 2009-02-28 18:57:41 9275
## 4 ethel mer… Stuck in the… #Browns #Indians … 2019-03-07 01:45:06 197
## 5 DIPR-J&K Jammu and Ka… 🖊️Official Twitter… 2017-02-12 06:45:15 101009
## 6 🎹 Franz S… Новоро́ссия 🎼 #Новоро́ссия #N… 2018-03-19 16:29:52 1180
## # … with 8 more variables: user_friends <dbl>, user_favourites <dbl>,
## # user_verified <lgl>, date <dttm>, text <chr>, hashtags <chr>, source <chr>,
## # is_retweet <lgl>
Dataset contains 179,109 rows and 13 variables associated with Twitter such as:
names(data)
## [1] "user_name" "user_location" "user_description" "user_created"
## [5] "user_followers" "user_friends" "user_favourites" "user_verified"
## [9] "date" "text" "hashtags" "source"
## [13] "is_retweet"
Preparation of data for modeling
As it was mentioned before, we will focus on the text and user_verified variables, hence the rest will be removed.
clean_data <-
data %>%
select(user_verified, text)
head(clean_data)
## # A tibble: 6 × 2
## user_verified text
## <lgl> <chr>
## 1 FALSE "If I smelled the scent of hand sanitizers today on someone in …
## 2 TRUE "Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more …
## 3 FALSE "@diane3443 @wdunlap @realDonaldTrump Trump never once claimed …
## 4 FALSE "@brookbanktv The one gift #COVID19 has give me is an appreciat…
## 5 FALSE "25 July : Media Bulletin on Novel #CoronaVirusUpdates #COVID19…
## 6 FALSE "#coronavirus #covid19 deaths continue to rise. It's almost as…
Converting the contents of the column “text” into a tokens.
We will split sentences into words which we will use as unit of analysis.
clean_data$text <- gsub("https\\S*", "", clean_data$text)
clean_data$text <- gsub("@\\S*", "", clean_data$text)
clean_data$text <- gsub("amp", "", clean_data$text)
clean_data$text <- gsub("[\r\n]", "", clean_data$text)
clean_data$text <- gsub("[[:punct:]]", "", clean_data$text)
Check if any of columns in the dataset have null values.
any((is.na(clean_data)))
## [1] FALSE
tokens <-
clean_data %>%
unnest_tokens(output = word, input = text)
Stopwords
These are link words, pronouns, negations, etc. that usually don’t have any sentiment meaning and can be removed from the text itself.
data(stop_words)
tokens <-
tokens %>%
anti_join(stop_words, by = "word")
Calculating word frequency.
tokens %>%
count(word, sort = TRUE)
## # A tibble: 142,667 × 2
## word n
## <chr> <int>
## 1 covid19 104617
## 2 coronavirus 14262
## 3 people 9090
## 4 pandemic 7975
## 5 deaths 7093
## 6 health 5235
## 7 positive 4728
## 8 covid 4691
## 9 total 4186
## 10 india 3861
## # … with 142,657 more rows
tokens
## # A tibble: 1,466,857 × 2
## user_verified word
## <lgl> <chr>
## 1 FALSE smelled
## 2 FALSE scent
## 3 FALSE hand
## 4 FALSE sanitizers
## 5 FALSE past
## 6 FALSE intoxicated
## 7 TRUE hey
## 8 TRUE wouldnt
## 9 TRUE sense
## 10 TRUE players
## # … with 1,466,847 more rows
Splitting the dataset
Splitting dataset into 2 separate dataframes, one for verified users and second for non-verified users.
ver_data<-tokens %>%
filter(user_verified == "TRUE")
unver_data<-tokens%>%
filter(user_verified=="FALSE")
Modeling
For both verified and unverified users’ tweets, we performed three types of sentiment analysis: affin, bing and nrc. All three dictionaries compute the sentiment of a words by analyzing the “semantic orientation” of that word in a text. These codings are made by people, through crowdsorcing:
affin: includes a positive and negative scale - this gives us an idea of the direction of sentiment for a given word in a tweet, as well as the strength of that positive or negative sentiment.
bing: gives the words an assignment of positive/negative sentiment.
nrc: shows associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
get_sentiments('afinn')
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
get_sentiments('bing')
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
get_sentiments('nrc')
## # A tibble: 13,875 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,865 more rows
Sentiment analysis for:
Verified Users
Wordcloud
The most-used word is, unsurprisingly, COVID-19. Other words that were used at a high frequency (not as often as the above) include ‘coronavirus’, ‘people’, ‘pandemic’, ‘deaths’.
frequency_ver <- ver_data %>% count(word, sort=T)
frequency_ver %>% top_n(10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 covid19 15113
## 2 coronavirus 1863
## 3 health 1119
## 4 pandemic 1106
## 5 people 1078
## 6 positive 1006
## 7 deaths 999
## 8 india 988
## 9 total 911
## 10 testing 747
set.seed(1234)
wordcloud(frequency_ver$word, frequency_ver$n, min.freq = 1, max.words=300, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

Afinn
Based on affin text analysis of verified users’ tweets we can see that the most of the sentiments are in the range from -3 to 2 which means most of the tweets from verified users were either slightly negative or slightly positive towards the COVID-19, but most of the sentiments were classified as slightly positive.
ver_data_affin <- ver_data %>%
inner_join(get_sentiments("afinn"))
table(ver_data_affin$value)
##
## -4 -3 -2 -1 1 2 3 4 5
## 47 1503 3825 2628 3605 4203 549 135 7
Bing
The same conclusion can be drawn from the bin dictionary - there is not much differentiation between negative and positive sentiment breakdown, but the first group predominates.
ver_data_bing <- ver_data%>%
inner_join(get_sentiments("bing"))
table(ver_data_bing$sentiment)
##
## negative positive
## 8843 7835
NRC
Surprisingly using the NRC library we got most of the words classified as positive. The categories that the most words from verified Tweets fall into are: positive, negative, trust, and fear. The negative/positive part echoes what we’ve found using affin and bing. However, the trust and fear parts are new – it seems that verified users, at the time that these tweets were tweeted (summer 2020), were expressing two conflicting emotions about COVID-19 - trust and fear.
ver_data_nrc <- ver_data%>%
inner_join(get_sentiments("nrc"))
table(ver_data_nrc$sentiment)
##
## anger anticipation disgust fear joy negative
## 4186 7194 2560 8680 3911 10952
## positive sadness surprise trust
## 16885 5852 2725 10181
ggplot(data = ver_data_nrc, aes(x=sentiment))+
geom_histogram(stat="count")+
ggtitle("Sentiment Frequency for verified users' Tweets")+
theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

Unverified Users
Wordcloud
The most-used word by unverified user was the same as for verified: COVID-19. Other words that were used at a high frequency include ‘coronavirus’, ‘pandemic’, ‘deaths’. The word ‘Trump’ has also appeared, so we can see that unverified users more often and more boldly posted on political topics.
frequency_unver <- unver_data %>% count(word, sort=T)
frequency_unver %>% top_n(10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 covid19 89504
## 2 coronavirus 12399
## 3 people 8012
## 4 pandemic 6869
## 5 deaths 6094
## 6 covid 4339
## 7 health 4116
## 8 positive 3722
## 9 dont 3588
## 10 mask 3453
set.seed(1234)
wordcloud(frequency_unver$word, frequency_unver$n, min.freq = 1, max.words=300, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

Afinn
First thing we noticed is definitely more tweets posted by unverified users. Same, as for verified users affin sentiment ranges from 2 to -3. Majority of the tweets were categorized as “2”, which was also noted for verified users.
unver_data_affin <- unver_data %>%
inner_join(get_sentiments("afinn"))
table(unver_data_affin$value)
##
## -5 -4 -3 -2 -1 1 2 3 4 5
## 99 2193 15146 29267 17321 22228 25896 6394 2254 79
Bing
Bing dictionary shows an imbalance in favor of negative sentiment Tweets.
unver_data_bing <- unver_data%>%
inner_join(get_sentiments("bing"))
table(unver_data_bing$sentiment)
##
## negative positive
## 73999 52188
Nrc
The majority of tweets fall into the positive, negative and trust catgories, after that, we see the largest categories of trust and fear, so similar to verified users.
unver_data_nrc <- unver_data%>%
inner_join(get_sentiments("nrc"))
table(unver_data_nrc$sentiment)
##
## anger anticipation disgust fear joy negative
## 30561 50499 21497 54877 31832 79615
## positive sadness surprise trust
## 105073 40787 21697 65363
ggplot(data = unver_data_nrc, aes(x=sentiment))+
geom_histogram(stat="count")+
ggtitle("Sentiment Frequency for unverified users' Tweets")+
theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

Summary
Based on the above analysis, we can conclude that the emotions of both verified and non-verified tweeters included in our dataset were similar, with emotions ranging from moderately negative to slightly positive. An interesting observation was that unverified people were bolder in talking about politics in their posts. Moreover, they tweeted with a much higher frequency.
References
- Lecture materials from Text Mining and Social Media Mining course.