COVID-19 Sentiment Analysis

Introduction

As we all know, COVID-19 has been a hot topic in the last three years and there has been a lot of misinformation about it, mainly due to the fact that social networks like Facebook and Twitter are a fertile ground for manipulation. People often jump to conclusions without thinking or checking their sources. This is especially dangerous in the case of famous people with many followers, because they often have influence over their audience.

Purpose of the project

The above observation motivated us to use one of the Kaggle datasets containing tweets of the Twitter verified and unverified users that have a #Covid19 hashtag and were posted during July and August 2020 to investigate whether public figures (mostly verified users) are more likely to show stronger sentiments towards the COVID-19 or toned down sentiments. Preliminary analysis and the answer to this question is not so obvious as some them might be either very vocal on their opinions regarding such a controversial topic or filter themselves to please the public or their followers. Consequently by doing the analysis below, we are trying to figure out whether their tweets reflect very strong sentiments through strong and persuasive language or whether their tweets are very general and not too political towards the topic of COVID vaccines. Similarly, we are trying to see verified users showed more positive sentiments as a whole in comparison to unverified users.

To make our Twitter account “verified”, we need to meet several conditions. As a rule, the account must be authentic, noteworthy and active. To be considered “authentic,” we must provide official identification documents. For an account to be noteworthy, it must be associated with a reputable person or brand. There are many categories of people and things that are considered notable: government personalities, companies, news organizations/journalists etc. Verified status can be revoked if a user violates Twitter rules. On the other hand, anyone can become an unverified user.

Main Assumptions

For the purpose of text mining and sentiment analysis, this project would particularly focus on tweets of which the language is English and an indication of whether the user is verified or not:
a) user_verified - verified users and non-verified users (here we will use it as a tweet identifier).
b) text - the actual content of the tweet.

Description of the data set

We started the analysis by downloading all the necessary libraries:

library(readr)        # reads in CSV
library(ggplot2)      # plot library
library(tidyverse)    # data manipulation
library(gridExtra)    # multiple plots 
library(magick)       # visualizations
library(tidytext)     # text preprocessing
library(rtweet)       # collecting Twitter Data
library(tidygraph)    # a tidy API for graph manipulation
library(ggraph)       # an implementation of grammar of graphics 
library(wordcloud)    # plot wordclouds
library(pacman)

Loading Kaggle dataset that contains tweets with a #Covid19 hashtag, collected during July and August 2020 and updated on a daily basis.

data <- read_csv("covid19_tweets.csv")

data$text <- as.character(data$text)
head(data)

## # A tibble: 6 × 13
##   user_name  user_location user_description   user_created        user_followers
##   <chr>      <chr>         <chr>              <dttm>                       <dbl>
## 1 ᏉᎥ☻լꂅϮ    astroworld    wednesday addams … 2017-05-26 05:46:42            624
## 2 Tom Basil… New York, NY  Husband, Father, … 2009-04-16 20:06:23           2253
## 3 Time4fist… Pewee Valley… #Christian #Catho… 2009-02-28 18:57:41           9275
## 4 ethel mer… Stuck in the… #Browns #Indians … 2019-03-07 01:45:06            197
## 5 DIPR-J&K   Jammu and Ka… 🖊️Official Twitter… 2017-02-12 06:45:15         101009
## 6 🎹 Franz S… Новоро́ссия    🎼  #Новоро́ссия #N… 2018-03-19 16:29:52           1180
## # … with 8 more variables: user_friends <dbl>, user_favourites <dbl>,
## #   user_verified <lgl>, date <dttm>, text <chr>, hashtags <chr>, source <chr>,
## #   is_retweet <lgl>

Dataset contains 179,109 rows and 13 variables associated with Twitter such as:

names(data)

##  [1] "user_name"        "user_location"    "user_description" "user_created"    
##  [5] "user_followers"   "user_friends"     "user_favourites"  "user_verified"   
##  [9] "date"             "text"             "hashtags"         "source"          
## [13] "is_retweet"

Preparation of data for modeling

As it was mentioned before, we will focus on the text and user_verified variables, hence the rest will be removed.

clean_data <-
  data %>%
  select(user_verified, text)

head(clean_data)

## # A tibble: 6 × 2
##   user_verified text                                                            
##   <lgl>         <chr>                                                           
## 1 FALSE         "If I smelled the scent of hand sanitizers today on someone in …
## 2 TRUE          "Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more …
## 3 FALSE         "@diane3443 @wdunlap @realDonaldTrump Trump never once claimed …
## 4 FALSE         "@brookbanktv The one gift #COVID19 has give me is an appreciat…
## 5 FALSE         "25 July : Media Bulletin on Novel #CoronaVirusUpdates #COVID19…
## 6 FALSE         "#coronavirus #covid19 deaths continue to rise. It's almost  as…

Converting the contents of the column “text” into a tokens.

We will split sentences into words which we will use as unit of analysis.

clean_data$text <- gsub("https\\S*", "", clean_data$text) 
clean_data$text <- gsub("@\\S*", "", clean_data$text) 
clean_data$text <- gsub("amp", "", clean_data$text) 
clean_data$text <- gsub("[\r\n]", "", clean_data$text)
clean_data$text <- gsub("[[:punct:]]", "", clean_data$text)

Check if any of columns in the dataset have null values.

any((is.na(clean_data)))

## [1] FALSE

tokens <- 
  clean_data %>%
  unnest_tokens(output = word, input = text)

Stopwords

These are link words, pronouns, negations, etc. that usually don’t have any sentiment meaning and can be removed from the text itself.

data(stop_words)

tokens <-
  tokens %>%
  anti_join(stop_words, by = "word")

Calculating word frequency.

tokens %>% 
    count(word, sort = TRUE)

## # A tibble: 142,667 × 2
##    word             n
##    <chr>        <int>
##  1 covid19     104617
##  2 coronavirus  14262
##  3 people        9090
##  4 pandemic      7975
##  5 deaths        7093
##  6 health        5235
##  7 positive      4728
##  8 covid         4691
##  9 total         4186
## 10 india         3861
## # … with 142,657 more rows

tokens

## # A tibble: 1,466,857 × 2
##    user_verified word       
##    <lgl>         <chr>      
##  1 FALSE         smelled    
##  2 FALSE         scent      
##  3 FALSE         hand       
##  4 FALSE         sanitizers 
##  5 FALSE         past       
##  6 FALSE         intoxicated
##  7 TRUE          hey        
##  8 TRUE          wouldnt    
##  9 TRUE          sense      
## 10 TRUE          players    
## # … with 1,466,847 more rows

Splitting the dataset

Splitting dataset into 2 separate dataframes, one for verified users and second for non-verified users.

ver_data<-tokens %>%
  filter(user_verified == "TRUE")

unver_data<-tokens%>%
  filter(user_verified=="FALSE")

Modeling

For both verified and unverified users’ tweets, we performed three types of sentiment analysis: affin, bing and nrc. All three dictionaries compute the sentiment of a words by analyzing the “semantic orientation” of that word in a text. These codings are made by people, through crowdsorcing:
affin: includes a positive and negative scale - this gives us an idea of the direction of sentiment for a given word in a tweet, as well as the strength of that positive or negative sentiment.
bing: gives the words an assignment of positive/negative sentiment.
nrc: shows associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).

get_sentiments('afinn')

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

get_sentiments('bing')

## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

get_sentiments('nrc')

## # A tibble: 13,875 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,865 more rows

Sentiment analysis for:

Verified Users

Wordcloud

The most-used word is, unsurprisingly, COVID-19. Other words that were used at a high frequency (not as often as the above) include ‘coronavirus’, ‘people’, ‘pandemic’, ‘deaths’.

frequency_ver <- ver_data %>% count(word, sort=T) 
frequency_ver %>% top_n(10)

## # A tibble: 10 × 2
##    word            n
##    <chr>       <int>
##  1 covid19     15113
##  2 coronavirus  1863
##  3 health       1119
##  4 pandemic     1106
##  5 people       1078
##  6 positive     1006
##  7 deaths        999
##  8 india         988
##  9 total         911
## 10 testing       747

set.seed(1234) 
wordcloud(frequency_ver$word, frequency_ver$n, min.freq = 1,           max.words=300, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

Afinn

Based on affin text analysis of verified users’ tweets we can see that the most of the sentiments are in the range from -3 to 2 which means most of the tweets from verified users were either slightly negative or slightly positive towards the COVID-19, but most of the sentiments were classified as slightly positive.

ver_data_affin <- ver_data %>%
  inner_join(get_sentiments("afinn"))

table(ver_data_affin$value)

## 
##   -4   -3   -2   -1    1    2    3    4    5 
##   47 1503 3825 2628 3605 4203  549  135    7

Bing

The same conclusion can be drawn from the bin dictionary - there is not much differentiation between negative and positive sentiment breakdown, but the first group predominates.

ver_data_bing <- ver_data%>%
  inner_join(get_sentiments("bing"))

table(ver_data_bing$sentiment)

## 
## negative positive 
##     8843     7835

NRC

Surprisingly using the NRC library we got most of the words classified as positive. The categories that the most words from verified Tweets fall into are: positive, negative, trust, and fear. The negative/positive part echoes what we’ve found using affin and bing. However, the trust and fear parts are new – it seems that verified users, at the time that these tweets were tweeted (summer 2020), were expressing two conflicting emotions about COVID-19 - trust and fear.

ver_data_nrc <- ver_data%>%
  inner_join(get_sentiments("nrc"))

table(ver_data_nrc$sentiment)

## 
##        anger anticipation      disgust         fear          joy     negative 
##         4186         7194         2560         8680         3911        10952 
##     positive      sadness     surprise        trust 
##        16885         5852         2725        10181

ggplot(data = ver_data_nrc, aes(x=sentiment))+
  geom_histogram(stat="count")+
  ggtitle("Sentiment Frequency for verified users' Tweets")+
  theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

Unverified Users

Wordcloud

The most-used word by unverified user was the same as for verified: COVID-19. Other words that were used at a high frequency include ‘coronavirus’, ‘pandemic’, ‘deaths’. The word ‘Trump’ has also appeared, so we can see that unverified users more often and more boldly posted on political topics.

frequency_unver <- unver_data %>% count(word, sort=T) 
frequency_unver %>% top_n(10)

## # A tibble: 10 × 2
##    word            n
##    <chr>       <int>
##  1 covid19     89504
##  2 coronavirus 12399
##  3 people       8012
##  4 pandemic     6869
##  5 deaths       6094
##  6 covid        4339
##  7 health       4116
##  8 positive     3722
##  9 dont         3588
## 10 mask         3453

set.seed(1234) 
wordcloud(frequency_unver$word, frequency_unver$n, min.freq = 1,           max.words=300, random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))

Afinn

First thing we noticed is definitely more tweets posted by unverified users. Same, as for verified users affin sentiment ranges from 2 to -3. Majority of the tweets were categorized as “2”, which was also noted for verified users.

unver_data_affin <- unver_data %>%
  inner_join(get_sentiments("afinn"))

table(unver_data_affin$value)

## 
##    -5    -4    -3    -2    -1     1     2     3     4     5 
##    99  2193 15146 29267 17321 22228 25896  6394  2254    79

Bing

Bing dictionary shows an imbalance in favor of negative sentiment Tweets.

unver_data_bing <- unver_data%>%
  inner_join(get_sentiments("bing"))

table(unver_data_bing$sentiment)

## 
## negative positive 
##    73999    52188

Nrc

The majority of tweets fall into the positive, negative and trust catgories, after that, we see the largest categories of trust and fear, so similar to verified users.

unver_data_nrc <- unver_data%>%
  inner_join(get_sentiments("nrc"))

table(unver_data_nrc$sentiment)

## 
##        anger anticipation      disgust         fear          joy     negative 
##        30561        50499        21497        54877        31832        79615 
##     positive      sadness     surprise        trust 
##       105073        40787        21697        65363

ggplot(data = unver_data_nrc, aes(x=sentiment))+
  geom_histogram(stat="count")+
  ggtitle("Sentiment Frequency for unverified users' Tweets")+
  theme_minimal()+theme(plot.title = element_text(hjust = 0.5))

Summary

Based on the above analysis, we can conclude that the emotions of both verified and non-verified tweeters included in our dataset were similar, with emotions ranging from moderately negative to slightly positive. An interesting observation was that unverified people were bolder in talking about politics in their posts. Moreover, they tweeted with a much higher frequency.

References

Lecture materials from Text Mining and Social Media Mining course.