Background and Research Question

Over just the past month, the COVID-19 outbreak has quickly become a global emergency. While the coronavirus is severely contagious and deadly, people respond with mixed feelings on social media. According to Feng Lim’s analysis of 15,000 tweets with #Coronavirus and #COVID19 between January 30 to March 15, 2020, false is the most frequent words that appeared, which suggests lack of understanding and misconceptions regarding the virus [1]. On the other hand, people over the world do express the sadness. Manlio De Domenico, a scientist at Italy’s Bruno Kessler Foundation’s Center for Information and Communication Technology, analyzed millions of coronavirus tweets and found that the whole world is in the negative sentiments [2].

As the more strict governmental regulations are issued, and the social distancing is practiced worldwide to respond to the exponential increase of coronavirus cases, this project is interested in whether and how people change their attitudes towards COVID-19. Specifically, the project would like to answer the following research questions:

How did people respond to the Coronavirus as the worldly COVID-19 cases have risen to 2 million on April 14?
What are people’s primary concerns during the pandemic season?

Exploratory Data Analysis

Data Summary

The dataset used for this project extracted by Shane Smith and uploaded on Kaggle [3]. It includes 449492 tweets on April 14 with the following hashtags: #coronavirus, #coronavirusoutbreak, #coronavirusPandemic, #covid19, #covid_19, #epitwitter, #ihavecorona, #StayHomeStaySafe, #TestTraceIsolate.

The dataset contains 22 variables associated with Twitter. For the purpose of text mining and sentiment analysis, this project would particularly focus on tweets of which the language is English and the following variables:

created_at: The date and time of the tweet.
text: The text of the tweet
country_code: The location of the tweet
favourites_count: The number of favourites this tweet has received.
retweet_count: The number of retweets this tweet has received.

Exploratory Analysis

#load libraries
suppressMessages(library(tidytext))
suppressMessages(library(stringr))
suppressMessages(library(readr))
suppressMessages(library(knitr))
suppressMessages(library(tidyverse))
suppressMessages(library(wordcloud))
suppressMessages(library(textdata))
suppressMessages(library(plotrix))
suppressMessages(library(radarchart))
suppressMessages(library(ggplot2))
suppressMessages(library(choroplethr))
suppressMessages(library(choroplethrMaps))
suppressMessages(library(sentimentr))
suppressMessages(library(sjmisc))

#get data
tweets0414 <- read_csv("2020-04-14 Coronavirus Tweets.csv")
tweets <- tweets0414[tweets0414$lang == 'en',] #retrieve English tweets

1. Geographical Tweet Distribution

Most of tweets are mainly sent from US, followed Canada, UK, Nigeria, and India.

data(country.regions) #a dataset that contains country names in different versions from choroplethr
countryname<-as.data.frame(country.regions) #convert it as a dataframe

plotdata <- tweets %>% 
  filter(is.na(country_code) ==FALSE) %>% #filter na value
  rename(iso2c = country_code) %>% #rename column
  left_join(countryname) %>% #join countryname
  count(region,sort = T) %>% #count region
  rename(value = n) %>% #rename column
  select(region, value) #select region and value


labs <- data.frame(region =tail(plotdata[order(plotdata$value),],5)$region, #5 region that tweets most frequently
                   lon = c(5.44,-105,77.3,-2.6,-93.4), #longtitude
                   lat = c(13.1,60,31.7,51.5,35)) #latitude

nplotdata <- plotdata %>% left_join(labs) #join labs

country_choropleth(nplotdata, num_colors  = 1) + #plot choropleth
  scale_fill_gradient(high = "#e34a33", low = "#fee8c8", #set color by stats
                      guide ="colorbar", na.value="white", name="Counts of Tweets") + 
  geom_point(data = as.data.frame(nplotdata), #mark the top 5 regions 
        aes(x = lon, y = lat),
        inherit.aes = FALSE,
        color = ifelse(is.na(nplotdata$lon)==F & nplotdata$value >1500, 'navy', 'blue'),
        size = ifelse(is.na(nplotdata$lon)==F, plotdata$value/100, 0),
        alpha = .6) +
  geom_point(data = as.data.frame(nplotdata), #mark the top 5 regions
                aes(x = lon, y = lat),
                inherit.aes = FALSE,
                color = 'green',
                size = 1) +
  geom_text(data = as.data.frame(nplotdata), #annotate the top 5 regions
                     aes(x = lon, y = lat, 
                         label = toupper(region) , vjust = 1.5, hjust = 0.5), color='black',
            inherit.aes = FALSE)+
  labs(title = "Tweets by Country") +
  theme(plot.title = element_text(size = 14, face = "bold"))

2. Top 5 favorite Tweets

Global

The most favorite tweets are relative to negative messages, such as overly hoarding supplies, government’s insufficient response to COVID-19, the homeless’s relocation, and modified logistics.

fav<-tweets %>%
  #order the tweets descendingly by counts of favorites
  arrange(desc(favourites_count)) %>% 
  #select the text and count
  select(text,favourites_count) %>% 
  #get the top 5
  head(5)

kable(fav,format = "html")

text	favourites_count
We are not #InThisTogether if people have to go about hoarding supplies from stores & leaving others defenseless without the necessary supplies like what the heck is wrong with you people stop being a hoarder!!! 😡😤 #COVID19	1989070
It’s absolutely irresponsible and reckless for Trump to be talking about “opening up” the country when TODAY ALONE there were 24,215 new #COVID cases and 2,284 fatalities from #coronavirus and there are no signs of the #COVID19 outbreak slowing.	1543111
Ordine presidenziale #civid19 #coronavirus BBC News - Coronavirus: Amazon ordered to deliver only essential items in France https://t.co/hxlo2NFrEj	1122418
Messaggi disorientanti # UK #covid19 #coronavirus government’s coronavirus response beset by mixed messages and U-turns https://t.co/WxxGpEpH8f	1122107
Nessuna pietà per i senza casa #covid19 #coronavirus Hotels sit vacant during the pandemic. But some locals don’t want homeless people moving in. https://t.co/cuXe9z4JdU	1122107

US

On the other hand, people located in the US are concerned about reduction of vital public service, decline of economy, and severe cybersecurity due to work from home.

fav_us <- tweets %>%
  #get tweets located in the US
  filter(country_code == 'US' & is.na(country_code) == F) %>%
  #order the tweets descendingly by counts of favorites
  arrange(desc(favourites_count)) %>% 
  #select the text and count
  select(text,favourites_count) %>% 
  #get the top 5
  top_n(5)

kable(fav_us, format = "html")

text	favourites_count
More than 2,100 US cities brace for huge budget shortfalls that will lead to thousands of layoffs, cuts in vital services and less cops on the streets during the #coronavirus pandemic https://t.co/8SdwHycfAE #EconTwitter #economy #publicservices	570236
A 500% increase in attacks related to #workfromhome individuals as a result of the #coronavirus pandemic #Security #Cybersecurity #Hackers #Databreach #Cybercrime #DataPrivacy #Ransomware #Cyberattacks #CSO #Infosec #Malware #CIS #CyberDefense #WFM https://t.co/6j4IyM5MX1	570234
#AirTravel in the era of #coronavirus 😯 #airlines #aviation #aviationinlockdown #avgeek #avgeeks https://t.co/2EfFuXzuga	570234
#Taiwan steps up the fight against #COVID19 @MHiesboeck #COVID2019 https://t.co/ieypNL5I2D	570096
#Google to Display More Virtual #Healthcare Options in Search and Maps https://t.co/ymEh57HQMy… #telehealth #TelemedNow @IrmaRaste @eViRaHealth @HealthTap #COVID19 #COVID https://t.co/6qDhnMZLjx	570089

3. Top 5 Retweeted Tweets

Global

For tweets that were retweeted the most, the contents display mixed sentiments: while some spread the positive and correct messages like social distancing and prevention tips, others talk about the conspiracy and lies about the virus.

rt<-tweets %>%
  #order the tweets descendingly by counts of retweets
  arrange(desc(retweet_count)) %>%
  #select the text and count
  select(text,retweet_count) %>%
  #get the top 5
  top_n(5)

kable(rt,format = 'html')

text	retweet_count
We can fight the spread of COVID-19 together by sticking to the basics. 💓 Follow the steps to protect yourself with BT21! 📝 #COVID19 #Coronavirus #Prevention #tips #StayatHome #SocialDistancing #SelfQuarantine #FlatteningtheCurve #BT21 https://t.co/obH9uFvxu7	18350
We can fight the spread of COVID-19 together by sticking to the basics. 💓 Follow the steps to protect yourself with BT21! 📝 #COVID19 #Coronavirus #Prevention #tips #StayatHome #SocialDistancing #SelfQuarantine #FlatteningtheCurve #BT21 https://t.co/2FgwSefIbq	16720
#Wapo has legit bombshell indicating the #coronavirus was created by & escaped from a #Chinese lab experimenting on bats, which means the whole wet market story was just BS cover for a bio-experiment fuckup of epic proportion.	12484
We’ve hit PEAK #COVID19 INTERNET https://t.co/GsY78RzBPK	12350
Dear @LindseyGrahamS: You lie. The impeachment trial ended Feb 5. Democrats in the House started writing legislation to address the pandemic in FEBRUARY. Democrats in the House held hearings on the #coronavirus in FEBRUARY. #FactsMatter https://t.co/rlN5nwGiHu	11600

US

In the US, people publicize the information regarding virus testing, social distancing, and the need to severe patients, as well as political criticism.

rt_us<-tweets %>%
  #get tweets located in the US
  filter(country_code == 'US' & is.na(country_code) == F) %>%
  #order the tweets descendingly by counts of retweets
  arrange(desc(retweet_count)) %>%
  #select the text and count
  select(text,retweet_count) %>%
  #get the top 5
  top_n(5)

kable(rt_us,format = "html")

text	retweet_count
PRAYER REQUEST This brother in arms & his wife are both veterans. She has breast cancer, he has bladder cancer. #Covid_19 has complicated their situation with not being able to have anyone around, meaning no support. They are isolated at home. Please this vet couple up in prayer. https://t.co/G5MoNfthjH	235
Our city operated testing sites are now open to anyone who would like a test. 1,000 tests per day. Please call 832-393-4220 to get you unique code. We started with 18 operators, ramped up to 25, and tomorrow we will have 50 operators with the increase demand. #COVID19	149
Florida Surgeon General suggests social distancing measures should go on until there is a vaccine and is scurried away by governor’s team b/c @GovRonDeSantis is probably going to make some bad policy decisions instead 😒 #coronavirusoutbreak #demcastfl https://t.co/cU8lAnOfeR	99
Do you have a close family member or close friend that is in the Trump cult? If so, has the undisputedly lethally inept #coronavirus response by Trump made a dent in their support of him?	88
The ambassador of China has been summoned to the French foreign ministry after a media campaign of its embassy to criticize the handling of #COVID19 by FRANCE (as a way of whitewashing China of any responsibility in the pandemic). https://t.co/wnskIWwyRb	80

Method

Based on the geographical distribution of tweets in the exploratory analysis section, the project would like to focus on people’s responses on Twitter to the coronavirus both worldwide and in the United States.

In terms of methodology, text mining is used to answer the research questions. The project uses tokenization to extract word-unit information from tweets. By counting word frequency, the project explores people’s top concerns during the coronavirus. By performing sentiment analysis using bing and nrc lexicon in word level and applying sentimentr package in sentence level, the project aims to examine people’s attitudes in specific context.

1. Word Frequency

Word tokenization is applied before formal analysis:

remove_reg <- "&amp;|&lt;|&gt;" #regular expression
newstops <- c('covid_19','covid-19','covid 19','coronavirus','covid19', '#coronavirus', '#coronavirusoutbreak', '#coronavirusPandemic', '#covid19', '#covid_19', '#epitwitter', '#ihavecorona', '#StayHomeStaySafe', '#TestTraceIsolate') #hashtags that need to be removed

tidy_tweets <- tweets %>%  
  mutate(text = str_remove_all(text, remove_reg)) %>%  #remove regular expression
  unnest_tokens(word, text, token = 'tweets',strip_url = TRUE) %>% #work tokenizations
  filter(!word %in% stop_words$word, #remove stopwords
         !word %in% str_remove_all(stop_words$word, "'"),
         !word %in% newstops, #remove those hashtags
         str_detect(word, "[a-z]"))

Most Frequent Words Worldwide

10 Most Frequent Words

#get words and their frequency
frequency_global <- tidy_tweets %>% count(word, sort=T) 
#get the top 10
frequency_global %>% top_n(10)

## # A tibble: 10 x 2
##    word          n
##    <chr>     <int>
##  1 people    26899
##  2 support   18246
##  3 pandemic  14664
##  4 time      11084
##  5 health    10969
##  6 economy   10799
##  7 million   10434
##  8 deaths     9705
##  9 home       9382
## 10 #stayhome  8844

WordCloud

wordcloud(frequency_global$word,frequency_global$n, min.freq = 2200,
          scale=c(4.5, .2), random.order = FALSE, random.color = FALSE,
          colors = brewer.pal(8, "Dark2"), res=800)

From above, “people”, “support”, “pandemic”, “health”, and “economy” are mentioned most by people on Twitter.

Most Frequent Words in the US

10 Most Frequent Words

#get cleaned tweets that are located in the US
tidy_us <- tidy_tweets[is.na(tidy_tweets$country_code)==F & tidy_tweets$country_code == "US", ]

#get words and their frequency
frequency_us <- tidy_us %>% count(word, sort=T)
#get the top 10
frequency_us %>% top_n(10)

## # A tibble: 10 x 2
##    word                     n
##    <chr>                <int>
##  1 people                 556
##  2 economy                311
##  3 #stayhome              303
##  4 million                293
##  5 student                278
##  6 package                271
##  7 debt                   264
##  8 urge                   263
##  9 #cancelstudentdebt     261
## 10 #studentdebtstimulus   259

Word Cloud

wordcloud(frequency_us$word,frequency_us$n, min.freq =50, scale=c(4.5, .2), random.order = FALSE, random.color = FALSE,colors = brewer.pal(8, "Dark2"), res=800)

Besides worldly concerns, student debt is the main discussion among people in the US. President Trump is also frequently mentioned.

2. Word-Level Sentiment Analysis

(a) Positive/Negative Sentiment

Globally, people present a relative negative attitude on Twitter during the pandemic.

tweets_bing<-tidy_tweets%>% 
  # Implement sentiment analysis using the "bing" lexicon
  inner_join(get_sentiments("bing")) 

perc<-tweets_bing %>% 
  count(sentiment)%>% #count sentiment
  mutate(total=sum(n)) %>% #get sum
  group_by(sentiment) %>% #group by sentiment
  mutate(percent=round(n/total,2)*100) %>% #get the proportion
  ungroup()

label <-c( paste(perc$percent[1],'%',' - ',perc$sentiment[1],sep=''),#create label
     paste(perc$percent[2],'%',' - ',perc$sentiment[2],sep=''))

pie3D(perc$percent,labels=label,labelcex=1.1,explode= 0.1, 
      main="Worldwide Sentiment") #create a pie chart

Sentiment Word Frequency

Global

People have negative feelings towards the death and virus, especially the economic side effects of the pandemic as “debt” appears 8370 times in global tweets and is the most common negative words. However, people are happy about the “support” and, interestingly, “Trump.”

top_words <- tweets_bing %>%
  # Count by word and sentiment
  count(word, sentiment) %>%
  group_by(sentiment) %>% #group ny sentiment
  # Take the top 10 for each sentiment
  top_n(10) %>%
  ungroup() %>%
  # Make word a factor in order of n
  mutate(word = reorder(word, n))

#plot the result
ggplot(top_words, aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n, hjust=1), size = 3.5, color = "black") +
  facet_wrap(~sentiment, scales = "free") +  
  coord_flip() +
  ggtitle("Most Common Positive and Negative words (Global)") + 
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5))

US

While the most common negative word in the US is still “debt,” “stimulate” is the most frequently appeared among positive words. However, we still need to evaluate the context around these words.

top_words_us <- tidy_us %>%
  # Implement sentiment analysis using the "bing" lexicon
  inner_join(get_sentiments("bing")) %>%
  # Count by word and sentiment
  count(word, sentiment) %>%
  group_by(sentiment) %>%
  # Take the top 10 for each sentiment
  top_n(10) %>%
  ungroup() %>%
  # Make word a factor in order of n
  mutate(word = reorder(word, n))

#plot the result above
ggplot(top_words_us, aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n, hjust=1), size = 3.5, color = "black") +
  facet_wrap(~sentiment, scales = "free") +  
  coord_flip() +
  ggtitle("Most common positive and negative words (US)") + 
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5))

(b) NRC Emotional Lexicon

General Sentiments

People primarily express trust, fear and anticipation in tweets.

tidy_tweets %>%
  # implement sentiment analysis using the "nrc" lexicon
  inner_join(get_sentiments("nrc")) %>%
  # remove "positive/negative" sentiments
  filter(!sentiment %in% c("positive", "negative")) %>%
  #get the frequencies of sentiments
  count(sentiment,sort = T) %>% 
  #calculate the proportion
  mutate(percent=100*n/sum(n)) %>%
  select(sentiment, percent) %>%
  #plot the result
  chartJSRadar(showToolTipLabel = TRUE, main = "NRC Radar")

Sentiment Word Frequency

Global

People display anger, disgust, fear, and sadness toward the pandemic and related death. People also show trust in the economy and president, but they are surprised about Trump.

tidy_tweets %>%
  # implement sentiment analysis using the "nrc" lexicon
  inner_join(get_sentiments("nrc")) %>%
  # remove "positive/negative" sentiments
  filter(!sentiment %in% c("positive", "negative")) %>%
  #get the frequencies of sentiments of words
  count(word,sentiment) %>% 
  group_by(sentiment) %>%
  top_n(10) %>% 
  ungroup() %>%
  mutate(word=reorder(word,n)) %>% 
  #plot the sentiment word frequency
  ggplot(aes(x=word,y=n,fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip() +
  ggtitle(label = "Sentiment Word Frequency (Global)") + 
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5))

US

Similarly, people located in the US are unhappy with the virus, but compared to the pandemic itself, debt seems to cause more sadness.

tidy_us %>%
  # implement sentiment analysis using the "nrc" lexicon
  inner_join(get_sentiments("nrc")) %>%
  # remove "positive/negative" sentiments
  filter(!sentiment %in% c("positive", "negative")) %>%
  #get the frequencies of sentiments of words
  count(word,sentiment) %>% 
  group_by(sentiment) %>%
  top_n(10) %>% 
  ungroup() %>%
  mutate(word=reorder(word,n)) %>% 
  #plot the sentiment word frequency
  ggplot(aes(x=word,y=n,fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip() + 
  ggtitle(label = "Sentiment Word Frequency (US)") + 
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5))

3. Sentence-Level Sentiment Analysis

In the previous sections, “support,” “trump,” “stimulate,” and “debt” appear most frequently among positive words. In this section, these words are examined in the context of sentences to understand people’s responses toward critical topics on Twitter fully.

The sentimentr package by Tyler Rinker is applied to analyze tweets that contain each of the words. Instead of matching words back to a dictionary of words labeled as “positive,” “negative,” or “neutral,” sentimentr would account for valence shifters such as negators, amplifiers, and de-amplifiers and output a sentiment score of a sentence by averaging the sentiment scores of words it contains.

“Support”

The tweets that contain “support” are about channels to donote to support health institutions and local businesses.

#get tweets that contain "support"
support<-tweets[sapply(1:nrow(tweets), function(x) str_contains(tolower(tweets$text[x]), "support")),]
#View(support$text)
head(support$text)

## [1] "Our Founders are only prepared to support a different city weekly. Our focus has shifted more towards New Orleans. Unemployment rate has sky rocketed over the last 3-weeks. Also if you can support please direct message.GOD BLESS. \n#NewOrleans #coronavirus #usaCoronavirus #cashapp"                               
## [2] "\U0001f4e2 The Deeper their States of mind scrambles for support the more they show their\U0001f918. Symbolism will be their downfall \n\U0001f53b\nNorth Carolina woman gets coronavirus despite staying home for three weeks \U0001f914...\n#NorthCarolina \n#CoronavirusOutbreak\nCLICK LINK\nhttps://t.co/Vv4kqV3VVi"
## [3] "JCF guest @ljiresearch's Dr. @EOSaphire &amp; other #SanDiego researchers featured in @sdut for their work studying #coronavirus. Support our #PublicHealth #COVID19 Fund with your #donoradvisedfund or donate to support institutions doing this critical work. https://t.co/0UBoCs6NL1 https://t.co/jEHjY3aAwI"       
## [4] "IN THE HOME--Kieding Senior Project Manager Jaime Brunner gets into some Friday at-home work.  #takeabreath #kieding #doingourpart  #kiedingith #supportingsmallbusiness #workathome #vigilance #coronavirus https://t.co/sL2WypY7n3"                                                                                    
## [5] "Link to Cauyunan appeal, please see \n\nhttps://t.co/DsioYKIrT1\n\nHow you can help, please see \n\nhttps://t.co/cIBtp9FCmt\n\nWe thank you for your continuing support and generosity. We are in this together!\n\n#resilienceinthetimeofcovid19\n#weareinthistogether\n#Covid_19"                                      
## [6] "Thank you, Senator Gillibrand for your support of the Capital Region’s incredible medical professionals and local small businesses. #Covid19 #TroyNY #coronavirus https://t.co/zzqLyP9aCL"

Most of the related tweets tend to be optimistic given the positive mean and median, but there are extreme outliers with the most negative review being -1.11.

# get average sentiment score for each sentence
sentiment_support <- sentiment_by(get_sentences(support$text))

summary(sentiment_support$ave_sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.10780  0.09751  0.16013  0.15320  0.18237  1.48076

#plot the score distribution
ggplot(sentiment_support,aes(ave_sentiment)) +
  geom_histogram(bins = 50) + 
  labs(title = "Sentiment Histogram of Tweets that Contain 'Support' ", x = "Sentiment Score") +
  theme_bw() +
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5)) +
   geom_vline(xintercept = 0, color = "red")

“Trump”

In the tweets that contain “trump,” people express lots of criticism regarding President Trump’s inaction to cope with the Coronavirus.

#get tweets that contain "trump"
trump<-tweets[sapply(1:nrow(tweets), function(x) str_contains(tolower(tweets$text[x]), "trump")),]
#View(trump$text)
head(trump$text)

## [1] "@BamaStephen When the World looks back on the #TrumpPresidency they will see how a smarter &amp; more Compassionate person,would have taken Action in Feb. instead of calling the #coronavirus a Hoax &amp; no more than a typical flu bug!Thousands died b/c of the #IdiotPresident #Trump"                                                    
## [2] "\U0001f534 LIVE PODCAST: CWR#866 4_13_20 on @Spreaker #china #coronavirus #fauci #pharma #trump https://t.co/HYfVXGJZxr"                                                                                                                                                                                                                        
## [3] "Another reason why the Trump propaganda unit should be dismantled due to gross incompetence...this video just skipped the month of February. So basically the Trump Admin did nothing that whole month. Well done. #Trump #coronavirus https://t.co/g5LugStcs6"                                                                                 
## [4] "US's global reputation hits rock-bottom over Trump's #coronavirus response\n\nhttps://t.co/scYsse1IFK"                                                                                                                                                                                                                                          
## [5] "\U0001f51d #PlattsCommodityNews Americas Apr 13\n\U0001f4f0 WTI retreats as market weighs OPEC+ cuts | https://t.co/f1I7GyfTfz\n\U0001f4f0 Baker Hughes plans $15B impairment, citing #coronavirus  | https://t.co/s27hN8m9Nw\n\U0001f3a7 Podcast: Has Trump found religion on low oil prices | https://t.co/cmUZGwQBFG https://t.co/4mLLt8xxET"
## [6] "@POTUSrox @ScottPresler @realDonaldTrump We Dems will watch you Reps die from #coronavirus and will take over government in the 2020 election. Darwin was right!"

Most of the related tweets tend to be moderately negative given the negative mean and median, but there are extreme outliers with the most positive review being 1.3 and the most negative being -1.48.

# get average sentiment score for each sentence
sentiment_trump <- sentiment_by(get_sentences(trump$text))
summary(sentiment_trump$ave_sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.48464 -0.17613 -0.05180 -0.06594  0.04564  1.30479

#plot the score distribution
ggplot(sentiment_trump,aes(ave_sentiment)) +
  geom_histogram(bins = 50) + 
  labs(title = "Sentiment Histogram of Tweets that Contain 'Trump' ", x = "Sentiment Score") +
  theme_bw() +
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5)) +
   geom_vline(xintercept = 0, color = "red")

“Stimulate”

The word “stimulate” seems to primarily come from the hashtag “#StudentDebtStimulus.” These tweets urge governors to cancel student debt as a way to stimulate economy.

#get tweets that contain "stimulate"
stimulate<-tweets[sapply(1:nrow(tweets), function(x) str_contains(tolower(tweets$text[x]), "stimulate")),]
#View(stimulate$text)
head(stimulate$text)

## [1] ".@SenSchumer, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."   
## [2] ".@Kilili_Sablan, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."
## [3] ".@Senatemajldr, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most." 
## [4] ".@GOPLeader, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."    
## [5] ".@RepBonamici, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."  
## [6] ".@CongressmanGT, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."

The range of sentiments is small. More than 75% of tweet contents are moderately negative.

# get average sentiment score for each sentence
sentiment_stimulate <- sentiment_by(get_sentences(stimulate$text))
summary(sentiment_stimulate$ave_sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.40367 -0.03130 -0.03130 -0.02837 -0.03130  0.82957

#plot the score distribution
ggplot(sentiment_stimulate,aes(ave_sentiment)) +
  geom_histogram(bins = 50) + 
  labs(title = "Sentiment Histogram of Tweets that Contain 'stimulate' ", x = "Sentiment Score") +
  theme_bw() +
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5)) +
   geom_vline(xintercept = 0, color = "red")

“Debt”

Similar to the tweets that contain “stimulate,” tweets that contain “debt” are mainly from those with the hashtag “#cancelstudentdebt” and often “#StudentDebtStimulus.” These tweets urge governors to cancel student debt as a way to stimulate economy.

#get tweets that contain "debt"
debt<-tweets[sapply(1:nrow(tweets), function(x)
str_contains(tolower(tweets$text[x]), "debt")),]

#View(debt$text)
head(debt$text)

## [1] ".@SenSchumer, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."   
## [2] ".@Kilili_Sablan, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."
## [3] ".@Senatemajldr, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most." 
## [4] ".@GOPLeader, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."    
## [5] ".@RepBonamici, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."  
## [6] ".@CongressmanGT, I urge you to #cancelstudentdebt in the next #coronavirus package. A #StudentDebtStimulus will help the 45 million people with student debt and stimulate the economy when it is needed most."

More than 75% of tweet contents are negative and more negative than tweets contain “stimulate.”

# get average sentiment score for each sentence
sentiment_debt <- sentiment_by(get_sentences(debt$text))
summary(sentiment_debt$ave_sentiment)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.98198 -0.03130 -0.03130 -0.03198 -0.03130  0.67082

#plot the score distribution
ggplot(sentiment_debt,aes(ave_sentiment)) + 
  geom_histogram(bins = 50) + 
  labs(title = "Sentiment Histogram of Tweets that Contain 'debt' ", 
       x = "Sentiment Score") + 
  theme_bw() +
  theme(plot.title = element_text(size = 14, face = "bold",hjust = 0.5)) +
   geom_vline(xintercept = 0, color = "red")

Conclusion

Overall, the tweets convey a moderately pessimistic sentiment, with 56% of tweets contents marked as negative. It is also reflected in the emotional lexicon analysis chart where words with joy or trust labels have a lower frequency compared to the other emotion tags, especially fear and sadness that have maximum rates of 15,000.

The primary source of negativity comes from not only the related health issue but also the destructive effect of the virus on the economy. On the one hand, compared to the previous research in which people had the misconception that COVID-19 is similar to flu, the current analysis suggests that people have realized the fetal and pervasive nature of the virus and expressed concerns. People complain about governments’ insufficient response to COVID-19, where President Trump is frequently mentioned. On the other hand, economic decline profoundly harms local businesses, education, and job force. In the United States, student debt is one of the most mentioned topics; people appeal to governors to cancel student debts as a means to stimulate the economy.

Ultimately, “people” and “support” are the two most frequent words in all tweets and contribute the most positivity. People continuously share resources and channels to support people in need and express appreciation to health care workers.

Future work

This project, as an exploratory analysis, functions well in detecting social attitudes regarding the Coronavirus and gaining insights that can direct future research.

However, there are certain limitations. The Twitter data, though it has many entries, consists of tweets of only a single day. The analysis is also limited in that the project focuses on tweets that are in the English language and thus fails to capture possible topics and sentiments of tweets in other languages.

In the future, in order to result in a comprehensive and representative analysis, multilingual sentiment analysis should be applied to account for tweets in various languages. Further, future research should collect data over a period. Such data allows us to observe and understand the trend and pattern of issues related to health, economy, and politics during the pandemic season. Based on the conclusion found in the exploratory analysis, we can potentially predict future trends and find out the critical solutions to improve current situations by conducting modeling such as time series analysis, topic modeling, and natural language process.

Resources

[1] https://towardsdatascience.com/how-did-twitter-react-to-the-coronavirus-pandemic-2857592b449a

[2] https://www.washingtonpost.com/science/2020/03/17/analysis-millions-coronavirus-tweets-shows-whole-world-is-sad/

[3] https://www.kaggle.com/smid80/coronavirus-covid19-tweets-early-april

Exploratory and Sentiment Analysis of COVID-19 Tweets

Yulin Yu

Background and Research Question

Exploratory Data Analysis

Data Summary

Exploratory Analysis

1. Geographical Tweet Distribution

2. Top 5 favorite Tweets

Global

US

3. Top 5 Retweeted Tweets

Global

US

Method

1. Word Frequency

Most Frequent Words Worldwide

Most Frequent Words in the US

2. Word-Level Sentiment Analysis

(a) Positive/Negative Sentiment

Sentiment Word Frequency

Global

US

(b) NRC Emotional Lexicon

General Sentiments

Sentiment Word Frequency

Global

US

3. Sentence-Level Sentiment Analysis

“Support”

“Trump”

“Stimulate”

“Debt”

Conclusion

Future work

Resources