Introduction

In this project, we look at the coverage of Coronavirus on mainstream news articles and Twitter in the USA. The main goal is to inspect the difference in the focus of each media outlet (topic modelling) and their expressed attitudes (sentiment analysis) on COVID-19.

The project is conducted with the following steps: Data collection, Data cleaning and Pre-processing, Topic modeling, and Sentiment analysis. The data was gathered using web-scraping technique with R and Python programming. After data cleaning, the Quanteda package in R was used to prepare for text analysis. Then, we applied topic modelling to identify distinct topics. In sentiment analysis, we chose a lexicon, calculated sentiments scores and looked at the shift in sentiments.

The result has revealed some disparities between the media sources. Social media, namely Twitter, are reflecting more on people´s sentiments and quick updates of the situation while news articles are more about macro problems. The two media’s sentiment remained negative throughout the observed period.

Data Collection and Cleaning

News Articles

Using the keyword coronavirus for querying, we collected news articles published on news websites in the United States from the beginning of 2020, where the first cases of coronavirus were reported, until April 19, 2020. For sources, we looked at the top 15 U.S news websites measured by unique monthly visitors (Statista, 2020), excluding the news aggregators (Yahoo News, Google News) and topic-specific newspapers (The Wall Street Journal). We used a third-party API called Currents API in R to get all the available URLs, then put them into the newspaper3k module in Python, which enabled scraping the entire articles. Furthermore, after checking the data quality from each source, we narrowed them down from 15 to 6 news outlets: CNN, The New York Times, Fox News, The Guardian, USA Today and The LA Times.

# Load the scraped data
news_data <- read_csv("news_coronavirus.csv")

# Add week and remove unnecessary column
news_data <- news_data %>% 
  mutate(week = isoweek(publish_date)) %>% 
  select(-title)

# Preview the data
glimpse(news_data)
Rows: 14,149
Columns: 4
$ publish_date <date> 2020-04-19, 2020-04-19, 2020-04-19, 2020-04-19, 202…
$ source       <chr> "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CN…
$ text         <chr> "(CNN) Nothing can bring back the months of wedding …
$ week         <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, …

Twitter

Thanks to the support from the team at <crowdbreaks.org>, who is tracking Twitter trend on COVID-19 in real time, we managed to get all tweet IDs that they had collected since January 2020 in the US with the keywords of “wuhan”, “ncov”, “coronavirus”, “covid”, “sars-cov-2”. We decided to take stratified sampling of the data (50%) to keep the distribution of number of tweets per week and speed up data collection and obtain the data in time for the report. The Twitter data were collected until April 16, 2020.

# Load the scraped data
tweets_data <- read_csv("twitter_coronavirus.csv")

# Transform the date column
tweets_data <- tweets_data %>% 
  separate(created_at, c("wday", "month", "day", "time", "plus", "year"), 
           sep = " ") %>% 
  mutate(date = paste(month, day, year, sep = " ")) %>% 
  select(c("text", "date"))

tweets_data$date <- as.Date(tweets_data$date, format = "%b %d %Y")
tweets_data <- tweets_data %>% 
  mutate(week = isoweek(date)) %>% 
  filter(week > 2)

# Clean hashtags, links and special characters
tweets_data$text <- tweets_data$text %>% 
  str_replace_all("#", " ") %>%
  str_remove_all("(?<=^|\\s)http[^\\s]+") %>% 
  str_remove_all("[^a-zA-Z0-9 ]") %>% 
  trimws()

# Remove blank text after cleaning
tweets_data <- tweets_data %>% 
  filter(str_count(text, pattern = boundary("word")) > 1)

Data Overview

For further data cleaning, we removed duplicated data, texts that were too short and kept only texts written in English. We kept the time variable in week format and the full text of the document for analysis. Consequently, 14,149 news articles and 109,329 tweets were ready for analysis.

The data distribution can be seen below. The number of news reported on COVID-19 in the U.S were relatively low in the first 6 weeks (week 3 - week 8). However, there was a dramatic increase starting from week 9 and peaked in week 11 and 13 for Twitter and News respectively. This was the end of March when the number of infected cases in the country surged.

Pre-processing

For text data preparation, firstly, we created a corpus for the data set of each source. A corpus consists of a collection of documents (each document is an article or a tweet), and the document variables which describe the characteristics of the document, for instance published date and source. Subsequently, we tokenized each corpus, which separates the text into its single words (also called terms or tokens). A summary of the corpus for news articles can be seen below.

corpus_News_data <- corpus(news_data, text_field = c("text"), unique_docnames = F) %>%
  `docnames<-`(news_data$week) # update the name of document in the corpus
summary(corpus_News_data, 5)
Corpus consisting of 14149 documents, showing 5 documents:

 Text Types Tokens Sentences publish_date source week
 16.1   134    190         6   2020-04-19    CNN   16
 16.2    92    155         7   2020-04-19    CNN   16
 16.3   127    230         8   2020-04-19    CNN   16
 16.4   113    192         5   2020-04-19    CNN   16
 16.5   581   1509        54   2020-04-19    CNN   16
head(corpus_News_data, 5) # take a look at the corpus
Corpus consisting of 5 documents and 3 docvars.
16.1 :
"(CNN) Nothing can bring back the months of wedding planning ..."

16.2 :
"(CNN) Broadway actor Nick Cordero is recovering after having..."

16.3 :
"(CNN) A Louisiana pastor who defied state orders and repeate..."

16.4 :
"(CNN) More than 100,000 people defied Bangladesh lockdown or..."

16.5 :
"Hong Kong (CNN) Less than a month ago, Singapore was being h..."
corpus_Twitter_data <- corpus(tweets_data, text_field = c("text"), unique_docnames = F) %>%
  `docnames<-`(tweets_data$week)
stopwords_extended <- readLines("stopwords_en.txt", encoding = "UTF-8")
lemma_data <- read.csv("baseform_en.tsv", encoding = "UTF-8")

We pre-processed the data sets by lowering upper case letters and removing the numbers, punctuations, symbols, URLs and separators. The main idea is to reduce the final amount of terms extracted, which is important in order to improve the accuracy of both topic modeling and sentiment analysis. If two words are similar it is convenient to combine them as one unique word. Moreover, if a word is not relevant for the analysis, it can be removed. Hence, we implemented stop-words removal and lemmatization technique. Stopwords are words that appear in texts but do not give the text a substantial meaning (e.g., “the”, “a”, or “for”) and lemmatization deals with the inflected forms of words by replacing them with their base forms. The total number tokens and unique tokens in each source decreased by 40 – 70% after the above steps:

Token_news_data <- corpus_News_data %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, 
         remove_url = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower(keep_acronyms = FALSE) %>%
  tokens_remove(pattern = c("amid", "updates", "live", "video", 
                            "didn", "briefing")) %>%
  tokens_replace(lemma_data$inflected_form, lemma_data$lemma, 
                 valuetype = "glob") %>%
  tokens_replace(pattern = c("covid-19", "ncov", "cov"), 
                 c("coronavirus", "coronavirus", "coronavirus"),
                 valuetype = "fixed") %>% 
  tokens_remove(pattern = stopwords_extended, padding = F)

Token_Twitter_data <- corpus_Twitter_data %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
         remove_url = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower(keep_acronyms = FALSE) %>%
  tokens_remove(pattern = c("people", "covid19", "covid", "rt", "dont")) %>%
  tokens_replace(lemma_data$inflected_form, lemma_data$lemma, 
                 valuetype = "glob") %>% 
  tokens_remove(pattern = stopwords_extended, padding = T)

In the next step, we generated a document-feature matrix (also known as document-term matrix) for each source. It represents how frequently terms (tokens) occur in the corpus (document) by counting single terms. We kept only the top 5% of the most frequent features (minimum term frequency set at 0.95) that present in less than 10% of all documents (maximum document frequency set at 0.1) to focus on common but distinctive features. We also dismissed the documents that had no tokens left after trimming. The dimensions of the two document-feature matrices are as follow:

News_DFM <- Token_news_data %>% 
  tokens_remove("") %>%
  dfm() %>% 
  dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
News_DFM <- News_DFM[ntoken(News_DFM) > 0, ]
#head(News_DFM, n = 5, nf = 8)

Twitter_DFM <- Token_Twitter_data %>% 
  tokens_remove("") %>%
  dfm() %>%
  dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
           max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
Twitter_DFM <- Twitter_DFM[ntoken(Twitter_DFM) > 0,]

For better interpretation, the first 3 documents and 10 features of the document-feature matrix of news articles can be seen below.

News_DFM[1:3, 1:10]
Document-feature matrix of: 3 documents, 10 features (66.67% sparse) and 3 docvars.
      features
docs   wed waste alleviate engage couple stress beer chance hall postpone
  16.1   3     1         1      1      4      1    2      1    1        1
  16.2   0     0         0      0      0      0    0      0    0        0
  16.3   0     0         0      0      0      0    0      0    0        0

Topic Modelling

We discovered topics distributed in news articles and tweets on a weekly basis over the three-month period and compared the topics between the two media outlets. For topic modelling we used Structural Topic Modeling (STM), since it incorporates metadata (information about each document) into the topic modelling framework, so that we can discover topics and estimate their relationship to the documents’ metadata.

STM is an unsupervised machine learning model where we have to define the number of topic K in advance, similarly to k-means clustering we do not know ahead how many topics we should use for any given corpus. Therefore, we trained a group of topic models with different numbers of K (topics) and then evaluated how many topics are appropriate. Here we proceeded with K = 25 for each of the media forms. The evaluation method can be referred to in this post by Julia Silge.

News_stm <- convert(News_DFM, to = "stm")
Twitter_stm <- convert(Twitter_DFM, to = "stm")
News_topic_model <- stm(
  documents = News_stm$documents, 
  vocab = News_stm$vocab,
  K = 25,
  prevalence =~ s(week),
  data = News_stm$meta,
  init.type = "Spectral",
  seed = 123456)
Twitter_topic_model <- stm(
  documents = Twitter_stm$documents, 
  vocab = Twitter_stm$vocab,
  K = 25,
  prevalence =~ s(week),
  data = Twitter_stm$meta,
  init.type = "Spectral",
  seed = 123456)

The plots below show the top 10 topics in each corpus (News, Twitter), and top 7 words that have the highest probability to belong to that topic. A closer look into each source showed us some interesting patterns.

News_beta <- tidy(News_topic_model)
News_gamma <- tidy(News_topic_model, matrix = "gamma")

News_top_terms <- News_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

News_gamma_terms <- News_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(News_top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

News_gamma_terms %>%
  top_n(10, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0.8, nudge_y = 0.0005, size = 3,
            family = "IBMPlexSans") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.09),
                     labels = percent_format()) +
  theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
  theme(plot.title = element_text(size = 16,
                                  family="IBMPlexSans-Bold"),
        plot.subtitle = element_text(size = 13)) +
  labs(x = NULL, y = expression(gamma),
       title = "Top 10 topics by popularity on News Articles",
       subtitle = "With the top words that contribute to each topic")

Twitter_beta <- tidy(Twitter_topic_model)
Twitter_gamma <- tidy(Twitter_topic_model, matrix = "gamma")

Twitter_top_terms <- Twitter_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

Twitter_gamma_terms <- Twitter_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(Twitter_top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

Twitter_gamma_terms %>%
  top_n(10, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0.85, nudge_y = 0.0005, size = 3,
            family = "IBMPlexSans") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.09),
                     labels = percent_format()) +
  theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
  theme(plot.title = element_text(size = 16,
                                  family="IBMPlexSans-Bold"),
        plot.subtitle = element_text(size = 13)) +
  labs(x = NULL, y = expression(gamma),
       title = "Top 10 topics by popularity on Twitter",
       subtitle = "With the top words that contribute to each topic")

The key focuses on Twitter are about something like “the spread of virus (Topic 11)”, “negative reaction (Topic 7, 23)”, “update on the pandemic information (Topic 13, 4)”. On the contrary, when looking into News topics, we can see a relatively difference pattern, which is more related to “political matters – topic 22”, “situation in other country - topic 19”, “economy situation – topic 4”, “healthcare system situation (Topic 20, 21)”. The message about family is popular in both sources (News Topic 13, Twitter Topic 18). Furthermore, politicians are well-covered, with Trump on Twitter and Pence/ Cuomo on mainstream papers.

Sentiment Analysis

Corpus sentiment analysis

To investigate sentiments in news articles and tweets, tokens extracted from the documents are used to calculate sentiment scores with the AFINN lexicon developed by Finn Årup Nielsen, which is included in the tidytext package. The AFINN lexicon is a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive). It comprises 878 positive terms and 1,598 negative terms. The sentiment score is the percentage of the difference between the absolute sum of positive terms and negative terms divided by the total sum of them.

# Make a dataframe with the AFINN dictionary
afinn <- get_sentiments("afinn")

News_ext <- convert(News_DFM, to = "tripletlist")
News_df <- data.frame(doc = News_ext$document,
                        word = News_ext$feature,
                        freq = News_ext$frequency)
News_df$word <- as.character(News_df$word)

News_afinn <- News_df %>% 
  inner_join(afinn, by = "word") %>% 
  mutate(score = freq * value) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>% 
  group_by(week, sentiment) %>% 
  summarise(sentiment_score = sum(score)) %>% 
  ungroup() %>% 
  mutate(source = "News")

Twitter_ext <- convert(Twitter_DFM, to = "tripletlist")
Twitter_df <- data.frame(doc = Twitter_ext$document,
                      word = Twitter_ext$feature,
                      freq = Twitter_ext$frequency)
Twitter_df$word <- as.character(Twitter_df$word)

Twitter_afinn <- Twitter_df %>% 
  inner_join(afinn, by = "word") %>% 
  mutate(score = freq * value) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>% 
  group_by(week, sentiment) %>% 
  summarise(sentiment_score = sum(score)) %>% 
  ungroup() %>% 
  mutate(source = "Twitter")

# Create a dataframe to plot sentiment
News_afinn_plot <- News_afinn %>% 
  pivot_wider(names_from = sentiment, values_from = sentiment_score) %>% 
  mutate(sum = positive + (-1)*negative, 
         pos_percent = positive/sum*100,
         neg_percent = (-1)*negative/sum*100, 
         polarity = round((pos_percent - neg_percent),3)) %>%
  select(week, polarity, source)

Twitter_afinn_plot <- Twitter_afinn %>% 
  pivot_wider(names_from = sentiment, values_from = sentiment_score) %>% 
  mutate(sum = positive + (-1)*negative, 
         pos_percent = positive/sum*100,
         neg_percent = (-1)*negative/sum*100, 
         polarity = round((pos_percent - neg_percent),3)) %>%
  select(week, polarity, source)

# Extract beta matrix to data frame format. Using AFINN lexicon, make a data frame with the sentiment for each word
News_beta <- tidy(News_topic_model, matrix = "beta")
News_topic_sc <- inner_join(News_beta, afinn, by = c("term" = "word")) %>% 
  mutate(score = beta * value) %>% 
  group_by(topic) %>% 
  summarise(sentiment_score = sum(score))
News_topic_sc$topic <- factor(News_topic_sc$topic, levels = c(1:25))

Twitter_beta <- tidy(Twitter_topic_model, matrix = "beta")
Twitter_topic_sc <- inner_join(Twitter_beta, afinn, by = c("term" = "word")) %>%  
  mutate(score = beta * value) %>% 
  group_by(topic) %>% 
  summarise(sentiment_score = sum(score))
Twitter_topic_sc$topic <- factor(Twitter_topic_sc$topic, levels = c(1:25))
afinn_chart_all <- rbind(News_afinn_plot, Twitter_afinn_plot)
afinn_chart_all$week <- as.numeric(as.character(afinn_chart_all$week))
afinn_chart_all %>% 
  arrange(week) %>% 
  ggplot(aes(x = week, y = polarity, color = source)) + 
  geom_point() +
  geom_path() +
  geom_hline(yintercept = 0, color = "gold3") +
  geom_vline(xintercept = 12, color = "coral2") +
  labs(x = "Week", y = "Sentiment Score") + 
  scale_x_continuous(  # This handles replacement of row 
    breaks = afinn_chart_all$week, # notice need to reuse data frame
    labels = afinn_chart_all$week) +
  scale_y_continuous(n.breaks = 10) +
  scale_colour_manual(values = c("#E3211C", "#1F78B4")) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 16),
        legend.title = element_text(size = 16),
        legend.text = element_text(size = 16),
        legend.position = "bottom")

Overall, the opinion towards COVID-19 remained negative throughout the period. While news was a bit more positive than Twitter, there was a similar pattern as the sentiment scores were highly negative in week 3 but showed a positive trend in the subsequent weeks. At week 12 when the situation got worse in the US, the sentiment score of news articles went down then picked up again at week 15, whereas Twitter showed a consistent sentiment.

With the purpose of examining from several perspectives, we also made use of the NRC lexicon from Saif Mohammad and Peter Turney. The NRC lexicon categorizes words in a binary fashion (“yes”/ “no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We filtered out the positive and negative categories which left us with 4,463 distinct words in the lexicon (Note: one word can be in multiple sentiment categories). We counted the number of words in each category and computed their percentage distribution in each of our sources. Applying NRC lexicon to inspect the data in different categorical sentiments, we presented the result in a radar chart:

nrc <- get_sentiments("nrc") %>% 
  filter(!sentiment %in%  c("positive", "negative"))

# News join and calculate score
News_nrc <- News_df %>% 
  inner_join(nrc) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  group_by(week, sentiment) %>% 
  summarise(count = sum(freq))
News_nrc_sum <- News_nrc %>% 
  group_by(sentiment) %>% 
  summarise(total_count = sum(count)) %>% 
  ungroup()
News_nrc_sum <- News_nrc_sum %>% 
  mutate(percent = total_count/ sum(News_nrc_sum$total_count),
         source = "News")

# Twitter join and calculate score
Twitter_nrc <- Twitter_df %>% 
  inner_join(nrc) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  group_by(week, sentiment) %>% 
  summarise(count = sum(freq))
Twitter_nrc_sum <- Twitter_nrc %>% 
  group_by(sentiment) %>% 
  summarise(total_count = sum(count)) %>% 
  ungroup()
Twitter_nrc_sum <- Twitter_nrc_sum %>% 
  mutate(percent = total_count/ sum(Twitter_nrc_sum$total_count),
         source = "Twitter")

# Combine data and plot radar chart
nrc_chart_all <- rbind(News_nrc_sum, Twitter_nrc_sum)
nrc_chart_all %>% 
  select(-total_count) %>% 
  pivot_wider(names_from = source, values_from = percent) %>% 
  chartJSRadar(width = 8,
               height = 5,
               showToolTipLabel = FALSE,
               colMatrix = grDevices::col2rgb(c("red", "blue", "green")),
               labelSize = 18)

Overall, news articles showed the most positivity by coming on top of “joy” and “trust”, but it also had more “anger” words. Reddit were dominant in most negative categories, such as “sadness”, “fear” and “anticipation”. Twitter’s posts were more negative, as they expressed “disgust”, “sadness”, and somehow “surprise”. This turned out to be a relevant interpretation of the outcome of the AFINN lexicon analysis above.

Topic-based sentiment analysis

On the topic-based level, we assigned the sentiment value for each topic word based on the lexicon and the probability of a word in a topic given by the STM topic model as the term weight of the word. As a result, topic sentiment score is calculated as

\[Sentiment\ Score\ of\ Topic\ A = \sum_{i=1}^{n} (prob(word_i|Topic_A)*sentiment\ value(word_i))\]

An overall topic sentiment score is computed by multiplying the sentiment value by the probability of words and summing the products in a chosen topic. Multiple topics could have the same words with different probability values. Therefore, the sentiment score of a topic would be distinguished from others even if the same set of words appear in different topics.


# News
News_topic_sc_plot <- News_topic_sc %>% 
  ggplot(aes(x = topic, y = sentiment_score)) + 
  geom_col(fill = "#E3211C") +
  geom_hline(yintercept = 0, color = "gold3") +
  labs(x = NULL, y = NULL, subtitle = "News") + 
  scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        plot.subtitle = element_text(size = rel(1.5)))

# Twitter
Twitter_topic_sc_plot <- Twitter_topic_sc %>% 
  ggplot(aes(x = topic, y = sentiment_score)) + 
  geom_col(fill = "#1F78B4") +
  geom_hline(yintercept = 0, color = "gold3") +
  labs(x = "Topic", y = NULL, subtitle = "Twitter") + 
  scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        plot.subtitle = element_text(size = rel(1.5)))

grid.arrange(News_topic_sc_plot, Twitter_topic_sc_plot)

Using STM model with the same number of topics of 25, we calculated the sentiment score of each topic. Again, the general opinions in most topics were negative. The sentiments were more significant in Twitter, while in news articles most topics had the scores closer to neutral. Topic 7 in Twitter was extremely negative, which mostly consist of sentimental words (hell, yall, shit…). This was one downside of unsupervised topic modelling, as it grouped these sentimental words together in one topic, but the interpretation was not useful for topic discovery. Excluding the topics above, news and Twitter had a few notable negativities, with topic 22 (political, pence, tweet, blame, party, attack…) and 23 (police, prison, court, law, jail…) in news and topic 12 (die, flu, trump…) in Twitter. Twitter users seemed to have the concerns centered around President Trump’s behaviors, while the news media gravitated towards the conflict of political parties and worries about social security.

Among the positive topics, there was a homogeneous them in all sources. Topic 13 in the news (love, church, moment, wife, mother…) and Topic 18 (family, friend, pray…) and 15 (protect, god, love, jesus…) in Twitter had the same theme of faith and belief, though the positive scores were mild.

Conclusion

This project studies the topic coverage and sentiment dynamics of the sensitive topic COVID-19 in News media and Twitter. Although the project has revealed different characteristics of news articles and Twitter, it only conducts analysis on the text property and the time variable. Many meta-data were removed from the raw data which might leave out some important aspects that could be elaborated further. Other choices that might lead to research bias were the data samples which just cover subsets of keywords and the dictionaries used for pre-processing and sentiment analysis. Lastly, the time frame of data collection is only three months, ending at the middle of April when the pandemic was still escalating in the US. It would be better If we collect the data and conduct the analysis after the pandemic is over, which would give a full overview of the situation.

---
title: Text mining of mainstream news and Twitter about Coronavirus outbreak in the USA.
author: Phong Viet Nguyen
date: "2020-12-05"
output: 
    html_notebook:
      toc: true
---

# Introduction

In this project, we look at the coverage of Coronavirus on mainstream news articles and Twitter in the USA. The main goal is to inspect the difference in the focus of each media outlet (topic modelling) and their expressed attitudes (sentiment analysis) on COVID-19.

The project is conducted with the following steps: Data collection, Data cleaning and Pre-processing, Topic modeling, and Sentiment analysis. The data was gathered using web-scraping technique with R and Python programming. After data cleaning, the *Quanteda* package in R was used to prepare for text analysis. Then, we applied topic modelling to identify distinct topics. In sentiment analysis, we chose a lexicon, calculated sentiments scores and looked at the shift in sentiments.

The result has revealed some disparities between the media sources. Social media, namely Twitter, are reflecting more on people´s sentiments and quick updates of the situation while news articles are more about macro problems. The two media's sentiment remained negative throughout the observed period.

```{r Packages, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse) # data manipulation
library(quanteda) # text pre-processing
library(quanteda.dictionaries)
library(tidytext) # text analysis
library(lubridate) # date function
library(gridExtra) # multiple graphs
library(stm) # topic modelling
library(stmCorrViz)
library(furrr)
library(tm)
library(scales)
library(ggthemes)
library(radarchart)
```

# Data Collection and Cleaning

## News Articles

Using the keyword *coronavirus* for querying, we collected news articles published on news websites in the United States from the beginning of 2020, where the first cases of coronavirus were reported, until April 19, 2020. For sources, we looked at the top 15 U.S news websites measured by unique monthly visitors (Statista, 2020), excluding the news aggregators (Yahoo News, Google News) and topic-specific newspapers (The Wall Street Journal). We used a third-party API called *Currents API* in R to get all the available URLs, then put them into the newspaper3k module in Python, which enabled scraping the entire articles. Furthermore, after checking the data quality from each source, we narrowed them down from 15 to 6 news outlets: CNN, The New York Times, Fox News, The Guardian, USA Today and The LA Times.

```{r, message=FALSE}
# Load the scraped data
news_data <- read_csv("news_coronavirus.csv")

# Add week and remove unnecessary column
news_data <- news_data %>% 
  mutate(week = isoweek(publish_date)) %>% 
  select(-title)

# Preview the data
glimpse(news_data)
```

## Twitter

Thanks to the support from the team at <crowdbreaks.org>, who is tracking Twitter trend on COVID-19 in real time, we managed to get all tweet IDs that they had collected since January 2020 in the US with the keywords of “wuhan”, “ncov”, “coronavirus”, “covid”, “sars-cov-2”. We decided to take stratified sampling of the data (50%) to keep the distribution of number of tweets per week and speed up data collection and obtain the data in time for the report. The Twitter data were collected until April 16, 2020.

```{r, message=FALSE}
# Load the scraped data
tweets_data <- read_csv("twitter_coronavirus.csv")

# Transform the date column
tweets_data <- tweets_data %>% 
  separate(created_at, c("wday", "month", "day", "time", "plus", "year"), 
           sep = " ") %>% 
  mutate(date = paste(month, day, year, sep = " ")) %>% 
  select(c("text", "date"))

tweets_data$date <- as.Date(tweets_data$date, format = "%b %d %Y")
tweets_data <- tweets_data %>% 
  mutate(week = isoweek(date)) %>% 
  filter(week > 2)

# Clean hashtags, links and special characters
tweets_data$text <- tweets_data$text %>% 
  str_replace_all("#", " ") %>%
  str_remove_all("(?<=^|\\s)http[^\\s]+") %>% 
  str_remove_all("[^a-zA-Z0-9 ]") %>% 
  trimws()

# Remove blank text after cleaning
tweets_data <- tweets_data %>% 
  filter(str_count(text, pattern = boundary("word")) > 1)
```

## Data Overview

For further data cleaning, we removed duplicated data, texts that were too short and kept only texts written in English. We kept the time variable in week format and the full text of the document for analysis. Consequently, 14,149 news articles and 109,329 tweets were ready for analysis.

The data distribution can be seen below. The number of news reported on COVID-19 in the U.S were relatively low in the first 6 weeks (week 3 - week 8). However, there was a dramatic increase starting from week 9 and peaked in week 11 and 13 for Twitter and News respectively. This was the end of March when the number of infected cases in the country surged. 

```{r Data Count, echo=FALSE, message=FALSE}
plot_news <- news_data %>%
  group_by(week) %>%
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(source = "News")

plot_tweets <- tweets_data %>%
  group_by(week) %>%
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(source = "Twitter")
```

```{r Data Distribution, echo=FALSE}

news_distribution <- plot_news %>%
  ggplot(aes(x = week)) + 
  geom_col(aes(y = n, fill = -n), show.legend = FALSE) +
  scale_x_continuous(n.breaks = 10) +
  labs(x = "week", y = "number of documents", subtitle = "News") +
  theme(
    axis.title = element_text(size = 14),
    axis.text.x = element_text(size = 14, hjust = .5, vjust = .5),
    axis.text.y = element_text(size = 14),
    panel.border = element_rect("lightgray", fill = NA),
    plot.subtitle = element_text(hjust = 0.5, size = rel(1.8)))

tweets_distribution <- plot_tweets %>%
  ggplot(aes(x = week)) + 
  geom_col(aes(y = n, fill = -n), show.legend = FALSE) +
  scale_x_continuous(n.breaks = 10) +
  labs(x = "week", y = NULL, subtitle = "Twitter") +
  theme(
    axis.title = element_text(size = 14),
    axis.text.x = element_text(size = 14, hjust = .5, vjust = .5),
    axis.text.y = element_text(size = 14),
    panel.border = element_rect("lightgray", fill = NA),
    plot.subtitle = element_text(hjust = 0.5, size = rel(1.8)))

grid.arrange(news_distribution, tweets_distribution, ncol = 2)
```

# Pre-processing

For text data preparation, firstly, we created a corpus for the data set of each source. A corpus consists of a collection of documents (each document is an article or a tweet), and the document variables which describe the characteristics of the document, for instance published date and source. Subsequently, we tokenized each corpus, which separates the text into its single words (also called terms or tokens). A summary of the corpus for news articles can be seen below.

```{r Create a corpus}
corpus_News_data <- corpus(news_data, text_field = c("text"), unique_docnames = F) %>%
  `docnames<-`(news_data$week) # update the name of document in the corpus
summary(corpus_News_data, 5)
head(corpus_News_data, 5) # take a look at the corpus

corpus_Twitter_data <- corpus(tweets_data, text_field = c("text"), unique_docnames = F) %>%
  `docnames<-`(tweets_data$week)
```

```{r Stop-words and Base-formed words}
stopwords_extended <- readLines("stopwords_en.txt", encoding = "UTF-8")
lemma_data <- read.csv("baseform_en.tsv", encoding = "UTF-8")
```

We pre-processed the data sets by lowering upper case letters and removing the numbers, punctuations, symbols, URLs and separators. The main idea is to reduce the final amount of terms extracted, which is important in order to improve the accuracy of both topic modeling and sentiment analysis. If two words are similar it is convenient to combine them as one unique word. Moreover, if a word is not relevant for the analysis, it can be removed. Hence, we implemented stop-words removal and lemmatization technique. Stopwords are words that appear in texts but do not give the text a substantial meaning (e.g., "the", "a", or "for") and lemmatization deals with the inflected forms of words by replacing them with their base forms. The total number tokens and unique tokens in each source decreased by 40 – 70% after the above steps:

```{r Create tokens}
Token_news_data <- corpus_News_data %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, 
         remove_url = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower(keep_acronyms = FALSE) %>%
  tokens_remove(pattern = c("amid", "updates", "live", "video", 
                            "didn", "briefing")) %>%
  tokens_replace(lemma_data$inflected_form, lemma_data$lemma, 
                 valuetype = "glob") %>%
  tokens_replace(pattern = c("covid-19", "ncov", "cov"), 
                 c("coronavirus", "coronavirus", "coronavirus"),
                 valuetype = "fixed") %>% 
  tokens_remove(pattern = stopwords_extended, padding = F)

Token_Twitter_data <- corpus_Twitter_data %>% 
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
         remove_url = TRUE, remove_separators = TRUE) %>% 
  tokens_tolower(keep_acronyms = FALSE) %>%
  tokens_remove(pattern = c("people", "covid19", "covid", "rt", "dont")) %>%
  tokens_replace(lemma_data$inflected_form, lemma_data$lemma, 
                 valuetype = "glob") %>% 
  tokens_remove(pattern = stopwords_extended, padding = T)
```


```{r Number of tokens, echo=FALSE, message=FALSE}
tibble(
  Source = c("News", "Twitter"), 
  Corpus = c(14149, 352842), 
  Number_of_tokens = c("3938639 (-59.6%)", "5729360 (-68.7%)"),
  Unique_tokens = c("2534518 (-42.5%)", "5065136 (-62.9%)")
)
```

In the next step, we generated a document-feature matrix (also known as document-term matrix) for each source. It represents how frequently terms (tokens) occur in the corpus (document) by counting single terms. We kept only the top 5% of the most frequent features (minimum term frequency set at 0.95) that present in less than 10% of all documents (maximum document frequency set at 0.1) to focus on common but distinctive features. We also dismissed the documents that had no tokens left after trimming. The dimensions of the two document-feature matrices are as follow:

```{r Keep top features}
News_DFM <- Token_news_data %>% 
  tokens_remove("") %>%
  dfm() %>% 
  dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
News_DFM <- News_DFM[ntoken(News_DFM) > 0, ]
#head(News_DFM, n = 5, nf = 8)

Twitter_DFM <- Token_Twitter_data %>% 
  tokens_remove("") %>%
  dfm() %>%
  dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
           max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
Twitter_DFM <- Twitter_DFM[ntoken(Twitter_DFM) > 0,]
```

```{r Document-feature matrix dimension, echo=FALSE, message=FALSE}
tibble(
  Source = c("News", "Twitter"), 
  "Number_of_documents (rows)" = c(14149, 351603),
  "Number_of_features (columns)" = c(3492, 2846)
)
```

For better interpretation, the first 3 documents and 10 features of the document-feature matrix of news articles can be seen below.

```{r DFM preview}
News_DFM[1:3, 1:10]
```

# Topic Modelling

We discovered topics distributed in news articles and tweets on a weekly basis over the three-month period and compared the topics between the two media outlets. For topic modelling we used Structural Topic Modeling (STM), since it incorporates metadata (information about each document) into the topic modelling framework, so that we can discover topics and estimate their relationship to the documents' metadata.

STM is an unsupervised machine learning model where we have to define the number of topic K in advance, similarly to k-means clustering we do not know ahead how many topics we should use for any given corpus. Therefore, we trained a group of topic models with different numbers of K (topics) and then evaluated how many topics are appropriate. Here we proceeded with **K = 25** for each of the media forms. The evaluation method can be referred to in [this post](https://juliasilge.com/blog/evaluating-stm/) by *Julia Silge*.

```{r Convert dfm to stm model}
News_stm <- convert(News_DFM, to = "stm")
Twitter_stm <- convert(Twitter_DFM, to = "stm")
```

```{r Run topic modelling, message=FALSE, results='hide'}
News_topic_model <- stm(
  documents = News_stm$documents, 
  vocab = News_stm$vocab,
  K = 25,
  prevalence =~ s(week),
  data = News_stm$meta,
  init.type = "Spectral",
  seed = 123456)

Twitter_topic_model <- stm(
  documents = Twitter_stm$documents, 
  vocab = Twitter_stm$vocab,
  K = 25,
  prevalence =~ s(week),
  data = Twitter_stm$meta,
  init.type = "Spectral",
  seed = 123456)
```

The plots below show the top 10 topics in each corpus (News, Twitter), and top 7 words that have the highest probability to belong to that topic. A closer look into each source showed us some interesting patterns. 

```{r Plot news topic, warning=FALSE, message=FALSE}
News_beta <- tidy(News_topic_model)
News_gamma <- tidy(News_topic_model, matrix = "gamma")

News_top_terms <- News_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

News_gamma_terms <- News_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(News_top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

News_gamma_terms %>%
  top_n(10, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0.8, nudge_y = 0.0005, size = 3,
            family = "IBMPlexSans") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.09),
                     labels = percent_format()) +
  theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
  theme(plot.title = element_text(size = 16,
                                  family="IBMPlexSans-Bold"),
        plot.subtitle = element_text(size = 13)) +
  labs(x = NULL, y = expression(gamma),
       title = "Top 10 topics by popularity on News Articles",
       subtitle = "With the top words that contribute to each topic")
```

```{r Plot twitter topic, warning=FALSE, message=FALSE}
Twitter_beta <- tidy(Twitter_topic_model)
Twitter_gamma <- tidy(Twitter_topic_model, matrix = "gamma")

Twitter_top_terms <- Twitter_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest()

Twitter_gamma_terms <- Twitter_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(Twitter_top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ", topic),
         topic = reorder(topic, gamma))

Twitter_gamma_terms %>%
  top_n(10, gamma) %>%
  ggplot(aes(topic, gamma, label = terms, fill = topic)) +
  geom_col(show.legend = FALSE) +
  geom_text(hjust = 0.85, nudge_y = 0.0005, size = 3,
            family = "IBMPlexSans") +
  coord_flip() +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0, 0.09),
                     labels = percent_format()) +
  theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
  theme(plot.title = element_text(size = 16,
                                  family="IBMPlexSans-Bold"),
        plot.subtitle = element_text(size = 13)) +
  labs(x = NULL, y = expression(gamma),
       title = "Top 10 topics by popularity on Twitter",
       subtitle = "With the top words that contribute to each topic")
```

The key focuses on Twitter are about something like “the spread of virus (Topic 11)",  “negative reaction (Topic 7, 23)”, “update on the pandemic information (Topic 13, 4)". On the contrary, when looking into News topics, we can see a relatively difference pattern, which is more related to “political matters – topic 22”, “situation in other country - topic 19”, “economy situation – topic 4”, "healthcare system situation (Topic 20, 21)". The message about family is popular in both sources (News Topic 13, Twitter Topic 18). Furthermore, politicians are well-covered, with Trump on Twitter and Pence/ Cuomo on mainstream papers.

# Sentiment Analysis

## Corpus sentiment analysis

To investigate sentiments in news articles and tweets, tokens extracted from the documents are used to calculate sentiment scores with the AFINN lexicon developed by Finn Årup Nielsen, which is included in the tidytext package. The AFINN lexicon is a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive). It comprises 878 positive terms and 1,598 negative terms. The sentiment score is the percentage of the difference between the absolute sum of positive terms and negative terms divided by the total sum of them.

```{r Preparation, warning=FALSE, message=FALSE}
# Make a dataframe with the AFINN dictionary
afinn <- get_sentiments("afinn")

News_ext <- convert(News_DFM, to = "tripletlist")
News_df <- data.frame(doc = News_ext$document,
                        word = News_ext$feature,
                        freq = News_ext$frequency)
News_df$word <- as.character(News_df$word)

News_afinn <- News_df %>% 
  inner_join(afinn, by = "word") %>% 
  mutate(score = freq * value) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>% 
  group_by(week, sentiment) %>% 
  summarise(sentiment_score = sum(score)) %>% 
  ungroup() %>% 
  mutate(source = "News")

Twitter_ext <- convert(Twitter_DFM, to = "tripletlist")
Twitter_df <- data.frame(doc = Twitter_ext$document,
                      word = Twitter_ext$feature,
                      freq = Twitter_ext$frequency)
Twitter_df$word <- as.character(Twitter_df$word)

Twitter_afinn <- Twitter_df %>% 
  inner_join(afinn, by = "word") %>% 
  mutate(score = freq * value) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>% 
  group_by(week, sentiment) %>% 
  summarise(sentiment_score = sum(score)) %>% 
  ungroup() %>% 
  mutate(source = "Twitter")

# Create a dataframe to plot sentiment
News_afinn_plot <- News_afinn %>% 
  pivot_wider(names_from = sentiment, values_from = sentiment_score) %>% 
  mutate(sum = positive + (-1)*negative, 
         pos_percent = positive/sum*100,
         neg_percent = (-1)*negative/sum*100, 
         polarity = round((pos_percent - neg_percent),3)) %>%
  select(week, polarity, source)

Twitter_afinn_plot <- Twitter_afinn %>% 
  pivot_wider(names_from = sentiment, values_from = sentiment_score) %>% 
  mutate(sum = positive + (-1)*negative, 
         pos_percent = positive/sum*100,
         neg_percent = (-1)*negative/sum*100, 
         polarity = round((pos_percent - neg_percent),3)) %>%
  select(week, polarity, source)

# Extract beta matrix to data frame format. Using AFINN lexicon, make a data frame with the sentiment for each word
News_beta <- tidy(News_topic_model, matrix = "beta")
News_topic_sc <- inner_join(News_beta, afinn, by = c("term" = "word")) %>% 
  mutate(score = beta * value) %>% 
  group_by(topic) %>% 
  summarise(sentiment_score = sum(score))
News_topic_sc$topic <- factor(News_topic_sc$topic, levels = c(1:25))

Twitter_beta <- tidy(Twitter_topic_model, matrix = "beta")
Twitter_topic_sc <- inner_join(Twitter_beta, afinn, by = c("term" = "word")) %>%  
  mutate(score = beta * value) %>% 
  group_by(topic) %>% 
  summarise(sentiment_score = sum(score))
Twitter_topic_sc$topic <- factor(Twitter_topic_sc$topic, levels = c(1:25))
```

```{r Plot AFINN sentiment analysis, warning=FALSE, message=FALSE}
afinn_chart_all <- rbind(News_afinn_plot, Twitter_afinn_plot)
afinn_chart_all$week <- as.numeric(as.character(afinn_chart_all$week))
afinn_chart_all %>% 
  arrange(week) %>% 
  ggplot(aes(x = week, y = polarity, color = source)) + 
  geom_point() +
  geom_path() +
  geom_hline(yintercept = 0, color = "gold3") +
  geom_vline(xintercept = 12, color = "coral2") +
  labs(x = "Week", y = "Sentiment Score") + 
  scale_x_continuous(  # This handles replacement of row 
    breaks = afinn_chart_all$week, # notice need to reuse data frame
    labels = afinn_chart_all$week) +
  scale_y_continuous(n.breaks = 10) +
  scale_colour_manual(values = c("#E3211C", "#1F78B4")) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 16),
        legend.title = element_text(size = 16),
        legend.text = element_text(size = 16),
        legend.position = "bottom")

```

Overall, the opinion towards COVID-19 remained negative throughout the period. While news was a bit more positive than Twitter, there was a similar pattern as the sentiment scores were highly negative in week 3 but showed a positive trend in the subsequent weeks. At week 12 when the situation got worse in the US, the sentiment score of news articles went down then picked up again at week 15, whereas Twitter showed a consistent sentiment.

With the purpose of examining from several perspectives, we also made use of the NRC lexicon from Saif Mohammad and Peter Turney. The NRC lexicon categorizes words in a binary fashion (“yes”/ “no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We filtered out the positive and negative categories which left us with 4,463 distinct words in the lexicon (Note: one word can be in multiple sentiment categories). We counted the number of words in each category and computed their percentage distribution in each of our sources. Applying NRC lexicon to inspect the data in different categorical sentiments, we presented the result in a radar chart:

```{r Plot NRC sentiment analysis, warning=FALSE, message=FALSE}
nrc <- get_sentiments("nrc") %>% 
  filter(!sentiment %in%  c("positive", "negative"))

# News join and calculate score
News_nrc <- News_df %>% 
  inner_join(nrc) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  group_by(week, sentiment) %>% 
  summarise(count = sum(freq))
News_nrc_sum <- News_nrc %>% 
  group_by(sentiment) %>% 
  summarise(total_count = sum(count)) %>% 
  ungroup()
News_nrc_sum <- News_nrc_sum %>% 
  mutate(percent = total_count/ sum(News_nrc_sum$total_count),
         source = "News")

# Twitter join and calculate score
Twitter_nrc <- Twitter_df %>% 
  inner_join(nrc) %>% 
  separate(doc, into = c("week", "no"), sep = "\\.") %>% 
  group_by(week, sentiment) %>% 
  summarise(count = sum(freq))
Twitter_nrc_sum <- Twitter_nrc %>% 
  group_by(sentiment) %>% 
  summarise(total_count = sum(count)) %>% 
  ungroup()
Twitter_nrc_sum <- Twitter_nrc_sum %>% 
  mutate(percent = total_count/ sum(Twitter_nrc_sum$total_count),
         source = "Twitter")

# Combine data and plot radar chart
nrc_chart_all <- rbind(News_nrc_sum, Twitter_nrc_sum)
nrc_chart_all %>% 
  select(-total_count) %>% 
  pivot_wider(names_from = source, values_from = percent) %>% 
  chartJSRadar(width = 8,
               height = 5,
               showToolTipLabel = FALSE,
               colMatrix = grDevices::col2rgb(c("red", "blue", "green")),
               labelSize = 18)
```

Overall, news articles showed the most positivity by coming on top of “joy” and “trust”, but it also had more “anger” words. Reddit were dominant in most negative categories, such as “sadness”, “fear” and “anticipation”. Twitter’s posts were more negative, as they expressed “disgust”, "sadness", and somehow "surprise". This turned out to be a relevant interpretation of the outcome of the AFINN lexicon analysis above. 

## Topic-based sentiment analysis

On the topic-based level, we assigned the sentiment value for each topic word based on the lexicon and the probability of a word in a topic given by the STM topic model as the term weight of the word. As a result, topic sentiment score is calculated as

$$Sentiment\ Score\ of\ Topic\ A = \sum_{i=1}^{n} (prob(word_i|Topic_A)*sentiment\ value(word_i))$$

An overall topic sentiment score is computed by multiplying the sentiment value by the probability of words and summing the products in a chosen topic. Multiple topics could have the same words with different probability values. Therefore, the sentiment score of a topic would be distinguished from others even if the same set of words appear in different topics.

```{r Topic-based sentiment analysis, warning=FALSE, message=FALSE}

# News
News_topic_sc_plot <- News_topic_sc %>% 
  ggplot(aes(x = topic, y = sentiment_score)) + 
  geom_col(fill = "#E3211C") +
  geom_hline(yintercept = 0, color = "gold3") +
  labs(x = NULL, y = NULL, subtitle = "News") + 
  scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        plot.subtitle = element_text(size = rel(1.5)))

# Twitter
Twitter_topic_sc_plot <- Twitter_topic_sc %>% 
  ggplot(aes(x = topic, y = sentiment_score)) + 
  geom_col(fill = "#1F78B4") +
  geom_hline(yintercept = 0, color = "gold3") +
  labs(x = "Topic", y = NULL, subtitle = "Twitter") + 
  scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
  theme(panel.grid = element_blank(),
        panel.background = NULL,
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        plot.subtitle = element_text(size = rel(1.5)))

grid.arrange(News_topic_sc_plot, Twitter_topic_sc_plot)
```

Using STM model with the same number of topics of 25, we calculated the sentiment score of each topic. Again, the general opinions in most topics were negative. The sentiments were more significant in Twitter, while in news articles most topics had the scores closer to neutral. Topic 7 in Twitter was extremely negative, which mostly consist of sentimental words (hell, yall, shit…). This was one downside of unsupervised topic modelling, as it grouped these sentimental words together in one topic, but the interpretation was not useful for topic discovery. Excluding the topics above, news and Twitter had a few notable negativities, with topic 22 (political, pence, tweet, blame, party, attack…) and 23 (police, prison, court, law, jail…) in news and topic 12 (die, flu, trump…) in Twitter. Twitter users seemed to have the concerns centered around President Trump’s behaviors, while the news media gravitated towards the conflict of political parties and worries about social security.

Among the positive topics, there was a homogeneous them in all sources. Topic 13 in the news (love, church, moment, wife, mother…) and Topic 18 (family, friend, pray…) and 15 (protect, god, love, jesus...) in Twitter had the same theme of faith and belief, though the positive scores were mild.

# Conclusion

This project studies the topic coverage and sentiment dynamics of the sensitive topic COVID-19 in News media and Twitter. Although the project has revealed different characteristics of news articles and Twitter, it only conducts analysis on the text property and the time variable. Many meta-data were removed from the raw data which might leave out some important aspects that could be elaborated further. Other choices that might lead to research bias were the data samples which just cover subsets of keywords and the dictionaries used for pre-processing and sentiment analysis. Lastly, the time frame of data collection is only three months, ending at the middle of April when the pandemic was still escalating in the US. It would be better If we collect the data and conduct the analysis after the pandemic is over, which would give a full overview of the situation.