Introduction
In this project, we look at the coverage of Coronavirus on mainstream
news articles and Twitter in the USA. The main goal is to inspect the
difference in the focus of each media outlet (topic modelling) and their
expressed attitudes (sentiment analysis) on COVID-19.
The project is conducted with the following steps: Data collection,
Data cleaning and Pre-processing, Topic modeling, and Sentiment
analysis. The data was gathered using web-scraping technique with R and
Python programming. After data cleaning, the Quanteda package
in R was used to prepare for text analysis. Then, we applied topic
modelling to identify distinct topics. In sentiment analysis, we chose a
lexicon, calculated sentiments scores and looked at the shift in
sentiments.
The result has revealed some disparities between the media sources.
Social media, namely Twitter, are reflecting more on people´s sentiments
and quick updates of the situation while news articles are more about
macro problems. The two media’s sentiment remained negative throughout
the observed period.
Data Collection and Cleaning
News Articles
Using the keyword coronavirus for querying, we collected
news articles published on news websites in the United States from the
beginning of 2020, where the first cases of coronavirus were reported,
until April 19, 2020. For sources, we looked at the top 15 U.S news
websites measured by unique monthly visitors (Statista, 2020), excluding
the news aggregators (Yahoo News, Google News) and topic-specific
newspapers (The Wall Street Journal). We used a third-party API called
Currents API in R to get all the available URLs, then put them
into the newspaper3k module in Python, which enabled scraping the entire
articles. Furthermore, after checking the data quality from each source,
we narrowed them down from 15 to 6 news outlets: CNN, The New York
Times, Fox News, The Guardian, USA Today and The LA Times.
# Load the scraped data
news_data <- read_csv("news_coronavirus.csv")
# Add week and remove unnecessary column
news_data <- news_data %>%
mutate(week = isoweek(publish_date)) %>%
select(-title)
# Preview the data
glimpse(news_data)
Rows: 14,149
Columns: 4
$ publish_date <date> 2020-04-19, 2020-04-19, 2020-04-19, 2020-04-19, 202…
$ source <chr> "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CNN", "CN…
$ text <chr> "(CNN) Nothing can bring back the months of wedding …
$ week <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, …
Data Overview
For further data cleaning, we removed duplicated data, texts that
were too short and kept only texts written in English. We kept the time
variable in week format and the full text of the document for analysis.
Consequently, 14,149 news articles and 109,329 tweets were ready for
analysis.
The data distribution can be seen below. The number of news reported
on COVID-19 in the U.S were relatively low in the first 6 weeks (week 3
- week 8). However, there was a dramatic increase starting from week 9
and peaked in week 11 and 13 for Twitter and News respectively. This was
the end of March when the number of infected cases in the country
surged.

Pre-processing
For text data preparation, firstly, we created a corpus for the data
set of each source. A corpus consists of a collection of documents (each
document is an article or a tweet), and the document variables which
describe the characteristics of the document, for instance published
date and source. Subsequently, we tokenized each corpus, which separates
the text into its single words (also called terms or tokens). A summary
of the corpus for news articles can be seen below.
corpus_News_data <- corpus(news_data, text_field = c("text"), unique_docnames = F) %>%
`docnames<-`(news_data$week) # update the name of document in the corpus
summary(corpus_News_data, 5)
Corpus consisting of 14149 documents, showing 5 documents:
Text Types Tokens Sentences publish_date source week
16.1 134 190 6 2020-04-19 CNN 16
16.2 92 155 7 2020-04-19 CNN 16
16.3 127 230 8 2020-04-19 CNN 16
16.4 113 192 5 2020-04-19 CNN 16
16.5 581 1509 54 2020-04-19 CNN 16
head(corpus_News_data, 5) # take a look at the corpus
Corpus consisting of 5 documents and 3 docvars.
16.1 :
"(CNN) Nothing can bring back the months of wedding planning ..."
16.2 :
"(CNN) Broadway actor Nick Cordero is recovering after having..."
16.3 :
"(CNN) A Louisiana pastor who defied state orders and repeate..."
16.4 :
"(CNN) More than 100,000 people defied Bangladesh lockdown or..."
16.5 :
"Hong Kong (CNN) Less than a month ago, Singapore was being h..."
corpus_Twitter_data <- corpus(tweets_data, text_field = c("text"), unique_docnames = F) %>%
`docnames<-`(tweets_data$week)
stopwords_extended <- readLines("stopwords_en.txt", encoding = "UTF-8")
lemma_data <- read.csv("baseform_en.tsv", encoding = "UTF-8")
We pre-processed the data sets by lowering upper case letters and
removing the numbers, punctuations, symbols, URLs and separators. The
main idea is to reduce the final amount of terms extracted, which is
important in order to improve the accuracy of both topic modeling and
sentiment analysis. If two words are similar it is convenient to combine
them as one unique word. Moreover, if a word is not relevant for the
analysis, it can be removed. Hence, we implemented stop-words removal
and lemmatization technique. Stopwords are words that appear in texts
but do not give the text a substantial meaning (e.g., “the”, “a”, or
“for”) and lemmatization deals with the inflected forms of words by
replacing them with their base forms. The total number tokens and unique
tokens in each source decreased by 40 – 70% after the above steps:
Token_news_data <- corpus_News_data %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
remove_url = TRUE, remove_separators = TRUE) %>%
tokens_tolower(keep_acronyms = FALSE) %>%
tokens_remove(pattern = c("amid", "updates", "live", "video",
"didn", "briefing")) %>%
tokens_replace(lemma_data$inflected_form, lemma_data$lemma,
valuetype = "glob") %>%
tokens_replace(pattern = c("covid-19", "ncov", "cov"),
c("coronavirus", "coronavirus", "coronavirus"),
valuetype = "fixed") %>%
tokens_remove(pattern = stopwords_extended, padding = F)
Token_Twitter_data <- corpus_Twitter_data %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE,
remove_url = TRUE, remove_separators = TRUE) %>%
tokens_tolower(keep_acronyms = FALSE) %>%
tokens_remove(pattern = c("people", "covid19", "covid", "rt", "dont")) %>%
tokens_replace(lemma_data$inflected_form, lemma_data$lemma,
valuetype = "glob") %>%
tokens_remove(pattern = stopwords_extended, padding = T)
In the next step, we generated a document-feature matrix (also known
as document-term matrix) for each source. It represents how frequently
terms (tokens) occur in the corpus (document) by counting single terms.
We kept only the top 5% of the most frequent features (minimum term
frequency set at 0.95) that present in less than 10% of all documents
(maximum document frequency set at 0.1) to focus on common but
distinctive features. We also dismissed the documents that had no tokens
left after trimming. The dimensions of the two document-feature matrices
are as follow:
News_DFM <- Token_news_data %>%
tokens_remove("") %>%
dfm() %>%
dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
News_DFM <- News_DFM[ntoken(News_DFM) > 0, ]
#head(News_DFM, n = 5, nf = 8)
Twitter_DFM <- Token_Twitter_data %>%
tokens_remove("") %>%
dfm() %>%
dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
# We remove the token = 0
Twitter_DFM <- Twitter_DFM[ntoken(Twitter_DFM) > 0,]
For better interpretation, the first 3 documents and 10 features of
the document-feature matrix of news articles can be seen below.
News_DFM[1:3, 1:10]
Document-feature matrix of: 3 documents, 10 features (66.67% sparse) and 3 docvars.
features
docs wed waste alleviate engage couple stress beer chance hall postpone
16.1 3 1 1 1 4 1 2 1 1 1
16.2 0 0 0 0 0 0 0 0 0 0
16.3 0 0 0 0 0 0 0 0 0 0
Topic Modelling
We discovered topics distributed in news articles and tweets on a
weekly basis over the three-month period and compared the topics between
the two media outlets. For topic modelling we used Structural Topic
Modeling (STM), since it incorporates metadata (information about each
document) into the topic modelling framework, so that we can discover
topics and estimate their relationship to the documents’ metadata.
STM is an unsupervised machine learning model where we have to define
the number of topic K in advance, similarly to k-means clustering we do
not know ahead how many topics we should use for any given corpus.
Therefore, we trained a group of topic models with different numbers of
K (topics) and then evaluated how many topics are appropriate. Here we
proceeded with K = 25 for each of the media forms. The
evaluation method can be referred to in this post by
Julia Silge.
News_stm <- convert(News_DFM, to = "stm")
Twitter_stm <- convert(Twitter_DFM, to = "stm")
News_topic_model <- stm(
documents = News_stm$documents,
vocab = News_stm$vocab,
K = 25,
prevalence =~ s(week),
data = News_stm$meta,
init.type = "Spectral",
seed = 123456)
Twitter_topic_model <- stm(
documents = Twitter_stm$documents,
vocab = Twitter_stm$vocab,
K = 25,
prevalence =~ s(week),
data = Twitter_stm$meta,
init.type = "Spectral",
seed = 123456)
The plots below show the top 10 topics in each corpus (News,
Twitter), and top 7 words that have the highest probability to belong to
that topic. A closer look into each source showed us some interesting
patterns.
News_beta <- tidy(News_topic_model)
News_gamma <- tidy(News_topic_model, matrix = "gamma")
News_top_terms <- News_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
News_gamma_terms <- News_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(News_top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
News_gamma_terms %>%
top_n(10, gamma) %>%
ggplot(aes(topic, gamma, label = terms, fill = topic)) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0.8, nudge_y = 0.0005, size = 3,
family = "IBMPlexSans") +
coord_flip() +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.09),
labels = percent_format()) +
theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
theme(plot.title = element_text(size = 16,
family="IBMPlexSans-Bold"),
plot.subtitle = element_text(size = 13)) +
labs(x = NULL, y = expression(gamma),
title = "Top 10 topics by popularity on News Articles",
subtitle = "With the top words that contribute to each topic")

Twitter_beta <- tidy(Twitter_topic_model)
Twitter_gamma <- tidy(Twitter_topic_model, matrix = "gamma")
Twitter_top_terms <- Twitter_beta %>%
arrange(beta) %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(-beta) %>%
select(topic, term) %>%
summarise(terms = list(term)) %>%
mutate(terms = map(terms, paste, collapse = ", ")) %>%
unnest()
Twitter_gamma_terms <- Twitter_gamma %>%
group_by(topic) %>%
summarise(gamma = mean(gamma)) %>%
arrange(desc(gamma)) %>%
left_join(Twitter_top_terms, by = "topic") %>%
mutate(topic = paste0("Topic ", topic),
topic = reorder(topic, gamma))
Twitter_gamma_terms %>%
top_n(10, gamma) %>%
ggplot(aes(topic, gamma, label = terms, fill = topic)) +
geom_col(show.legend = FALSE) +
geom_text(hjust = 0.85, nudge_y = 0.0005, size = 3,
family = "IBMPlexSans") +
coord_flip() +
scale_y_continuous(expand = c(0,0),
limits = c(0, 0.09),
labels = percent_format()) +
theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) +
theme(plot.title = element_text(size = 16,
family="IBMPlexSans-Bold"),
plot.subtitle = element_text(size = 13)) +
labs(x = NULL, y = expression(gamma),
title = "Top 10 topics by popularity on Twitter",
subtitle = "With the top words that contribute to each topic")

The key focuses on Twitter are about something like “the spread of
virus (Topic 11)”, “negative reaction (Topic 7, 23)”, “update on the
pandemic information (Topic 13, 4)”. On the contrary, when looking into
News topics, we can see a relatively difference pattern, which is more
related to “political matters – topic 22”, “situation in other country -
topic 19”, “economy situation – topic 4”, “healthcare system situation
(Topic 20, 21)”. The message about family is popular in both sources
(News Topic 13, Twitter Topic 18). Furthermore, politicians are
well-covered, with Trump on Twitter and Pence/ Cuomo on mainstream
papers.
Sentiment Analysis
Corpus sentiment analysis
To investigate sentiments in news articles and tweets, tokens
extracted from the documents are used to calculate sentiment scores with
the AFINN lexicon developed by Finn Årup Nielsen, which is included in
the tidytext package. The AFINN lexicon is a list of English terms
manually rated for valence with an integer between -5 (negative) and +5
(positive). It comprises 878 positive terms and 1,598 negative terms.
The sentiment score is the percentage of the difference between the
absolute sum of positive terms and negative terms divided by the total
sum of them.
# Make a dataframe with the AFINN dictionary
afinn <- get_sentiments("afinn")
News_ext <- convert(News_DFM, to = "tripletlist")
News_df <- data.frame(doc = News_ext$document,
word = News_ext$feature,
freq = News_ext$frequency)
News_df$word <- as.character(News_df$word)
News_afinn <- News_df %>%
inner_join(afinn, by = "word") %>%
mutate(score = freq * value) %>%
separate(doc, into = c("week", "no"), sep = "\\.") %>%
mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>%
group_by(week, sentiment) %>%
summarise(sentiment_score = sum(score)) %>%
ungroup() %>%
mutate(source = "News")
Twitter_ext <- convert(Twitter_DFM, to = "tripletlist")
Twitter_df <- data.frame(doc = Twitter_ext$document,
word = Twitter_ext$feature,
freq = Twitter_ext$frequency)
Twitter_df$word <- as.character(Twitter_df$word)
Twitter_afinn <- Twitter_df %>%
inner_join(afinn, by = "word") %>%
mutate(score = freq * value) %>%
separate(doc, into = c("week", "no"), sep = "\\.") %>%
mutate(sentiment = ifelse(score > 0, "positive", "negative")) %>%
group_by(week, sentiment) %>%
summarise(sentiment_score = sum(score)) %>%
ungroup() %>%
mutate(source = "Twitter")
# Create a dataframe to plot sentiment
News_afinn_plot <- News_afinn %>%
pivot_wider(names_from = sentiment, values_from = sentiment_score) %>%
mutate(sum = positive + (-1)*negative,
pos_percent = positive/sum*100,
neg_percent = (-1)*negative/sum*100,
polarity = round((pos_percent - neg_percent),3)) %>%
select(week, polarity, source)
Twitter_afinn_plot <- Twitter_afinn %>%
pivot_wider(names_from = sentiment, values_from = sentiment_score) %>%
mutate(sum = positive + (-1)*negative,
pos_percent = positive/sum*100,
neg_percent = (-1)*negative/sum*100,
polarity = round((pos_percent - neg_percent),3)) %>%
select(week, polarity, source)
# Extract beta matrix to data frame format. Using AFINN lexicon, make a data frame with the sentiment for each word
News_beta <- tidy(News_topic_model, matrix = "beta")
News_topic_sc <- inner_join(News_beta, afinn, by = c("term" = "word")) %>%
mutate(score = beta * value) %>%
group_by(topic) %>%
summarise(sentiment_score = sum(score))
News_topic_sc$topic <- factor(News_topic_sc$topic, levels = c(1:25))
Twitter_beta <- tidy(Twitter_topic_model, matrix = "beta")
Twitter_topic_sc <- inner_join(Twitter_beta, afinn, by = c("term" = "word")) %>%
mutate(score = beta * value) %>%
group_by(topic) %>%
summarise(sentiment_score = sum(score))
Twitter_topic_sc$topic <- factor(Twitter_topic_sc$topic, levels = c(1:25))
afinn_chart_all <- rbind(News_afinn_plot, Twitter_afinn_plot)
afinn_chart_all$week <- as.numeric(as.character(afinn_chart_all$week))
afinn_chart_all %>%
arrange(week) %>%
ggplot(aes(x = week, y = polarity, color = source)) +
geom_point() +
geom_path() +
geom_hline(yintercept = 0, color = "gold3") +
geom_vline(xintercept = 12, color = "coral2") +
labs(x = "Week", y = "Sentiment Score") +
scale_x_continuous( # This handles replacement of row
breaks = afinn_chart_all$week, # notice need to reuse data frame
labels = afinn_chart_all$week) +
scale_y_continuous(n.breaks = 10) +
scale_colour_manual(values = c("#E3211C", "#1F78B4")) +
theme(panel.grid = element_blank(),
panel.background = NULL,
axis.title = element_text(size = 16),
axis.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.text = element_text(size = 16),
legend.position = "bottom")

Overall, the opinion towards COVID-19 remained negative throughout
the period. While news was a bit more positive than Twitter, there was a
similar pattern as the sentiment scores were highly negative in week 3
but showed a positive trend in the subsequent weeks. At week 12 when the
situation got worse in the US, the sentiment score of news articles went
down then picked up again at week 15, whereas Twitter showed a
consistent sentiment.
With the purpose of examining from several perspectives, we also made
use of the NRC lexicon from Saif Mohammad and Peter Turney. The NRC
lexicon categorizes words in a binary fashion (“yes”/ “no”) into
categories of positive, negative, anger, anticipation, disgust, fear,
joy, sadness, surprise, and trust. We filtered out the positive and
negative categories which left us with 4,463 distinct words in the
lexicon (Note: one word can be in multiple sentiment categories). We
counted the number of words in each category and computed their
percentage distribution in each of our sources. Applying NRC lexicon to
inspect the data in different categorical sentiments, we presented the
result in a radar chart:
nrc <- get_sentiments("nrc") %>%
filter(!sentiment %in% c("positive", "negative"))
# News join and calculate score
News_nrc <- News_df %>%
inner_join(nrc) %>%
separate(doc, into = c("week", "no"), sep = "\\.") %>%
group_by(week, sentiment) %>%
summarise(count = sum(freq))
News_nrc_sum <- News_nrc %>%
group_by(sentiment) %>%
summarise(total_count = sum(count)) %>%
ungroup()
News_nrc_sum <- News_nrc_sum %>%
mutate(percent = total_count/ sum(News_nrc_sum$total_count),
source = "News")
# Twitter join and calculate score
Twitter_nrc <- Twitter_df %>%
inner_join(nrc) %>%
separate(doc, into = c("week", "no"), sep = "\\.") %>%
group_by(week, sentiment) %>%
summarise(count = sum(freq))
Twitter_nrc_sum <- Twitter_nrc %>%
group_by(sentiment) %>%
summarise(total_count = sum(count)) %>%
ungroup()
Twitter_nrc_sum <- Twitter_nrc_sum %>%
mutate(percent = total_count/ sum(Twitter_nrc_sum$total_count),
source = "Twitter")
# Combine data and plot radar chart
nrc_chart_all <- rbind(News_nrc_sum, Twitter_nrc_sum)
nrc_chart_all %>%
select(-total_count) %>%
pivot_wider(names_from = source, values_from = percent) %>%
chartJSRadar(width = 8,
height = 5,
showToolTipLabel = FALSE,
colMatrix = grDevices::col2rgb(c("red", "blue", "green")),
labelSize = 18)
Overall, news articles showed the most positivity by coming on top of
“joy” and “trust”, but it also had more “anger” words. Reddit were
dominant in most negative categories, such as “sadness”, “fear” and
“anticipation”. Twitter’s posts were more negative, as they expressed
“disgust”, “sadness”, and somehow “surprise”. This turned out to be a
relevant interpretation of the outcome of the AFINN lexicon analysis
above.
Topic-based sentiment analysis
On the topic-based level, we assigned the sentiment value for each
topic word based on the lexicon and the probability of a word in a topic
given by the STM topic model as the term weight of the word. As a
result, topic sentiment score is calculated as
\[Sentiment\ Score\ of\ Topic\ A =
\sum_{i=1}^{n} (prob(word_i|Topic_A)*sentiment\
value(word_i))\]
An overall topic sentiment score is computed by multiplying the
sentiment value by the probability of words and summing the products in
a chosen topic. Multiple topics could have the same words with different
probability values. Therefore, the sentiment score of a topic would be
distinguished from others even if the same set of words appear in
different topics.
# News
News_topic_sc_plot <- News_topic_sc %>%
ggplot(aes(x = topic, y = sentiment_score)) +
geom_col(fill = "#E3211C") +
geom_hline(yintercept = 0, color = "gold3") +
labs(x = NULL, y = NULL, subtitle = "News") +
scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
theme(panel.grid = element_blank(),
panel.background = NULL,
axis.title = element_text(size = 16),
axis.text = element_text(size = 14),
plot.subtitle = element_text(size = rel(1.5)))
# Twitter
Twitter_topic_sc_plot <- Twitter_topic_sc %>%
ggplot(aes(x = topic, y = sentiment_score)) +
geom_col(fill = "#1F78B4") +
geom_hline(yintercept = 0, color = "gold3") +
labs(x = "Topic", y = NULL, subtitle = "Twitter") +
scale_y_continuous(limits = c(-1, 0.2), n.breaks = 8) +
theme(panel.grid = element_blank(),
panel.background = NULL,
axis.title = element_text(size = 16),
axis.text = element_text(size = 14),
plot.subtitle = element_text(size = rel(1.5)))
grid.arrange(News_topic_sc_plot, Twitter_topic_sc_plot)

Using STM model with the same number of topics of 25, we calculated
the sentiment score of each topic. Again, the general opinions in most
topics were negative. The sentiments were more significant in Twitter,
while in news articles most topics had the scores closer to neutral.
Topic 7 in Twitter was extremely negative, which mostly consist of
sentimental words (hell, yall, shit…). This was one downside of
unsupervised topic modelling, as it grouped these sentimental words
together in one topic, but the interpretation was not useful for topic
discovery. Excluding the topics above, news and Twitter had a few
notable negativities, with topic 22 (political, pence, tweet, blame,
party, attack…) and 23 (police, prison, court, law, jail…) in news and
topic 12 (die, flu, trump…) in Twitter. Twitter users seemed to have the
concerns centered around President Trump’s behaviors, while the news
media gravitated towards the conflict of political parties and worries
about social security.
Among the positive topics, there was a homogeneous them in all
sources. Topic 13 in the news (love, church, moment, wife, mother…) and
Topic 18 (family, friend, pray…) and 15 (protect, god, love, jesus…) in
Twitter had the same theme of faith and belief, though the positive
scores were mild.
Conclusion
This project studies the topic coverage and sentiment dynamics of the
sensitive topic COVID-19 in News media and Twitter. Although the project
has revealed different characteristics of news articles and Twitter, it
only conducts analysis on the text property and the time variable. Many
meta-data were removed from the raw data which might leave out some
important aspects that could be elaborated further. Other choices that
might lead to research bias were the data samples which just cover
subsets of keywords and the dictionaries used for pre-processing and
sentiment analysis. Lastly, the time frame of data collection is only
three months, ending at the middle of April when the pandemic was still
escalating in the US. It would be better If we collect the data and
conduct the analysis after the pandemic is over, which would give a full
overview of the situation.
