Sentiment Analysis of US Financial News Headlines Data

For this assignment I am going to perform a sentiment analysis on US Financial News Headlines data, which were obtained from Kaggle.com at the address below:

https://www.kaggle.com/notlucasp/financial-news-headlines

Context

The datasets consist of 3 sets scraped from CNBC, the Guardian, and Reuters official websites, the headlines in these datasets reflects the overview of the U.S. economy and stock market every day for the past year to 2 years.

Content

Data scraped from CNBC contains the headlines, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020. Data scraped from the Guardian Business contains the headlines and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text. Data scraped from Reuters contains the headlines, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)
library(textdata)    # Needed for loughran lexicon
library(ggplot2)

Let’s use the loughran lexicon to perform the sentiment analysis

loughran_sentiments <- get_sentiments("loughran")

Let’s take a peak at the sentiments from the “loughran” lexicon

loughran_sentiments 
## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,140 more rows

Read the data

cnbc_csv <- read.csv("https://raw.githubusercontent.com/arinolan/week-11-assignment/main/cnbc_headlines.csv")

head(cnbc_csv)
##                                                                             Headlines
## 1                Jim Cramer: A better way to invest in the Covid-19 vaccine gold rush
## 2                                      Cramer's lightning round: I would own Teradyne
## 3                                                                                    
## 4           Cramer's week ahead: Big week for earnings, even bigger week for vaccines
## 5                      IQ Capital CEO Keith Bliss says tech and healthcare will rally
## 6 Wall Street delivered the 'kind of pullback I've been waiting for,' Jim Cramer says
##                             Time
## 1  7:51  PM ET Fri, 17 July 2020
## 2  7:33  PM ET Fri, 17 July 2020
## 3                               
## 4  7:25  PM ET Fri, 17 July 2020
## 5  4:24  PM ET Fri, 17 July 2020
## 6  7:36  PM ET Thu, 16 July 2020
##                                                                                                                                          Description
## 1                                              "Mad Money" host Jim Cramer recommended buying four companies that are supporting vaccine developers.
## 2        "Mad Money" host Jim Cramer rings the lightning round bell, which means he's giving his answers to callers' stock questions at rapid speed.
## 3                                                                                                                                                   
## 4 "We'll pay more for the earnings of the non-Covid companies if The Lancet publishes some good news from AstraZeneca's vaccine trial," Cramer said.
## 5      Keith Bliss, IQ Capital CEO, joins "Closing Bell" to talk about the broader markets, including the performance of the S&P 500 and the Nasdaq.
## 6          "Look for the stocks of high-quality companies that are going lower even though they deserve to go higher," the "Mad Money" host advised.
# Remove all rows where all the column values are blank
cnbc_headlines <- cnbc_csv[rowSums(is.na(cnbc_csv)) != ncol(cnbc_csv),]

head(cnbc_headlines)
##                                                                             Headlines
## 1                Jim Cramer: A better way to invest in the Covid-19 vaccine gold rush
## 2                                      Cramer's lightning round: I would own Teradyne
## 3                                                                                    
## 4           Cramer's week ahead: Big week for earnings, even bigger week for vaccines
## 5                      IQ Capital CEO Keith Bliss says tech and healthcare will rally
## 6 Wall Street delivered the 'kind of pullback I've been waiting for,' Jim Cramer says
##                             Time
## 1  7:51  PM ET Fri, 17 July 2020
## 2  7:33  PM ET Fri, 17 July 2020
## 3                               
## 4  7:25  PM ET Fri, 17 July 2020
## 5  4:24  PM ET Fri, 17 July 2020
## 6  7:36  PM ET Thu, 16 July 2020
##                                                                                                                                          Description
## 1                                              "Mad Money" host Jim Cramer recommended buying four companies that are supporting vaccine developers.
## 2        "Mad Money" host Jim Cramer rings the lightning round bell, which means he's giving his answers to callers' stock questions at rapid speed.
## 3                                                                                                                                                   
## 4 "We'll pay more for the earnings of the non-Covid companies if The Lancet publishes some good news from AstraZeneca's vaccine trial," Cramer said.
## 5      Keith Bliss, IQ Capital CEO, joins "Closing Bell" to talk about the broader markets, including the performance of the S&P 500 and the Nasdaq.
## 6          "Look for the stocks of high-quality companies that are going lower even though they deserve to go higher," the "Mad Money" host advised.

Sentiment Analysis with Inner Join First, we need to take the text of the headlines and convert the text to the tidy format using unnest_tokens(). Let’s also set up a column to keep track of which headline each word comes from.

Add a new columns to the dataframe containing the Headline Date and Month (YYY-MM)

# Add a new column to the dataframe containing the Headline Date

cnbc_headlines <- cnbc_headlines %>%
  rowwise() %>%
  mutate(Headline_Date = as.Date(sub(".*, ","",Time), format = "%d %B %Y"),
         Headline_YYYYMM = format( as.Date(sub(".*, ","",Time), format = "%d %B %Y"), "%Y-%m")
         )

Convert headlines to tidytext format

tidy_cnbc_headlines <- cnbc_headlines %>%
  select(Headline_YYYYMM, Headline_Date, Headlines) %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(output = word, input = Headlines, token = "words", format = "text", to_lower = TRUE)

First, we find a sentiment score for each word using the “loughran” lexicon and inner_join().

Next, we count up how many positive and negative words there are in each headline.

We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).

cnbc_sentiment <- tidy_cnbc_headlines %>%
  inner_join(loughran_sentiments) %>%
  count(Headline_YYYYMM, Headline_Date, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
ggplot(cnbc_sentiment, aes(Headline_YYYYMM, sentiment)) +
  geom_col(show.legend = FALSE) +
  #facet_wrap(~Headline_YYYYMM, ncol = 4, scales = "free_x")
  coord_flip()

Most Common Positive and Negative Words

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

loughran_word_counts <- tidy_cnbc_headlines %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
loughran_word_counts
## # A tibble: 431 × 3
##    word        sentiment       n
##    <chr>       <chr>       <int>
##  1 could       uncertainty   167
##  2 good        positive       57
##  3 may         uncertainty    55
##  4 best        positive       46
##  5 recession   negative       35
##  6 opportunity positive       32
##  7 warns       negative       30
##  8 bad         negative       27
##  9 better      positive       27
## 10 wrong       negative       26
## # … with 421 more rows

This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames

loughran_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n