My Dataset - Sentiment Analysis of US Financial News Headlines Data

For this assignment I am going to perform a sentiment analysis on US Financial News Headlines data obtained from kaggle. The link to the dataset is https://www.kaggle.com/notlucasp/financial-news-headlines.

The datasets consist of 3 sets scraped from CNBC, the Guardian, and Reuters news official websites, the headlines in these datasets reflects the overview of the U.S. economy and stock market every day for the years 2017 to 2020.

Data obtained from CNBC contains the headlines, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020.

Data obtained from the Guardian Business contains the headlines and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text.

Data obtained from Reuters contains the headlines, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(textdata)   
library(ggplot2)

Using the loughran lexicon, I performed the sentiment analysis and took the peak at the sentiments from the loughran lexicon.

loughran_sentiments <- get_sentiments("loughran")
loughran_sentiments 
## # A tibble: 4,150 × 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # ℹ 4,140 more rows
cnbc_headlines <- read_csv('cnbc_headlines.csv')
## Rows: 3080 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Headlines, Time, Description
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(cnbc_headlines, 10)
## # A tibble: 10 × 3
##    Headlines                                                   Time  Description
##    <chr>                                                       <chr> <chr>      
##  1 Jim Cramer: A better way to invest in the Covid-19 vaccine… 7:51… "\"Mad Mon…
##  2 Cramer's lightning round: I would own Teradyne              7:33… "\"Mad Mon…
##  3 <NA>                                                        <NA>   <NA>      
##  4 Cramer's week ahead: Big week for earnings, even bigger we… 7:25… "\"We'll p…
##  5 IQ Capital CEO Keith Bliss says tech and healthcare will r… 4:24… "Keith Bli…
##  6 Wall Street delivered the 'kind of pullback I've been wait… 7:36… "\"Look fo…
##  7 Cramer's lightning round: I would just stay long Wex        7:23… "\"Mad Mon…
##  8 Acorns CEO: Parents can turn $5 into five figures for thei… 8:03… "Investing…
##  9 Dividend cuts may mean rethinking your retirement income s… 8:54… "Hundreds …
## 10 <NA>                                                        <NA>   <NA>
#clean the cnbc headlines data
cnbc_headlines_clean <- cnbc_headlines[rowSums(is.na(cnbc_headlines)) != ncol(cnbc_headlines), ]
head(cnbc_headlines_clean)
## # A tibble: 6 × 3
##   Headlines                                                    Time  Description
##   <chr>                                                        <chr> <chr>      
## 1 Jim Cramer: A better way to invest in the Covid-19 vaccine … 7:51… "\"Mad Mon…
## 2 Cramer's lightning round: I would own Teradyne               7:33… "\"Mad Mon…
## 3 Cramer's week ahead: Big week for earnings, even bigger wee… 7:25… "\"We'll p…
## 4 IQ Capital CEO Keith Bliss says tech and healthcare will ra… 4:24… "Keith Bli…
## 5 Wall Street delivered the 'kind of pullback I've been waiti… 7:36… "\"Look fo…
## 6 Cramer's lightning round: I would just stay long Wex         7:23… "\"Mad Mon…

Sentiment Analysis with Inner Join

First, I need to take the text of the headlines and convert the text to the tidy format using unnest_tokens(). I also set up a column to keep track of which headline each word comes from. I add a new columns to the dataframe containing the Headline Date and Month (YYY-MM). Then I convert headlines to tidytext format. Then I find a sentiment score for each word using the “loughran” lexicon and inner_join().Next, I count up how many positive and negative words there are in each headline.I then use spread() so that I can have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).

cnbc_headlines_clean <- cnbc_headlines_clean %>%
  rowwise() %>%
  mutate(Headline_Date = as.Date(sub(".*, ","",Time), format = "%d %B %Y"),
         Headline_YYYYMM = format( as.Date(sub(".*, ","",Time), format = "%d %B %Y"), "%Y-%m")
         )

tidy_cnbc_headlines <- cnbc_headlines_clean %>%
  select(Headline_YYYYMM, Headline_Date, Headlines) %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(output = word, input = Headlines, token = "words", format = "text", to_lower = TRUE)

cnbc_sentiment <- tidy_cnbc_headlines %>%
  inner_join(loughran_sentiments) %>%
  count(Headline_YYYYMM, Headline_Date, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., loughran_sentiments): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2720 of `x` matches multiple rows in `y`.
## ℹ Row 2406 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
ggplot(cnbc_sentiment, aes(Headline_YYYYMM, sentiment)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

Most Common Negative and Positive Words

One advantage of having the data frame with both sentiment and word is that one can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment. This can be shown visually, and one can pipe straight into ggplot2, if one likes, because of the way one is consistently using tools built for handling tidy data frames.

loughran_word_counts <- tidy_cnbc_headlines %>%
  inner_join(get_sentiments("loughran")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("loughran")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2720 of `x` matches multiple rows in `y`.
## ℹ Row 2406 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
loughran_word_counts
## # A tibble: 431 × 3
##    word        sentiment       n
##    <chr>       <chr>       <int>
##  1 could       uncertainty   167
##  2 good        positive       57
##  3 may         uncertainty    55
##  4 best        positive       46
##  5 recession   negative       35
##  6 opportunity positive       32
##  7 warns       negative       30
##  8 bad         negative       27
##  9 better      positive       27
## 10 wrong       negative       26
## # ℹ 421 more rows
loughran_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n