In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
Sentiment Analysis of US Financial News Headlines Data
For this assignment I am going to perform a sentiment analysis on US Financial News Headlines data, which were obtained from Kaggle.com at the address below:
https://www.kaggle.com/notlucasp/financial-news-headlines
Context
The datasets consist of 3 sets scraped from CNBC, the Guardian, and Reuters official websites, the headlines in these datasets reflects the overview of the U.S. economy and stock market every day for the past year to 2 years.
Content
library(tidyverse)
library(tidytext)
library(textdata) # Needed for loughran lexicon
library(ggplot2)
Let’s use the loughran lexicon to perform the sentiment analysis
loughran_sentiments <- get_sentiments("loughran")
Let’s take a peak at the sentiments from the “loughran” lexicon
loughran_sentiments
## # A tibble: 4,150 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
## 7 abdicated negative
## 8 abdicates negative
## 9 abdicating negative
## 10 abdication negative
## # ... with 4,140 more rows
cnbc_csv <- read_csv('cnbc_headlines.csv')
head(cnbc_csv, 10)
## # A tibble: 10 x 3
## Headlines Time Description
## <chr> <chr> <chr>
## 1 Jim Cramer: A better way to in~ 7:51 PM ET ~ "\"Mad Money\" host Jim Cramer~
## 2 Cramer's lightning round: I wo~ 7:33 PM ET ~ "\"Mad Money\" host Jim Cramer~
## 3 <NA> <NA> <NA>
## 4 Cramer's week ahead: Big week ~ 7:25 PM ET ~ "\"We'll pay more for the earn~
## 5 IQ Capital CEO Keith Bliss say~ 4:24 PM ET ~ "Keith Bliss, IQ Capital CEO, ~
## 6 Wall Street delivered the 'kin~ 7:36 PM ET ~ "\"Look for the stocks of high~
## 7 Cramer's lightning round: I wo~ 7:23 PM ET ~ "\"Mad Money\" host Jim Cramer~
## 8 Acorns CEO: Parents can turn $~ 8:03 PM ET ~ "Investing $5 per day can comp~
## 9 Dividend cuts may mean rethink~ 8:54 AM ET ~ "Hundreds of companies have cu~
## 10 <NA> <NA> <NA>
# Remove all rows where all the column values are blank
cnbc_headlines <- cnbc_csv[rowSums(is.na(cnbc_csv)) != ncol(cnbc_csv), ]
head(cnbc_headlines)
## # A tibble: 6 x 3
## Headlines Time Description
## <chr> <chr> <chr>
## 1 Jim Cramer: A better way to in~ 7:51 PM ET ~ "\"Mad Money\" host Jim Cramer ~
## 2 Cramer's lightning round: I wo~ 7:33 PM ET ~ "\"Mad Money\" host Jim Cramer ~
## 3 Cramer's week ahead: Big week ~ 7:25 PM ET ~ "\"We'll pay more for the earni~
## 4 IQ Capital CEO Keith Bliss say~ 4:24 PM ET ~ "Keith Bliss, IQ Capital CEO, j~
## 5 Wall Street delivered the 'kin~ 7:36 PM ET ~ "\"Look for the stocks of high-~
## 6 Cramer's lightning round: I wo~ 7:23 PM ET ~ "\"Mad Money\" host Jim Cramer ~
First, we need to take the text of the headlines and convert the text to the tidy format using unnest_tokens(). Let’s also set up a column to keep track of which headline each word comes from.
Add a new columns to the dataframe containing the Headline Date and Month (YYY-MM)
# Add a new column to the dataframe containing the Headline Date
cnbc_headlines <- cnbc_headlines %>%
rowwise() %>%
mutate(Headline_Date = as.Date(sub(".*, ","",Time), format = "%d %B %Y"),
Headline_YYYYMM = format( as.Date(sub(".*, ","",Time), format = "%d %B %Y"), "%Y-%m")
)
Convert headlines to tidytext format
tidy_cnbc_headlines <- cnbc_headlines %>%
select(Headline_YYYYMM, Headline_Date, Headlines) %>%
mutate(linenumber = row_number()) %>%
unnest_tokens(output = word, input = Headlines, token = "words", format = "text", to_lower = TRUE)
First, we find a sentiment score for each word using the “loughran” lexicon and inner_join().
Next, we count up how many positive and negative words there are in each headline.
We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).
cnbc_sentiment <- tidy_cnbc_headlines %>%
inner_join(loughran_sentiments) %>%
count(Headline_YYYYMM, Headline_Date, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(cnbc_sentiment, aes(Headline_YYYYMM, sentiment)) +
geom_col(show.legend = FALSE) +
#facet_wrap(~Headline_YYYYMM, ncol = 4, scales = "free_x")
coord_flip()
Most Common Positive and Negative Words
One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.
loughran_word_counts <- tidy_cnbc_headlines %>%
inner_join(get_sentiments("loughran")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
loughran_word_counts
## # A tibble: 431 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 could uncertainty 167
## 2 good positive 57
## 3 may uncertainty 55
## 4 best positive 46
## 5 recession negative 35
## 6 opportunity positive 32
## 7 warns negative 30
## 8 bad negative 27
## 9 better positive 27
## 10 wrong negative 26
## # ... with 421 more rows
This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames
loughran_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()