1 Overview

This vignette demonstrates how to construct a basic sentiment analysis using public data and open source. We will be focusing on the use of BERT models in performing the said analysis by leveraging on various Python packages.

library(tidytext)
library(stringr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(quanteda)
library(data.table)
library(ggplot2)
library(gridExtra)
library(kableExtra)
library(pdftools)

2 Data Import

path_root = "."
path_data = file.path(path_root, "data")

load(file = file.path(path_data, "fed_raw_text.Rdata"))

raw_text %>% 
  select(report, rdate, ctext, page, line) 
# A tibble: 64,847 × 5
   report   rdate      ctext                      page  line
   <chr>    <date>     <chr>                     <int> <int>
 1 20080305 2008-03-05 ""                            1     1
 2 20080305 2008-03-05 "SUMMARY OF COMMENTARY O…     2     2
 3 20080305 2008-03-05 ""                            3     3
 4 20080305 2008-03-05 ""                            4     4
 5 20080305 2008-03-05 ""                            5     5
 6 20080305 2008-03-05 ""                            6     6
 7 20080305 2008-03-05 ""                            7     7
 8 20080305 2008-03-05 ""                            8     8
 9 20080305 2008-03-05 ""                            9     9
10 20080305 2008-03-05 ""                           10    10
# … with 64,837 more rows

3 Data Hierachy

Let’s split the data into two hierarchical levels:

  • Line Level - A more granular line level dataset to understand the relationship and occurrences of words within the document

  • Document Level - Overall document level dataset with text for each report aggregated into a row

Let’s take a look at how the two datasets looks like:

#== Line Level ======
line_df <- raw_text %>% 
  select(date = rdate, texts = ctext) %>% 
  mutate(
    id = row_number(),
    fed = 1) %>% 
  select(id, date, texts, fed) %>% 
  arrange(id)

head(line_df)
# A tibble: 6 × 4
     id date       texts                                 fed
  <int> <date>     <chr>                               <dbl>
1     1 2008-03-05 ""                                      1
2     2 2008-03-05 "SUMMARY OF COMMENTARY ON CURRENT …     1
3     3 2008-03-05 ""                                      1
4     4 2008-03-05 ""                                      1
5     5 2008-03-05 ""                                      1
6     6 2008-03-05 ""                                      1

#== Document Level ======
doc_df <- raw_text %>% 
  select(id = report, date = rdate, texts = ctext) %>% 
  group_by(id, date) %>% 
  summarise(texts = paste0(texts, collapse = "")) %>% 
  ungroup() %>% 
  mutate(fed = 1) %>% 
  select(id, date, texts, fed) %>% 
  arrange(id)

head(doc_df)
# A tibble: 6 × 4
  id       date       texts                              fed
  <chr>    <date>     <chr>                            <dbl>
1 20080305 2008-03-05 "SUMMARY OF COMMENTARY ON CURRE…     1
2 20080416 2008-04-16 "SUMMARY OF COMMENTARY ON CURRE…     1
3 20080611 2008-06-11 "SUMMARY OF COMMENTARY ON CURRE…     1
4 20080723 2008-07-23 "SUMMARY OF COMMENTARY ON CURRE…     1
5 20080903 2008-09-03 "SUMMARY OF COMMENTARY ON CURRE…     1
6 20081015 2008-10-15 "SUMMARY OF COMMENTARY ON CURRE…     1

copy_doc_df <- doc_df

4 Transformers

The Transformer is a relatively new deep learning architecture that utilizes positional encoding and attention mechanism. Even though it is newly introduced, it quickly took over as the new state of the art model within the Natural Language Processing (NLP) domain. With Transformers, we have direct access to all other steps, indicating that there is little to no information loss when parsing the message. Furthermore, the self-attention mechanism that it possesses allows it to enjoy the benefit of bidirectional RNNs without the 2x computational cost.

Recommended materials:

  1. Attention is all you need - published by Google (https://arxiv.org/abs/1706.03762)

  2. Detailed explanation of Transformers by D2L (http://d2l.ai/chapter_attention-mechanisms/transformer.html)

4.1 Sentiment Analysis with BERT

We will be using the doc_df created as BERT requires sentences input structure due to its utilization of the attention mechanism and positional encoding. We will be implementing BERT by leveraging on the reticulate library to call Python functions.

4.1.1 Using NLTK to Separate Sentences

There are two methods of using Python in R. With the first method shown below, we can import the Python modules directly into the R environment and use the modules according to our needs.

Here, we would be using the Natural Language ToolKit (NLTK) Python package to aid us in the tokenization and separation of the sentences in our input.

library(reticulate)

nltk = import("nltk")

nltk$download("punkt")
[1] TRUE

# Initialize the variables
id = c(); date = c(); final_sentences = c()

for (i in 1:length(copy_doc_df$texts)){
  sentences = nltk$tokenize$sent_tokenize(copy_doc_df$texts[i])
  id = append(id, 
              c(rep(copy_doc_df$id[i], length(sentences))))
  date = append(date, 
                c(rep(copy_doc_df$date[i], length(sentences))))
  final_sentences = append(final_sentences, sentences)
}

BERT_df = data.frame(id, date, final_sentences)

4.2 Using BERT via the Available Pipeline

An alternative method to using Python in R is to directly use the Python interpreter as shown below. If we need to use a variable defined previously within the R environment, we can call it via “r.variableName”.

We will be leveraging on the pre-trained BERT sentiment analysis model from the Transformers package to generate the sentiment classification for each of the sentences in our input.

import transformers

from transformers import pipeline
import numpy as np
import pandas as pd

sentiment_analysis = pipeline("sentiment-analysis")
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
label = []
confidence = []
error_sentences = []

for sentence in r.BERT_df['final_sentences']:
  try:
    result = sentiment_analysis(sentence)
    label += [result[0]['label']]
    confidence += [result[0]['score']]
  except:
    label += [np.nan]
    confidence += [np.nan]
    error_sentences += [sentence]
Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors
BERT_df_sentiment = pd.DataFrame(r.BERT_df)

BERT_df_sentiment["Label"] = label
BERT_df_sentiment["ConfidenceScore"] = confidence

# Saving it to a csv file
BERT_df_sentiment.to_csv("data/BERT_df_sentiment.csv")
BERT_df_sentiment.head()
         id        date  ...     Label ConfidenceScore
0  20080305  2008-03-05  ...  NEGATIVE        0.702392
1  20080305  2008-03-05  ...  POSITIVE        0.644930
2  20080305  2008-03-05  ...  POSITIVE        0.973016
3  20080305  2008-03-05  ...  NEGATIVE        0.994362
4  20080305  2008-03-05  ...  NEGATIVE        0.731147

[5 rows x 5 columns]

4.3 Sentiment by Report Date

With 1 representing sentences classified to be of “Positive” sentiments and 0 for those classified to be “Negative”, we compute the mean Sentiment Score across all the sentences within each report published.

BERT_df_sentiment["SentimentScore"] = list(map(lambda x: 1 if x == "POSITIVE" else 0, BERT_df_sentiment["Label"]))

df_sentiment_by_date = BERT_df_sentiment.groupby(["date"]).mean().reset_index()

df_sentiment_by_date.date = df_sentiment_by_date.date.astype(str)

df_sentiment_by_date.head(n=10)
         date  ConfidenceScore  SentimentScore
0  2008-03-05         0.960681        0.220339
1  2008-04-16         0.964310        0.249617
2  2008-06-11         0.956411        0.243243
3  2008-07-23         0.964427        0.217391
4  2008-09-03         0.961305        0.196557
5  2008-10-15         0.966293        0.162275
6  2008-12-03         0.972762        0.110795
7  2009-01-14         0.976464        0.097662
8  2009-03-04         0.973050        0.103801
9  2009-04-15         0.973187        0.151773

4.4 Plotting the Sentiment by Report Date (BERT)

ggplot(data = py$df_sentiment_by_date, 
       aes(x = as.Date(date), y = SentimentScore, group = 1)) +
  geom_line(color = "#27408b")+
  geom_point(shape = 21, fill = "white", color = "#27408b", size = 1, stroke = 1.1)+
  scale_y_continuous(labels = scales::comma)+
  scale_x_date(date_labels = "%b/%Y")+
  theme(plot.title = element_text(size = 13))+
  labs(
    x = "Year", 
    y = "Average Sentiment (BERT)",
    title = "Average Sentiment (BERT) of each Federal Reserve Biege Book",
    subtitle = "March 2008 - Oct 2020",
    caption = "Source: US FED Beige Book"
  )

If you would like to create your own pipeline or understand the basic structure of BERT, refer to the article here: (https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)

4.5 Improvements to the BERT Model

  • Better Data - The data used in this markdown is not properly processed. There are multiple cases that are not in proper sentence structure, and there are a few sentences that contains illegible texts. Investing more time in cleaning and processing the data will improve the performance significantly.

  • Customize BERT - BERT is a customizable model where we can annotate a part of our data and let BERT understand what we are trying to accomplish. We can even introduce new classes such as Neutral or Overwhemingly Positive/Negative to the sentiments. Example of fine-tuning the BERT model can be found here: https://skimai.com/fine-tuning-bert-for-sentiment-analysis/ (Refer to Chapter 3: Train our Model to set up the uncased BERT model and train it on your own customized dataset).

  • Use Appropriate BERT Model - There are wide arrays of BERT models, such as BART for text summarizing and FinBERT for BERT usage in financial/company reports. Using the suitable BERT model for the use case would help in the performance of the model.

4.6 FinBERT Application

FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification.

Recommended Reading Materials: - Fine-tuning FinBERT (https://github.com/ProsusAI/finBERT/blob/master/notebooks/finbert_training.ipynb)

4.7 Using FinBERT

Here, we will reuse the results from BERT, and compare them to the sentiment scores produced by FinBERT.

# Set Working Directory to FinBERT folder within model folder
os.chdir(os.path.join(os.getcwd(), "model/finBERT"))

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import nltk
from nltk.tokenize import sent_tokenize
import pandas as pd
import numpy as np
from UtilityTools import *
import finbert

nltk.download('punkt')
True
model = AutoModelForSequenceClassification.from_pretrained("models/sentiment")
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

4.8 Applying FinBERT

The native FinBERT model classifies sentiment into 3 classes, namely positive, neutral, and negative. FinBERT also provides the intensity of the sentiment on a scale of [-1,1], where -1 indicates overwhelmingly negative and 1 indicates overwhelmingly positive.

label_list = ['positive', 'negative', 'neutral']
label_dict = {0: 'positive', 1: 'negative', 2: 'neutral'}
result = pd.DataFrame(columns=['sentence', 'logit', 'prediction', 'sentiment_score'])

sentences = list(BERT_df_sentiment["final_sentences"])

# Suppress logging messages from UtilityTools
logger.setLevel(30)

for batch in chunks(sentences, 5):
  examples = [InputExample(str(i), sentence) for i, sentence in enumerate(batch)]

  features = convert_examples_to_features(examples, label_list, 64, tokenizer)

  all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
  all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
  all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)

  with torch.no_grad():
    logits = model(all_input_ids, all_attention_mask, all_token_type_ids)[0]
    logits = softmax(np.array(logits))
    sentiment_score = pd.Series(logits[:, 0] - logits[:, 1])
    predictions = np.squeeze(np.argmax(logits, axis=1))

    batch_result = {'sentence': batch,
                    'logit': list(logits),
                    'prediction': predictions,
                    'sentiment_score': sentiment_score}

    batch_result = pd.DataFrame(batch_result)
    result = pd.concat([result, batch_result], ignore_index=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors
# Converting prediction from class (0,1,2) to sentiment (negative,neutral,positive)
result['prediction'] = result["prediction"].apply(lambda x: label_dict[x])

# Checking that the sentences and the order are the same
sum(BERT_df_sentiment["final_sentences"] != result["sentence"])

# Porting the date from BERT_df_sentiment dataframe so that we can analyze via dates
0
result['date'] = BERT_df_sentiment["date"]

# Navigate out of folder twice
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
result.to_csv("data/FinBERT.csv")

result.head()
                                            sentence  ...        date
0  SUMMARY OF COMMENTARY ON CURRENT ECONOMIC COND...  ...  2008-03-05
1  VI-1Seventh District – Chicago ………………………………………...  ...  2008-03-05
2  XI-1Twelfth District – San Francisco ..……….………...  ...  2008-03-05
3  Several Districts noted declines in sales of b...  ...  2008-03-05
4  Farm incomes and/or value of production rosein...  ...  2008-03-05

[5 rows x 5 columns]

4.9 Looking at the Reports’ Sentiment (FinBERT) Throughout the Years

FB_sentiment_by_date = result[["date","logit","sentiment_score"]].groupby(["date"]).mean().reset_index()

FB_sentiment_by_date.date = FB_sentiment_by_date.date.astype(str)

FB_sentiment_by_date.head(n=10)
         date                                  logit  sentiment_score
0  2008-03-05    [0.2652003, 0.57976055, 0.15503936]        -0.314560
1  2008-04-16    [0.2845241, 0.55292267, 0.16255294]        -0.268399
2  2008-06-11    [0.28699178, 0.5470741, 0.16593392]        -0.260083
3  2008-07-23   [0.24519339, 0.57913643, 0.17567025]        -0.333943
4  2008-09-03    [0.27335647, 0.5567819, 0.16986164]        -0.283425
5  2008-10-15   [0.21417552, 0.63015056, 0.15567341]        -0.415976
6  2008-12-03  [0.16563646, 0.71037775, 0.123986036]        -0.544741
7  2009-01-14    [0.16178429, 0.7197153, 0.11850039]        -0.557931
8  2009-03-04    [0.15317762, 0.7147659, 0.13205646]        -0.561589
9  2009-04-15     [0.1963015, 0.6658742, 0.13782418]        -0.469573

Using ggplot to plot

ggplot(data=py$FB_sentiment_by_date, aes(x=as.Date(date), y=sentiment_score, group=1)) +
  geom_line(color = "#27408b")+
  geom_point(shape = 21, fill = "white", color = "#27408b", size = 1, stroke = 1.1)+
  scale_y_continuous(labels = scales::comma)+
  scale_x_date(date_labels = "%b/%Y")+
  theme(plot.title = element_text(size=13))+
  labs(
    x = "Year", 
    y = "Average Sentiment (FinBERT)",
    title = "Average Sentiment (FinBERT) of each Federal Reserve Biege Book",
    subtitle = "March 2008 - Oct 2020",
    caption = "Source: US FED Beige Book"
  )

4.10 Comparing Results of BERT and FinBERT

To compare the result of BERT and FinBERT, we need to normalize the [-1,1] output of FinBERT to [0,1].

FB_sentiment_by_date["NormalizedFinBERT"] = list(map(lambda x: (x+1)/2, FB_sentiment_by_date["sentiment_score"]))
FB_sentiment_by_date
           date  ... NormalizedFinBERT
0    2008-03-05  ...          0.342720
1    2008-04-16  ...          0.365801
2    2008-06-11  ...          0.369959
3    2008-07-23  ...          0.333028
4    2008-09-03  ...          0.358287
..          ...  ...               ...
110  2022-03-02  ...          0.483172
111  2022-04-20  ...          0.489750
112  2022-06-01  ...          0.443476
113  2022-07-13  ...          0.407980
114  2022-09-07  ...          0.376452

[115 rows x 4 columns]

Now we will plot it using ggplot

py$FB_sentiment_by_date$SentimentScore = py$FB_sentiment_by_date$NormalizedFinBERT

df_plot <- py$FB_sentiment_by_date %>% 
  mutate(Type = "FinBERT") %>% 
  bind_rows(py$df_sentiment_by_date %>% mutate(Type = 'BERT'))


ggplot(df_plot,aes(x=as.Date(date), y=SentimentScore, color = Type)) +
  geom_line()+
  scale_y_continuous(labels = scales::comma)+
  scale_x_date(date_labels = "%b/%Y")+
  theme(plot.title = element_text(size=13))+
  labs(
    x = "Year", 
    y = "Average Sentiment",
    title = "Average Sentiment of each Federal Reserve Biege Book via BERT and FinBERT",
    subtitle = "March 2008 - Oct 2020",
    caption = "Source: US FED Beige Book"
  )

4.10.1 Looking at the Change in Sentiment over the Last Report

We may be able to derive additional insights by looking at the change in Sentiment compared to the previous report released.

py$FB_sentiment_by_date <- py$FB_sentiment_by_date %>% 
  mutate(delta = (NormalizedFinBERT-lag(NormalizedFinBERT))/lag(NormalizedFinBERT)*100)

py$df_sentiment_by_date <- py$df_sentiment_by_date %>% 
  mutate(delta = (SentimentScore-lag(SentimentScore))/lag(SentimentScore)*100)

df_change_plot <- py$FB_sentiment_by_date %>% 
  mutate(Type = "FinBERT") %>% 
  bind_rows(
    py$df_sentiment_by_date %>% 
      mutate(Type = 'BERT')
  )

ggplot(df_change_plot,aes(x=as.Date(date), y=delta, color = Type)) +
  geom_line()+
  scale_y_continuous(labels = scales::comma)+
  scale_x_date(date_labels = "%b/%Y")+
  theme(plot.title = element_text(size=13))+
  labs(
    x = "Year", 
    y = "Change in Sentiment (%)",
    title = "Change in Sentiment of the Federal Reserve Biege Book via BERT and FinBERT",
    subtitle = "March 2008 - Oct 2020",
    caption = "Source: US FED Beige Book"
  )