This vignette demonstrates how to construct a basic sentiment analysis using public data and open source. We will be focusing on the use of BERT models in performing the said analysis by leveraging on various Python packages.
library(tidytext)
library(stringr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(quanteda)
library(data.table)
library(ggplot2)
library(gridExtra)
library(kableExtra)
library(pdftools)
path_root = "."
path_data = file.path(path_root, "data")
load(file = file.path(path_data, "fed_raw_text.Rdata"))
raw_text %>%
select(report, rdate, ctext, page, line)
# A tibble: 64,847 × 5
report rdate ctext page line
<chr> <date> <chr> <int> <int>
1 20080305 2008-03-05 "" 1 1
2 20080305 2008-03-05 "SUMMARY OF COMMENTARY O… 2 2
3 20080305 2008-03-05 "" 3 3
4 20080305 2008-03-05 "" 4 4
5 20080305 2008-03-05 "" 5 5
6 20080305 2008-03-05 "" 6 6
7 20080305 2008-03-05 "" 7 7
8 20080305 2008-03-05 "" 8 8
9 20080305 2008-03-05 "" 9 9
10 20080305 2008-03-05 "" 10 10
# … with 64,837 more rows
Let’s split the data into two hierarchical levels:
Line Level - A more granular line level dataset to understand the relationship and occurrences of words within the document
Document Level - Overall document level dataset with text for each report aggregated into a row
Let’s take a look at how the two datasets looks like:
#== Line Level ======
line_df <- raw_text %>%
select(date = rdate, texts = ctext) %>%
mutate(
id = row_number(),
fed = 1) %>%
select(id, date, texts, fed) %>%
arrange(id)
head(line_df)
# A tibble: 6 × 4
id date texts fed
<int> <date> <chr> <dbl>
1 1 2008-03-05 "" 1
2 2 2008-03-05 "SUMMARY OF COMMENTARY ON CURRENT … 1
3 3 2008-03-05 "" 1
4 4 2008-03-05 "" 1
5 5 2008-03-05 "" 1
6 6 2008-03-05 "" 1
#== Document Level ======
doc_df <- raw_text %>%
select(id = report, date = rdate, texts = ctext) %>%
group_by(id, date) %>%
summarise(texts = paste0(texts, collapse = "")) %>%
ungroup() %>%
mutate(fed = 1) %>%
select(id, date, texts, fed) %>%
arrange(id)
head(doc_df)
# A tibble: 6 × 4
id date texts fed
<chr> <date> <chr> <dbl>
1 20080305 2008-03-05 "SUMMARY OF COMMENTARY ON CURRE… 1
2 20080416 2008-04-16 "SUMMARY OF COMMENTARY ON CURRE… 1
3 20080611 2008-06-11 "SUMMARY OF COMMENTARY ON CURRE… 1
4 20080723 2008-07-23 "SUMMARY OF COMMENTARY ON CURRE… 1
5 20080903 2008-09-03 "SUMMARY OF COMMENTARY ON CURRE… 1
6 20081015 2008-10-15 "SUMMARY OF COMMENTARY ON CURRE… 1
copy_doc_df <- doc_df
The Transformer is a relatively new deep learning architecture that utilizes positional encoding and attention mechanism. Even though it is newly introduced, it quickly took over as the new state of the art model within the Natural Language Processing (NLP) domain. With Transformers, we have direct access to all other steps, indicating that there is little to no information loss when parsing the message. Furthermore, the self-attention mechanism that it possesses allows it to enjoy the benefit of bidirectional RNNs without the 2x computational cost.
Recommended materials:
Attention is all you need - published by Google (https://arxiv.org/abs/1706.03762)
Detailed explanation of Transformers by D2L (http://d2l.ai/chapter_attention-mechanisms/transformer.html)
We will be using the doc_df created as BERT requires sentences input structure due to its utilization of the attention mechanism and positional encoding. We will be implementing BERT by leveraging on the reticulate library to call Python functions.
There are two methods of using Python in R. With the first method shown below, we can import the Python modules directly into the R environment and use the modules according to our needs.
Here, we would be using the Natural Language ToolKit (NLTK) Python package to aid us in the tokenization and separation of the sentences in our input.
library(reticulate)
nltk = import("nltk")
nltk$download("punkt")
[1] TRUE
# Initialize the variables
id = c(); date = c(); final_sentences = c()
for (i in 1:length(copy_doc_df$texts)){
sentences = nltk$tokenize$sent_tokenize(copy_doc_df$texts[i])
id = append(id,
c(rep(copy_doc_df$id[i], length(sentences))))
date = append(date,
c(rep(copy_doc_df$date[i], length(sentences))))
final_sentences = append(final_sentences, sentences)
}
BERT_df = data.frame(id, date, final_sentences)
An alternative method to using Python in R is to directly use the Python interpreter as shown below. If we need to use a variable defined previously within the R environment, we can call it via “r.variableName”.
We will be leveraging on the pre-trained BERT sentiment analysis model from the Transformers package to generate the sentiment classification for each of the sentences in our input.
import transformers
from transformers import pipeline
import numpy as np
import pandas as pd
sentiment_analysis = pipeline("sentiment-analysis")
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
label = []
confidence = []
error_sentences = []
for sentence in r.BERT_df['final_sentences']:
try:
result = sentiment_analysis(sentence)
label += [result[0]['label']]
confidence += [result[0]['score']]
except:
label += [np.nan]
confidence += [np.nan]
error_sentences += [sentence]
Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors
BERT_df_sentiment = pd.DataFrame(r.BERT_df)
BERT_df_sentiment["Label"] = label
BERT_df_sentiment["ConfidenceScore"] = confidence
# Saving it to a csv file
BERT_df_sentiment.to_csv("data/BERT_df_sentiment.csv")
BERT_df_sentiment.head()
id date ... Label ConfidenceScore
0 20080305 2008-03-05 ... NEGATIVE 0.702392
1 20080305 2008-03-05 ... POSITIVE 0.644930
2 20080305 2008-03-05 ... POSITIVE 0.973016
3 20080305 2008-03-05 ... NEGATIVE 0.994362
4 20080305 2008-03-05 ... NEGATIVE 0.731147
[5 rows x 5 columns]
With 1 representing sentences classified to be of “Positive” sentiments and 0 for those classified to be “Negative”, we compute the mean Sentiment Score across all the sentences within each report published.
BERT_df_sentiment["SentimentScore"] = list(map(lambda x: 1 if x == "POSITIVE" else 0, BERT_df_sentiment["Label"]))
df_sentiment_by_date = BERT_df_sentiment.groupby(["date"]).mean().reset_index()
df_sentiment_by_date.date = df_sentiment_by_date.date.astype(str)
df_sentiment_by_date.head(n=10)
date ConfidenceScore SentimentScore
0 2008-03-05 0.960681 0.220339
1 2008-04-16 0.964310 0.249617
2 2008-06-11 0.956411 0.243243
3 2008-07-23 0.964427 0.217391
4 2008-09-03 0.961305 0.196557
5 2008-10-15 0.966293 0.162275
6 2008-12-03 0.972762 0.110795
7 2009-01-14 0.976464 0.097662
8 2009-03-04 0.973050 0.103801
9 2009-04-15 0.973187 0.151773
ggplot(data = py$df_sentiment_by_date,
aes(x = as.Date(date), y = SentimentScore, group = 1)) +
geom_line(color = "#27408b")+
geom_point(shape = 21, fill = "white", color = "#27408b", size = 1, stroke = 1.1)+
scale_y_continuous(labels = scales::comma)+
scale_x_date(date_labels = "%b/%Y")+
theme(plot.title = element_text(size = 13))+
labs(
x = "Year",
y = "Average Sentiment (BERT)",
title = "Average Sentiment (BERT) of each Federal Reserve Biege Book",
subtitle = "March 2008 - Oct 2020",
caption = "Source: US FED Beige Book"
)
If you would like to create your own pipeline or understand the basic structure of BERT, refer to the article here: (https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/)
Better Data - The data used in this markdown is not properly processed. There are multiple cases that are not in proper sentence structure, and there are a few sentences that contains illegible texts. Investing more time in cleaning and processing the data will improve the performance significantly.
Customize BERT - BERT is a customizable model where we can annotate a part of our data and let BERT understand what we are trying to accomplish. We can even introduce new classes such as Neutral or Overwhemingly Positive/Negative to the sentiments. Example of fine-tuning the BERT model can be found here: https://skimai.com/fine-tuning-bert-for-sentiment-analysis/ (Refer to Chapter 3: Train our Model to set up the uncased BERT model and train it on your own customized dataset).
Use Appropriate BERT Model - There are wide arrays of BERT models, such as BART for text summarizing and FinBERT for BERT usage in financial/company reports. Using the suitable BERT model for the use case would help in the performance of the model.
FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification.
Recommended Reading Materials: - Fine-tuning FinBERT (https://github.com/ProsusAI/finBERT/blob/master/notebooks/finbert_training.ipynb)
Here, we will reuse the results from BERT, and compare them to the sentiment scores produced by FinBERT.
# Set Working Directory to FinBERT folder within model folder
os.chdir(os.path.join(os.getcwd(), "model/finBERT"))
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import nltk
from nltk.tokenize import sent_tokenize
import pandas as pd
import numpy as np
from UtilityTools import *
import finbert
nltk.download('punkt')
True
model = AutoModelForSequenceClassification.from_pretrained("models/sentiment")
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
The native FinBERT model classifies sentiment into 3 classes, namely positive, neutral, and negative. FinBERT also provides the intensity of the sentiment on a scale of [-1,1], where -1 indicates overwhelmingly negative and 1 indicates overwhelmingly positive.
label_list = ['positive', 'negative', 'neutral']
label_dict = {0: 'positive', 1: 'negative', 2: 'neutral'}
result = pd.DataFrame(columns=['sentence', 'logit', 'prediction', 'sentiment_score'])
sentences = list(BERT_df_sentiment["final_sentences"])
# Suppress logging messages from UtilityTools
logger.setLevel(30)
for batch in chunks(sentences, 5):
examples = [InputExample(str(i), sentence) for i, sentence in enumerate(batch)]
features = convert_examples_to_features(examples, label_list, 64, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
with torch.no_grad():
logits = model(all_input_ids, all_attention_mask, all_token_type_ids)[0]
logits = softmax(np.array(logits))
sentiment_score = pd.Series(logits[:, 0] - logits[:, 1])
predictions = np.squeeze(np.argmax(logits, axis=1))
batch_result = {'sentence': batch,
'logit': list(logits),
'prediction': predictions,
'sentiment_score': sentiment_score}
batch_result = pd.DataFrame(batch_result)
result = pd.concat([result, batch_result], ignore_index=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors
# Converting prediction from class (0,1,2) to sentiment (negative,neutral,positive)
result['prediction'] = result["prediction"].apply(lambda x: label_dict[x])
# Checking that the sentences and the order are the same
sum(BERT_df_sentiment["final_sentences"] != result["sentence"])
# Porting the date from BERT_df_sentiment dataframe so that we can analyze via dates
0
result['date'] = BERT_df_sentiment["date"]
# Navigate out of folder twice
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
result.to_csv("data/FinBERT.csv")
result.head()
sentence ... date
0 SUMMARY OF COMMENTARY ON CURRENT ECONOMIC COND... ... 2008-03-05
1 VI-1Seventh District – Chicago ………………………………………... ... 2008-03-05
2 XI-1Twelfth District – San Francisco ..……….………... ... 2008-03-05
3 Several Districts noted declines in sales of b... ... 2008-03-05
4 Farm incomes and/or value of production rosein... ... 2008-03-05
[5 rows x 5 columns]
FB_sentiment_by_date = result[["date","logit","sentiment_score"]].groupby(["date"]).mean().reset_index()
FB_sentiment_by_date.date = FB_sentiment_by_date.date.astype(str)
FB_sentiment_by_date.head(n=10)
date logit sentiment_score
0 2008-03-05 [0.2652003, 0.57976055, 0.15503936] -0.314560
1 2008-04-16 [0.2845241, 0.55292267, 0.16255294] -0.268399
2 2008-06-11 [0.28699178, 0.5470741, 0.16593392] -0.260083
3 2008-07-23 [0.24519339, 0.57913643, 0.17567025] -0.333943
4 2008-09-03 [0.27335647, 0.5567819, 0.16986164] -0.283425
5 2008-10-15 [0.21417552, 0.63015056, 0.15567341] -0.415976
6 2008-12-03 [0.16563646, 0.71037775, 0.123986036] -0.544741
7 2009-01-14 [0.16178429, 0.7197153, 0.11850039] -0.557931
8 2009-03-04 [0.15317762, 0.7147659, 0.13205646] -0.561589
9 2009-04-15 [0.1963015, 0.6658742, 0.13782418] -0.469573
Using ggplot to plot
ggplot(data=py$FB_sentiment_by_date, aes(x=as.Date(date), y=sentiment_score, group=1)) +
geom_line(color = "#27408b")+
geom_point(shape = 21, fill = "white", color = "#27408b", size = 1, stroke = 1.1)+
scale_y_continuous(labels = scales::comma)+
scale_x_date(date_labels = "%b/%Y")+
theme(plot.title = element_text(size=13))+
labs(
x = "Year",
y = "Average Sentiment (FinBERT)",
title = "Average Sentiment (FinBERT) of each Federal Reserve Biege Book",
subtitle = "March 2008 - Oct 2020",
caption = "Source: US FED Beige Book"
)
To compare the result of BERT and FinBERT, we need to normalize the [-1,1] output of FinBERT to [0,1].
FB_sentiment_by_date["NormalizedFinBERT"] = list(map(lambda x: (x+1)/2, FB_sentiment_by_date["sentiment_score"]))
FB_sentiment_by_date
date ... NormalizedFinBERT
0 2008-03-05 ... 0.342720
1 2008-04-16 ... 0.365801
2 2008-06-11 ... 0.369959
3 2008-07-23 ... 0.333028
4 2008-09-03 ... 0.358287
.. ... ... ...
110 2022-03-02 ... 0.483172
111 2022-04-20 ... 0.489750
112 2022-06-01 ... 0.443476
113 2022-07-13 ... 0.407980
114 2022-09-07 ... 0.376452
[115 rows x 4 columns]
Now we will plot it using ggplot
py$FB_sentiment_by_date$SentimentScore = py$FB_sentiment_by_date$NormalizedFinBERT
df_plot <- py$FB_sentiment_by_date %>%
mutate(Type = "FinBERT") %>%
bind_rows(py$df_sentiment_by_date %>% mutate(Type = 'BERT'))
ggplot(df_plot,aes(x=as.Date(date), y=SentimentScore, color = Type)) +
geom_line()+
scale_y_continuous(labels = scales::comma)+
scale_x_date(date_labels = "%b/%Y")+
theme(plot.title = element_text(size=13))+
labs(
x = "Year",
y = "Average Sentiment",
title = "Average Sentiment of each Federal Reserve Biege Book via BERT and FinBERT",
subtitle = "March 2008 - Oct 2020",
caption = "Source: US FED Beige Book"
)
We may be able to derive additional insights by looking at the change in Sentiment compared to the previous report released.
py$FB_sentiment_by_date <- py$FB_sentiment_by_date %>%
mutate(delta = (NormalizedFinBERT-lag(NormalizedFinBERT))/lag(NormalizedFinBERT)*100)
py$df_sentiment_by_date <- py$df_sentiment_by_date %>%
mutate(delta = (SentimentScore-lag(SentimentScore))/lag(SentimentScore)*100)
df_change_plot <- py$FB_sentiment_by_date %>%
mutate(Type = "FinBERT") %>%
bind_rows(
py$df_sentiment_by_date %>%
mutate(Type = 'BERT')
)
ggplot(df_change_plot,aes(x=as.Date(date), y=delta, color = Type)) +
geom_line()+
scale_y_continuous(labels = scales::comma)+
scale_x_date(date_labels = "%b/%Y")+
theme(plot.title = element_text(size=13))+
labs(
x = "Year",
y = "Change in Sentiment (%)",
title = "Change in Sentiment of the Federal Reserve Biege Book via BERT and FinBERT",
subtitle = "March 2008 - Oct 2020",
caption = "Source: US FED Beige Book"
)