Few years ago I saw a sentiment analysis by Michael Toth of Warren Buffett’s letters to shareholders. It’s a super interesting analysis, done well, but we can see from some of the plots in that analysis that the specifically financial nature of these documents would make a financial sentiment lexicon a great choice. Sentiment lexicons are lists of words that are used to assess the emotion or opinion content of text by adding up the sentiment scores of individual words within that text. The tidytext package contains three general purpose English sentiment lexicons. The positive or negative meaning of a word can depend on its context, though. A word like “risk” has a negative meaning in most general contexts but may be more neutral for financial reporting. Context-specific sentiment lexicons like the AFINN lexicon or the Loughran-McDonald dictionary provide a way to deal with this.
Let’s download (utils/download.R
) the letters from
Berkshire Hathaway, Warren Buffett’s company, and then implement a
sentiment analysis.
berkshire_names=list.files(path_data, pattern="html|pdf")
berkshire_names=berkshire_names %>%
set_names(str_extract(berkshire_names, "\\d+"))
raw_text=berkshire_names %>%
file.path(path_data, .) %>%
map_if(
function(x) str_detect(x, "pdf"),
~pdf_text(.x) %>% paste(collapse = " "),
.else = ~read_html(.x) %>% html_text()
) %>%
set_names(str_extract(berkshire_names, "\\d+")) %>%
map(~map_dfc(.x, ~.x)) %>%
bind_rows(.id="year") %>%
rename(data=`...1`) %>% # %>% mutate(data=map(data, ~.x))
mutate(text=map(data, ~.x)) %>%
select(-data) %>%
suppressMessages(.)
glimpse(raw_text)
Rows: 45
Columns: 2
$ year <chr> "1977", "1978", "1979", "1980", "1981", "1982", "1983", "1984", "…
$ text <list> "\r\n window.dataLayer = window.dataLayer || [];\r\n function …
Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham, tidy data has a specific structure:
We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.
tidy_text <- raw_text %>%
unnest_tokens(word, text) %>%
filter(
str_detect(word, "[a-z']$"),
!word %in% stop_words$word
)
glimpse(tidy_text)
Rows: 208,422
Columns: 2
$ year <chr> "1977", "1977", "1977", "1977", "1977", "1977", "1977", "1977", "…
$ word <chr> "window.datalayer", "window.datalayer", "function", "gtag", "data…
unnest_tokens
to split the dataset(all the letters) into
tokens and remove stop words.
Common words throughout 45` years of letters
tidy_text %>% count(word, sort=TRUE)
# A tibble: 15,282 × 2
word n
<chr> <int>
1 berkshire 2311
2 business 2243
3 earnings 1986
4 company 1353
5 million 1265
6 insurance 1261
7 businesses 1084
8 billion 937
9 companies 891
10 market 833
# … with 15,272 more rows
tidy_text %>%
count(word, sort = TRUE) %>%
filter(n > 600) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
ggtitle("Most Common Words in Buffett's Letters") +
theme_minimal()
Most common words each year
words_by_year <- tidy_text %>%
count(year, word, sort = TRUE) %>%
ungroup()
words_by_year
# A tibble: 91,165 × 3
year word n
<chr> <chr> <int>
1 2014 berkshire 203
2 1985 business 112
3 1983 business 97
4 1984 business 96
5 2014 business 92
6 1990 business 90
7 2015 berkshire 90
8 1980 earnings 87
9 2016 berkshire 86
10 1989 business 85
# … with 91,155 more rows
Examine how often positive and negative words occurred in these letters. Which years were the most positive or negative overall? AFINN lexicon provides a positivity score for each word, from \(-5\) (most negative) to \(5\) (most positive). What I am doing here is to calculate the average sentiment score for each year.
letters_sentiments <- words_by_year %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
rename(score=value) %>%
group_by(year) %>%
summarize(score = sum(score * n) / sum(n))
letters_sentiments %>%
mutate(year = reorder(year, score)) %>%
ggplot(aes(year, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("Average sentiment score") +
ggtitle(
"Sentiment Score of Buffett's Letters to Shareholders 1977-2021"
) +
theme_minimal()
Warren Buffett is known for his long-term, optimistic economic outlook. Only 1 out of 45 letters appeared negative. Berkshire’s loss in net worth during 2001 was \(\$3.77\) billion, in addition, \(911\) terrorist attack contributed to the negative sentiment score in that year’s letter.
Let now examine the total positive and negative contributions of each word.
contributions <- tidy_text %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
rename(score=value) %>%
group_by(word) %>%
summarize(
occurences = n(),
contribution = sum(score)
)
contributions %>%
format.dt.f(.)
For example,
contributions %>% slice(1)
# A tibble: 1 × 3
word occurences contribution
<chr> <int> <dbl>
1 abandon 5 -10
contributions %>%
top_n(25, abs(contribution)) %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(word, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ggtitle(
'Words with the Most Contributions to Positive/Negative Sentiment Scores'
) + theme_minimal()
Word outstanding
made the most positive contribution and
word loss
made the most negative contribution.
sentiment_messages <- tidy_text %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
rename(score=value) %>%
group_by(year, word) %>%
summarize(
sentiment = mean(score),
words = n()
) %>%
ungroup() %>%
filter(words >= 5)
Now we look for the words with the highest positive scores in each
letter, here it is, outstanding
:
sentiment_messages %>%
arrange(desc(sentiment)) %>%
format.dt.f(.)
Unsurprisingly, word loss
secured the highest negative
score.
sentiment_messages %>%
arrange(sentiment) %>%
format.dt.f(.)
The assignments of words to sentiments look reasonable. However, it
removed outstanding
and superb
from the
positive sentiment.
Another sentiment is the Loughran
and McDonald sentiment lexicon of words specific to financial
reporting. This financial lexicon labels words with six possible
sentiments: positive
, negative
,
litigious
, uncertainty
,
constraining
, and superfluous
.
Relative changes in these sentiments over the years:
tidy_text %>%
add_count(year) %>%
rename(year_total = n) %>%
# Implement the sentiment analysis using Loughran-McDonald lexicon
inner_join(get_sentiments("loughran"), by = "word") %>%
count(year, year_total, sentiment) %>%
filter(sentiment %in% c("positive", "negative", "uncertainty", "litigious")) %>%
mutate(
sentiment = factor(
sentiment,
levels = c("negative", "positive", "uncertainty", "litigious")
)
) %>%
# need to fix this one (https://juliasilge.com/blog/tidytext-0-1-3/):
ggplot(aes(x=year,y=n/year_total,fill=sentiment)) +
geom_density(geom_area="identity",alpha=0.5) +
geom_col(show.legend = FALSE) +
labs(
y = "Relative frequency", x = NULL,
title = "Sentiment analysis of Warren Buffett's shareholder letters",
subtitle = "Using the Loughran-McDonald lexicon"
)
We see negative sentiment spiking, higher than positive sentiment, during the financial upheaval of \(2008\), the collapse of the dot-com bubble in the early \(2000s\), and the recession of the \(1990s\). Overall, though, notice that the balance of positive to negative sentiment is not as skewed to positive as when you use one of the general purpose sentiment lexicons.
This happens because of the words that are driving the sentiment score in these different cases. When using the financial sentiment lexicon, the words have specifically been chosen for a financial context. What words are driving these sentiment scores?
tidy_text %>%
count(word) %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
filter(sentiment %in% c("positive", "negative", "uncertainty", "litigious")) %>%
group_by(sentiment) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
mutate(sentiment = factor(sentiment, levels = c("negative", "positive", "uncertainty", "litigious"))) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(alpha = 0.8, show.legend = FALSE) + #geom_col() +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
facet_wrap(~ sentiment, scales = "free") +
labs(
x = NULL,
y = "Total number of occurrences",
title = "Words driving sentiment scores in Warren Buffett's shareholder letters",
subtitle = "From the Loughran-McDonald lexicon"
) # + ggtitle("Frequency of This Word in Buffett's Letters") + theme_minimal()
Relationship Between Words: Now it is the most interesting part. By
tokenizing text into consecutive sequences of words, we can examine how
often one word is followed by another. We can then study the
relationship between words. In this case, defining a list of six words
that are used in negative situation, such as don’t
,
not
, no
, can’t
,
won’t
and without
, and visualize the
sentiment-associated words that most often followed them.
letters_bigrams <- raw_text %>%
unnest(cols = c(text)) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
letters_bigram_counts <- letters_bigrams %>%
count(year, bigram, sort = TRUE) %>%
ungroup() %>%
separate(bigram, c("word1", "word2"), sep = " ")
negate_words <- c("not", "without", "no", "can't", "don't", "won't")
letters_bigram_counts %>%
filter(word1 %in% negate_words) %>%
count(word1, word2, wt = n, sort = TRUE) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
rename(score=value) %>%
mutate(contribution = -score * n) %>%
group_by(word1) %>%
top_n(10, abs(contribution)) %>%
ungroup() %>%
mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
ggplot(aes(word2, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free", nrow = 3) +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words followed by a negation") +
ylab("Sentiment score * # of occurrences") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
ggtitle("Words that contributed the most to sentiment when they followed a ‘negation'") +
theme_minimal()
It looks like the largest sources of misidentifying a word as
positive come from no matter
, no better
,
not worth
, not good
, and the largest source of
incorrectly classified negative sentiment is “no debt”, “no problem” and
“not charged”.
Welcome to Text Mining with R: https://www.tidytextmining.com/
Supervised Machine Learning for Text Analysis in R: https://smltar.com/