Few years ago I saw a sentiment analysis by Michael Toth of Warren Buffett’s letters to shareholders. It’s a super interesting analysis, done well, but we can see from some of the plots in that analysis that the specifically financial nature of these documents would make a financial sentiment lexicon a great choice. Sentiment lexicons are lists of words that are used to assess the emotion or opinion content of text by adding up the sentiment scores of individual words within that text. The tidytext package contains three general purpose English sentiment lexicons. The positive or negative meaning of a word can depend on its context, though. A word like “risk” has a negative meaning in most general contexts but may be more neutral for financial reporting. Context-specific sentiment lexicons like the AFINN lexicon or the Loughran-McDonald dictionary provide a way to deal with this.

1 Download

Let’s download (utils/download.R) the letters from Berkshire Hathaway, Warren Buffett’s company, and then implement a sentiment analysis.

berkshire_names=list.files(path_data, pattern="html|pdf")
berkshire_names=berkshire_names %>%
  set_names(str_extract(berkshire_names, "\\d+"))

raw_text=berkshire_names %>%
  file.path(path_data, .) %>%
  map_if(
    function(x) str_detect(x, "pdf"), 
    ~pdf_text(.x) %>% paste(collapse = " "), 
    .else = ~read_html(.x) %>% html_text()
  ) %>%
  set_names(str_extract(berkshire_names, "\\d+")) %>%
  map(~map_dfc(.x, ~.x)) %>%
  bind_rows(.id="year") %>%
  rename(data=`...1`) %>%  # %>% mutate(data=map(data, ~.x))
  mutate(text=map(data, ~.x)) %>%
  select(-data) %>%
  suppressMessages(.)

glimpse(raw_text)
Rows: 45
Columns: 2
$ year <chr> "1977", "1978", "1979", "1980", "1981", "1982", "1983", "1984", "…
$ text <list> "\r\n  window.dataLayer = window.dataLayer || [];\r\n  function …

2 TidyText

Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham, tidy data has a specific structure:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This one-token-per-row structure is in contrast to the ways text is often stored in current analyses, perhaps as strings or in a document-term matrix. For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.

tidy_text <- raw_text %>%
  unnest_tokens(word, text) %>%
  filter(
    str_detect(word, "[a-z']$"), 
    !word %in% stop_words$word
  )

glimpse(tidy_text)
Rows: 208,422
Columns: 2
$ year <chr> "1977", "1977", "1977", "1977", "1977", "1977", "1977", "1977", "…
$ word <chr> "window.datalayer", "window.datalayer", "function", "gtag", "data…

unnest_tokens to split the dataset(all the letters) into tokens and remove stop words.

Common words throughout 45` years of letters

tidy_text %>% count(word, sort=TRUE)
# A tibble: 15,282 × 2
   word           n
   <chr>      <int>
 1 berkshire   2311
 2 business    2243
 3 earnings    1986
 4 company     1353
 5 million     1265
 6 insurance   1261
 7 businesses  1084
 8 billion      937
 9 companies    891
10 market       833
# … with 15,272 more rows

tidy_text %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  ggtitle("Most Common Words in Buffett's Letters") + 
  theme_minimal()

Most common words each year

words_by_year <- tidy_text %>%
  count(year, word, sort = TRUE) %>%
  ungroup()

words_by_year
# A tibble: 91,165 × 3
   year  word          n
   <chr> <chr>     <int>
 1 2014  berkshire   203
 2 1985  business    112
 3 1983  business     97
 4 1984  business     96
 5 2014  business     92
 6 1990  business     90
 7 2015  berkshire    90
 8 1980  earnings     87
 9 2016  berkshire    86
10 1989  business     85
# … with 91,155 more rows

3 Sentiment

3.1 AFINN lexicon

Examine how often positive and negative words occurred in these letters. Which years were the most positive or negative overall? AFINN lexicon provides a positivity score for each word, from \(-5\) (most negative) to \(5\) (most positive). What I am doing here is to calculate the average sentiment score for each year.

letters_sentiments <- words_by_year %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  rename(score=value) %>%
  group_by(year) %>%
  summarize(score = sum(score * n) / sum(n))

letters_sentiments %>%
  mutate(year = reorder(year, score)) %>%
  ggplot(aes(year, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score") + 
  ggtitle(
    "Sentiment Score of Buffett's Letters to Shareholders 1977-2021"
  ) + 
  theme_minimal()

Warren Buffett is known for his long-term, optimistic economic outlook. Only 1 out of 45 letters appeared negative. Berkshire’s loss in net worth during 2001 was \(\$3.77\) billion, in addition, \(911\) terrorist attack contributed to the negative sentiment score in that year’s letter.

Let now examine the total positive and negative contributions of each word.

contributions <- tidy_text %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  rename(score=value) %>%
  group_by(word) %>%
  summarize(
    occurences = n(),
    contribution = sum(score)
  )

contributions %>%
  format.dt.f(.)

For example,

contributions %>% slice(1)
# A tibble: 1 × 3
  word    occurences contribution
  <chr>        <int>        <dbl>
1 abandon          5          -10

contributions %>%
  top_n(25, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() + 
  ggtitle(
    'Words with the Most Contributions to Positive/Negative Sentiment Scores'
  ) + theme_minimal()

Word outstanding made the most positive contribution and word loss made the most negative contribution.

sentiment_messages <- tidy_text %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  rename(score=value) %>%
  group_by(year, word) %>%
  summarize(
    sentiment = mean(score),
    words = n()
  ) %>%
  ungroup() %>%
  filter(words >= 5)

Now we look for the words with the highest positive scores in each letter, here it is, outstanding:

sentiment_messages %>% 
  arrange(desc(sentiment)) %>%
  format.dt.f(.)

Unsurprisingly, word loss secured the highest negative score.

sentiment_messages %>% 
  arrange(sentiment) %>%
  format.dt.f(.)

The assignments of words to sentiments look reasonable. However, it removed outstanding and superb from the positive sentiment.

3.2 Loughran and McDonald Lexicon

Another sentiment is the Loughran and McDonald sentiment lexicon of words specific to financial reporting. This financial lexicon labels words with six possible sentiments: positive, negative, litigious, uncertainty, constraining, and superfluous.

Relative changes in these sentiments over the years:

tidy_text %>%
  add_count(year) %>%
  rename(year_total = n) %>%
  # Implement the sentiment analysis using Loughran-McDonald lexicon
  inner_join(get_sentiments("loughran"), by = "word") %>%
  count(year, year_total, sentiment) %>%
  filter(sentiment %in% c("positive", "negative", "uncertainty", "litigious")) %>%
  mutate(
    sentiment = factor(
      sentiment, 
      levels = c("negative", "positive", "uncertainty", "litigious")
    )
  ) %>%
  # need to fix this one (https://juliasilge.com/blog/tidytext-0-1-3/):
  ggplot(aes(x=year,y=n/year_total,fill=sentiment)) +
  geom_density(geom_area="identity",alpha=0.5) +
  geom_col(show.legend = FALSE) +
  labs(
    y = "Relative frequency", x = NULL,
    title = "Sentiment analysis of Warren Buffett's shareholder letters",
    subtitle = "Using the Loughran-McDonald lexicon"
  )  

We see negative sentiment spiking, higher than positive sentiment, during the financial upheaval of \(2008\), the collapse of the dot-com bubble in the early \(2000s\), and the recession of the \(1990s\). Overall, though, notice that the balance of positive to negative sentiment is not as skewed to positive as when you use one of the general purpose sentiment lexicons.

This happens because of the words that are driving the sentiment score in these different cases. When using the financial sentiment lexicon, the words have specifically been chosen for a financial context. What words are driving these sentiment scores?

tidy_text %>%
  count(word) %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  filter(sentiment %in% c("positive", "negative", "uncertainty", "litigious")) %>%
  group_by(sentiment) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  mutate(sentiment = factor(sentiment, levels = c("negative", "positive", "uncertainty", "litigious"))) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(alpha = 0.8, show.legend = FALSE) + #geom_col() +
  coord_flip() +
  scale_y_continuous(expand = c(0,0)) +
  facet_wrap(~ sentiment, scales = "free") +
  labs(
    x = NULL, 
    y = "Total number of occurrences",
    title = "Words driving sentiment scores in Warren Buffett's shareholder letters",
    subtitle = "From the Loughran-McDonald lexicon"
  ) # + ggtitle("Frequency of This Word in Buffett's Letters") + theme_minimal()

4 Bigrams

Relationship Between Words: Now it is the most interesting part. By tokenizing text into consecutive sequences of words, we can examine how often one word is followed by another. We can then study the relationship between words. In this case, defining a list of six words that are used in negative situation, such as don’t, not, no, can’t, won’t and without, and visualize the sentiment-associated words that most often followed them.

letters_bigrams <- raw_text %>%
  unnest(cols = c(text)) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

letters_bigram_counts <- letters_bigrams %>%
  count(year, bigram, sort = TRUE) %>%
  ungroup() %>%
  separate(bigram, c("word1", "word2"), sep = " ")

negate_words <- c("not", "without", "no", "can't", "don't", "won't")

letters_bigram_counts %>%
  filter(word1 %in% negate_words) %>%
  count(word1, word2, wt = n, sort = TRUE) %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  rename(score=value) %>%
  mutate(contribution = -score * n) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ungroup() %>%
  mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  ggplot(aes(word2, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free", nrow = 3) +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words followed by a negation") +
  ylab("Sentiment score * # of occurrences") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip() + 
  ggtitle("Words that contributed the most to sentiment when they followed a ‘negation'") + 
  theme_minimal()

It looks like the largest sources of misidentifying a word as positive come from no matter, no better, not worth, not good, and the largest source of incorrectly classified negative sentiment is “no debt”, “no problem” and “not charged”.

5 References