Text Mining with Warren Buffet’s Letters

Pavan Singh

2022-02-17

Preliminary Discussion and Background

Text mining can be described as the practice of analyzing vast collections of textual materials to capture key concepts, trends and hidden relationships. Essentially it is the data analysis of natural language works (articles, books, etc.), using text as a form of data, joined with the numeric analysis. The analysis ultimately can lead to and is intended for the exploration of hidden relationships or trends. Text mining is is often accompanies with other statistical methods, including but not limited to Sentiment Analysis. Sentiment analysis (opinion mining) is a text mining technique that uses natural language processing (NLP) to analyze text for the sentiment of the writer (positive, negative, neutral, and beyond).

Within the context of the project, we discuss shareholder letters by Warren Buffet. Warren Buffet’s letters to shareholders are widely renowned and are often of great interest to the public. The shareholder letter is generally written once per year and is included at the beginning of the firm’s annual report and can usually be found in the investor relations section of a company’s website. These letters have become an annual required read across the investing world, providing insight into how Buffet and his team think about everything from investment strategy to stock ownership to company culture, and more (CBNinsights, 2021). Amongst others, these letters generally discuss the performance of Berkshire Hathaway and their portfolio of businesses and investments, as well as his views on business, the market, and investing. The letters can be found on the website here. The letters are a demonstration of Warren Buffet’s atypical ethos. While many companies fill their reports with dense technical language, Buffet’s letters take a different approach. Th letters are written in simple language, making them digestible and accessible to everybody.

Interestingly, in annually compounded returns, Berkshire stock has gained 20.8% since 1965, while the S&P 500 as a whole has gained only 9.7% over the same period. That is, if you had invested \(\$1\) in the S&P in 1965 you would have \(\$112.34\), whilst a \(\$1\) in Berkshire stock would have you with a wonderful sum of \(\$15, 325.46\).


Aim and Project Goal

This project looks to take the shareholder letters written by Warren Buffet and perform a sentiment analysis of the text within the letters. Essentially, we look to identify or quantify the overall sentiment of a particular set of text. This is a common use-case (and also the aim of this project), to use Sentiment Analysis to determine how positive or negative a particular text document is.


Data Description

The data used in this project was taken from Berkshire Hathaways public repository of shareholder letters available for free access. The repository contains other documents and information available to public; for the scope of this project, we shall be concerned with the annual shareholder letters written by Warren Buffet.


Setup

Begin by loading and (or) installing packages required for the study. The main package used for scraping the html text is rvest, which helps scrape (or harvest) data from web pages.

library(dplyr)         # For data manipulation
library(ggplot2)       # For plotting
library(hrbrthemes)    # For ggplot2 theme.
library(tidyr)         # For data cleaning
library(tidytext)      # For data cleaning of text corpus
library(pdftools)      # For reading text from pdf files
library(rvest)         # For scraping html text
library(wordcloud)     # For wordclouds
library(XML)           # For easily reading HTML Tables
library(knitr)
library(stringr)       # For work with strings
library(ggthemes)      # Fpr extra themes

The code shown below extracts the HTML letters, as well as gets and reads in the PDF letters to our R environment. We then use a combination of functions to combine all the letters into a data frame. From here we can used data manipulation to explore and retrieve the information in the text for our sentiment analysis. The code used here was taken from online resources. Note that the code can take a while to run and scrape the data. You may load the data manually and skip the next two chunks if you prefer to proceed with loading the data.

###############################################################
# CAN TAKE SOME TIME TO SCRAPE
# USE .RDATA FILES AND SKIP THIS CHUNK IF NEED BE
###############################################################

# Getting & Reading in HTML Letters
urls_77_97 <- paste('http://www.berkshirehathaway.com/letters/', seq(1977, 1997), '.html', sep='')
html_urls <- c(urls_77_97,
               'http://www.berkshirehathaway.com/letters/1998htm.html',
               'http://www.berkshirehathaway.com/letters/1999htm.html',
               'http://www.berkshirehathaway.com/2000ar/2000letter.html',
               'http://www.berkshirehathaway.com/2001ar/2001letter.html')

letters_html <- lapply(html_urls, function(x) read_html(x) %>% html_text())

# Getting & Reading in PDF Letters
urls_03_16 <- paste('http://www.berkshirehathaway.com/letters/', seq(2003, 2016), 'ltr.pdf', sep = '')
pdf_urls <- data.frame('year' = seq(2002, 2016),
                       'link' = c('http://www.berkshirehathaway.com/letters/2002pdf.pdf', urls_03_16))

download_pdfs <- function(x) {
  myfile = paste0(x['year'], '.pdf')
  download.file(url = x['link'], destfile = myfile, mode = 'wb')
  return(myfile)
}

pdfs <- apply(pdf_urls, 1, download_pdfs)
letters_pdf <- lapply(pdfs, function(x) pdf_text(x) %>% paste(collapse=" "))
tmp <- lapply(pdfs, function(x) if(file.exists(x)) file.remove(x)) # Clean up directory

# Combine all letters in a data frame
letters <- do.call(rbind, Map(data.frame, year=seq(1977, 2016), text=c(letters_html, letters_pdf)))
letters$text <- as.character(letters$text)

#Saving Data
#save(letters_html, file = "letter_html.rdata")
#save(letters_pdf, file = "letter_pdf.rdata")

Here, we use unnest_tokens to split the data set (all the letters) into tokens and remove stop words. For more information about tokens and tidy text, look here. Essentially, a token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens. So for our letters, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. To do this, we use tidytext’s unnest_tokens() function.

letter_words <- letters %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

#save(letter_words, file = "letter_words.rdata")

Exploration of Data

An extract of the current data is shown below, along with some details about the data stored in the data frame. We have words tokenized from the text within the letters from the years 1977 all the way till 2016. Let’s look at the structure, it should have year and words as variables.

str(letter_words)
## 'data.frame':    193706 obs. of  2 variables:
##  $ year: int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
##  $ word: chr  "window.datalayer" "window.datalayer" "function" "gtag" ...

Our data frame contains two columns, year and the words in the letter for that year. Let’s look at some of the data.

DT::datatable(tail(letter_words, 15))

The output above simply shows the last 15 words used in the letter from the 2016 letter. Let’s look at what the most common words throughout 40 years of letters.

letter_words %>% 
  dplyr::count(word, sort=TRUE) %>% head(10) %>% DT::datatable()

We see that, as one would expect, business is the most used word. Not surprising. Berkshire is a close second, and earnings a close third. Nothing stands out as extra ordinary from the top 10 words used in the letters. Let’s visualize this.

letter_words %>%
  dplyr::count(word, sort = TRUE) %>%
  filter(n > 450) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(fill = "#4682B4") +
  xlab(NULL) +
  coord_flip() + ggtitle("The Most Common Words in Buffett's Letters") + theme_minimal()

We only show words which appeared more than 450 times. Looking at the results, it is interesting to see that all the words used are somehow associated or understandably used in the context of a business and annual report. As mentioned, the shareholder letters contain minimal technical language, and this is exemplified by the fact that most of these commonly used words in the letters are easily understandable and identifiable from a non-technical background. The Government Employees Insurance Company (GEICO) is a private American auto insurance company and is wholly owned subsidiary of Berkshire Hathaway, Inc. 

We can further break down this analysis by looking at the most common words used each year. This resulting data frame will be central to use in our sentiment analysis as we can track how the sentiment in the letters change over the years. An extract of the resulting data frame is shown below.

words_by_year <- letter_words %>%
  dplyr::count(year, word, sort = TRUE) %>%
  ungroup()

DT::datatable(head(words_by_year,10))

We shall conduct three stages of sentiment analysis. There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are

  • AFINN from Finn Arup Nielson

  • bing from Liu and collaborators

  • loughran from Loughran and McDonald

  • nrc from Saif Mohammad and Peter Turney

We shall use three of these lexicons. Namely, AFINN in Sentiment Analysis 1.0, loughran in 2.0 and bing in the final Sentiment Analysis (3.0). All three of these lexicons are based on unigrams. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The loughran lexicon divided words into constraining, litigious, negative, positive, superfluous and uncertainty. More information and examples of this can be found here, notes for Text Mining with R.


Sentiment Analysis

Using the results from our prior exploratory analysis, we have a data frame which has the most common words used each year. We can now look to examine how often positive and negative words occurred in these letters. For example, we could be interested to see which years were the most positive or negative overall?

Part 1 - AFINN

We use AFINN lexicon, which is a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive) by Finn Årup Nielsen. It was primarily used to analyze Twitter sentiment. It consists of 2,477 words with 878 positive and 1,598 negative on a scale of -5 to +5 with a mean of -0.59.

  • A cool little implementation of AFINN Sentiment Analysis is found here. You can enter some text below for real-time (in-browser) sentiment analysis.

We conduct our sentiment analysis by calculating the average sentiment score for each year.

letters_sentiments <- words_by_year %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(year) %>%
  summarize(value = sum(value * n) / sum(n))

letters_sentiments %>%
  dplyr::mutate(year = reorder(year, value)) %>%
  ggplot(aes(year, value, fill = value > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Score of Buffett's Letters to Shareholders 1977-2016") + theme_minimal()

Warren Buffet is known for his long-term, optimistic economic outlook. This is truly exemplified by the sentiment of his letters of over the 40 year period. Only 1 out of 40 letters appeared negative. Berkshire’s loss in net worth during 2001 was \(\$3.77\) billion, in addition, 911 terrorist attack contributed to the negative sentiment score in that year’s letter.

Let’s examine the total positive and negative contributions of each word in the letters. For example, word “abandon” appeared 4 times and contributed total -8 scores.

contributions <- letter_words %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(value))

contributions %>%
  top_n(25, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() + ggtitle('Words with the Most Contributions to Positive/Negative Sentiment Scores') + theme_minimal()

Word “outstanding” made the most positive contribution and word “loss” made the most negative contribution. It is worthy to note, this is based on the value scoring from the chosen sentiment analysis, i.e., AFINN.

Continuing with our analysis we can look at the sentiment in each letter and see the words with highest positive scores in each letter. An extract of the resulting set is shown below. Here we see that “outstanding” appeared eight out of ten letters.

sentiment_messages <- letter_words %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(year, word) %>%
  summarize(sentiment = mean(value),
            words = n()) %>%
  ungroup() %>%
  filter(words >= 5)
sentiment_messages %>%
  arrange(desc(sentiment)) %>% DT::datatable()

We can also look at the letters where “loss” was used a lot.

sentiment_messages %>% 
  arrange(sentiment) %>% head(10)  %>% DT::datatable()

Unsurprisingly, seven out of ten letters, word “loss” secured the highest negative score.

In the final part of this phase of analysis (using AFINN) we can look at the relationship between words used in the letters from Warren Buffet to the shareholders. Specifically, by tokenizing text into consecutive sequences of words, we can examine how often one word is followed by another. Hereby, allowing us to then study the relationship between words.

In this case, defining a list of six words that are used in negative situation, such as “don’t”, “not”, “no”, “can’t”, “won’t” and “without”, and visualize the sentiment-associated words that most often followed them.

letters_bigrams <- letters %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
letters_bigram_counts <- letters_bigrams %>%
  count(year, bigram, sort = TRUE) %>%
  ungroup() %>%
  separate(bigram, c("word1", "word2"), sep = " ")
  
negate_words <- c("not", "without", "no", "can't", "don't", "won't")

letters_bigram_counts %>%
  filter(word1 %in% negate_words) %>%
  count(word1, word2, wt = n, sort = TRUE) %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  mutate(contribution = value * n) %>%
  group_by(word1) %>%
  top_n(10, abs(contribution)) %>%
  ungroup() %>%
  mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
  ggplot(aes(word2, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ word1, scales = "free", nrow = 3) +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab("Words followed by a negation") +
  ylab("Sentiment score * # of occurrences") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip() + ggtitle("Words that contributed the most to sentiment when they followed a ‘negation'") + theme_minimal()

It looks like the largest sources of misidentifying a word as positive come from “no matter”, “no better”, “not worth”, “not good”, and the largest source of incorrectly classified negative sentiment is “no debt”, “no problem” and “not charged”.

Part 2 - Loughran

We can explore using another sentiment lexicon. Here we look to use “loughran”, which was developed based on analyses of financial reports. The Loughran dictionary divides words into six sentiments: “positive”, “negative”, “litigious”, “uncertainty”, “constraining”, and “superfluous” (Loughran-McDonald).

letter_words %>%
  count(word) %>%
  inner_join(get_sentiments("loughran"), by = "word") %>%
  group_by(sentiment) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ sentiment, scales = "free") +
  ggtitle("Frequency of This Word in Buffett's Letters") + theme_minimal()

The assignments of words to sentiments look reasonable. However, it removed “outstanding” and “superb” from the positive sentiment. This is in great contrast tot he previous implementation of the analysis (Sentiment Analysis 1.0), where we found that the word outstanding carried a heavy weight for determining the positive sentiment using the AFINN lexicon.

Part 3 - Bing

For this sentiment analysis we look to employ the bing lexicon.

# Tidy letters
tidy_letters <- letters %>% 
  unnest_tokens(word, text) %>%                           # split text into words
  anti_join(stop_words, by = "word") %>%                  # remove stop words
  filter(!grepl('[0-9]', word)) %>%                       # remove numbers
  left_join(get_sentiments("bing"), by = "word") %>%      # add sentiment scores to words
  group_by(year) %>% 
  mutate(linenumber = row_number(),                       # add line numbers
         sentiment = ifelse(is.na(sentiment), 'neutral', sentiment)) %>%
  ungroup

We can further delve into some of the trends identified here and perhaps look to find suitable explanations for lower sentiment scores in some years.

  • 1987: This shows a sentiment value of 0.38. The market crash that happened on October 19th, 1987 (Black Monday) is widely known as the largest single-day percentage decline ever experienced for the Dow-Jones Industrial Average, 22.61% in one day. This could suggest reason for the a slightly low sentiment scored in this years (1987’s) letter.

  • 1990: Value of 0.42. Higher than 1987, however still lower than many other years. The recession of 1990, triggered by an oil price shock following the United States’ invasion of Kuwait, resulted in a notable increase in unemployment.

  • 2001: Following the 1990s, which represented the longest period of growth in American history, 2001 saw the collapse of the dot-com bubble and associated declining market values, as well as the September 11th attacks.

  • 2002: The market, already falling in 2001, continued to see declines throughout much of 2002.

  • 2008: The Great Recession was a large worldwide economic recession, characterized by the International Monetary Fund as the worst global recession since before World War II. Other related events during this period included the financial crisis of 2007-2008 and the sub prime mortgage crisis of 2007-2009.

rbind(
letters_sentiment[which(letters_sentiment$year == 1987),],
letters_sentiment[which(letters_sentiment$year == 1990),],
letters_sentiment[which(letters_sentiment$year == 2001),],
letters_sentiment[which(letters_sentiment$year == 2002),],
letters_sentiment[which(letters_sentiment$year == 2008),]
)
## # A tibble: 5 × 2
##    year sentiment_pct
##   <int>         <dbl>
## 1  1987     -0.00206 
## 2  1990     -0.000657
## 3  2001     -0.0120  
## 4  2002     -0.00136 
## 5  2008     -0.00928

We can also, like how we did with AFINN lexicon in the previous analyses, examine which words were actually the strongest contributors to the positive and negative sentiment in the letters. For this exercise, we analyzed the letters as one single text, and present the most common positive and negative words in the graph below.

The results here are interesting. Many of the most common words–‘gain’, ‘gains’, ‘loss’, ‘losses’, ‘worth’, ‘liability’, and ‘debt’–are what we’d expect given the financial nature of these documents. And as expected, they have the largest contribution to positive sentiment. Particularly, the words, “gain”, “worth” and “gains”. Similarly, the words “loss”, and “losses” have the largest contribution to negative sentiment.

Note that the adjectives that make their way into this set particularly texts, are of great interest as well. Since they invite us to reflect on how Warren Buffet thinks or expresses his own feelings and prospects. For example, n the positive side we have ‘significant’, ‘outstanding’, ‘excellent’, ‘extraordinary’, and ‘competitive’. On the negative side there are ‘negative’, ‘unusual’, ‘difficult’, and ‘bad’.

A clear limitation of sentiment analysis that is brought to light is the use of the word ‘casualty’, where Buffet is not referring to death, but to the basket of property and casualty insurance companies that make up a significant portion of Berkshire Hathaway business holdings.

The plot is somewhat limited, as we are restricted in the number of words we can effectively show without crowding. To see a larger number of words, we can use a word cloud. The word cloud below shows 400 of the most commonly used words, split by positive and negative sentiment.

*** # References