R Packages Used

This assignment was accomplished by utilizing these packages for both data analysis and visualizations.

library(dplyr)
library(SentimentAnalysis)
library(lubridate)
library(ggplot2)
library(tidyr)
library(stringr)
library(rlang)
library(tidyverse)
library(tidytext)
library(xlsx)
library(RCurl)
library(XML)
library(kableExtra)
library(tm)
library(ngram)
library(wordcloud)
library(ggridges)
library(ggplot2)
library(gridExtra)
library(rcompanion)
library(ggcorrplot)
library(caret)
library(e1071)
library(R.utils)
library(DT)
library(lattice)
library(kernlab)
library(mlbench)
library(caretEnsemble)
library(nnet)
library(LiblineaR)
library(knitr)

1. The Purpose of Study

Central bank announcements of interest rate trajectories and monetary policy are among the most impactful events to global financial markets. This project uses data science to analyze Federal Reserve policy statements and seeks insight on their sentiment and relationship to financial market variables. By conducting this study, we identify some tangible results of interest but also lay out a basic foundation for a programme of future inquiry.

Our approach to analyzing the FOMC statement is through the lense of textual analysis. Two approaches are used in this project: Sentiment analysis is applied to the FOMC text corpus and used to construct a sentiment time series. That time series is then used as an explanatory variable to compare to real-word financial time series. Time series charting and inear regression are the main tools of that approach.

The second approach is via text classification. The textual analysis of FOMC statements is not exclusive to data science. Financial market practitioners pore over every word, gesture and media interview of the Federal Reserve chairman or chairwoman, governors and District presidents. Some financial news vendors like Bloomberg even publish text comparisons - showing the redlined differences between two FOMC statements. Our approach via text classification attempts to automate the classification of each FOMC statement by several attributes. The attributes include the hawkishness or dovishness of the statement, the FOMC’s opinion of economic growth or labor market health. The method used here is to manually label each FOMC statement and then to train a support vector machine algorithm to predict each attribute.

This paper is organized as follows: Section 2 gives background on the Federal Reserve, the FOMC and past research. Section 3 describes the sources of data: the FOMC statements, the manual labelling of statements, the financial time series. Section 4 performs exploratory data analysis on these data sources. Section 5 conducts two analysis: the first half addresses the machine learning training work to classify statements by 5 attributes. the second half conducts sentiment analysis. Section 6 discusses the results and Section 7 concludes this project.

2. Background

2.1 Federal Reserve

2.1.1 Background

The Federal Reserve System - commonly called “the Fed” - serves as the central bank of the United States. Congress passed the Federal Reserve Act in 1913, which President Woodrow Wilson supported and signed into law on December 23, 1913. Congress structured the Fed as a distinctly American version of a central bank: a “decentralized” central bank, with Reserve Banks and Branches in 12 Districts spread across the country and coordinated by a Board of Governors in Washington, D.C. Congress also gave the Fed System a mixture of public and private characteristics. The 12 Reserve Banks share many features with private-sector corporations, including boards of directors and stockholders (the member banks within their Districts). The Board of Governors, though, is an independent government agency, with oversight responsibilities for the Reserve Banks.

The Fed conducts monetary policy, supervises and regulates banking, serves as lender of last resort, maintains an effective and efficient payments system, and serves as banker for banks and the U.S. government. Conducting the nation’s monetary policy is one of the most important - and often the most visible - functions of the Fed.

2.1.2 Monetary Policy

So, what is monetary policy? Simply put, it refers to the actions taken by the Fed to influence the supply of money and credit in order to foster price stability (i.e. control inflation) and maintain maximum sustainable employment. These two objectives are called the “dual mandate”. This distinguishes the Fed from other central banks which typically have a single mandate to control inflation.

The Fed’s instrument for implementing monetary policy is the FOMC’s target for the federal funds rate - the interest rate at which banks lend to each other overnight. By buying and selling U.S. government securities in the open market, the Fed influences the interest rate that banks charge each other. Movements in this rate and expectations about those changes influence all other interest rates and asset prices in the economy.

The Federal Reserve also issues the nation’s currency (Federal Reserve notes) and manages the amount of funds the banking system holds as reserves. Currency and reserves make up what is called the monetary base. However, because the vast majority of money in the US economy is in intangible form rather than physical notes, monetary policy focuses on interest rates instead of currency supply.

In the early days of the FOMC, controversy swirled around how to structure the vote. Should monetary policy be set by the 12 Reserve Banks or the Board of Governors? Or both? In 1935 Congress decided that the seven Governors would vote along with only five of the 12 presidents. The president of the New York Fed always votes - since the Open Market Trading Desk operates in that District - along with four presidents who rotate from among the groups shown below. In that way, voting members always come from different parts of the country.

2.2 The FOMC

As long as the U.S. economy is growing steadily and inflation is low, few people give much thought to the Federal Open Market Committee (FOMC), the group within the Federal Reserve System charged with setting monetary policy. Yet, when economic volatility makes the evening news, this Committee and its activities become much more prominent. Investors and workers, shoppers and savers all pay more attention to the FOMC’s decisions and the wording of its announcements at the end of each meeting.

Why? Because the decisions made by the FOMC have a ripple effect throughout the economy. The FOMC is a key part of the Federal Reserve System, which serves as the central bank of the United States. Among the Fed’s duties are managing the growth of the money supply, providing liquidity in times of crisis, and ensuring the integrity of the financial system. The FOMC’s decisions to change the growth of the nation’s money supply affect the availability of credit and the level of interest rates that businesses and consumers pay. Those changes in money supply and interest rates, in turn, influence the nation’s economic growth and employment in the short run and the general level of prices in the long run.

2.2.1 FOMC Meetings

The FOMC meets regularly - typically every six to eight weeks - in Washington, D.C., although the Committee can and does meet more often by phone or videoconference if needed. The meetings are generally one-day or two-day events, with the two-day meetings providing more time to discuss a special topic. Around the table in the Federal Reserve Board’s headquarters sit all 19 FOMC participants (seven Governors and 12 Reserve Bank presidents) as well as select staff and economists from the Board and the Reserve Banks. Because of the nature of the discussions, attendance is restricted. A Reserve Bank president, for instance, typically brings along only one staff member, usually his or her director of research.

The objective at each meeting is to set the Committee’s target for the federal funds rate - the interest rate at which banks lend to each other overnight - at a level that will support the two key objectives of U.S. monetary policy: price stability and maximum sustainable economic growth. The meeting’s agenda follows a structured and logical process that results in well-informed and thoroughly deliberated decisions on the future course of monetary policy.

2.2.2 Structure of a Typical Meeting

The meeting begins with a report from the manager of the System Open Market Account (SOMA) at the Federal Reserve Bank of New York, who is responsible for keeping the federal funds rate close to the target level set by the FOMC. The manager explains how well the Open Market Trading Desk has done in hitting the target level since the last FOMC meeting and discusses recent developments in the financial and foreign exchange markets. Up next is the Federal Reserve Board’s director of the Division of Research and Statistics, along with the director of the Division of International Finance. They review the Board staff’s outlook for the U.S. economy and foreign economies. This detailed forecast is circulated the week before the meeting to FOMC members in what is called the “Greenbook” - named for its green cover in the days when it was a printed document.

Then the meeting progresses to the first of two “go-rounds,” which are the core of FOMC meetings. During the first go-round, all of the Fed Governors and Reserve Bank presidents discuss how they see economic and financial conditions. The Reserve Bank presidents speak about conditions in their Districts, as well as offering their views on national economic conditions. The data and information discussed vary by region and therefore spotlight a wide range of industries. For example, one would expect the review of regional conditions in the San Francisco District to lend insight into the tech sector of Silicon Valley.

The policymakers have prepared for this go-round through weeks of information gathering. Before the FOMC meeting, each Reserve Bank prepares a “Summary of Commentary on Current Economic Conditions,” which is published two weeks before each meeting in what most people call the “Beige Book,” for the color of its cover when originally printed. One Federal Reserve Bank, designated on a rotating basis, publishes the overall summary of the 12 District reports. The Reserve Bank presidents have also gathered information by talking with executives in a variety of business sectors and through meetings with the Banks’ boards of directors and advisory councils.

This first go-round covers valuable information about economic activity throughout the country, measured in hard data and recent anecdotal information, as well as the analysis and interpretation conveyed by the policymakers sitting around the table. This is a key way in which each region of the U.S. has input into the making of national monetary policy. This portion of the meeting concludes with the FOMC Chair summarizing the discussion and providing the Chair’s own view of the economy. At this point, the policy discussion begins with the Federal Reserve Board’s director of the Division of Monetary Affairs, who outlines the Committee’s various policy options.

The outlook options could include no change, an increase, or a decrease in the federal funds rate target. Each option is described, along with a clear rationale, the pros and cons, and some alternatives for how the Committee could explain its decision in a public statement to be released that afternoon. Then, there is a second go-round. The Reserve Bank presidents and Governors each make the best case for the policy alternative they prefer, given current economic conditions and their personal outlook for the economy. They also comment on how they think the statement explaining the decision should be worded. One of the most important aspects of an FOMC meeting is that all voices matter. The analysis and viewpoints of each committee participant - whether a voting member or not - play an instrumental role in the FOMC’s policy decisions.

At the end of this policy go-round, the Chair summarizes a proposal for action based on the Committee’s discussion, as well as a proposed statement to explain the policy decision. The Fed Governors and presidents then get a chance to question or comment on the Chair’s proposed approach. Once a motion for a decision is on the table, the Committee tries to come to a consensus through its deliberations. Although the final decision is most often one that all can support, there are times when some differences of opinion may remain, and voting members may dissent. At the end of the policy discussion, all seven of the Fed Governors and the five voting Reserve Bank presidents cast a formal vote on the proposed decision and the wording of the statement.

2.2.3 Announcing the Policy Decision

After the vote has been taken, the FOMC publicly announces its policy decision at 2:15 p.m. The announcement includes the federal funds rate target, the statement explaining its actions, and the vote tally, including the names of the voters and the preferred action of those who dissented.

In addition, the FOMC releases its official minutes three weeks after each meeting. The minutes include a more complete explanation of the views expressed, which allows the public to get a better sense of the range of views within the FOMC and promotes awareness and understanding of how monetary policy is made. In recent years, the FOMC has improved communications with the public. What’s more, the FOMC now releases Committee participants’ projections for the economy and inflation four times a year, which provides added insight into the policymakers’ perspectives.

2.2.4 Implementing Policy

Once the FOMC establishes a target for the federal funds rate, the Open Market Trading Desk at the Federal Reserve Bank of New York conducts daily open market operations - buying or selling U.S. government securities on the open market - as necessary to achieve the federal funds rate target. Open market operations affect the amount of money and credit available in the banking system, thereby affecting interest rates, which in turn affect the spending decisions of households and businesses and ultimately the overall performance of the U.S. economy.

2.2.5 Connecting To Our Project

This detailed description of the FOMC serves two purposes: (a) to describe the monetary policymaking activities of the FOMC (b) to identify the dataset which we will analyze. We are going to focus exclusively on the FOMC policy statements released at 2:15pm ET. Anecdotally, these policy statements have the greatest short term impact on financial markets and potential for surprise. In the next section, we identify past research that examines the FOMC policy statements from a data science perspective.

2.3 Past Research

Our project is inspired and guided by past work in this field. There is a research literature and past work on FOMC statement analysis using data science methods. A related but distinct literature on the financial market impact of central bank communications is also relevant motivation. Moreover, the authors are aware of several financial institutions and companies engaged in the use of machine learning techniques to analyze central bank communications. We will touch on these in turn.

Cannon’s 2015 paper on FOMC sentiment analysis uses the transcripts instead of FOMC statements. He uses the financial dictionary of Loughran-McDonald to construct sentiment and defines it using a bag of words count method. The R package we use to derive sentiment calculates the same metric which is:

\(sentiment(doc) = \frac{P-N}{P+N}\) where P, N are the number of positive and negative words in the document respectively

However, he compares the sentiment index against real economic activity proxied by the Chicago Fed National Activity Index as not against financial market variables. A 2011 paper by Lucca and Trebbi analyzes the FOMC statements directly but measure hawkish and dovish tone not sentiment. In addition, their measurement technique is to use search engines (Google or Factiva) to generate a correlation of the word count hits in the search engine’s corpus between the words “hawk”" or “dove”" and each relevant word or N-gram from the policy statement. They call their approach a semantic orientation method to analyzing the statements. This measurement technique is not reproducible and computationally impractical. Another paper by Schmeling and Wagner (2019) analyzes sentiment in European Central Bank (ECB) policy statements, which are in English, using the Loughran-McDonald dictionary. They find that tone does seem to affect the risk premia of equities through a risk-based channel. I.e. Higher beta stocks respond more to ECB tone than lower beta stocks. They also find that corporate bond credit spreads between BBB and AAA rated bonds tighten when tone is positive. Lastly, Fuksa and Sornette (2012) analyze FOMC Beige Book, minutes and policy statements for sentiment. They find predictive power in the Beige Book which is released 3 weeks before the policy statement. Thus, analyzing Beige Book sentiment could predict FOMC policy actions.

A separate literature on the central bank impact on asset prices finds the FOMC meetings and statements are important. Cieslak, et.al. (2018) find that US and global stock returns are driven by the FOMC meeting cycle. Since 1994 the equity premium has been earned on even numbered weeks of the FOMC cycle. However, Brusa, et. al. (2017) find evidence that no other central banks have the same equity market impact as the Fed.

3. Data

3.1 FOMC Statements

3.1.2 Data Staging - prepare metadata for data extraction and create a dataframe

# Extract year of publication from the statement's release date, and create a data frame with date, year and URL. 
statement.dates<-NULL
year<-NULL
for(i in seq(from=1, to=length(links))) {
  statement.dates[i]<-(str_extract(links[i],"[[:digit:]]+"))
  year[i]<-substr(statement.dates[i],1,4)
}
reports<-data.frame(year,statement.dates, links)
# Convert factors to characters
reports %<>% mutate_if(is.factor, as.character)%>% arrange(statement.dates)

3.1.3 Data Extraction via web-scraping

# Loop through the statement links and scrape the content from the Federal Reserve website.
# Discard irrelevant portions of the extracted content i.e. preliminary paragraphs and last paragraph.
statement.content<-NULL
statement.length<-NULL
for(i in seq(from=1, to=length(reports$links))) {
stm.url<-getURL(reports$links[i])
stm.tree<-htmlTreeParse(stm.url,useInternal=TRUE )
stm.tree.parse<-unlist(xpathApply(stm.tree, path="//p", fun=xmlValue))
n<-(which(!is.na(str_locate(stm.tree.parse, "release")))+1)[1]
l<-length(stm.tree.parse)-1
# Condense separate paragraphs into one element per statement date
reports$statement.content[i]<-paste(stm.tree.parse[n:l], collapse = "")
# Remove line breaks
reports$statement.content[i]<-gsub("\r?\n|\r"," ",reports$statement.content[i])
#reports$statement.content[i]<-gsub("\\.+\\;+\\,+","",reports$statement.content[i])
# Count number of characters per statement
reports$statement.length[i]<-nchar(reports$statement.content[i])
#reports$statement.length[i]<-wordcount(reports$statement.content[i], sep = " ", count.function = sum)
}
# Create R data object
saveRDS(reports, file = "fomc_data.rds")

3.1.4 Data cleansing - correct a statement date

# Correct the date for one statement, because the URL is not in sync with the actual date inside the statement content
reports$statement.dates[match(c("20070618"),reports$statement.dates)]<-"20070628"

3.2 Human Classification

We use specialist knowledge of financial markets to read and manually label all 102 FOMC statements from 2007 to May 2019. Specifically, five attributes were reviewed and manually collected into a CSV file. This file was then merged with the previous FOMC web scraped data to build a merged classification enriched dataset. In this section, we demonstrate the data wrangling steps for to merge the 2 data sets. Then we define and illustrate with examples from various statements each of the possible outcomes for each of the 5 attributes. Providing transparency to the classification method is essential to understand the challenges even to human judgment of understanding “FedSpeak”.

3.2.1 Wrangling the Data into a Merged Dataset

First we load the FOMC statement data set into memory as a dataframe.

d4<-readRDS(file = "fomc_data.rds")
dim(d4)
## [1] 102   5
str(d4)
## 'data.frame':    102 obs. of  5 variables:
##  $ year             : chr  "2007" "2007" "2007" "2007" ...
##  $ statement.dates  : chr  "20070131" "20070321" "20070509" "20070618" ...
##  $ links            : chr  "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070131a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070321a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070509a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070618a.htm" ...
##  $ statement.content: chr  "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ ...
##  $ statement.length : int  1155 1098 1087 1179 1388 864 1710 1977 1903 1669 ...

We explicitly override the Date column to be imported as string because we will join these two dataframes on this data. In other words, we use a date column of the format “yyyymmdd” in string format as the common key for joining disparate datasets.

classificationFile = "https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/Code/Classification_FOMC_Statements.csv"
cls = read_csv(classificationFile , col_types = cols( Date = col_character() ) )
cls %>% rename( Economic.Growth = "Economic Growth", Employment.Growth = "Employment Growth", Medium.Term.Rate = "Medium Term Rate", Policy.Rate = "Policy Rate") -> cls
str(cls)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 102 obs. of  8 variables:
##  $ Index            : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date2            : chr  "1/31/07" "3/21/07" "5/9/07" "6/28/07" ...
##  $ Date             : chr  "20070131" "20070321" "20070509" "20070628" ...
##  $ Economic.Growth  : chr  "Up" "Flat" "Down" "Up" ...
##  $ Employment.Growth: chr  "Flat" "Flat" "Flat" "Flat" ...
##  $ Inflation        : chr  "Down" "Up" "Up" "Down" ...
##  $ Medium.Term.Rate : chr  "Hawk" "Hawk" "Hawk" "Hawk" ...
##  $ Policy.Rate      : chr  "Flat" "Flat" "Flat" "Flat" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Index = col_double(),
##   ..   Date2 = col_character(),
##   ..   Date = col_character(),
##   ..   `Economic Growth` = col_character(),
##   ..   `Employment Growth` = col_character(),
##   ..   Inflation = col_character(),
##   ..   `Medium Term Rate` = col_character(),
##   ..   `Policy Rate` = col_character()
##   .. )

3.2.2 Merging FOMC data and Classification Data

d4 %>% inner_join( cls , by = c("statement.dates" = "Date")) %>%
  mutate( date_mdy = mdy(Date2)) %>%
  select(Index, 
         year ,
         statement.dates, 
         links, 
         statement.content, 
         statement.length ,
         date_mdy,
         Economic.Growth,
         Employment.Growth,
         Inflation,
         Medium.Term.Rate,
         Policy.Rate ) -> mgData
str(mgData)
## 'data.frame':    101 obs. of  12 variables:
##  $ Index            : num  1 2 3 5 6 7 8 9 10 11 ...
##  $ year             : chr  "2007" "2007" "2007" "2007" ...
##  $ statement.dates  : chr  "20070131" "20070321" "20070509" "20070807" ...
##  $ links            : chr  "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070131a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070321a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070509a.htm" "https://www.federalreserve.gov/newsevents/pressreleases/monetary20070807a.htm" ...
##  $ statement.content: chr  "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ "The Federal Open Market Committee decided today to keep its target for the federal funds rate at 5-1/4 percent."| __truncated__ ...
##  $ statement.length : int  1155 1098 1087 1388 864 1710 1977 1903 1669 1653 ...
##  $ date_mdy         : Date, format: "2007-01-31" "2007-03-21" ...
##  $ Economic.Growth  : chr  "Up" "Flat" "Down" "Up" ...
##  $ Employment.Growth: chr  "Flat" "Flat" "Flat" "Flat" ...
##  $ Inflation        : chr  "Down" "Up" "Up" "Flat" ...
##  $ Medium.Term.Rate : chr  "Hawk" "Hawk" "Hawk" "Hawk" ...
##  $ Policy.Rate      : chr  "Flat" "Flat" "Flat" "Flat" ...

Let us view the sample data from the statements

mgData %>% select( Index, date_mdy, Economic.Growth, Employment.Growth, Inflation, Medium.Term.Rate, Policy.Rate) %>% kable() %>% kable_styling(bootstrap_options = c("hover", "striped")) %>%
scroll_box(width = "90%", height = "300px")
Index date_mdy Economic.Growth Employment.Growth Inflation Medium.Term.Rate Policy.Rate
1 2007-01-31 Up Flat Down Hawk Flat
2 2007-03-21 Flat Flat Up Hawk Flat
3 2007-05-09 Down Flat Up Hawk Flat
5 2007-08-07 Up Flat Flat Hawk Flat
6 2007-08-17 Down Flat Flat Dove Flat
7 2007-09-18 Flat Flat Down Dove Lower
8 2007-10-31 Up Flat Down Dove Lower
9 2007-12-11 Down Flat Down Dove Lower
10 2008-01-22 Down Down Flat Dove Lower
11 2008-01-30 Down Down Flat Dove Lower
12 2008-03-18 Down Down Up Dove Lower
13 2008-04-30 Down Down Up Dove Lower
14 2008-06-25 Up Down Up Hawk Flat
15 2008-08-05 Up Down Up Hawk Flat
16 2008-09-16 Down Down Up Dove Flat
17 2008-10-08 Down Flat Down Dove Lower
18 2008-10-29 Down Flat Down Dove Lower
19 2008-12-16 Down Down Down Dove Lower
20 2009-01-28 Down Down Down Dove Flat
21 2009-03-18 Down Down Down Dove Flat
22 2009-04-29 Down Down Flat Dove Flat
23 2009-06-24 Down Down Down Dove Flat
24 2009-08-12 Flat Down Down Dove Flat
25 2009-09-23 Up Down Down Dove Flat
26 2009-11-04 Up Down Down Dove Flat
27 2009-12-16 Up Flat Down Dove Flat
28 2010-01-27 Flat Flat Down Dove Flat
29 2010-03-16 Flat Flat Down Dove Flat
30 2010-04-28 Flat Flat Flat Dove Flat
31 2010-06-23 Flat Up Down Dove Flat
32 2010-08-10 Down Down Down Dove Flat
33 2010-09-21 Down Down Flat Dove Flat
34 2010-11-03 Flat Flat Down Dove Flat
35 2010-12-14 Flat Down Down Dove Flat
36 2011-01-26 Flat Down Down Dove Flat
37 2011-03-15 Up Flat Up Dove Flat
38 2011-04-27 Up Flat Up Dove Flat
39 2011-06-22 Flat Down Up Dove Flat
40 2011-08-09 Flat Down Down Dove Flat
41 2011-09-21 Flat Down Down Dove Flat
42 2011-11-02 Up Down Flat Dove Flat
43 2011-12-13 Up Flat Down Dove Flat
44 2012-01-25 Up Up Flat Dove Flat
45 2012-03-13 Up Up Flat Dove Flat
46 2012-04-25 Up Up Up Dove Flat
47 2012-06-20 Up Flat Down Dove Flat
48 2012-08-01 Down Flat Down Dove Flat
49 2012-09-13 Up Flat Flat Dove Flat
50 2012-10-24 Up Flat Up Dove Flat
51 2012-12-12 Up Up Flat Dove Flat
52 2013-01-30 Flat Up Flat Dove Flat
53 2013-03-20 Up Up Flat Dove Flat
54 2013-05-01 Up Up Flat Dove Flat
55 2013-06-19 Up Up Down Dove Flat
56 2013-07-31 Up Up Flat Dove Flat
57 2013-09-18 Up Up Flat Dove Flat
58 2013-10-30 Up Up Flat Dove Flat
59 2013-12-18 Up Up Flat Dove Flat
60 2014-01-29 Up Up Flat Dove Flat
61 2014-03-19 Down Up Flat Dove Flat
62 2014-04-30 Up Up Flat Dove Flat
63 2014-06-18 Up Up Flat Dove Flat
64 2014-07-30 Up Up Up Dove Flat
65 2014-09-17 Up Up Flat Dove Flat
66 2014-10-29 Up Up Down Dove Flat
67 2014-12-17 Up Up Down Dove Flat
68 2015-01-28 Up Up Down Dove Flat
69 2015-03-18 Flat Up Down Dove Flat
70 2015-04-29 Down Flat Flat Dove Flat
71 2015-06-17 Up Up Flat Dove Flat
72 2015-07-29 Up Up Flat Dove Flat
73 2015-09-17 Up Up Down Dove Flat
74 2015-10-28 Up Flat Down Dove Flat
75 2015-12-16 Up Up Flat Dove Raise
76 2016-01-27 Down Up Down Dove Flat
77 2016-03-16 Up Up Up Dove Flat
78 2016-04-27 Down Up Flat Dove Flat
79 2016-06-15 Up Flat Down Dove Flat
80 2016-07-27 Up Up Flat Dove Flat
81 2016-09-21 Up Up Flat Dove Flat
82 2016-11-02 Flat Up Flat Dove Flat
83 2016-12-14 Up Up Up Dove Raise
84 2017-02-01 Up Up Up Dove Flat
85 2017-03-15 Up Up Flat Hawk Raise
86 2017-05-03 Down Up Down Hawk Flat
87 2017-06-14 Up Up Down Hawk Raise
88 2017-07-26 Up Up Down Hawk Flat
89 2017-09-20 Up Up Down Hawk Flat
90 2017-11-01 Up Up Down Hawk Flat
91 2017-12-13 Up Up Down Hawk Raise
92 2018-01-31 Up Up Flat Hawk Flat
93 2018-03-21 Up Up Up Hawk Raise
94 2018-05-02 Up Up Flat Hawk Flat
95 2018-06-13 Up Up Flat Hawk Raise
96 2018-08-01 Up Up Flat Hawk Flat
97 2018-09-26 Up Up Flat Hawk Raise
98 2018-11-08 Up Up Flat Hawk Flat
99 2018-12-19 Up Up Flat Hawk Raise
100 2019-01-30 Up Up Down Dove Flat
101 2019-03-20 Flat Flat Down Dove Flat
102 2019-05-01 Up Up Flat Dove Flat

3.2.3 Exporting the Merged Data Frame

We export the merged dataframe as a single RDS object for research use.

rds_filename = "fomc_merged_data_v2.rds"
saveRDS(mgData, file = rds_filename)

3.2.4 Economic Growth

The attribute Economic.Growth is assigned one of 3 classification: Up, Flat or Down. It refers to the near term trend in economic growth since the last FOMC meeting or within the last quarter (whichever is mentioned). Most statements give an explicit assessment of economic growth in the first 3 sentences. An example of a UP classification is the July 29, 2015 statement (below coloring is mine):

Information received since the Federal Open Market Committee met in June indicates that \(\color{red}{\text{economic activity has been expanding moderately in recent months}}\). An example of a FLAT classification is the March 18, 2015 statement:

Information received since … January suggests that \(\color{red}{\text{economic growth has moderated somewhat.}}\) An example of a DOWN classification is in the May 3, 2017 statement:

Information received since … March indicates that … \(\color{red}{\text{growth in economic activity slowed.}}\). Rarely does the FOMC statement exclude an assessment of near-term economic growth trends in the US.

3.2.5 Employment Growth

The attribute Employment.Growth refers to the near-term trend of the labor market in the US. We use the same classification values as for Economic.Growth. If the labor market indicators are improving, we mark the indicator as UP. This requires a decrease in the unemployment rate (if stated) and/or an increase in jobs creation. These two key indicators broadly define the health of the labor market.

An example of an UP classification is in the Feb 1, 2017 statement:

the labor market has continued to strengthen … Job gains remained solid and the unemployment rate stayed near its recent low. An example of a FLAT classification is the Dec 13, 2011 statement where the indicators are mixed:

While indicators point to some improvement in overall labor market conditions, >the unemployment rate remains elevated. An example of a DOWN classification is the April 29, 2009 statement where the labor market is discussed indirectly:

Household spending has shown signs of stabilizing but remains constrained by >\(\color{red}{\text{ongoing job losses}}\), lower housing wealth, and tight >credit. Weak sales prospects and difficulties in obtaining credit have led >businesses to cut back on inventories, fixed investment, and \(\color{red}{\text{staffing.}}\) Sometimes the FOMC statement does not mention labor market conditions. In this case, we assume information is irrelevant or not a concern and assign a FLAT classification.

3.2.6 Inflation

When measuring inflation, we refer to the realized price fluctuation of core PCE (where available) in the period since the last FOMC meeting. Sometimes, this is not explicitly stated. Then, we see the overall price movements (including food and energy) since the last meeting. Where this is unstated, we rely on market driven indicators of medium term inflation risk as described by the statement. We don’t rely on shifts in the long term inflation expectations. Of the various metrics from the FOMC statements, this indicator is the most challenging to classify due to the multiple dimensions of inflation.

An example of a UP classification comes from the April 27, 2011 meeting:

\(\color{red}{\text{Commodity prices have risen}}\) significantly since last summer, and concerns about global supplies of crude oil have contributed to a further \(\color{red}{\text{increase in oil prices}}\) since the Committee met in March. \(\color{red}{\text{Inflation has picked up}}\) in recent months, but longer-term inflation expectations have remained stable and measures of underlying inflation are still subdued. An example of a FLAT classification comes from the November 8, 2018 meeting. Note that the FOMC views a 2 percent inflation rate as the natural rate of inflation, thus inflation near 2 percent is perceived as flat. FLAT refers to either an absence of information or a rate near the natural rate.

On a 12-month basis, \(\color{red}{\text{both overall inflation and inflation for items other than food and energy remain near 2 percent}}\). Indicators of longer-term inflation expectations are little changed, on balance. An example of a DOWN classification comes from the Jan 28, 2009 statement during the depths of the final crisis.

In light of the \(\color{red}{\text{declines in the prices}}\) of energy and other commodities in recent months and the prospects for considerable economic slack, the Committee expects that \(\color{red}{\text{inflation pressures}}\) will remain subdued in coming quarters. Moreover, the Committee sees some risk that inflation could persist for a time below rates that best foster economic growth and price stability in the longer term. ### Medium Term Outlook

The FOMC tries to provide guidance of where it believe the 1-2 year outlook for the target fed funds rate will be positioned based on current information. Medium Term Outlook attempts to measure this guidance:

An example of a HAWK classification comes from the March 15, 2017 statement:

The Committee expects that economic conditions will evolve in a manner that will warrant \(\color{red}{\text{gradual increases in the federal funds rate}}\); the federal funds rate is likely to remain, for some time, below levels that are expected to prevail in the longer run. However, the actual path of the federal funds rate will depend on the economic outlook as informed by incoming data.

An example of a DOVE classification comes from the Sept 17, 2014 statement:

the Committee today reaffirmed its view that a \(\color{red}{\text{highly accommodative stance of monetary policy remains appropriate}}\). In determining how long to maintain the current 0 to 1/4 percent target range for the federal funds rate, the Committee will assess progress–both realized and expected–toward its objectives of maximum employment and 2 percent inflation. This assessment will take into account a wide range of information, including measures of labor market conditions, indicators of inflation pressures and inflation expectations, and readings on financial developments. The Committee continues to anticipate, based on its assessment of these factors, that it likely will be appropriate to \(\color{red}{\text{maintain the current target range for the federal funds rate for a considerable time}}\) after the asset purchase program ends, especially if projected inflation continues to run below the Committee’s 2 percent longer-run goal, and provided that longer-term inflation expectations remain well anchored.

3.2.7 Policy Rate

This last attribute is objective not subjective. It identifies whether the FOMC decides to raise, keep unchanged or lower the federal funds target rate. On that basis, the classification is assigned.

3.3 Financial TimeSeries

Our observation period for the FOMC data is 2007-2019. We wanted a period to cover both hawkish and dovish periods. The period should have enough observations to include the 2008 financial crisis and a full business cycle and both useful in statistical estimation for regression analysis.

The selection of our time series data from FRED followed specific criteria. We choose them because of several considerations. First, we need a public source of financial time series data. FRED meets that requirement. Second, we need relevant time series. The choices of the 3 financial time series below meet those requirements.

The US Treasury 10 Year yield is associated with the bellwether fixed income asset. It is possibly the most followed bond yield in the world. The breadth of its historical data is more than sufficient to cover the observation period of our study.

The Russell 1000 Index which we will used below is a liquid and large cross section of the most well-known and large capitalization US stocks. The particular time series is a total return index (dividends are assumed to be reinvested.)

The US Federal Reserve Funds target rates are the policy rates of the FOMC. At each FOMC meeting, these rates are changed. Prior to 2014, the FOMC published a single point estimate of that rate. After 2014, the FOMC decided to published a tight range with an upper and lower bound on the federal funds rate. This gives some latitude for the Open market operations desk to buy and sell Treasuries within this range.

4. Exploratory Analysis

4.1 Statements

4.1.1 Analyse FOMC statement word lengths and word frequency

# Compute total statement length per year by aggregating across individual statements
yearly.length<-reports%>% group_by(year) %>% summarize(words.per.year=sum(statement.length))
yearly.length
## # A tibble: 13 x 2
##    year  words.per.year
##    <chr>          <int>
##  1 2007           12361
##  2 2008           19660
##  3 2009           23410
##  4 2010           24857
##  5 2011           26634
##  6 2012           27816
##  7 2013           40310
##  8 2014           46081
##  9 2015           32005
## 10 2016           30787
## 11 2017           28423
## 12 2018           19457
## 13 2019            6845

As can be seen, the total statement length was the highest for the year 2014. As expected, the count for 2019 is low because the year is still in progress and there have been only 3 meetings so far this year.

# Graph the total statement length per year
ggplot(yearly.length, aes(x=yearly.length$year,y=yearly.length$words.per.year))+geom_bar(stat="identity",fill="darkblue", colour="black") + coord_flip()+xlab("Year")+ylab("Statement Length")

#Verify word count for a sample word in a sample statement
sample<-reports%>%filter(reports$statement.dates=="20140319")
sample[,4]
## [1] "        Information received since the Federal Open Market Committee met in January indicates that growth in economic activity slowed during the winter months, in part reflecting adverse weather conditions. Labor market indicators were mixed but on balance showed further improvement. The unemployment rate, however, remains elevated. Household spending and business fixed investment continued to advance, while the recovery in the housing sector remained slow. Fiscal policy is restraining economic growth, although the extent of restraint is diminishing. Inflation has been running below the Committee's longer-run objective, but longer-term inflation expectations have remained stable.             Consistent with its statutory mandate, the Committee seeks to foster maximum employment and price stability. The Committee expects that, with appropriate policy accommodation, economic activity will expand at a moderate pace and labor market conditions will continue to improve gradually, moving toward those the Committee judges consistent with its dual mandate. The Committee sees the risks to the outlook for the economy and the labor market as nearly balanced. The Committee recognizes that inflation persistently below its 2 percent objective could pose risks to economic performance, and it is monitoring inflation developments carefully for evidence that inflation will move back toward its objective over the medium term.             The Committee currently judges that there is sufficient underlying strength in the broader economy to support ongoing improvement in labor market conditions. In light of the cumulative progress toward maximum employment and the improvement in the outlook for labor market conditions since the inception of the current asset purchase program, the Committee decided to make a further measured reduction in the pace of its asset purchases. Beginning in April, the Committee will add to its holdings of agency mortgage-backed securities at a pace of $25 billion per month rather than $30 billion per month, and will add to its holdings of longer-term Treasury securities at a pace of $30 billion per month rather than $35 billion per month. The Committee is maintaining its existing policy of reinvesting principal payments from its holdings of agency debt and agency mortgage-backed securities in agency mortgage-backed securities and of rolling over maturing Treasury securities at auction. The Committee's sizable and still-increasing holdings of longer-term securities should maintain downward pressure on longer-term interest rates, support mortgage markets, and help to make broader financial conditions more accommodative, which in turn should promote a stronger economic recovery and help to ensure that inflation, over time, is at the rate most consistent with the Committee's dual mandate.             The Committee will closely monitor incoming information on economic and financial developments in coming months and will continue its purchases of Treasury and agency mortgage-backed securities, and employ its other policy tools as appropriate, until the outlook for the labor market has improved substantially in a context of price stability. If incoming information broadly supports the Committee's expectation of ongoing improvement in labor market conditions and inflation moving back toward its longer-run objective, the Committee will likely reduce the pace of asset purchases in further measured steps at future meetings. However, asset purchases are not on a preset course, and the Committee's decisions about their pace will remain contingent on the Committee's outlook for the labor market and inflation as well as its assessment of the likely efficacy and costs of such purchases.             To support continued progress toward maximum employment and price stability, the Committee today reaffirmed its view that a highly accommodative stance of monetary policy remains appropriate. In determining how long to maintain the current 0 to 1/4 percent target range for the federal funds rate, the Committee will assess progress--both realized and expected--toward its objectives of maximum employment and 2 percent inflation. This assessment will take into account a wide range of information, including measures of labor market conditions, indicators of inflation pressures and inflation expectations, and readings on financial developments. The Committee continues to anticipate, based on its assessment of these factors, that it likely will be appropriate to maintain the current target range for the federal funds rate for a considerable time after the asset purchase program ends, especially if projected inflation continues to run below the Committee's 2 percent longer-run goal, and provided that longer-term inflation expectations remain well anchored.             When the Committee decides to begin to remove policy accommodation, it will take a balanced approach consistent with its longer-run goals of maximum employment and inflation of 2 percent. The Committee currently anticipates that, even after employment and inflation are near mandate-consistent levels, economic conditions may, for some time, warrant keeping the target federal funds rate below levels the Committee views as normal in the longer run.             With the unemployment rate nearing 6-1/2 percent, the Committee has updated its forward guidance. The change in the Committee's guidance does not indicate any change in the Committee's policy intentions as set forth in its recent statements.             Voting for the FOMC monetary policy action were: Janet L. Yellen, Chair; William C. Dudley, Vice Chairman; Richard W. Fisher; Sandra Pianalto; Charles I. Plosser; Jerome H. Powell; Jeremy C. Stein; and Daniel K. Tarullo.             Voting against the action was Narayana Kocherlakota, who supported the sixth paragraph, but believed the fifth paragraph weakens the credibility of the Committee's commitment to return inflation to the 2 percent target from below and fosters policy uncertainty that hinders economic activity.             Statement Regarding Purchases of Treasury Securities and Agency Mortgage-Backed Securities      Board of Governors of the Federal Reserve System"
str_count(sample, pattern="inflation")
## [1]  0  0  0 15  0

4.1.2 Trend in Statement Length by year and Fed Chair

It seems that the FOMC statements became progressively verbose under Chairman Bernanke until they reached a peak in 2014 when Janet Yellen took over as the Fed Chair. This can be attributed to the fact that during 2014, there was a lot of discussion around when the Fed would end the quantitative easing measures that it had put in place to combat the recession that ensued from the financial crisis. There were 2 schools of thought - one which felt that the time was right for the Fed to start trimming its large balance sheet and the other that wanted to wait a bit longer to see more definite signs of growth before starting to reverse the quantitative easing measures. So the Fed tried to provide more transparency into their thinking which resulted in longer FOMC statments.

Since 2014, the statements have gotten shorter. The current chairman Jerome Powell took over in February 2018.

# Graph the annual trend in statement length, annotated by Fed Chair
p<-ggplot(reports, aes(x=year,y=statement.length))+geom_point(stat="identity",color=statement.dates)+scale_fill_brewer(palette="Pastel1")+theme(legend.position="right")+xlab("Year") + ylab("Length of Statement")
p + ggplot2::annotate("text", x = 4,y = 5000, label = "Bernanke", family="serif", fontface="bold", colour="blue", size=4)+ggplot2::annotate("text", x=10, y=5500, label="Yellen", family="serif", fontface="bold", colour="darkred",size=4)+ggplot2::annotate("text", x=13, y=3600, label="Powell", family="serif", fontface="bold", colour="black",size=4)+ggplot2::annotate("segment", x = 0, xend = 8.1, y = 2700, yend = 6500, colour = "blue", size=1, arrow=arrow(ends="both"))+ggplot2::annotate("segment", x = 8.1, xend = 12.1, y = 6500, yend = 3200, colour = "darkred", size=1, arrow=arrow(ends="both"))+ggplot2::annotate("segment", x = 12.1, xend = 14, y = 3200, yend = 3200, colour = "black", size=1, arrow=arrow(ends="both"))

4.1.3 Adding custom words and names to the list of stop words

Remove proper nouns and irrelevant words from further analysis by adding them as custom words to the stop words lexicon

# Add custom words to the stop words list to exclude proper nouns/names and words such as "committee" which would provide no meangingful insight into the statement's sentiment analysis
#print(stop_words)
words<-c("committee", "ben", "geithner", "bernanke", "timothy", "hoenig", "thomas", "donald", "kevin", "mishkin", "kroszner", "kohn", "charles", "frederic")
lexicon<-c("Custom")
my.stop_words<-data.frame(words, lexicon)
colnames(my.stop_words)<-c("word","lexicon")
new.stop_words <- rbind(my.stop_words, stop_words)
new.stop_words$word<-as.character(new.stop_words$word)
new.stop_words$lexicon<-as.character(new.stop_words$lexicon)
head(new.stop_words)
##        word lexicon
## 1 committee  Custom
## 2       ben  Custom
## 3  geithner  Custom
## 4  bernanke  Custom
## 5   timothy  Custom
## 6    hoenig  Custom

4.1.4 Cleanse data - remove irrelevant characters and calculate the frequency of the main words per statement date

# Strip out punctuations, white space and custom stop words, and calculate the word frequency by statement date
report.words<-reports %>%mutate(date = statement.dates, year = year, text= statement.content) %>% unnest(text) %>% unnest_tokens(word, text) %>%mutate(word = stripWhitespace(gsub("[^A-Za-z ]"," ",word))) %>% filter(word != "") %>% filter(word != " ") %>%anti_join(new.stop_words)%>% count(date, year, word, sort = TRUE)%>% mutate(frequency = n) %>% select(date, year, word, frequency)
## Joining, by = "word"

4.1.5 Verify if the count is correct for a given combination of sample word and statement

# Verify the count for the word "inflation" during the statements published in 2007 
report.words%>%filter(year=='2007', word=='inflation')
## # A tibble: 8 x 4
##   date     year  word      frequency
##   <chr>    <chr> <chr>         <int>
## 1 20070131 2007  inflation         5
## 2 20071031 2007  inflation         5
## 3 20071211 2007  inflation         5
## 4 20070321 2007  inflation         4
## 5 20070509 2007  inflation         4
## 6 20070628 2007  inflation         4
## 7 20070807 2007  inflation         4
## 8 20070918 2007  inflation         3
# Rank most frequent words by year
f_text<-report.words%>% group_by(year,word) %>% summarize(total=sum(frequency))%>%arrange(year,desc(total),word)%>% mutate(rank=row_number())%>%ungroup() %>% arrange(rank,year)
# Select the top 10 ranked words per year
topWords <- f_text %>% filter(rank<11)%>%arrange(year,rank)
print(topWords)
## # A tibble: 130 x 4
##    year  word      total  rank
##    <chr> <chr>     <int> <int>
##  1 2007  inflation    34     1
##  2 2007  federal      31     2
##  3 2007  growth       23     3
##  4 2007  economic     20     4
##  5 2007  action       19     5
##  6 2007  moderate     19     6
##  7 2007  policy       19     7
##  8 2007  chairman     18     8
##  9 2007  rate         13     9
## 10 2007  governors    12    10
## # ... with 120 more rows

4.1.6 Graph the most frequent words per year

# Graph top 10 most frequent words by year
gg <- ggplot(head(topWords, 130), aes(y=total,x=reorder(word,rank))) + geom_col(fill="#27408b") +
  facet_wrap(~year,scales="free", ncol=3)+ coord_flip()+theme_ridges(font_size=11) + 
  labs(x="",y="",title="Most Frequent Words in FOMC Statements grouped by years (2007 - 2019)")
gg

4.1.7 Conclusion

As can be seen from the above analysis, the type of words that show up in the top 10 list are largely the same. This is because in almost all cases, the FOMC statements start by making a reference to the previous statement and refer to the common economic parameters that the committee tracks. So there is large amount of consistency in how the statements are worded and the type of terms they employ. There is no surprise in the most frequently used words in these statements. In fact, one could argue that it is the differential i.e. the new words which are likely to be the least frequently words in the statements that provide the real information needed for sentiment analysis.

On account of this, we do not pursue this path further, and change track to other approaches to do our analysis.

4.2 Human Classification

4.2.1 Exploratory Data Analysis: Human Classification

We inspect the categorical data to check for distribution and covariation patterns.

mgData<-readRDS(file = "fomc_merged_data_v2.rds")

4.2.2 Frequency Distributions of Each Attribute

gEcon <- ggplot(data=mgData, aes(x=Economic.Growth, fill=Economic.Growth)) + 
  geom_bar() + theme(legend.position = "none")
gEmp  <- ggplot(data=mgData, aes(x=Employment.Growth, fill=Employment.Growth)) + 
  geom_bar() +  theme(legend.position = "none")
gInf  <- ggplot(data=mgData, aes(x=Inflation, fill=Inflation)) + 
  geom_bar() + theme(legend.position = "none")
gRate <- ggplot(data=mgData, aes(x=Medium.Term.Rate, fill=Medium.Term.Rate)) + 
  geom_bar() + theme(legend.position = "none")
gPolicy <- ggplot(data=mgData, aes(x=Policy.Rate, fill=Policy.Rate)) + 
  geom_bar() + theme(legend.position = "none")
grid.arrange(gEcon, gEmp, gInf, gRate, gPolicy, ncol=3, nrow=2 )

Inspecting the above charts, we infer some tendencies and align them to our understanding of the markets.

  • Economic.Growth is Up over 60% of the time. This is consistent with the US economy have a positive growth rate over the last 200 years. Since the long term trend is positive growth, the histogram is not a surprise.

  • Employment.Growth is Up over 52% of the time. This is likewise consistent with US economy having positive employment and economic growth.

  • Inflation is down or flat for 80% of the time. This is not consistent with the long term trend. However, in the 2007-2019 period, inflation has been less than the long term trend.

  • Dove is 80% of the time. This is inconsistent with long term trend where Dove and Hawk are more balanced. It is consistent with the 2008-2015 period being part of a long recovery cycle.

  • Policy rate is flat over 80% of the time. This is consistent with the FOMC being a patient body that watches economic data and trends before acting. Generally, the FOMC does not move rates at most meetings. This is consistent with long term history.

4.2.3 Covariation of Attributes

Now we attempt to measure the degree of covariation between the categorical attributes. Although we would use a correlation matrix if the data was continuous and normally distributed, the categorical data defies such an approach. Luckily, we can introduce a new type of statistical measure called Cramer’s V which measures the degree of association between two categorical variables. Its value can vary from 0 (no association) to 1 (perfectly associated). Like correlation, Cramer’s V is symmetric in the variables \(x\) and \(y\). Now we will calculate and display this measure using the rcompanion package and *ggcorrplot** package. A reference to this statistic may be found here: [https://en.wikipedia.org/w/index.php?title=Cram%C3%A9r%27s_V&oldid=882900387]

mgData %>% select(Economic.Growth:Policy.Rate) -> catData  # categorical data
cv = matrix(rep(0,25), nrow=5, ncol=5)  # Allocate a 5x5 matrix of cramerV values initialized to 0.
for(idx in 1:5){
   for(jdx in 1:5){
       cv[idx,jdx] = cramerV(catData[,idx], catData[,jdx])
   }
}
rownames( cv ) = colnames(catData)
colnames( cv ) = colnames(catData)
ggcorrplot(cv, lab=TRUE, ggtheme = ggplot2::theme_classic(), colors=c("violet", "white", "lightgreen")) +
  ggtitle("CramerV Matrix", subtitle="Classification Attributes Comparison")

None of the CramerV values are high suggesting limited dependence between all variables. The strongest association is between policy rate changes and medium term rate outlook at 0.44. The surprising finding is that inflation is weakly associated with medium term rate outlook and policy rate changes. One explanation suggests that inflation is not being strongly supervised by central bankers on the FOMC. In this last business cycle, the key challenges have been financial crisis, significant unemployment and stagnant growth until the last 3 years. Inflation has drifted sideways.

4.3 Financial TimeSeries

4.3.1 10-Year Treasury Constant Maturity Rate

Constant maturity is the theoretical value of a U.S. Treasury that is based on recent values of auctioned U.S. Treasuries. The value is obtained by the U.S. Treasury on a daily basis through interpolation of the Treasury yield curve which, in turn, is based on closing bid-yields of actively-traded Treasury securities. It is calculated using the daily yield curve of U.S. Treasury securities. Constant maturity yields are often used by lenders to determine mortgage rates. The one-year constant maturity Treasury index is one of the most widely used, and is mainly used as a reference point for adjustable-rate mortgages (ARMs) whose rates are adjusted annually. Source: Investopedia

# Reading the data


DGS10<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/DGS10.csv",stringsAsFactors = FALSE)



# Data Cleanup

str(DGS10)
## 'data.frame':    3223 obs. of  2 variables:
##  $ DATE : chr  "2007-01-02" "2007-01-03" "2007-01-04" "2007-01-05" ...
##  $ DGS10: chr  "4.68" "4.67" "4.62" "4.65" ...
DGS10$DATE<- as_date(DGS10$DATE)
DGS10$DGS10<-as.numeric(DGS10$DGS10)
## Warning: NAs introduced by coercion
# Analysis of 10-Year Treasury Constant Maturity Rate

ggplot(data = DGS10)+
  aes(x=DATE,y=`DGS10`)+
  geom_line(size=.98,color="steelblue")+
  labs(x="Date",y="Percent",title="10 Year Constant Maturity Rate")+
  theme(panel.background = element_rect(fill = "white"))

The US Treasury 10 Year yield is associated with the bellwether fixed income asset. It is possibly the most followed bond yield in the world. The breadth of its historical data is more than sufficient to cover the observation period of our study.

4.3.2 Russell 3000® Total Market Index

The Russell 3000 Index is a market-capitalization-weighted equity index maintained by the FTSE Russell that provides exposure to the entire U.S. stock market. The index tracks the performance of the 3,000 largest U.S.-traded stocks which represent about 98% of all U.S incorporated equity securities. The Russell 3000 Index serves as a building block for a broad range of financial products which include the large-cap Russell 1000 and the small-cap Russell 2000 index. The largest 1,000 stocks of the Russell 3000 constitute the Russell 1000, while the Russell 2000 is a subset of the smallest 2000 components. Unlike other funds, the Russell 3000 does not attempt to outperform a benchmark or take a defensive position when the markets appear overvalued; instead, it employs a fully passive strategy. Source: Investopedia

# Reading the data


RU3000TR<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/RU3000TR.csv",stringsAsFactors = FALSE)



# Data Cleanup

str(RU3000TR)
## 'data.frame':    3223 obs. of  2 variables:
##  $ DATE    : chr  "2007-01-03" "2007-01-04" "2007-01-05" "2007-01-08" ...
##  $ RU3000TR: chr  "3399.40" "3405.66" "3381.19" "3389.61" ...
RU3000TR$DATE<- as_date(RU3000TR$DATE)
RU3000TR$RU3000TR<-as.numeric(RU3000TR$RU3000TR)
## Warning: NAs introduced by coercion
# Analysis of Russell 3000® Total Market Index

ggplot(data = RU3000TR)+
  aes(x=DATE,y=`RU3000TR`)+
  geom_line(size=.98,color="steelblue")+
  labs(x="Date",y="Percent",title="Russell 3000® Total Market Index")+
  theme(panel.background = element_rect(fill = "white"))

The Russell 3000 Index which we used above is a liquid and large cross section of the most well-known and large capitalization US stocks. The particular time series is a total return index (dividends are assumed to be reinvested.)

4.3.3 Russell 1000® Total Market Index

The Russell 1000 Index is an index of approximately 1,000 of the largest companies in the U.S. equity market. The Russell 1000 is a subset of the Russell 3000 Index. It represents the top companies by market capitalization. The Russell 1000 typically comprises approximately 90% of the total market capitalization of all listed U.S. stocks. It is considered a bellwether index for large-cap investing The Russell 1000 is a much broader index than the often quoted Dow Jones Industrial Average and Standard & Poor’s 500 Index, although all three are considered large cap stock benchmarks. (See also Investment Fundamentals: S&P 500 Index vs. Russell 1000 Index.) The Russell 1000 is managed by FTSE Russell. FTSE Russell also manages the Russell 3000 and Russell 2000 as well as numerous alternative indexes derived from each. Source: Investopedia

# Reading the data


RU1000TR<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/RU1000TR.csv",stringsAsFactors = FALSE)



# Data Cleanup

str(RU1000TR)
## 'data.frame':    3223 obs. of  2 variables:
##  $ DATE    : chr  "2007-01-03" "2007-01-04" "2007-01-05" "2007-01-08" ...
##  $ RU1000TR: chr  "3430.01" "3435.87" "3414.67" "3423.50" ...
RU1000TR$DATE<- as_date(RU1000TR$DATE)
RU1000TR$RU1000TR<-as.numeric(RU1000TR$RU1000TR)
## Warning: NAs introduced by coercion
# Analysis of Russell 1000® Total Market Index

ggplot(data = RU1000TR)+
  aes(x=DATE,y=`RU1000TR`)+
  geom_line(size=.98,color="steelblue")+
  labs(x="Date",y="Percent",title="Russell 1000® Total Market Index")+
  theme(panel.background = element_rect(fill = "white"))

The Russell 1000 Index which we used above is a liquid and large cross section of the most well-known and large capitalization US stocks. The particular time series is a total return index (dividends are assumed to be reinvested.)

4.3.4 Federal Funds Target Range

The federal funds rate refers to the interest rate that banks charge other banks for lending them money from their reserve balances on an overnight basis. By law, banks must maintain a reserve equal to a certain percentage of their deposits in an account at a Federal Reserve bank. Any money in their reserve that exceeds the required level is available for lending to other banks that might have a shortfall. Banks and other depository institutions are required to maintain non-interest-bearing accounts at Federal Reserve banks to ensure that they will have enough money to cover depositors’ withdrawals and other obligations. How much money a bank must keep in its account is known as a reserve requirement and is based on a percentage of the bank’s total deposits.

Source: Investopedia

# Reading the data


FEDTARGET<-read.csv("https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/FEDTARGET.csv",stringsAsFactors = FALSE)



# Data Cleanup

str(FEDTARGET)
## 'data.frame':    13536 obs. of  3 variables:
##  $ DATE   : chr  "1/1/2007" "1/2/2007" "1/3/2007" "1/4/2007" ...
##  $ Type   : chr  "DFEDTARL" "DFEDTARL" "DFEDTARL" "DFEDTARL" ...
##  $ Percent: num  NA NA NA NA NA NA NA NA NA NA ...
FEDTARGET$DATE<- as.Date(strptime(FEDTARGET$DATE,format="%m/%d/%Y"),format="%Y-%m-%d")
FEDTARGET$Percent<-as.numeric(FEDTARGET$Percent)

# Analysis of Federal Funds Target Range

ggplot(data = FEDTARGET)+
  aes(x=DATE,y=`Percent`,color=Type)+
  geom_line(size=.98)+
  labs(x="Date",y="Percent",title="Federal Funds Target Range")+
  theme(panel.background = element_rect(fill = "white"))
## Warning: Removed 7783 rows containing missing values (geom_path).

The US Federal Reserve Funds target rates are the policy rates of the FOMC. At each FOMC meeting, these rates are changed. Prior to 2014, the FOMC published a single point estimate of that rate. After 2014, the FOMC decided to published a tight range with an upper and lower bound on the federal funds rate. This gives some latitude for the Open market operations desk to buy and sell Treasuries within this range.

5. Analysis

5.1 Text Classification

The manually labelled attributes of the FOMC statements are leveraged to validate the automated classification that we conduct below. We hope this can serve a good basis for future research and that there is room for improvement. Our analysis will be to run a predict training model on each of 5 attributes separately. Although we will use the same code framework and model, the parameters used to tune and train the model will differ for each attribute. Final results are summarized in a table after all the individual backtest results are presented.

One additional challenge of this data was that the categorical variables are not all binary. Four of the attributes (variables) are ternary-valued. For example, the Policy rate can be one of three values: Raise, Flat, or Lower. We overcome this challenge. The caret framework is able to handle multiple-class attributes with no difficulty. In the sentence: Data Preparation: Here, we prepare the data so that it can be reused for all the classifications without the need to repeat the leaning and preparation processes gaain

Data Preparation: Here, we prepare the data such that it can be reused for all the classifications without the need to repeat the cleaning and preparation processes again

fomc_data <-readRDS(file = "fomc_merged_data_v2.rds")
head(select(fomc_data, Index,year,statement.dates,statement.length,date_mdy,Employment.Growth,Economic.Growth,Inflation,Medium.Term.Rate,Policy.Rate))
##   Index year statement.dates statement.length   date_mdy Employment.Growth
## 1     1 2007        20070131             1155 2007-01-31              Flat
## 2     2 2007        20070321             1098 2007-03-21              Flat
## 3     3 2007        20070509             1087 2007-05-09              Flat
## 4     5 2007        20070807             1388 2007-08-07              Flat
## 5     6 2007        20070817              864 2007-08-17              Flat
## 6     7 2007        20070918             1710 2007-09-18              Flat
##   Economic.Growth Inflation Medium.Term.Rate Policy.Rate
## 1              Up      Down             Hawk        Flat
## 2            Flat        Up             Hawk        Flat
## 3            Down        Up             Hawk        Flat
## 4              Up      Flat             Hawk        Flat
## 5            Down      Flat             Dove        Flat
## 6            Flat      Down             Dove       Lower
Data preparation: First, randomising the rows so that the statements from different eras of the economic movements can be well represented
set.seed(1234567)
fomc_Rand <- fomc_data[sample(nrow(fomc_data)),]
Preliminary data cleansing: convert the statements’ textual contents to lower and remove the federal open market committee and committee as it is present in all the statements
customStopWords <- c("the federal open market committee", "committee")
fomc_dataX <- fomc_Rand %>% mutate(statement.content = tolower(statement.content))%>%mutate(statement.content = str_replace_all(statement.content, customStopWords, ""))
Data Preparation: Here, we prepare the data so that it can be reused for all the classifications without the need to repeat the leaning and preparation processes gaain
# form a corpus
corpus <- VCorpus(VectorSource(fomc_dataX$statement.content))
# Remove Punctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stop words
corpus <- tm_map(corpus, content_transformer(removeWords), stopwords("english"))
##Stemming
corpus <- tm_map(corpus, stemDocument)
# Remove Whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Create Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
# handle sparsity
corpusX <- removeSparseTerms(dtm, 0.30)
# convert to matrix
data_matrix <- as.matrix(corpusX)
From here, were are going to perform classifications based on Medium.Term.Rate, Employment.Growth, Economic.Growth, Inflation, and Policy.Rate variables

5.1.1 Medium.Term.Rate

Classification targetting the Medium.Term.Rate variable
mRate <- data_matrix
# attach the 'medium.term.rate' column
mRate_matrix <- cbind(mRate, fomc_dataX$Medium.Term.Rate)
# rename it to 'tone'
colnames(mRate_matrix)[ncol(mRate_matrix)] <- "tone"
# convert to data frame
mRateData <- as.data.frame(mRate_matrix)
# convert 'tone' to lower case and make it a factor column as well
mRateData$tone <- as.factor(tolower(mRateData$tone))
Partition the data into training and test sets
mRate_n <- nrow(mRateData)
mRateTrainVolume <- round(mRate_n * 0.68)
set.seed(314)
mRateTrainIndex <- sample(mRate_n, mRateTrainVolume)
mRateTrain <- mRateData[mRateTrainIndex,]
mRateTest <- mRateData[-mRateTrainIndex,]
mRateModel <- train(tone ~., data = mRateTrain, method = 'svmLinear3')
mRateResult <- predict(mRateModel, newdata = mRateTest)
( mRateStats = confusionMatrix( mRateResult, mRateTest$tone))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction dove hawk
##       dove   22    2
##       hawk    0    8
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.7919, 0.9923)
##     No Information Rate : 0.6875          
##     P-Value [Acc > NIR] : 0.0007323       
##                                           
##                   Kappa : 0.8462          
##                                           
##  Mcnemar's Test P-Value : 0.4795001       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6875          
##          Detection Rate : 0.6875          
##    Detection Prevalence : 0.7500          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : dove            
## 

5.1.2 Economic.Growth

Classification targeting the Economic.Growth variable
econGrowth <- data_matrix
# attach the 'Economic.Growth' column
econG_matrix <- cbind(econGrowth, tolower(fomc_dataX$Economic.Growth))
# rename it to 'growth'
colnames(econG_matrix)[ncol(econG_matrix)] <- "egrowth"
# convert to data frame
econData <- as.data.frame(econG_matrix)
# convert 'growth' to a factor column as well
econData$egrowth <- as.factor(econData$egrowth)
Partition the data into training and test sets: note that the ratios here are different from the other models
econ_n <- nrow(econData)
econTrainVolume <- round(econ_n * 0.70)
set.seed(314)
econTrainIndex <- sample(econ_n, econTrainVolume)
econTrain <- econData[econTrainIndex,]
econTest <- econData[-econTrainIndex,]
econModel <- train(egrowth ~., data = econTrain, method = 'svmLinear3')
econResult <- predict(econModel, newdata = econTest)
(econStats = confusionMatrix( econResult, econTest$egrowth))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction down flat up
##       down    3    0  0
##       flat    3    0  0
##       up      4    3 17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.4719, 0.8271)
##     No Information Rate : 0.5667          
##     P-Value [Acc > NIR] : 0.17896         
##                                           
##                   Kappa : 0.3377          
##                                           
##  Mcnemar's Test P-Value : 0.01857         
## 
## Statistics by Class:
## 
##                      Class: down Class: flat Class: up
## Sensitivity               0.3000      0.0000    1.0000
## Specificity               1.0000      0.8889    0.4615
## Pos Pred Value            1.0000      0.0000    0.7083
## Neg Pred Value            0.7407      0.8889    1.0000
## Prevalence                0.3333      0.1000    0.5667
## Detection Rate            0.1000      0.0000    0.5667
## Detection Prevalence      0.1000      0.1000    0.8000
## Balanced Accuracy         0.6500      0.4444    0.7308

5.1.3 Inflation

Classification targeting the Inflation variable
# Create Document Term Matrix
dtmI <- DocumentTermMatrix(corpus)
# handle sparsity
corpusI <- removeSparseTerms(dtm, 0.80)
# convert to matrix
data_matrixI <- as.matrix(corpusI)
inflation <- data_matrixI
# attach the 'Inflation' column
inflation_matrix <- cbind(inflation, tolower(fomc_dataX$Inflation))
# rename it to 'inflation'
colnames(inflation_matrix)[ncol(inflation_matrix)] <- "inflation"
# convert to data frame
inflationData <- as.data.frame(inflation_matrix)
# convert 'inflation' to a factor column
inflationData$inflation <- as.factor(inflationData$inflation)
Remove columns that will not contribute meaninfully to the model fitting
infDataX <- inflationData[, -which(names(inflationData) %in% c("although", "william", "richard", "raphael", "randal", "san", "sarah","sandra", "togeth", "timothi","committe","dudley","esther"))]
inf_n <- nrow(infDataX)
infTrainVolume <- round(inf_n * 0.68)
set.seed(314)
infTrainIndex <- sample(inf_n, infTrainVolume)
infTrain <- infDataX[infTrainIndex,]
infTest <- infDataX[-infTrainIndex,]
inflationModel <- train(inflation ~., data = infTrain, method="svmLinear3")
inflationResult <- predict(inflationModel, newdata = infTest)
( infStats = confusionMatrix( inflationResult, infTest$inflation))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction down flat up
##       down    8    4  0
##       flat    2   10  1
##       up      2    1  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6875          
##                  95% CI : (0.4999, 0.8388)
##     No Information Rate : 0.4688          
##     P-Value [Acc > NIR] : 0.01037         
##                                           
##                   Kappa : 0.5077          
##                                           
##  Mcnemar's Test P-Value : 0.44592         
## 
## Statistics by Class:
## 
##                      Class: down Class: flat Class: up
## Sensitivity               0.6667      0.6667    0.8000
## Specificity               0.8000      0.8235    0.8889
## Pos Pred Value            0.6667      0.7692    0.5714
## Neg Pred Value            0.8000      0.7368    0.9600
## Prevalence                0.3750      0.4688    0.1562
## Detection Rate            0.2500      0.3125    0.1250
## Detection Prevalence      0.3750      0.4062    0.2188
## Balanced Accuracy         0.7333      0.7451    0.8444

5.1.4 Employment.Growth

Classification targeting the Employment.Growth variable
empGrowth <- data_matrix
# attach the 'Employment.Growth column
emp_matrix <- cbind(empGrowth, tolower(fomc_dataX$Employment.Growth))
# rename it to 'empGrowth'
colnames(emp_matrix)[ncol(emp_matrix)] <- "empGrowth"
# convert to data frame
empData <- as.data.frame(emp_matrix)
# convert 'empGrowth' to a factor column as well
empData$empGrowth <- as.factor(empData$empGrowth)
emp_n <- nrow(empData)
empTrainVolume <- round(emp_n * 0.70)
set.seed(314)
empTrainIndex <- sample(emp_n, empTrainVolume)
empTrain <- empData[empTrainIndex,]
empTest <- empData[-empTrainIndex,]
empModel <- train(empGrowth ~., data = empTrain, method = 'svmLinear3')
empResult <- predict(empModel, newdata = empTest)
( empStats = confusionMatrix( empResult, empTest$empGrowth))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction down flat up
##       down    5    0  0
##       flat    4    2  1
##       up      0    1 17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6143, 0.9229)
##     No Information Rate : 0.6             
##     P-Value [Acc > NIR] : 0.01718         
##                                           
##                   Kappa : 0.6471          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: down Class: flat Class: up
## Sensitivity               0.5556     0.66667    0.9444
## Specificity               1.0000     0.81481    0.9167
## Pos Pred Value            1.0000     0.28571    0.9444
## Neg Pred Value            0.8400     0.95652    0.9167
## Prevalence                0.3000     0.10000    0.6000
## Detection Rate            0.1667     0.06667    0.5667
## Detection Prevalence      0.1667     0.23333    0.6000
## Balanced Accuracy         0.7778     0.74074    0.9306

5.1.5 Policy.Rate

Classification targeting the Policy.Rate variable
plRate <- data_matrix
# attach the 'Policy.Rate' column
pl_matrix <- cbind(plRate, tolower(fomc_dataX$Policy.Rate))
# rename it to 'empGrowth'
colnames(pl_matrix)[ncol(pl_matrix)] <- "policy"
# convert to data frame
plData <- as.data.frame(pl_matrix)
# convert 'policy' to a factor column as well
plData$policy <- as.factor(plData$policy)
pl_n <- nrow(plData)
plTrainVolume <- round(pl_n * 0.68)
set.seed(314)
plTrainIndex <- sample(pl_n, plTrainVolume)
plTrain <- plData[empTrainIndex,]
plTest <- plData[-empTrainIndex,]
plModel <- train(policy ~., data = plTrain, method = 'svmLinear3')
plResult <- predict(plModel, newdata = plTest)
( plStats = confusionMatrix( plResult, plTest$policy))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction flat lower raise
##      flat    21     0     5
##      lower    0     4     0
##      raise    0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6528, 0.9436)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.07659         
##                                           
##                   Kappa : 0.5562          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: flat Class: lower Class: raise
## Sensitivity               1.0000       1.0000       0.0000
## Specificity               0.4444       1.0000       1.0000
## Pos Pred Value            0.8077       1.0000          NaN
## Neg Pred Value            1.0000       1.0000       0.8333
## Prevalence                0.7000       0.1333       0.1667
## Detection Rate            0.7000       0.1333       0.0000
## Detection Prevalence      0.8667       0.1333       0.0000
## Balanced Accuracy         0.7222       1.0000       0.5000

5.1.6 Summary and Conclusion

Table of summary

results <- tibble(variable = c("Medium.Term.Rate","Employment.Growth","Economic.Growth","Inflation","Policy.Rate"), modelling = c("68 : 32", "70 : 30", "70 : 30", "68 : 32", "68 : 32"), accuracy= c(93.75, 80, 66.67, 68.75, 83.33)) 

  kable(results,
      col.names = linebreak(c("Variable", "Modelling (Train : Test)", "Accuracy (%)"), align = "c")) %>%
  kable_styling("striped", bootstrap_options = c("hover", "striped")) %>%
  column_spec(1:3, bold = T, color = "#000") %>%
  row_spec(1:5, bold = T, color = "#000")
Variable Modelling (Train : Test) Accuracy (%)
Medium.Term.Rate 68 : 32 93.75
Employment.Growth 70 : 30 80.00
Economic.Growth 70 : 30 66.67
Inflation 68 : 32 68.75
Policy.Rate 68 : 32 83.33

For some of the variables, extra fine-tuning of the data was not needed to achieve appreciable accuracy. But for the Economic.Growth variable, we needed to adjust the ratio of training set to test data set to 70:30 to achieve an accuracy of 66.67%. For the Inflation variable we had to adjust sparsity to 0.80, removed 13 unuseful columns, adjusted the ratio of training to test data sets to 68:32 before we could achieve an accuracy of 68.75%

We can conclude that these values obtained, though not perfect, did go a long way to align with the human based scoring/classification of the economy trends based on the variables considered within the selected years. There are much room left for improvements and further analysis but we cannot go beyond this level right now as time will not permit us

5.2 Findings: Text Classification

Text classification of FOMC statements is not generally a research objective but we think it is worthwhile. Classification addresses a potential need: Can a machine correctly infer the opinion or direction of forward guidance or policy decisions in a structured text by the FOMC? In this regard, the classification problem for the FOMC is isomorphic to the Ham-Spam classification of incoming emails by an email program. However, the reader may object that FOMC statements are not so voluminous to require automated processing. Our response is that FOMC statements are merely the first baby step in a much larger classification problem: the public communications of all FOMC and Federal Reserve system members. As previously explained, the FOMC members give speeches, publish articles, appear on TV interviews. Moreover, FOMC meeting minutes are released several weeks after the policy statement is released. These are much longer and required more effort to read and digest. Also, the FOMC transcripts released several years after the meeting may run to over 100 pages each. They contain word for word replay of the entire meeting (excluding private discussions). Lastly, there are at least 16 relevant central banks around the world of interest. Although the Fed is the world’s more important central bank, the ECB, Bank of England, Bank of Japan, Bank of China, Bank of Australia, Bank of New Zealand, all produce communications. In summary, no single person can read all central bank communications. The ability to extract key messages from plain texts remains a valuable capability.

Our machine learning prediction backtest suggests that automated classification is feasible to detect limited features of a central bank communication. Our algorithm succeeds at detecting medium term rate outlooks, employment growth and policy rate changes. At these tasks, we have attained accuracy rates between 80-93.75 percent. The most challenging attribute to understand is inflation (64.5%). This is consistent with financial practitioner opinion. Inflation is the most complex of these areas to quantify, control and manage. That is because inflation has 4 distinct aspects: realized inflation (price changes from past surveys), market based real yields of TIPS bonds and inflation swaps, and long term expectations of inflation, inflation measured with or without volatile sectors: food and energy. Because the statements may treat some or all of these aspects we believe accuracy in understanding FOMC inflation views is hard.

5.3 Sentiment Analysis

5.3.1 Sentiment Analysis Used

Sentiment analysis is a research branch located at the heart of natural language processing (NLP), computational linguistics and text mining. It refers to any measures by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed opinion in a range spanning from positive to negative. Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. This immediately reveals manifold implications for practitioners, as well as those involved in the fields of finance research and the social sciences: researchers can use R to extract text components that are relevant for readers and test their hypotheses on this basis.

source

5.3.2 Choice of Dictionary for text processing - Loughran-McDonoald

We used Lougran-McDonald Dictionary as our choice of Finance Dictionary for text processing. The Loughran-McDonald Master Dictionary was initially developed in conjunction with our paper published in Journal of Finance. The dictionary provides a means of determining which tokens (collections of characters) are actual words,which is important for consistency in word counts.

source

5.3.3 Methods for Sentiment Analysis

As sentiment analysis is applied to a broad variety of domains and textual sources, research has devised various approaches to measuring sentiment. The citation of Pang and Lee 2008 should be added to the References section: It will be: Pang, Bo and Lillian Lee, “Opinion Mining and Sentiment Analysis”, Foundations and Trends in Information Retrieval, Vol 2, Issue 1-2, January 2008, pp1-135.

In the process of performing sentiment analysis, one must convert the running text into a machine-readable format. This is achieved by executing a series of preprocessing operations. First, the text is tokenized into single words, followed by what are common preprocessing steps: stopword removal, stemming, removal of punctuation and conversion to lower-case. These operations are also conducted by default in SentimentAnalysis, but can be adapted to one’s personal needs.

5.3.4 Functionality of Sentiment Analysis

The process of Sentiment analysis tokenizes each document and finally converts the input into a document-term matrix. All of the previous operations are undertaken without manual specification. The analyzeSentiment() routine also accepts other input formats in case the user has already performed a preprocessing step or wants to implement a specific set of operations.

# Reading FOMC Data

fomcStatements <-readRDS(file = "fomc_merged_data_v2.rds") %>% select(statement.dates, statement.content)
# Sentimental Analysis

fomcX <- fomcStatements %>%  mutate(date = statement.dates, year = as.numeric(str_extract(statement.dates,'\\d{4}')),text= statement.content)%>%   select(date, year, text)
# Sentiment analysis with Loughran-Mcdonald dictionary

sentiment <- analyzeSentiment(fomcX$text, language = "english", aggregate = fomcX$year,
                              removeStopwords = TRUE, stemming = TRUE,
                              rules=list("SentimentLM"=list(ruleSentiment,
                                                            loadDictionaryLM())))
# Summary of sentiment score

summary(sentiment)
##   SentimentLM       
##  Min.   :-0.064748  
##  1st Qu.:-0.025974  
##  Median :-0.015444  
##  Mean   :-0.017269  
##  3rd Qu.:-0.008475  
##  Max.   : 0.026316
# Table showing breakdown of Sentiments

table(convertToDirection(sentiment$SentimentLM))
## 
## negative  neutral positive 
##       88        5        8
# Line plot to visualize the evolvement of sentiment scores. This is especially helpful when studying a time series of sentiment scores.




plotSentiment(sentiment, xlab="Tone")

Conclusion: The Sentiment score is the lowest in the period from 2008 and 2009. The observation correlates to the recession during 2008 where market collapsed.

Sentiment<-data.frame(fomcX$date,fomcX$year,sentiment$SentimentLM,convertToDirection(sentiment$SentimentLM))

names(Sentiment)<-(c("FOMC_Date","FOMC_Year","Sentiment_Score","Sentiment"))


# Structure before date type change

str(Sentiment)
## 'data.frame':    101 obs. of  4 variables:
##  $ FOMC_Date      : Factor w/ 101 levels "20070131","20070321",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ FOMC_Year      : num  2007 2007 2007 2007 2007 ...
##  $ Sentiment_Score: num  0.0263 -0.0286 -0.028 -0.0308 -0.0127 ...
##  $ Sentiment      : Factor w/ 3 levels "negative","neutral",..: 3 1 1 1 1 1 1 1 1 1 ...
# Change the date format to Ymd
Sentiment$FOMC_Date<- ymd(Sentiment$FOMC_Date)

# Change the Year format

Sentiment$FOMC_Year<- as.numeric(Sentiment$FOMC_Year)

# Structure after date type change

str(Sentiment)
## 'data.frame':    101 obs. of  4 variables:
##  $ FOMC_Date      : Date, format: "2007-01-31" "2007-03-21" ...
##  $ FOMC_Year      : num  2007 2007 2007 2007 2007 ...
##  $ Sentiment_Score: num  0.0263 -0.0286 -0.028 -0.0308 -0.0127 ...
##  $ Sentiment      : Factor w/ 3 levels "negative","neutral",..: 3 1 1 1 1 1 1 1 1 1 ...
# Distribution of Sentiment Score for period of 2007 to 2019


ggplot(Sentiment,aes(x=Sentiment_Score))+
  geom_histogram(binwidth =.0125,color="black",fill="lightblue")+
  labs(x="Setiment Score",y="Frequency",title="Sentiment Score Distribution from 2007 to 2019")+
   theme(panel.background = element_rect(fill = "white"))

The Overall sentiment frequency is towards negative than Positive score

# Sentiment Score Trend

ggplot(data = Sentiment)+
  aes(x=FOMC_Date,y=Sentiment_Score)+
  geom_line(size=.98,color="firebrick")+
  labs(x="FOMC Date",y="Sentiment Score",title="Sentiment Score trend over the period of 2007 to 2019")+
   theme(panel.background = element_rect(fill = "white"))

The Sentiment score is the lowest in the period from 2008 and 2009. The observation corelates to the recession during 2008 where market collapsed.

# Scatter plot of score vs Date

## Grouped

ggplot(Sentiment,aes(x=FOMC_Date,y=Sentiment_Score,color=Sentiment))+
  geom_point()+
  labs(x="FOMC Date",y="Sentiment Score",title="Sentiments spread over the period of 2007 to 2019")+
   theme(panel.background = element_rect(fill = "white"))

The chart shows the spread of negative sentiment throughtout the time period from 2008 to 2015, followed by the neutral and positive sentiment.

# Exporting data frame to RDS
## Changing the Date format
Sentiment$FOMC_Date<-format(Sentiment$FOMC_Date, format = "%Y%m%d")
## Exporting to .RDS
saveRDS(Sentiment,"SentimentDF.rds")

5.4 Financial Impact of Sentiment

5.4.1 Sentiment and Equity Markets

In this section, we evaluate the relationship how our sentiment index compares to a broad US equity index (the Russell 1000 Index). This section will examine the fluctuations of the sentiment compared to the equity market in two ways: through a visual analysis of the normalized levels of both variables and a linear regression of the time series data. To accomplish this, we first merge 3 data sets aligned by the 102 FOMC meeting dates. To calculate normalized versions of the variables, we calculate Z-scores of both variables over the sample period. Lastly, we perform both analyses using the Z-score data.

# First load all 3 files into data frames.
# ------------------------------------------------------
mgData<-readRDS(file = "fomc_merged_data_v2.rds")
sData <- readRDS( file = "../DATA/SentimentDF.rds")
file_fred_ru1000tr = "https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv"
ru1000tr = read_csv(file_fred_ru1000tr, 
                    col_types = cols(DATE=col_character(), 
                                     RU1000TR = col_double() ) )
## Warning: 111 parsing failures.
## row      col expected actual                                                                                         file
##   9 RU1000TR a double      . 'https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv'
##  34 RU1000TR a double      . 'https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv'
##  68 RU1000TR a double      . 'https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv'
## 104 RU1000TR a double      . 'https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv'
## 131 RU1000TR a double      . 'https://raw.githubusercontent.com/completegraph/DATA607FINAL/master/DATA/FRED_RU1000TR.csv'
## ... ........ ........ ...... ............................................................................................
## See problems(...) for more details.
# Generate a lubridate date column to join with the FOMC data.
# -----------------------------------------------------------------
ru1000tr %>% mutate( date_mdy = lubridate::ymd( DATE ) )-> ruData
 #z_ru_daily = (RU1000TR - mean(RU1000TR, na.rm=TRUE))/sd(RU1000TR, na.rm = TRUE )
#  Second, join the data:
#  Since this is a 2-way inner join, we start with the FOMC statement data
#  and join it to the sentiment data by date string (yyyymmdd)
# -------------------------------------------------------------------------
mgData %>% inner_join(sData, by = c( "statement.dates" = "FOMC_Date")) -> msData
#  Join the sentiment-FOMC data to the Russell 1000 Index data from FRED
#  Make sure to add a Z-score for each of the time series: sentiment and Rusell index values
#     Save the raw data and normalized data by FOMC data.
# ----------------------------------------------------------------------------------
msEQdata = msData %>% left_join(ruData, by = c("date_mdy" = "date_mdy") ) %>% 
                    select( date_mdy, Sentiment_Score, RU1000TR ) %>%
                    mutate( z_ru_fomc = (RU1000TR - mean(RU1000TR, na.rm = TRUE) ) / sd( RU1000TR, na.rm=TRUE ) ,
                            z_sentiment = ( Sentiment_Score - mean( Sentiment_Score, na.rm = TRUE) ) / 
                              sd( Sentiment_Score, na.rm=TRUE) )

5.4.2 Data Transformation: Scale and Frequency Domain Issues

Let’s inspect the data for accuracy and scaling issues. Exploratory data analysis shows 3 issues.

  • Normalization to z-score format is needed to ensure that scale is not a problem. Since the Russell Index level are expressed in the thousands, while the sentiment is on expressed in units of 0.01, scaling is essential along the y-dimension. To solve the scale problem, we convert the entire sample to Z-score equivalent which bring both time series to the same order of magnitude and mean.

  • There is also a need to normalize in the frequency domain. FOMC meetings occur 8 times per year so their sentiment levels and changes reflect nearly 2 months of news. Russell equity index levels are collected on a daily basis in order to ensure completeness of the data collection. The volatility of lower frequency data is much greater in absolute terms than volatility of higher frequency (daily) data. To address this, we only calculate Z-scores of the Russell equity index levels observed only on the FOMC dates.

  • Lastly, Russell Index levels increases at a geometric rate (roughly). Thus, values at the start of the sample period are smaller than values at the end of the period. The residuals in a regression of such data show significant increase volatility over the sample period. This is solved by apply a logarithmic transformation to Russell Index levels. This change fixes the non-constant residual volatility and also improves the model fit from 36 to 39 percent adjusted R-squared roughly.

The following code produces the log-transformed z-scores of FOMC periodic equity values.

msEQdata %>% mutate( logEquity = log(RU1000TR) ) %>%
             mutate( z_logEquity = ( logEquity - mean(logEquity) )/ sd( logEquity ) ) -> msEQdata
  
msEQdata %>%  kable() %>% scroll_box(width="100%", height="200px")
date_mdy Sentiment_Score RU1000TR z_ru_fomc z_sentiment logEquity z_logEquity
2007-01-31 0.0263158 3497.78 -0.7169469 2.8673178 8.159884 -0.6473283
2007-03-21 -0.0285714 3505.86 -0.7129262 -0.7435143 8.162191 -0.6416587
2007-05-09 -0.0280374 3698.71 -0.6169627 -0.7083814 8.215739 -0.5100826
2007-08-07 -0.0307692 3610.26 -0.6609760 -0.8880998 8.191535 -0.5695562
2007-08-17 -0.0126582 3531.38 -0.7002273 0.3033577 8.169444 -0.6238373
2007-09-18 -0.0250000 3726.95 -0.6029103 -0.5085629 8.223346 -0.4913932
2007-10-31 -0.0370370 3816.32 -0.5584391 -1.3004362 8.247042 -0.4331676
2007-12-11 -0.0169492 3650.86 -0.6407732 0.0210733 8.202718 -0.5420780
2008-01-22 -0.0522876 3234.17 -0.8481211 -2.3037145 8.081528 -0.8398612
2008-01-30 -0.0259740 3353.44 -0.7887715 -0.5726406 8.117742 -0.7508770
2008-03-18 -0.0451977 3299.30 -0.8157120 -1.8372994 8.101466 -0.7908705
2008-04-30 -0.0310881 3452.16 -0.7396477 -0.9090759 8.146755 -0.6795866
2008-06-25 -0.0379747 3328.66 -0.8011022 -1.3621206 8.110325 -0.7691014
2008-08-05 -0.0454545 3219.69 -0.8553265 -1.8541937 8.077040 -0.8508870
2008-09-16 -0.0647482 3047.69 -0.9409149 -3.1234537 8.022139 -0.9857875
2008-10-08 -0.0490566 2454.63 -1.2360259 -2.0911601 7.805731 -1.5175344
2008-10-29 -0.0478723 2309.41 -1.3082884 -2.0132517 7.744747 -1.6673812
2008-12-16 -0.0466926 2277.87 -1.3239829 -1.9356413 7.730996 -1.7011701
2009-01-28 -0.0136519 2201.85 -1.3618110 0.2379891 7.697053 -1.7845729
2009-03-18 -0.0270270 2016.35 -1.4541171 -0.6419137 7.609044 -2.0008243
2009-04-29 -0.0447761 2234.29 -1.3456686 -1.8095625 7.711679 -1.7486357
2009-06-24 -0.0430622 2311.77 -1.3071140 -1.6968100 7.745769 -1.6648715
2009-08-12 -0.0270270 2597.64 -1.1648630 -0.6419137 7.862359 -1.3783925
2009-09-23 -0.0377358 2751.63 -1.0882365 -1.3464085 7.919949 -1.2368849
2009-11-04 -0.0307167 2711.23 -1.1083398 -0.8846455 7.905158 -1.2732287
2009-12-16 -0.0224719 2889.67 -1.0195468 -0.3422490 7.968898 -1.1166102
2010-01-27 -0.0146628 2868.68 -1.0299916 0.1714870 7.961607 -1.1345236
2010-03-16 -0.0071942 3049.00 -0.9402631 0.6628135 8.022569 -0.9847315
2010-04-28 -0.0154440 3143.24 -0.8933686 0.1200908 8.053009 -0.9099347
2010-06-23 -0.0084746 2886.76 -1.0209949 0.5785851 7.967890 -1.1190859
2010-08-10 -0.0273973 2963.05 -0.9830324 -0.6662700 7.993974 -1.0549926
2010-09-21 -0.0185185 3026.69 -0.9513647 -0.0821696 8.015225 -1.0027770
2010-11-03 -0.0193548 3195.65 -0.8672890 -0.1371881 8.069546 -0.8693023
2010-12-14 -0.0263158 3329.18 -0.8008435 -0.5951240 8.110481 -0.7687176
2011-01-26 -0.0144928 3485.32 -0.7231471 0.1826709 8.156315 -0.6560969
2011-03-15 0.0034965 3460.65 -0.7354231 1.3661192 8.149212 -0.6735511
2011-04-27 0.0034843 3674.91 -0.6288057 1.3653178 8.209284 -0.5259446
2011-06-22 -0.0350877 3499.60 -0.7160413 -1.1721976 8.160404 -0.6460501
2011-08-09 -0.0280374 3173.99 -0.8780671 -0.7083814 8.062745 -0.8860135
2011-09-21 -0.0298913 3167.68 -0.8812070 -0.8303442 8.060755 -0.8909033
2011-11-02 -0.0132013 3366.74 -0.7821533 0.2676297 8.121700 -0.7411511
2011-12-13 -0.0110294 3337.07 -0.7969174 0.4105117 8.112849 -0.7629011
2012-01-25 -0.0114504 3630.15 -0.6510786 0.3828176 8.197029 -0.5560562
2012-03-13 -0.0185185 3838.77 -0.5472678 -0.0821696 8.252907 -0.4187554
2012-04-25 0.0000000 3827.04 -0.5531047 1.1360969 8.249847 -0.4262751
2012-06-20 -0.0198675 3731.04 -0.6008750 -0.1709175 8.224442 -0.4886982
2012-08-01 -0.0140351 3781.35 -0.5758404 0.2127791 8.237836 -0.4557869
2012-09-13 -0.0028736 4040.51 -0.4468805 0.9470555 8.304126 -0.2929028
2012-10-24 -0.0086207 3908.01 -0.5128135 0.5689728 8.270784 -0.3748307
2012-12-12 -0.0160183 3988.68 -0.4726715 0.0823103 8.291216 -0.3246260
2013-01-30 -0.0218978 4215.41 -0.3598490 -0.3044811 8.346502 -0.1887787
2013-03-20 -0.0169903 4397.11 -0.2694338 0.0183669 8.388703 -0.0850853
2013-05-01 -0.0141844 4464.70 -0.2358005 0.2029566 8.403957 -0.0476027
2013-06-19 -0.0160920 4604.29 -0.1663395 0.0774653 8.434744 0.0280443
2013-07-31 -0.0227790 4789.46 -0.0741976 -0.3624542 8.474173 0.1249278
2013-09-18 -0.0181087 4931.06 -0.0037364 -0.0552060 8.503309 0.1965201
2013-10-30 -0.0204918 5048.76 0.0548319 -0.2119849 8.526898 0.2544811
2013-12-18 -0.0255941 5198.37 0.1292789 -0.5476499 8.556100 0.3262359
2014-01-29 -0.0227704 5114.60 0.0875944 -0.3618855 8.539855 0.2863172
2014-03-19 -0.0232975 5393.54 0.2263969 -0.3965610 8.592957 0.4167987
2014-04-30 -0.0211946 5446.93 0.2529642 -0.2582197 8.602807 0.4410021
2014-06-18 -0.0116279 5680.51 0.3691953 0.3711388 8.644796 0.5441751
2014-07-30 -0.0092937 5719.90 0.3887960 0.5246992 8.651707 0.5611548
2014-09-17 -0.0069808 5831.98 0.4445678 0.6768551 8.671112 0.6088364
2014-10-29 -0.0043290 5772.43 0.4149353 0.8513073 8.660848 0.5836176
2014-12-17 -0.0064795 5873.73 0.4653429 0.7098352 8.678245 0.6263640
2015-01-28 -0.0053619 5872.69 0.4648254 0.7833548 8.678068 0.6259289
2015-03-18 -0.0106383 6193.92 0.6246718 0.4362417 8.731323 0.7567854
2015-04-29 -0.0214477 6219.01 0.6371568 -0.2748713 8.735366 0.7667186
2015-06-17 -0.0056180 6225.98 0.6406251 0.7665104 8.736486 0.7694709
2015-07-29 -0.0084746 6244.98 0.6500796 0.5785851 8.739533 0.7769581
2015-09-17 -0.0105541 5927.08 0.4918903 0.4417814 8.687287 0.6485811
2015-10-28 -0.0134771 6193.52 0.6244728 0.2494878 8.731259 0.7566267
2015-12-16 -0.0107527 6147.98 0.6018117 0.4287163 8.723879 0.7384929
2016-01-27 -0.0226629 5577.69 0.3180313 -0.3548129 8.626530 0.4992921
2016-03-16 0.0000000 6035.22 0.5457015 1.1360969 8.705368 0.6930079
2016-04-27 -0.0027397 6265.80 0.6604398 0.9558602 8.742862 0.7851363
2016-06-15 -0.0205882 6215.64 0.6354798 -0.2183288 8.734824 0.7653867
2016-07-27 -0.0084507 6512.26 0.7830801 0.5801555 8.781442 0.8799339
2016-09-21 -0.0053763 6531.17 0.7924898 0.7824066 8.784341 0.8870585
2016-11-02 -0.0054348 6335.69 0.6952176 0.7785622 8.753954 0.8123921
2016-12-14 -0.0117302 6840.08 0.9462057 0.3644090 8.830555 1.0006116
2017-02-01 -0.0123457 6941.77 0.9968073 0.3239192 8.845312 1.0368727
2017-03-15 -0.0117302 7274.32 1.1622866 0.3644090 8.892106 1.1518515
2017-05-03 -0.0235294 7298.39 1.1742640 -0.4118182 8.895409 1.1599686
2017-06-14 -0.0085470 7472.19 1.2607482 0.5738200 8.918943 1.2177960
2017-07-26 -0.0092593 7607.26 1.3279600 0.5269636 8.936858 1.2618157
2017-09-20 -0.0260870 7727.34 1.3877126 -0.5800699 8.952520 1.3002987
2017-11-01 -0.0212766 7954.93 1.5009631 -0.2636136 8.981547 1.3716229
2017-12-13 -0.0129870 8231.14 1.6384072 0.2817282 9.015680 1.4554920
2018-01-31 -0.0037037 8732.78 1.8880268 0.8924436 9.074839 1.6008552
2018-03-21 0.0069686 8440.61 1.7426409 1.5945387 9.040810 1.5172403
2018-05-02 0.0000000 8213.04 1.6294005 1.1360969 9.013478 1.4500828
2018-06-13 0.0000000 8687.40 1.8654454 1.1360969 9.069629 1.5880532
2018-08-01 0.0147059 8800.09 1.9215208 2.1035439 9.082517 1.6197216
2018-09-26 0.0104712 9118.25 2.0798395 1.8249597 9.118033 1.7069897
2018-11-08 0.0050761 8798.28 1.9206201 1.4700380 9.082311 1.6192162
2018-12-19 0.0000000 7877.65 1.4625080 1.1360969 8.971785 1.3476356
2019-01-30 0.0046729 8468.86 1.7566983 1.4435099 9.044151 1.5254505
2019-03-20 -0.0177778 8948.39 1.9953159 -0.0334390 9.099229 1.6607847
2019-05-01 -0.0144231 9277.94 2.1593024 0.1872547 9.135395 1.7496499

5.4.3 Charting the Time Series Alternatives

In this section, we will show 3 time series charts illustrating the alternative considerations of regression modeling.

The first chart below shows the raw sentiment compared to raw Russell equity levels. Scale issues are obvious since the sentiment values are compressed to the appearance of a slightly fuzzy flat line. The chart below shows scaling is essential.

ggplot() + 
  geom_line(data=msEQdata, aes(x=date_mdy, y=Sentiment_Score) , color = "red" ) +
  geom_line(data=msEQdata, aes(x=date_mdy, y=RU1000TR), color="green") +
  ggtitle("Sentiment vs. Russell 1000 Equity Level", subtitle="Not usable without fixes")

The second chart shows the use of scaled sentiment versus scaled Russell equity levels. Scale issues are remain because the right hand side (the more recent years) shows higher variation than the left hand side (earliest years).

ggplot() + 
  geom_line(data=msEQdata, aes(x=date_mdy, y=z_sentiment) , color = "red" ) +
  geom_line(data=msEQdata, aes(x=date_mdy, y=z_ru_fomc), color="green") +
  ggtitle("Scaled Sentiment vs. Scaled Equity Index", subtitle = "Nearly There...")

Finally, the third chart shows the variables we will use in the regression analysis.

ggplot() + 
  geom_line(data=msEQdata, aes(x=date_mdy, y=z_sentiment) , color = "red" ) +
  geom_line(data=msEQdata, aes(x=date_mdy, y=z_logEquity), color="green") +
  ggtitle("Scaled-Sentiment vs. Scaled Log Equity Price", subtitle="What we will use")

5.4.4 Regressing Sentiment to Financial Variables

The final regression model we present uses the scaled, log-transformed data with the removal of an influential outlier (observation 1 of Jan 2007). For a reason yet to be determined, Jan 2007 generates the highest sentiment of the entire observation period. This is arguably wrong as the Sept 2018 period was possibly the most euphoric in recent memory. It is calculated in the code chunk below.

mod1 = lm( z_logEquity ~ z_sentiment, data=msEQdata[2:102,])
summary(mod1)
## 
## Call:
## lm(formula = z_logEquity ~ z_sentiment, data = msEQdata[2:102, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96400 -0.58297  0.09103  0.56555  1.65740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02507    0.07893   0.318    0.751    
## z_sentiment  0.64857    0.08239   7.872 4.76e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.789 on 98 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3874, Adjusted R-squared:  0.3811 
## F-statistic: 61.96 on 1 and 98 DF,  p-value: 4.761e-12

The mod1 clearly has a statistically significant leading coefficient because the p-value is 6.19e-12. The adjusted-R-squared of 37 percent suggests the model has some explanatory power.

Examining the diagnostic plots below shows:

  • Q-Q plot and histogram of residuals show reasonable approximation to normality.
  • residuals have relatively homogenous variance across the range of observations
  • residuals have little trend in relative to the fitted values
  • leverage plot has controlled for most influential outlier (observation 1)
par(mfrow=c(3,2))
plot(mod1)
hist(mod1$residuals )

Finally, we present the scatterplot of regressed values overlay with the regression line to study the model fit.

ggplot(data=msEQdata[2:102,], aes(x=z_sentiment, y=z_logEquity) ) + 
   geom_point() + 
   geom_smooth(method=lm) +
   ggtitle("ScatterPlot of Fitted Regression Model", subtitle="X=Z-Sentiment, Y=Z-LogRussell 1000 (2007-2019)")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

5.4.5 Discussion of Results

There are two comments related to the time series and regression we should make.

First, the time series of sentiment clearly shows a pattern characteristic of other financial variables through the 2007-2019 period. During the Q4 2008, at the depths of the financial crisis, sentiment appears to be at a low. During H2 2009, when the financial markets had miraculously recovered, the sentiment spikes upward. Other signs that sentiment is effective include the 2018 euphoria when equity markets reached daily highs during the summer and fall. Moreover, sentiment in Q4 2018 and Q1 2019 declined in concert with the observed selloff of risk assets in the same period.

However, the sentiment index is imperfect. The 2013 taper tantrum is not reflected correctly from a bond investor point of view. As we recall, on May 22, 2013, bond markets panicked when Bernanke gave a speech to Congress that quantitative easing would likely be terminated at a future date. More investigation is needed to understand the market and FOMC dynamics around that historical episode and we regard this as future work.

Second, the regressions suggests that sentiment is positively associated with equity levels. Positive sentiment is associated with higher Russell Index 1000 levels. We think this makes sense. Whether sentiment causes equity markets to move or vice versa is too complex to answer with the crude econometric analysis we have conducted. However, the trend and regression results suggest that more detailed regression analysis of sentiment difference vs. equity returns (instead of levels) both contemporaneous or lagged would promising some predictive value from sentiment analysis. The project timeline did not allow for this more extensive regression analysis work, but we view it as fertile ground for future research.

6. Discussion of Results and Impact

We have discussed the results of our two analyses previously. This section focuses on broader considerations of the of FOMC statement analysis. How useful were the results in practical terms? First, the training and classification of FOMC statement attributes is clearly feasible. Our efforts combining laborious human review of the statements for labelling purposes with traditional supervised learning methods can produce useful prediction for some attributes. Other attributes such as inflation are not easily predicted using our approach. This does not mean that the problem cannot be solved with this line of attack.

Second, sentiment analysis is likely to be the more impactful tool in forecasting for investment management purposes. Our preliminary results, effectively a conventional approach to sentiment analysis, yielded a realistic looking financial sentiment indicator. We demonstrate some moderate level of explanatory power of sentiment and equity market levels using a linear regression. Much more extension regression analyses covering first differences of sentiment and price series is required. We believe the results are insufficient to be useful in practice but sufficient to justify further refinement and investigation to unlock value.

7. Conclusion

The project analyzed the FOMC statements using text based methods. The results are encouraging but not definitive in their utility. A much longer programme of research would be needed to explore the implications of this work. Others have gone done this path. Some companies such as JPMorgan Chase, BlackRock have dedicated teams to analyze interest rate markets and central bank communications with machine learning tools. One company called Prattle (www.prattle.co) has even commercialized this idea and provides central bank sentiment analysis for 16 central banks including the Fed. Therefore, we are on the right path – one led by earlier pioneers.

8. Reference

The project analyzed the FOMC statements using text based methods. The results are encouraging but not definitive in their utility. A much longer programme of research would be needed to explore the implications of this work. Others have gone done this path. Some companies such as JPMorgan Chase, BlackRock have dedicated teams to analyze interest rate markets and central bank communications with machine learning tools. One company called Prattle (www.prattle.co) has even commercialized this idea and provides central bank sentiment analysis for 16 central banks including the Fed. Therefore, we are on the right path – one led by earlier pioneers.

Section 8: References