The effect of emotional contents of FOMC statement release on stock markets and market volatility

This document is a report of a final project for the Responsible Machine Learning course delivered for Ph.D. students at the University of Warsaw (Ph.D. program in Quantitative Psychology and Economics) by prof.Przemysław Biecek in the summer semester 2021.

Key Takeaways

Our model proved that emotion tone from the FOMC press conference text could be used to predict trading behavior.
Although the Random Forest model manages to capture a more complex relationship between target variable and features, it falls short in terms of the range of prediction compared to the Logistic Regression model. Logistic regression models have a longer span of prediction than RF models. The logistic regression model might be more versatile when being deployed onsite. The random forest model might suffer some drawbacks and must be retrained more frequently.
Emotions perform better than general sentiment as a predictor. Emotions that are formulated under a particular narrative, e.g., policy stance (hawkish-dovish) and relative sentiment score (excitement-anxiety), perform better than emotions in the raw index (e.g., fear, anger, anticipation, etc.)
Policy stance and Speakers have a good fairness score, passing 4 out of 5 metrics. The probable source of the bias is historical bias and algorithmic bias. Unfortunately, the RSS can not be tested.

Introduction

Predicting the stock market is hard, especially during uncertain times, for example, during shifts in monetary economic policies. Most human agents, even the most experienced market players, do not behave fully rational and not fully attentional when dealing with strategic uncertainty problems. Currently, economists are starting to move away from using rational expectations to predict how humans react to monetary policies towards using behavioral expectations.

A number of studies showed that emotions (see, for example Duxbury et al., 2020, particularly fear, play a dominant role in shaping decisions in the stock market. In behavioral economics, the terms such as loss aversion or disposition effects captured the significant phenomena on how negative feelings, rather than positive ones, drive people’s behavior and decisions when anticipating loss.

Can we predict the trading behavior if we have information on the emotions of the economic-related news? The increasingly sophisticated machine learning algorithms and the FOMC press conference provide ideal tools for researchers to answer this question.

What we are doing

We are trying to predict the trading behavior captured from two US indices: the E-mini futures (ES) and the CBOB volatility index (VIX) using emotion analysis from the scripts of FOMC prepared remarks and Q&A sessions. The ES index is a derivative contract that speculates on the future movement of the S&P 500 index. Like the ES index, the VIX index speculates on the 30-day forward projection of S&P 500 volatility. The FOMC conferences’ scripts are taken from the period of 2011 - 2019.

Federal Open Market Committee (FOMC) is a committee within the Federal Reserve System (the Fed) in the USA. It is responsible for overseeing the open market operations, and it makes critical decisions about interest rates and the growth of the US money supply. There are eight regularly scheduled meetings held each year at intervals of five to eight weeks. At each meeting, the Committee votes on the policy to be carried out during the gap between meetings. After each such meeting, there is a press conference, where they convey their economic predictions and guidance of the upcoming quarters to the public.

First, the Chairman reads the written prepared remarks for approx. 15 minutes, then a Q&A session follows for the next 45 minutes. This event is undoubtedly one of the most awaited events by market players because much of the significant decision-making will depend on whether, for example, the FED will increase the interest rate or not.

In the present project, we investigate the mentioned effects using machine learning models. The objectives are to classify financial returns from investments in the ES and VIX indexes. Our project makes use of logistic regression and random forest algorithms. A total of four models are constructed, including the following:

Two baseline models are constructed using the ES and VIX indexes as independent variables and their lag terms as features. Information criteria will determine the optimal number of lags.
Four additional features are added to the baseline models: implied policy stance, sentiment score, the interaction term between implied policy stance and sentiment score, and a categorical variable to differentiate the section of the press conference.

Why is this important?

First of all, there is a growing interest in studying the effectiveness of Central Bank communication to manage inflation better. Perhaps much can be read from textual analysis, voice tone, facial expression, and gesture of the Chairman during the public announcement. However, studying the emotional aspect is relatively new and somewhat challenging. Most economic texts produced by the Central Bank are “cold” and predisposed to be as unemotional as possible.

Second of all, by using a responsible machine learning approach to analyze predictive models in trading behavior using emotions analysis from textual data, we expect to have a better, more reliable, fairer, and more rigorous model.

Potential drawbacks

Subjectivity in detecting and interpreting emotions may be causing bias within the algorithm and between the users of the machine (or model). Therefore, achieving good performance of the model alone may not be enough. We have to ensure that the emotion detection has been thoroughly examined toward biases regarding cultures, language, context, etc.
Until recently, there has been no solid emotion detection program/algorithm specialized for analyzing economics/finance texts, let alone a model with good predictive value to use the emotion of economics text to predict market behavior.

Hypothesis

Emotion analysis of the Central Bank text communication can be used to predict trading behavior.
Investors make a rational investment decision in response to the policy stance of the Federal Reserve.
Random forest model should capture a complex relationship between predictor and features and give a more accurate prediction.

Project objectives

The primary objective of our project is to extract sentiment (by sentiment, we refer to both sentiment and emotions) and implied policy stance from the scripts of FOMC statements and investigate the impact of these features on two US indices based on trading data.

There is a total of five sentiment lexica used in our work:

Loughran and McDonald (2011) dictionary
We utilize a new metric created by Nyman et al., 2021- the Relative Sentiment Theory (RSS). This metric was developed using findings from social and psychological theories on narratives. It says that humans are driven by context and sentiment. For example, they suggested that particularly the shift between excitement and anxiety is the most vital sentiment prediction in responding to financial news. They developed their own dictionary to capture the “excitement” and “anxiety” emotions which differs from the Loughran and McDonald.The RSS is formulated as follows:
We also use the Policy Stance index, which indicates whether a text implies a “hawkish” or “dovish” tone. In our study, we follow the procedure from Gorodnichenko et al., 2021. The formula is follows:
We use NRC EmoLex (Mohammad & Turney, 2013) as an addition to typical emotion analysis from texts and utilize the LM dictionary, which was designed for analyzing sentiment from economics and finance text.
For general sentiment, we use the Sentiword lexicon (Baccianella et al. (n.d))

We want to compare the performance between the simple and complex models in terms of their predictability, explainability, and fairness, by comparing the policy stance, RSS, Sentiword, and NRC emotions performance.

Data description

Here we provide the basic data description we use in our project. * FOMC_sentiment_ES_m corresponds to the dataset containing all sentiment scores extracted from the FOMC script and ES index trading data aggregated into a minute-time interval. * FOMC_sentiment_VX_m corresponds to the dataset containing all sentiment scores extracted from the FOMC script and VIX index trading data aggregated into a minute-time interval.

There are 22 predictors, including 18 from sentiment metrics, 2 related to trading activities, and the last 2 related to the characteristics of the FOMC press conference.

summary(FOMC_sentiment_ES_m)

##        X1            positive         negative          anger        
##  Min.   :   1.0   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 525.5   1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.: 0.0000  
##  Median :1050.0   Median : 1.000   Median : 2.000   Median : 0.0000  
##  Mean   :1050.0   Mean   : 1.058   Mean   : 2.214   Mean   : 0.6417  
##  3rd Qu.:1574.5   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.: 1.0000  
##  Max.   :2099.0   Max.   :10.000   Max.   :16.000   Max.   :11.0000  
##   anticipation       sadness           surprise           fear       
##  Min.   : 0.000   Min.   : 0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median : 3.000   Median : 1.0000   Median :1.0000   Median : 2.000  
##  Mean   : 3.142   Mean   : 0.8552   Mean   :0.9909   Mean   : 1.949  
##  3rd Qu.: 4.000   3rd Qu.: 1.0000   3rd Qu.:2.0000   3rd Qu.: 3.000  
##  Max.   :16.000   Max.   :11.0000   Max.   :8.0000   Max.   :12.000  
##      trust             joy            disgust         sentiword      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :-4.5938  
##  1st Qu.: 3.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:-0.5208  
##  Median : 5.000   Median : 1.000   Median : 0.000   Median : 0.3542  
##  Mean   : 5.234   Mean   : 1.348   Mean   : 0.404   Mean   : 0.3210  
##  3rd Qu.: 7.000   3rd Qu.: 2.000   3rd Qu.: 1.000   3rd Qu.: 1.1271  
##  Max.   :19.000   Max.   :11.000   Max.   :10.000   Max.   : 6.0000  
##     ps_score         rss_score         total_hawkish     total_dovish   
##  Min.   :-0.5000   Min.   :-0.005891   Min.   : 0.000   Min.   : 2.000  
##  1st Qu.: 0.0000   1st Qu.:-0.003631   1st Qu.: 2.000   1st Qu.: 5.000  
##  Median : 0.3333   Median :-0.002722   Median : 4.000   Median : 8.000  
##  Mean   : 0.2903   Mean   :-0.002925   Mean   : 4.915   Mean   : 8.923  
##  3rd Qu.: 0.5556   3rd Qu.:-0.001998   3rd Qu.: 6.000   3rd Qu.:11.000  
##  Max.   : 1.0000   Max.   :-0.000709   Max.   :21.000   Max.   :21.000  
##  approach_score   avoid_score       DateTime                  
##  Min.   : 5.00   Min.   :17.00   Min.   :2011-04-27 14:31:00  
##  1st Qu.: 9.00   1st Qu.:25.00   1st Qu.:2013-06-19 14:37:30  
##  Median :11.00   Median :33.00   Median :2015-06-17 15:15:00  
##  Mean   :12.69   Mean   :36.82   Mean   :2015-07-14 11:13:59  
##  3rd Qu.:14.00   3rd Qu.:47.00   3rd Qu.:2017-09-20 15:03:30  
##  Max.   :28.00   Max.   :86.00   Max.   :2019-10-30 15:18:00  
##       Date                Time              part             Speaker         
##  Min.   :2011-04-27   Length:2099       Length:2099        Length:2099       
##  1st Qu.:2013-06-19   Class1:hms        Class :character   Class :character  
##  Median :2015-06-17   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2015-07-13   Mode  :numeric                                         
##  3rd Qu.:2017-09-20                                                          
##  Max.   :2019-10-30                                                          
##      Price          Volume          No_trades      sentiment_score  
##  Min.   :1226   Min.   :    3.0   Min.   :   2.0   Min.   :-1.0000  
##  1st Qu.:1635   1st Qu.:  531.5   1st Qu.: 129.5   1st Qu.:-1.0000  
##  Median :2028   Median : 1298.0   Median : 291.0   Median :-0.3333  
##  Mean   :2065   Mean   : 2378.9   Mean   : 477.5   Mean   :-0.2992  
##  3rd Qu.:2501   3rd Qu.: 3196.5   3rd Qu.: 655.5   3rd Qu.: 0.0000  
##  Max.   :3043   Max.   :25301.0   Max.   :4742.0   Max.   : 1.0000

summary(FOMC_sentiment_VX)

##        X1            positive         negative          anger        
##  Min.   :   1.0   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 565.2   1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.: 0.0000  
##  Median :1108.5   Median : 1.000   Median : 2.000   Median : 0.0000  
##  Mean   :1105.8   Mean   : 1.083   Mean   : 2.235   Mean   : 0.6452  
##  3rd Qu.:1649.8   3rd Qu.: 2.000   3rd Qu.: 3.000   3rd Qu.: 1.0000  
##  Max.   :2192.0   Max.   :13.000   Max.   :16.000   Max.   :11.0000  
##   anticipation       sadness           surprise           fear       
##  Min.   : 0.000   Min.   : 0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median : 3.000   Median : 1.0000   Median :1.0000   Median : 2.000  
##  Mean   : 3.189   Mean   : 0.8631   Mean   :0.9981   Mean   : 1.964  
##  3rd Qu.: 5.000   3rd Qu.: 1.0000   3rd Qu.:2.0000   3rd Qu.: 3.000  
##  Max.   :16.000   Max.   :11.0000   Max.   :8.0000   Max.   :12.000  
##      trust             joy            disgust          sentiword      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   :-6.0208  
##  1st Qu.: 3.000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.:-0.5094  
##  Median : 5.000   Median : 1.000   Median : 0.0000   Median : 0.3750  
##  Mean   : 5.332   Mean   : 1.355   Mean   : 0.4043   Mean   : 0.3397  
##  3rd Qu.: 7.000   3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.: 1.1833  
##  Max.   :27.000   Max.   :11.000   Max.   :10.0000   Max.   : 6.0000  
##     ps_score         rss_score         total_hawkish     total_dovish   
##  Min.   :-0.5000   Min.   :-0.005891   Min.   : 0.000   Min.   : 2.000  
##  1st Qu.: 0.0000   1st Qu.:-0.003716   1st Qu.: 2.000   1st Qu.: 5.000  
##  Median : 0.3333   Median :-0.002953   Median : 4.000   Median : 8.000  
##  Mean   : 0.3164   Mean   :-0.002987   Mean   : 4.747   Mean   : 9.337  
##  3rd Qu.: 0.5556   3rd Qu.:-0.002016   3rd Qu.: 6.000   3rd Qu.:12.000  
##  Max.   : 1.0000   Max.   :-0.000709   Max.   :21.000   Max.   :21.000  
##  approach_score   avoid_score       DateTime                  
##  Min.   : 5.00   Min.   :17.00   Min.   :2011-04-27 14:31:00  
##  1st Qu.: 9.00   1st Qu.:25.00   1st Qu.:2013-06-19 15:20:15  
##  Median :11.00   Median :34.00   Median :2015-09-17 15:22:30  
##  Mean   :12.61   Mean   :37.17   Mean   :2015-10-14 11:42:31  
##  3rd Qu.:14.00   3rd Qu.:47.00   3rd Qu.:2017-12-13 15:29:45  
##  Max.   :28.00   Max.   :86.00   Max.   :2019-10-30 15:18:00  
##       Date                Time              part             Speaker         
##  Min.   :2011-04-27   Length:2162       Length:2162        Length:2162       
##  1st Qu.:2013-06-19   Class1:hms        Class :character   Class :character  
##  Median :2015-09-17   Class2:difftime   Mode  :character   Mode  :character  
##  Mean   :2015-10-13   Mode  :numeric                                         
##  3rd Qu.:2017-12-13                                                          
##  Max.   :2019-10-30                                                          
##      Price           Volume          No_trades       sentiment_score  
##  Min.   :10.35   Min.   :    1.0   Min.   :   1.00   Min.   :-1.0000  
##  1st Qu.:14.40   1st Qu.:   53.0   1st Qu.:  13.00   1st Qu.:-1.0000  
##  Median :15.80   Median :  201.0   Median :  40.00   Median :-0.3333  
##  Mean   :16.40   Mean   :  336.4   Mean   :  65.93   Mean   :-0.3002  
##  3rd Qu.:18.16   3rd Qu.:  458.8   3rd Qu.:  83.00   3rd Qu.: 0.0000  
##  Max.   :33.66   Max.   :10747.0   Max.   :1062.00   Max.   : 1.0000

The Model

We train a machine learning model to classify gain or loss from trading in the stock market.
We use two machine learning algorithms: logistic regression and random forest model.
As the main task is classification, we need to preprocess our data for model fitting.

Data preprocessing

First, we transform the variable Price from both datasets to four different period returns: one-minute, two-minute, five-minute, and ten-minute.

This is to determine which return yields the highest prediction accuracy that can best reflect the behavior of the investors.

We train a total of 16 models for each return and index.
We split the data into training and testing sets using 70 percent of observation as the training data.
We use the tidymodel package to build a machine learning model.

Fitting Logistic Regression model

First, we create model recipes for every period return of both indices.
Next, we set the computational engine and working pipelines. Then, we fit the models.
Once we have the trained models ready, we make an out-of-sample prediction. Based on the features from the testing dataset, the predicted values from trained models are directly compared with actual classification.
We use two metrics to measure model performance: prediction accuracy and AUC score.

bind_rows(es.lr.one.accuracy,es.lr.one.auc) #ES index one-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.541
## 2 roc_auc  binary         0.513

bind_rows(es.lr.two.accuracy,es.lr.two.auc) #ES index two-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.537
## 2 roc_auc  binary         0.539

bind_rows(es.lr.five.accuracy,es.lr.five.auc) #ES index five-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.583
## 2 roc_auc  binary         0.587

bind_rows(es.lr.ten.accuracy,es.lr.ten.auc) #ES index ten-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.662
## 2 roc_auc  binary         0.697

bind_rows(vx.lr.one.accuracy,vx.lr.one.auc) #VX index one-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.553
## 2 roc_auc  binary         0.580

bind_rows(vx.lr.two.accuracy,vx.lr.two.auc) #VX index two-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.556
## 2 roc_auc  binary         0.563

bind_rows(vx.lr.five.accuracy,vx.lr.five.auc) #VX index five-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.598
## 2 roc_auc  binary         0.625

bind_rows(vx.lr.ten.accuracy,vx.lr.ten.auc) #VX index ten-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.667
## 2 roc_auc  binary         0.740

Both the prediction accuracy and AUC score indicate that ten-minute return models have the best performance in classifying gain/loss from investments.

However, this finding can be misleading. The training model might suffer from overfitting issues.

Although we use the testing dataset to validate model predictions, both training and testing data are initially the same. The variation in the observations of both sets may be similar.

Furthermore, the sample sizes of both the ES and VIX index are not large. We would need to employ a resampling technique to reevaluate the performance of each model.

To prevent the logistic regression model from overfitting the training dataset, we perform a ten-fold repeated cross-validation.

After performing 10 fold repeated cross-validation, we retrieve the best-performing model for each index.

## # A tibble: 8 x 4
##   wflow_id              .metric   mean preprocessor
##   <chr>                 <chr>    <dbl> <chr>       
## 1 ten.min_logistic_reg  accuracy 0.634 recipe      
## 2 ten.min_logistic_reg  roc_auc  0.698 recipe      
## 3 five.min_logistic_reg accuracy 0.590 recipe      
## 4 five.min_logistic_reg roc_auc  0.608 recipe      
## 5 two.min_logistic_reg  accuracy 0.570 recipe      
## 6 two.min_logistic_reg  roc_auc  0.591 recipe      
## 7 one.min_logistic_reg  accuracy 0.555 recipe      
## 8 one.min_logistic_reg  roc_auc  0.557 recipe

## # A tibble: 8 x 4
##   wflow_id              .metric   mean preprocessor
##   <chr>                 <chr>    <dbl> <chr>       
## 1 ten.min_logistic_reg  accuracy 0.681 recipe      
## 2 ten.min_logistic_reg  roc_auc  0.747 recipe      
## 3 five.min_logistic_reg accuracy 0.596 recipe      
## 4 five.min_logistic_reg roc_auc  0.637 recipe      
## 5 two.min_logistic_reg  accuracy 0.554 recipe      
## 6 two.min_logistic_reg  roc_auc  0.558 recipe      
## 7 one.min_logistic_reg  accuracy 0.524 recipe      
## 8 one.min_logistic_reg  roc_auc  0.524 recipe

For both the ES and VIX indices, the logistic regression with a ten-minute return yields the highest AUC score and prediction accuracy. Thus, we select the ten-minute return model as the representative model among other different time interval returns.

Fitting Random Forest model

First, we use the recipes for every model that we previously created.

Then, we set up the algorithm’s computational engine and working pipeline to train the models easily.

To train the random forest models we use an engine from the ranger package. And get the following prediction accuracies:

bind_rows(es.rf.one.accuracy, es.rf.one.auc) #ES index one-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.540
## 2 roc_auc  binary         0.525

bind_rows(es.rf.two.accuracy, es.rf.two.auc)  #ES index two-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.538
## 2 roc_auc  binary         0.550

bind_rows(es.rf.five.accuracy, es.rf.five.auc) #ES index five-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.598
## 2 roc_auc  binary         0.636

bind_rows(es.rf.ten.accuracy, es.rf.ten.auc) #ES index ten-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.652
## 2 roc_auc  binary         0.726

bind_rows(vx.rf.one.accuracy, vx.rf.one.auc) #VX index one-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.558
## 2 roc_auc  binary         0.564

bind_rows(vx.rf.two.accuracy, vx.rf.two.auc) #VX index two-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.567
## 2 roc_auc  binary         0.572

bind_rows(vx.rf.five.accuracy, vx.rf.five.auc) #VX index five-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.619
## 2 roc_auc  binary         0.659

bind_rows(vx.rf.ten.accuracy, vx.rf.ten.auc) #VX index ten-minute

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.700
## 2 roc_auc  binary         0.766

From the prediction accuracy and the AUC scores, we can see that the ten-minute return models have the best prediction performance among other models.

This finding is consistent with the case of logistic regression models.

The random forest model has hyperparameter values that must be tuned to obtain the best fitting. Instead of using the default values, we would like to obtain the best values that fit best with our data. We use the random grid to randomly select the combinations of three hyperparameters for the random forest model: mtry, trees, and min_n. We find, that the best hyperparameter values are mtry = 9, trees = 1945, min_n = 14 for ES index model and mtry = 3, trees = 555, min_n = 39) for VIX index model.

Model Interpretability

Above, we trained sixteen models in total to select the best ones for each modeling algorithm and trading dataset, measured in terms of their highest prediction accuracy.

In all cases, ten-minute return models were consistently the best-performing models.

We investigate further how each feature fairs in terms of their importance in predicting gain/loss from investments by using the concept of PFI (Permutational feature importance).

Mean variable-importance over 100 permutations

Mean variable-importance for Logistic Regression (left) vs. Random Forest(right), ES index

Mean variable-importance for Logistic Regression (left) vs. Random Forest (right), VIX index

It can be observed that among the top ten most important features, the rankings across the models are not drastically different. All four models seem to capture similar patterns among target variables and various features.

From the plots, it can be observed that the variable part has the highest impact in predicting gain/loss from investments, implying that investors react differently during two different sections of the FOMC press conference: prepared remark and Q&A.
Speaker seems to have a high predictive impact in logistic models, but is less important in the random forest model.
ps_score is consistently ranked among the most important feature in predictive power for all models.

Partial Dependence profile, depending on the model and index

We compute PD profiles for each model using the ten most important features as shown in the previous section.

Partial Dependence profile, ES index (for Logistic Regression)

Partial Dependence profile, ES index (for Random Forest)

Partial Dependence profile for the prepared remarks part vs. Q&A part, ES index (for Logistic Regression)

Partial Dependence profile for the prepared remarks part vs. Q&A part, ES index (for Random Forest)

Partial Dependence profile depending on the Chairman, ES index (for Logistic Regression)

Partial Dependence profile depending on the Chairman, ES index (for Random Forest)

The logistic model does not capture the difference in market response to different speakers well.
ES index market seems to react more to Bernanke while VIX index market reacts more to Yellen.

Partial Dependence profile, VIX index (for Logistic Regression)

Partial Dependence profile, VIX index (for Random Forest)

Partial Dependence profile for the prepared remarks part vs. Q&A part, VIX index (for Logistic Regression)

Partial Dependence profile for the prepared remarks part vs. Q&A part, VIX index (for Random Forest)

Partial Dependence profile depending on the Chairman, VIX index (for Logistic Regression)

Partial Dependence profile depending on the Chairman, VIX index (for Random Forest)

The Partial Dependence profiles reveal that the logistic model captures only linear relationships, whereas the random forest model can capture more non-linear relationships.
Both models highlight the different market reactions during prepared remarks and Q&A, but only random forest models can capture different market reactions to different FOMC speakers clearly.
PD profile of ps_score confirms the hypothesis that investors react rationally to the policy direction of FED. The positive relationship indicates that investors always look for an opportunity to gain no matter how FED positions their monetary policy.
PD profile of fear also highlights the possibility of investors reacting purely based on emotion, but the degree pf importance of ps_score is much higher than fear.
PD profile of rss_score does not reveal a clear pattern in the relationship with the targeted variable.

Contrastive Partial Dependence profile (comparison between Logistic Regression and Random Forest)

Partial Dependence profile, Logistic Regression vs. Random Forest, ES index

Partial Dependence profile, Logistic Regression vs. Random Forest, VIX index

Contrastive PD profiles reveal that although the random forest model manages to capture a more complex relationship between target variable and features, it falls short in terms of the range of prediction compared to the logistic regression model.
Logistic regression models have a longer range of prediction. The range can be from 0.2 to 0.6 in some features but stay roughly around the range of 0.4 to 0.6 on average.
The prediction computed by random forest models is limited to around 0.4, and it is consistent across all features.
Logistic regression model might be more versatile when being deployed onsite. The random forest model might suffer some drawbacks and must be retrained more frequently.

Fairness test

Once we have the results from the model prediction and interpretability test, we conduct a fairness test. We choose a ten-minute return as our target variable because it gives the best prediction model.

This project aims to test whether emotion is a good predictor for trading behavior. Therefore, we will test policy stance, RSS score, and fear. We are also aware that gender and race may affect the tone and language when people speak; thus, we will also include the speaker in the tests.

We use the “fairmodel” package to run a fairness test (see Wiśniewski & Biecek, 2021).

Policy Stance test for ESMS dataset

Please note the naming: ESMS corresponds to what was previously called FOMC_sentiment_ES_m.

After uploading the ESMS dataset, to run the fairmodels package, we need to prepare variables that we will test into factor variables and remove all non-factorized or numeric unused variables.

##   X positive negative anger anticipation sadness surprise fear trust joy
## 1 1        1        1     0            1       0        0    0     3   1
## 2 2        2        0     0            2       0        0    4    12   0
## 3 3        0        2     0            3       0        2    3     4   0
## 4 4        0        5     2            2       0        2    5     3   1
## 5 5        1        0     0            2       2        0    7     5   1
## 6 6        1        5     1            7       1        2    1    10   2
##   disgust  sentiword ps_score     rss_score total_hawkish total_dovish
## 1       0  2.6770833    -0.05 -0.0007089685            21           19
## 2       1 -0.5312500    -0.05 -0.0007089685            21           19
## 3       0 -0.6250000    -0.05 -0.0007089685            21           19
## 4       1 -2.0625000    -0.05 -0.0007089685            21           19
## 5       0 -0.2180556    -0.05 -0.0007089685            21           19
## 6       0  1.7916667    -0.05 -0.0007089685            21           19
##   approach_score avoid_score            DateTime       Date     Time
## 1             17          23 2011-04-27 14:31:00 2011-04-27 14:31:00
## 2             17          23 2011-04-27 14:32:00 2011-04-27 14:32:00
## 3             17          23 2011-04-27 14:33:00 2011-04-27 14:33:00
## 4             17          23 2011-04-27 14:34:00 2011-04-27 14:34:00
## 5             17          23 2011-04-27 14:35:00 2011-04-27 14:35:00
## 6             17          23 2011-04-27 14:36:00 2011-04-27 14:36:00
##               part  Speaker Volume No_trades sentiment_score one.min.return
## 1 Prepared remarks Bernanke   5609      1054       0.0000000           gain
## 2 Prepared remarks Bernanke   1369       284       1.0000000           gain
## 3 Prepared remarks Bernanke   1162       210      -1.0000000           gain
## 4 Prepared remarks Bernanke   1633       279      -1.0000000           gain
## 5 Prepared remarks Bernanke   1404       265       1.0000000           loss
## 6 Prepared remarks Bernanke   2260       403      -0.6666667           gain
##   two.min.return five.min.return ten.min.return
## 1           gain            gain           gain
## 2           gain            gain           gain
## 3           gain            gain           gain
## 4           gain            gain           gain
## 5           loss            gain           gain
## 6           gain            gain           gain

Our Policy Stance and RSS score are still numeric, so we create new variables which are taken by extracting the variables from their original “hawkish”, “neutral”, and “dovish” score, and save them into factor, and labeled them accordingly.

The same procedure is applied for the RSS score. We create a new variable rss by extracting the variable from its original anxiety “avoid”, “neutral”, and excitement “approach” score, and save it into factor and labeled them accordingly.

Also, we factorized our target variables, that is, “ten.min.return”. We do not flip over the score because the model is created to predict the positive outcome, that is, the gain return.

Basic Features

## Preparation of a new explainer is initiated
##   -> model label       :  ranger  (  default  )
##   -> data              :  2099  rows  29  cols 
##   -> target variable   :  2099  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.12.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.02421984 , mean =  0.6054935 , max =  0.994481  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.6471508 , mean =  0.0009857974 , max =  0.5797627  
##   A new explainer has been created!

Fairness check

We chose the policy stance as a protected group variable that may conceal sensitive data and needs to be tested for fairness. Policy stance consists of “hawkish” and “dovish,” and it is classified by the algorithm using a ‘search and count’ function that leaves some room for potential bias depending on the language and wording style of the communicator.

Further, we select “dovish” as the privileged subgroup because we found that the dovish score is dominating as a predictor in our models. It may imply that dovish may have been treated differently by the algorithm.

fobject_ps <- fairness_check(rf_explainer,       # explainer
                         
             protected = esms$ps,  # ps as protected variable as factor
             privileged = "dovish", # level in protected variable, potentially more privileged
            cutoff = 0.5,       # cutoff - optional, default = 0.5
          colorize = FALSE)

## Creating fairness classification object
## -> Privileged subgroup       : character ([32m Ok [39m )
## -> Protected variable        : factor ([32m Ok [39m ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups )
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 1 in total ( [32m compatible [39m )
## -> Metric calculation        : 12/12 metrics calculated for all models
##  Fairness object created succesfully

By using the “ranger” model to see if this variable has a bias, to get:

## 
## Fairness check for models: ranger 
## 
## ranger passes 4/5 metrics
## Total loss:  1.915318

The one failed metric is in the Predictive Equality Ratio. This is actually good enough.

None of our ranger models reach the red fields on the left. This means there is no bias toward unprivileged groups (in our case, they are “neutral” and “hawkish”). However, in one metric, that is Predictive Equality Ratio (PER), “hawkish” enters the red field on the right quite far. This means there is a bias toward privileged subgroup (i.e., “dovish”), mainly by hawkish with the value of ratio more than 2. This is an interesting result since the overall policy stance score of the text in our datasets is actually ‘dovish’. Why does this bias occur?

From the density plot we can see that it is more likely that the model will categorize “hawkish” to contribute to the ten-minute return compared to “neutral” and “dovish” respectively.

We could also see the raw (unscaled) metrics using the metric_score object.

Fairness object

Because it is infrequent for a model to pass all metrics, it is better to elaborate on how the fairness test is conducted in different models and explainers.

Here we present one object with all explainers.

From the plot, we can see that our models perform relatively well in terms of satisfying fairness testing in 3 metrics: Accuracy equality ratio, Equal opportunity ratio, and Predictive parity ratio. However, we can see that for the Predictive equality ratio, the gbm and ranger model strongly bias towards the neutral subgroup and towards the privileged group for the hawkish subgroup. Also, in this metric, in the ranger_2 model, hawkish strongly bias towards the dovish with the ratio value of about 2.2.

Among all four models, only ranger_3 has no values reaching the red field both to the left and right. This means this model treats the variables equally. Note that ranger_3 in our model is developed by fear and Speaker as predictors.

Let us see the fairness metrics for all models.

## 
## Fairness check for models: ranger, ranger_2, ranger_3, gbm 
## 
## [32mranger passes 5/5 metrics
## [39mTotal loss:  0.3213423 
## 
## [33mranger_2 passes 4/5 metrics
## [39mTotal loss:  1.984299 
## 
## [32mranger_3 passes 5/5 metrics
## [39mTotal loss:  0.3213423 
## 
## [31mgbm passes 3/5 metrics
## [39mTotal loss:  1.857252

The metrics scores also support our previous inference. The ranger passes only 3 metrics out of 5. The ranger_2 and gbm passe 4 out of 5 metrics. The best model in terms of fairness metrics, thus, is the ranger_3 and passes all 5 metrics.

To see what consists of the fairness object can be examined by evaluating the parity loss from each metric.

fobject_ps_mix$parity_loss_metric_data

##           TPR       TNR        PPV         NPV       FNR      FPR       FDR
## 1 0.000000000       NaN 0.15081155          NA       NaN 0.000000 0.2595238
## 2 0.007362293 0.0766040 0.02785503 0.003873998 0.6545323 1.159271 0.7779351
## 3 0.000000000       NaN 0.15081155          NA       NaN 0.000000 0.2595238
## 4 0.154319680 0.4462236 0.15478199 0.169365348 1.0705113 1.241188 0.8722794
##         FOR         TS       STP        ACC         F1
## 1        NA 0.15081155 0.0000000 0.15081155 0.09252516
## 2 0.1896209 0.02139447 0.1669535 0.01826086 0.01095359
## 3        NA 0.15081155 0.0000000 0.15081155 0.09252516
## 4 0.5733673 0.26411634 0.3229578 0.16168642 0.15270457

The closer to 0 means the model treats all subgroups equally. The majority of the metrics are dominated with 0 values, which means our model is more likely to treat subgroups variables equally.

Choosing the best model

Now we want to see how the models perform based on different metrics.

From the plot, we see that ranger_3 and ranger_2 have the smallest metric scores, meaning that they are the best models. They both have parity loss in ACC and PPV only. This confirms our previous metrics scores analysis again.

We also compute the fairness PCA and plot it.

## Fairness PCA : 
##             PC1       PC2           PC3           PC4
## [1,]  0.1201169  1.678501 -1.318390e-16  1.076342e-16
## [2,] -2.6322166 -1.517985  1.110223e-16 -8.500145e-17
## [3,]  0.1201169  1.678501 -1.318390e-16  1.076342e-16
## [4,]  2.3919828 -1.839018  2.220446e-16  3.556183e-17
## 
## Created with: 
## [1] "ranger"   "ranger_2" "ranger_3" "gbm"     
## 
## First two components explained 100 % of variance.

To depict the fairness of our group data more clearly, we use heatmap.

Group metric

We now want to see the metric within groups and decide, which model to use of the two best models we already have, that is ranger and ranger_3.

Regarding the fairness metric- the FPR scale, the ranger_2 shows a better fairness score than ranger_3. However, in terms of the performance metric, ranger_3 shows a higher score, meaning higher accuracy. Both models show similar patterns in both parity loss metrics. This can also be seen from the below radar plot:

We can also test how cut-off points affect the parity loss:

We can also see how parity loss metrics change if we modify the cutoff only for one subgroup (here, “dovish”).

From the above results, we conclude that the variable policy stance is sufficiently fair in our model, and there is no need to run the bias mitigation. However, we found that only one model out of five passes all scores. Our first ranger model passes only 4 out of 5 metrics. This is not an excellent result but also not bad. Therefore, we decide not to continue to mitigate the bias. We suspect that the most probable cause of the bias is the ‘algorithm caused’ bias, and therefore it is still possible to be treated using various ways of bias mitigation.

Speaker fairness testing for ESVX data

Please note the naming: ESVX corresponds to what was previously FOMC_sentiment_VX_m.

We now go to the fairness test for ESVX data. The whole procedure is similar to the previous one with policy stance. We upload the data; then, to pre-process it for the fairness test, we remove all vars that are not required and/or not factorized.

Next, we add the variable policy stance score in the form of a factor with levels. This time we do not convert the rss_score because we are aware that it consists only of one level.

Also, we factorize our target variables, the ten.min.return. We do not flip over the score because the model already predicts the positive outcome, that is, the gain return. We also factorized “part”, to be able to compute model comparisons later.

Basic Features

## Preparation of a new explainer is initiated
##   -> model label       :  ranger  (  default  )
##   -> data              :  2162  rows  28  cols 
##   -> target variable   :  2162  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.12.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.02676032 , mean =  0.5460687 , max =  0.9973651  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.6608349 , mean =  -0.0002777185 , max =  0.6426  
##   A new explainer has been created!

Fairness check

We chose Speaker as the protected group- the variable that contains sensitive data and needs to be tested for fairness. The reason is apparent- not only it contains gender variation, speaker is also a perfect medium to analyze the possible bias in emotion analysis based on the text as it captures language style, wording, tones, and to some extent, the culture behind the communicator.

Further, we select Bernanke as the suggested privileged subgroup because we found different patterns from the logistic regression model and random forest regarding Bernanke’s workflow, while not so much on Powell or Yellen. It may imply that Bernanke’s wording has been treated somehow differently.

## Creating fairness classification object
## -> Privileged subgroup       : character ([32m Ok [39m )
## -> Protected variable        : factor ([32m Ok [39m ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups )
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 1 in total ( [32m compatible [39m )
## -> Metric calculation        : 12/12 metrics calculated for all models
##  Fairness object created succesfully

The ranger model result for fairness checks:

## 
## Fairness check for models: ranger 
## 
## ranger passes 4/5 metrics
## Total loss:  1.07354

Let’s see which part of the metric passes and which fails.

Our ranger passes 4 metrics. The failed metric is in the Predictive Equality Ratio. This is similar to our previous fairness testing for the policy stance variable. According to this metric, this model is biased toward Yellen as an unprivileged subgroup with the ratio just slightly over the threshold of 0.8.

Why does this bias occur?

From the density plot above, we can see that it is more likely that the ranger model will categorize Bernanke and Powell, respectively, to contribute to the ten-minute return compares to Yellen. However, the difference between Bernanke and Powell is very little.

We can also see the raw (unscaled) metrics using the metric_score object:

Fairness object

Again, we now elaborate on how the fairness test results in different models and explainers to understand the bias tendency in our model better.

According to this plot, only in the ranger model is there a strong bias toward Bernanke (the privileged subgroup) from Yellen (unprivileged subgroup). This can be seen in 3 metrics: Equal opportunity ratio, Predictive equality ratio (which is very strong, the value of the ratio is more than 16), and the Statistical parity ratio.

Note that in the ranger model, we create a ranger from building a model that predicts the ten-minute return from policy stance and part. Strong bias in this model may reflect the speaker effect, or “Yellen effect” in particular, that is strongly sensitive towards policy stance (i.e., the use of hawkish and dovish terms) and part (i.e., prepared remarks and Q&A). Two possible reasons for differences: their monetary policy styles are saliently different, and perhaps because during the Q&A direct communication between the Speaker and the journalists there may be conveyed a wider unveiled variation between Yellen and Bernanke (and Powell) regarding emotion tone in their communication styles.

Now, we want to see what the plot_density object can tell us regarding the likelihood of each model.

As we can see here, ranger and ranger_3 give us more or less similar results, in which the probability of the three speakers contributed to the ten-minute return that is equally around 0.5. This result basically tells us nothing. But, the gbm and ranger_2 do show a significantly different pattern from the previous two; however, they are pretty similar between themselves. Both in gbm and ranger_2, Yellen is the least likely to contribute to the return. Bernanke received the highest probability in both models; however, ranger_2 gives the more considerable differences between the speakers.

Let us see what do the metrics say.

From this plot we can infer that most models suffer from bias because they have relatively lots of metrics. Only in FPR and STP, the ranger has a score closest to 0.

But, let us see the fairness metrics for each model.

## 
## Fairness check for models: ranger, ranger_2, ranger_3, gbm 
## 
## [31mranger passes 1/5 metrics
## [39mTotal loss:  18.69909 
## 
## [33mranger_2 passes 4/5 metrics
## [39mTotal loss:  1.385423 
## 
## [31mranger_3 passes 3/5 metrics
## [39mTotal loss:  1.217589 
## 
## [31mgbm passes 3/5 metrics
## [39mTotal loss:  1.63332

The ranger only passes 1 metric, ranger_3 passes 3 metrics, while ranger_2 and gbm passe 4 metrics. This suggests that ranger_2 and gbm are our best models, meaning, treat the variables most equally.

Note that in the ranger_2 model we try to predict the ten-minute return from the whole variable in the dataset using the Random Forest model.

Choosing the best model

Now we want to see how our models perform based on different metrics.

From the plot, we see that ranger_2 and gbm have the smallest metric scores, meaning that they are the best models. They both have the biggest parity loss in STP. This plot also confirms our previous metrics scores analysis.

For a better visual, we also try to compute the fairness PCA and then plot it.

## Fairness PCA : 
##             PC1        PC2         PC3          PC4
## [1,] -3.9216739 -0.5298641  0.03516476 5.551115e-17
## [2,]  2.2923151 -1.5312095 -0.37072023 2.775558e-17
## [3,]  0.5006532  1.4239014 -0.96961498 4.163336e-17
## [4,]  1.1287056  0.6371722  1.30517045 1.665335e-16
## 
## Created with: 
## [1] "ranger"   "ranger_2" "ranger_3" "gbm"     
## 
## First two components explained 91 % of variance.

The heatmap depicts the fairness of our group data more clearly.

Group metric

We now want to see the metric within groups and decide which model to use out of the two best models we already have, that is ranger_2 and gbm.

Both of our models show very different patterns in terms of the FPR for the fairness metric. The gbm shows higher scores for all subgroups and is more diverse compared to ranger_2. This means that ranger_2 performs better in terms of the fairness metric (at least in FPR). Also, in terms of the performance metric, ranger_2 has higher accuracy compared to gbm.

Overall, we can say that the result of fairness testing for the Speaker variable is worse than for the Policy stance. However, we still have two of our models that pass four out of five metrics. Therefore we can say that our variables are treated sufficiently fairly in our models. As for the types of bias, we suggest that for Speaker the bias can be classified as a “historical bias”, according to Mehrabi et al. (2019).

Discussion

Data pre-processing

We have no control over how the data was extracted and the quality of extraction from the source material. In fact, we found one session that was extracted twice (that is, Bernanke (January 24, 2012; line 15156 - 21916, and line 21917 - 31608)). The other problem is that one session is actually missing, that is, the transcript of Yellen (September 17, 2014).

Limitation of the fairness test

Currently, the pursuit of fairness in emotion and sentiment algorithms is a new agenda in sentiment analysis (see, for example Mohammad, 2021). There are no conclusive findings at the moment, but factors such as gender and race, which can be traced back to culture and languages, evidently also have their potential to affect fairness. Our finding supports this argument, particularly with the speaker variable.

The RSS

The Rss score variable is also crucial in our model. Unfortunately, we do not have the solution to test fairness on numeric variables or calculate our own factor level based on the score (within norms) rather than from the predetermined factor level. The fairmodel package that we use to test does not have the option to test variable that is not categorical.

Potential impact

Apart from the emotions detection algorithm, we also found that speaker and which part of the conference (whether the prepared remarks or the Q&A session) matter. This makes sense, as non-verbal communication such as tone and emotions is conveyed during Q&A.

Another essential potential impact is taking us back to the primary goal of this project- Central Bank can use an emotional tone to sharpen its message. We learn from our results that Powell uses a more ‘emotional’ tone compared to Bernanke and Yellen, affecting the number of trades.

On the other hand, we also should consider that emotions alone may only give a short-term outcome and perhaps will suffer from an inconsistency in the long term. A typical solution for this problem is the commitment and consistency of the implemented policies.

This document is a report of a final project for the Responsible Machine Learning course delivered for Ph.D. students at the University of Warsaw (PhD program in Quantitative Psychology and Economics) by prof.Przemysław Biecek in the summer semester 2021.

The effect of emotional contents of FOMC statement release on stock markets and market volatility

Pirapat Pareeratanasomporn, Erita Narhetali, Joanna Rachubik

June 25th, 2021

Key Takeaways

Introduction

What we are doing

Why is this important?

Potential drawbacks

Hypothesis

Project objectives

Data description

The Model

Data preprocessing

Fitting Logistic Regression model

Fitting Random Forest model

Model Interpretability

Mean variable-importance over 100 permutations

Mean variable-importance for Logistic Regression (left) vs. Random Forest(right), ES index

Mean variable-importance for Logistic Regression (left) vs. Random Forest (right), VIX index

Partial Dependence profile, depending on the model and index

Partial Dependence profile, ES index (for Logistic Regression)

Partial Dependence profile, ES index (for Random Forest)

Partial Dependence profile for the prepared remarks part vs. Q&A part, ES index (for Logistic Regression)

Partial Dependence profile for the prepared remarks part vs. Q&A part, ES index (for Random Forest)

Partial Dependence profile depending on the Chairman, ES index (for Logistic Regression)

Partial Dependence profile depending on the Chairman, ES index (for Random Forest)

Partial Dependence profile, VIX index (for Logistic Regression)

Partial Dependence profile, VIX index (for Random Forest)

Partial Dependence profile for the prepared remarks part vs. Q&A part, VIX index (for Logistic Regression)

Partial Dependence profile for the prepared remarks part vs. Q&A part, VIX index (for Random Forest)

Partial Dependence profile depending on the Chairman, VIX index (for Logistic Regression)

Partial Dependence profile depending on the Chairman, VIX index (for Random Forest)

Contrastive Partial Dependence profile (comparison between Logistic Regression and Random Forest)

Partial Dependence profile, Logistic Regression vs. Random Forest, ES index

Partial Dependence profile, Logistic Regression vs. Random Forest, VIX index

Fairness test

Policy Stance test for ESMS dataset

Basic Features

Fairness check

Fairness object

Choosing the best model

Group metric

Speaker fairness testing for ESVX data

Basic Features

Fairness check

Fairness object

Choosing the best model

Group metric

Discussion

Data pre-processing

Limitation of the fairness test

The RSS

Potential impact