This document is a report of a final project for the Responsible Machine Learning course delivered for Ph.D. students at the University of Warsaw (Ph.D. program in Quantitative Psychology and Economics) by prof.Przemysław Biecek in the summer semester 2021.
Our model proved that emotion tone from the FOMC press conference text could be used to predict trading behavior.
Although the Random Forest model manages to capture a more complex relationship between target variable and features, it falls short in terms of the range of prediction compared to the Logistic Regression model. Logistic regression models have a longer span of prediction than RF models. The logistic regression model might be more versatile when being deployed onsite. The random forest model might suffer some drawbacks and must be retrained more frequently.
Emotions perform better than general sentiment as a predictor. Emotions that are formulated under a particular narrative, e.g., policy stance (hawkish-dovish) and relative sentiment score (excitement-anxiety), perform better than emotions in the raw index (e.g., fear, anger, anticipation, etc.)
Policy stance and Speakers have a good fairness score, passing 4 out of 5 metrics. The probable source of the bias is historical bias and algorithmic bias. Unfortunately, the RSS can not be tested.
Predicting the stock market is hard, especially during uncertain times, for example, during shifts in monetary economic policies. Most human agents, even the most experienced market players, do not behave fully rational and not fully attentional when dealing with strategic uncertainty problems. Currently, economists are starting to move away from using rational expectations to predict how humans react to monetary policies towards using behavioral expectations.
A number of studies showed that emotions (see, for example Duxbury et al., 2020, particularly fear, play a dominant role in shaping decisions in the stock market. In behavioral economics, the terms such as loss aversion or disposition effects captured the significant phenomena on how negative feelings, rather than positive ones, drive people’s behavior and decisions when anticipating loss.
Can we predict the trading behavior if we have information on the emotions of the economic-related news? The increasingly sophisticated machine learning algorithms and the FOMC press conference provide ideal tools for researchers to answer this question.
We are trying to predict the trading behavior captured from two US indices: the E-mini futures (ES) and the CBOB volatility index (VIX) using emotion analysis from the scripts of FOMC prepared remarks and Q&A sessions. The ES index is a derivative contract that speculates on the future movement of the S&P 500 index. Like the ES index, the VIX index speculates on the 30-day forward projection of S&P 500 volatility. The FOMC conferences’ scripts are taken from the period of 2011 - 2019.
Federal Open Market Committee (FOMC) is a committee within the Federal Reserve System (the Fed) in the USA. It is responsible for overseeing the open market operations, and it makes critical decisions about interest rates and the growth of the US money supply. There are eight regularly scheduled meetings held each year at intervals of five to eight weeks. At each meeting, the Committee votes on the policy to be carried out during the gap between meetings. After each such meeting, there is a press conference, where they convey their economic predictions and guidance of the upcoming quarters to the public.
First, the Chairman reads the written prepared remarks for approx. 15 minutes, then a Q&A session follows for the next 45 minutes. This event is undoubtedly one of the most awaited events by market players because much of the significant decision-making will depend on whether, for example, the FED will increase the interest rate or not.
In the present project, we investigate the mentioned effects using machine learning models. The objectives are to classify financial returns from investments in the ES and VIX indexes. Our project makes use of logistic regression and random forest algorithms. A total of four models are constructed, including the following:
Two baseline models are constructed using the ES and VIX indexes as independent variables and their lag terms as features. Information criteria will determine the optimal number of lags.
Four additional features are added to the baseline models: implied policy stance, sentiment score, the interaction term between implied policy stance and sentiment score, and a categorical variable to differentiate the section of the press conference.
First of all, there is a growing interest in studying the effectiveness of Central Bank communication to manage inflation better. Perhaps much can be read from textual analysis, voice tone, facial expression, and gesture of the Chairman during the public announcement. However, studying the emotional aspect is relatively new and somewhat challenging. Most economic texts produced by the Central Bank are “cold” and predisposed to be as unemotional as possible.
Second of all, by using a responsible machine learning approach to analyze predictive models in trading behavior using emotions analysis from textual data, we expect to have a better, more reliable, fairer, and more rigorous model.
Subjectivity in detecting and interpreting emotions may be causing bias within the algorithm and between the users of the machine (or model). Therefore, achieving good performance of the model alone may not be enough. We have to ensure that the emotion detection has been thoroughly examined toward biases regarding cultures, language, context, etc.
Until recently, there has been no solid emotion detection program/algorithm specialized for analyzing economics/finance texts, let alone a model with good predictive value to use the emotion of economics text to predict market behavior.
Emotion analysis of the Central Bank text communication can be used to predict trading behavior.
Investors make a rational investment decision in response to the policy stance of the Federal Reserve.
Random forest model should capture a complex relationship between predictor and features and give a more accurate prediction.
The primary objective of our project is to extract sentiment (by sentiment, we refer to both sentiment and emotions) and implied policy stance from the scripts of FOMC statements and investigate the impact of these features on two US indices based on trading data.
There is a total of five sentiment lexica used in our work:
Loughran and McDonald (2011) dictionary
We utilize a new metric created by Nyman et al., 2021- the Relative Sentiment Theory (RSS). This metric was developed using findings from social and psychological theories on narratives. It says that humans are driven by context and sentiment. For example, they suggested that particularly the shift between excitement and anxiety is the most vital sentiment prediction in responding to financial news. They developed their own dictionary to capture the “excitement” and “anxiety” emotions which differs from the Loughran and McDonald.The RSS is formulated as follows:
We also use the Policy Stance index, which indicates whether a text implies a “hawkish” or “dovish” tone. In our study, we follow the procedure from Gorodnichenko et al., 2021. The formula is follows:
We use NRC EmoLex (Mohammad & Turney, 2013) as an addition to typical emotion analysis from texts and utilize the LM dictionary, which was designed for analyzing sentiment from economics and finance text.
For general sentiment, we use the Sentiword lexicon (Baccianella et al. (n.d))
We want to compare the performance between the simple and complex models in terms of their predictability, explainability, and fairness, by comparing the policy stance, RSS, Sentiword, and NRC emotions performance.
Here we provide the basic data description we use in our project. * FOMC_sentiment_ES_m corresponds to the dataset containing all sentiment scores extracted from the FOMC script and ES index trading data aggregated into a minute-time interval. * FOMC_sentiment_VX_m corresponds to the dataset containing all sentiment scores extracted from the FOMC script and VIX index trading data aggregated into a minute-time interval.
summary(FOMC_sentiment_ES_m)
## X1 positive negative anger
## Min. : 1.0 Min. : 0.000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 525.5 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.0000
## Median :1050.0 Median : 1.000 Median : 2.000 Median : 0.0000
## Mean :1050.0 Mean : 1.058 Mean : 2.214 Mean : 0.6417
## 3rd Qu.:1574.5 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 1.0000
## Max. :2099.0 Max. :10.000 Max. :16.000 Max. :11.0000
## anticipation sadness surprise fear
## Min. : 0.000 Min. : 0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 3.000 Median : 1.0000 Median :1.0000 Median : 2.000
## Mean : 3.142 Mean : 0.8552 Mean :0.9909 Mean : 1.949
## 3rd Qu.: 4.000 3rd Qu.: 1.0000 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :16.000 Max. :11.0000 Max. :8.0000 Max. :12.000
## trust joy disgust sentiword
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :-4.5938
## 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:-0.5208
## Median : 5.000 Median : 1.000 Median : 0.000 Median : 0.3542
## Mean : 5.234 Mean : 1.348 Mean : 0.404 Mean : 0.3210
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 1.1271
## Max. :19.000 Max. :11.000 Max. :10.000 Max. : 6.0000
## ps_score rss_score total_hawkish total_dovish
## Min. :-0.5000 Min. :-0.005891 Min. : 0.000 Min. : 2.000
## 1st Qu.: 0.0000 1st Qu.:-0.003631 1st Qu.: 2.000 1st Qu.: 5.000
## Median : 0.3333 Median :-0.002722 Median : 4.000 Median : 8.000
## Mean : 0.2903 Mean :-0.002925 Mean : 4.915 Mean : 8.923
## 3rd Qu.: 0.5556 3rd Qu.:-0.001998 3rd Qu.: 6.000 3rd Qu.:11.000
## Max. : 1.0000 Max. :-0.000709 Max. :21.000 Max. :21.000
## approach_score avoid_score DateTime
## Min. : 5.00 Min. :17.00 Min. :2011-04-27 14:31:00
## 1st Qu.: 9.00 1st Qu.:25.00 1st Qu.:2013-06-19 14:37:30
## Median :11.00 Median :33.00 Median :2015-06-17 15:15:00
## Mean :12.69 Mean :36.82 Mean :2015-07-14 11:13:59
## 3rd Qu.:14.00 3rd Qu.:47.00 3rd Qu.:2017-09-20 15:03:30
## Max. :28.00 Max. :86.00 Max. :2019-10-30 15:18:00
## Date Time part Speaker
## Min. :2011-04-27 Length:2099 Length:2099 Length:2099
## 1st Qu.:2013-06-19 Class1:hms Class :character Class :character
## Median :2015-06-17 Class2:difftime Mode :character Mode :character
## Mean :2015-07-13 Mode :numeric
## 3rd Qu.:2017-09-20
## Max. :2019-10-30
## Price Volume No_trades sentiment_score
## Min. :1226 Min. : 3.0 Min. : 2.0 Min. :-1.0000
## 1st Qu.:1635 1st Qu.: 531.5 1st Qu.: 129.5 1st Qu.:-1.0000
## Median :2028 Median : 1298.0 Median : 291.0 Median :-0.3333
## Mean :2065 Mean : 2378.9 Mean : 477.5 Mean :-0.2992
## 3rd Qu.:2501 3rd Qu.: 3196.5 3rd Qu.: 655.5 3rd Qu.: 0.0000
## Max. :3043 Max. :25301.0 Max. :4742.0 Max. : 1.0000
summary(FOMC_sentiment_VX)
## X1 positive negative anger
## Min. : 1.0 Min. : 0.000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 565.2 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.0000
## Median :1108.5 Median : 1.000 Median : 2.000 Median : 0.0000
## Mean :1105.8 Mean : 1.083 Mean : 2.235 Mean : 0.6452
## 3rd Qu.:1649.8 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 1.0000
## Max. :2192.0 Max. :13.000 Max. :16.000 Max. :11.0000
## anticipation sadness surprise fear
## Min. : 0.000 Min. : 0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 3.000 Median : 1.0000 Median :1.0000 Median : 2.000
## Mean : 3.189 Mean : 0.8631 Mean :0.9981 Mean : 1.964
## 3rd Qu.: 5.000 3rd Qu.: 1.0000 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :16.000 Max. :11.0000 Max. :8.0000 Max. :12.000
## trust joy disgust sentiword
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. :-6.0208
## 1st Qu.: 3.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:-0.5094
## Median : 5.000 Median : 1.000 Median : 0.0000 Median : 0.3750
## Mean : 5.332 Mean : 1.355 Mean : 0.4043 Mean : 0.3397
## 3rd Qu.: 7.000 3rd Qu.: 2.000 3rd Qu.: 1.0000 3rd Qu.: 1.1833
## Max. :27.000 Max. :11.000 Max. :10.0000 Max. : 6.0000
## ps_score rss_score total_hawkish total_dovish
## Min. :-0.5000 Min. :-0.005891 Min. : 0.000 Min. : 2.000
## 1st Qu.: 0.0000 1st Qu.:-0.003716 1st Qu.: 2.000 1st Qu.: 5.000
## Median : 0.3333 Median :-0.002953 Median : 4.000 Median : 8.000
## Mean : 0.3164 Mean :-0.002987 Mean : 4.747 Mean : 9.337
## 3rd Qu.: 0.5556 3rd Qu.:-0.002016 3rd Qu.: 6.000 3rd Qu.:12.000
## Max. : 1.0000 Max. :-0.000709 Max. :21.000 Max. :21.000
## approach_score avoid_score DateTime
## Min. : 5.00 Min. :17.00 Min. :2011-04-27 14:31:00
## 1st Qu.: 9.00 1st Qu.:25.00 1st Qu.:2013-06-19 15:20:15
## Median :11.00 Median :34.00 Median :2015-09-17 15:22:30
## Mean :12.61 Mean :37.17 Mean :2015-10-14 11:42:31
## 3rd Qu.:14.00 3rd Qu.:47.00 3rd Qu.:2017-12-13 15:29:45
## Max. :28.00 Max. :86.00 Max. :2019-10-30 15:18:00
## Date Time part Speaker
## Min. :2011-04-27 Length:2162 Length:2162 Length:2162
## 1st Qu.:2013-06-19 Class1:hms Class :character Class :character
## Median :2015-09-17 Class2:difftime Mode :character Mode :character
## Mean :2015-10-13 Mode :numeric
## 3rd Qu.:2017-12-13
## Max. :2019-10-30
## Price Volume No_trades sentiment_score
## Min. :10.35 Min. : 1.0 Min. : 1.00 Min. :-1.0000
## 1st Qu.:14.40 1st Qu.: 53.0 1st Qu.: 13.00 1st Qu.:-1.0000
## Median :15.80 Median : 201.0 Median : 40.00 Median :-0.3333
## Mean :16.40 Mean : 336.4 Mean : 65.93 Mean :-0.3002
## 3rd Qu.:18.16 3rd Qu.: 458.8 3rd Qu.: 83.00 3rd Qu.: 0.0000
## Max. :33.66 Max. :10747.0 Max. :1062.00 Max. : 1.0000
First, we transform the variable Price from both datasets to four different period returns: one-minute, two-minute, five-minute, and ten-minute.
This is to determine which return yields the highest prediction accuracy that can best reflect the behavior of the investors.
We train a total of 16 models for each return and index.
We split the data into training and testing sets using 70 percent of observation as the training data.
We use the tidymodel package to build a machine learning model.
First, we create model recipes for every period return of both indices.
Next, we set the computational engine and working pipelines. Then, we fit the models.
Once we have the trained models ready, we make an out-of-sample prediction. Based on the features from the testing dataset, the predicted values from trained models are directly compared with actual classification.
We use two metrics to measure model performance: prediction accuracy and AUC score.
bind_rows(es.lr.one.accuracy,es.lr.one.auc) #ES index one-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.541
## 2 roc_auc binary 0.513
bind_rows(es.lr.two.accuracy,es.lr.two.auc) #ES index two-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.537
## 2 roc_auc binary 0.539
bind_rows(es.lr.five.accuracy,es.lr.five.auc) #ES index five-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.583
## 2 roc_auc binary 0.587
bind_rows(es.lr.ten.accuracy,es.lr.ten.auc) #ES index ten-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.662
## 2 roc_auc binary 0.697
bind_rows(vx.lr.one.accuracy,vx.lr.one.auc) #VX index one-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.553
## 2 roc_auc binary 0.580
bind_rows(vx.lr.two.accuracy,vx.lr.two.auc) #VX index two-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.556
## 2 roc_auc binary 0.563
bind_rows(vx.lr.five.accuracy,vx.lr.five.auc) #VX index five-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.598
## 2 roc_auc binary 0.625
bind_rows(vx.lr.ten.accuracy,vx.lr.ten.auc) #VX index ten-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.667
## 2 roc_auc binary 0.740
Both the prediction accuracy and AUC score indicate that ten-minute return models have the best performance in classifying gain/loss from investments.
However, this finding can be misleading. The training model might suffer from overfitting issues.
Although we use the testing dataset to validate model predictions, both training and testing data are initially the same. The variation in the observations of both sets may be similar.
Furthermore, the sample sizes of both the ES and VIX index are not large. We would need to employ a resampling technique to reevaluate the performance of each model.
To prevent the logistic regression model from overfitting the training dataset, we perform a ten-fold repeated cross-validation.
After performing 10 fold repeated cross-validation, we retrieve the best-performing model for each index.
## # A tibble: 8 x 4
## wflow_id .metric mean preprocessor
## <chr> <chr> <dbl> <chr>
## 1 ten.min_logistic_reg accuracy 0.634 recipe
## 2 ten.min_logistic_reg roc_auc 0.698 recipe
## 3 five.min_logistic_reg accuracy 0.590 recipe
## 4 five.min_logistic_reg roc_auc 0.608 recipe
## 5 two.min_logistic_reg accuracy 0.570 recipe
## 6 two.min_logistic_reg roc_auc 0.591 recipe
## 7 one.min_logistic_reg accuracy 0.555 recipe
## 8 one.min_logistic_reg roc_auc 0.557 recipe
## # A tibble: 8 x 4
## wflow_id .metric mean preprocessor
## <chr> <chr> <dbl> <chr>
## 1 ten.min_logistic_reg accuracy 0.681 recipe
## 2 ten.min_logistic_reg roc_auc 0.747 recipe
## 3 five.min_logistic_reg accuracy 0.596 recipe
## 4 five.min_logistic_reg roc_auc 0.637 recipe
## 5 two.min_logistic_reg accuracy 0.554 recipe
## 6 two.min_logistic_reg roc_auc 0.558 recipe
## 7 one.min_logistic_reg accuracy 0.524 recipe
## 8 one.min_logistic_reg roc_auc 0.524 recipe
For both the ES and VIX indices, the logistic regression with a ten-minute return yields the highest AUC score and prediction accuracy. Thus, we select the ten-minute return model as the representative model among other different time interval returns.
First, we use the recipes for every model that we previously created.
Then, we set up the algorithm’s computational engine and working pipeline to train the models easily.
To train the random forest models we use an engine from the ranger package. And get the following prediction accuracies:
bind_rows(es.rf.one.accuracy, es.rf.one.auc) #ES index one-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.540
## 2 roc_auc binary 0.525
bind_rows(es.rf.two.accuracy, es.rf.two.auc) #ES index two-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.538
## 2 roc_auc binary 0.550
bind_rows(es.rf.five.accuracy, es.rf.five.auc) #ES index five-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.598
## 2 roc_auc binary 0.636
bind_rows(es.rf.ten.accuracy, es.rf.ten.auc) #ES index ten-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.652
## 2 roc_auc binary 0.726
bind_rows(vx.rf.one.accuracy, vx.rf.one.auc) #VX index one-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.558
## 2 roc_auc binary 0.564
bind_rows(vx.rf.two.accuracy, vx.rf.two.auc) #VX index two-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.567
## 2 roc_auc binary 0.572
bind_rows(vx.rf.five.accuracy, vx.rf.five.auc) #VX index five-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.619
## 2 roc_auc binary 0.659
bind_rows(vx.rf.ten.accuracy, vx.rf.ten.auc) #VX index ten-minute
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.700
## 2 roc_auc binary 0.766
From the prediction accuracy and the AUC scores, we can see that the ten-minute return models have the best prediction performance among other models.
This finding is consistent with the case of logistic regression models.
The random forest model has hyperparameter values that must be tuned to obtain the best fitting. Instead of using the default values, we would like to obtain the best values that fit best with our data. We use the random grid to randomly select the combinations of three hyperparameters for the random forest model: mtry, trees, and min_n. We find, that the best hyperparameter values are mtry = 9, trees = 1945, min_n = 14 for ES index model and mtry = 3, trees = 555, min_n = 39) for VIX index model.
Above, we trained sixteen models in total to select the best ones for each modeling algorithm and trading dataset, measured in terms of their highest prediction accuracy.
In all cases, ten-minute return models were consistently the best-performing models.
We investigate further how each feature fairs in terms of their importance in predicting gain/loss from investments by using the concept of PFI (Permutational feature importance).
It can be observed that among the top ten most important features, the rankings across the models are not drastically different. All four models seem to capture similar patterns among target variables and various features.
From the plots, it can be observed that the variable part has the highest impact in predicting gain/loss from investments, implying that investors react differently during two different sections of the FOMC press conference: prepared remark and Q&A.
Speaker seems to have a high predictive impact in logistic models, but is less important in the random forest model.
ps_score is consistently ranked among the most important feature in predictive power for all models.
We compute PD profiles for each model using the ten most important features as shown in the previous section.
The logistic model does not capture the difference in market response to different speakers well.
ES index market seems to react more to Bernanke while VIX index market reacts more to Yellen.
The Partial Dependence profiles reveal that the logistic model captures only linear relationships, whereas the random forest model can capture more non-linear relationships.
Both models highlight the different market reactions during prepared remarks and Q&A, but only random forest models can capture different market reactions to different FOMC speakers clearly.
PD profile of ps_score confirms the hypothesis that investors react rationally to the policy direction of FED. The positive relationship indicates that investors always look for an opportunity to gain no matter how FED positions their monetary policy.
PD profile of fear also highlights the possibility of investors reacting purely based on emotion, but the degree pf importance of ps_score is much higher than fear.
PD profile of rss_score does not reveal a clear pattern in the relationship with the targeted variable.
Contrastive PD profiles reveal that although the random forest model manages to capture a more complex relationship between target variable and features, it falls short in terms of the range of prediction compared to the logistic regression model.
Logistic regression models have a longer range of prediction. The range can be from 0.2 to 0.6 in some features but stay roughly around the range of 0.4 to 0.6 on average.
The prediction computed by random forest models is limited to around 0.4, and it is consistent across all features.
Logistic regression model might be more versatile when being deployed onsite. The random forest model might suffer some drawbacks and must be retrained more frequently.
Once we have the results from the model prediction and interpretability test, we conduct a fairness test. We choose a ten-minute return as our target variable because it gives the best prediction model.
This project aims to test whether emotion is a good predictor for trading behavior. Therefore, we will test policy stance, RSS score, and fear. We are also aware that gender and race may affect the tone and language when people speak; thus, we will also include the speaker in the tests.
We use the “fairmodel” package to run a fairness test (see Wiśniewski & Biecek, 2021).
Please note the naming: ESMS corresponds to what was previously called FOMC_sentiment_ES_m.
After uploading the ESMS dataset, to run the fairmodels package, we need to prepare variables that we will test into factor variables and remove all non-factorized or numeric unused variables.
## X positive negative anger anticipation sadness surprise fear trust joy
## 1 1 1 1 0 1 0 0 0 3 1
## 2 2 2 0 0 2 0 0 4 12 0
## 3 3 0 2 0 3 0 2 3 4 0
## 4 4 0 5 2 2 0 2 5 3 1
## 5 5 1 0 0 2 2 0 7 5 1
## 6 6 1 5 1 7 1 2 1 10 2
## disgust sentiword ps_score rss_score total_hawkish total_dovish
## 1 0 2.6770833 -0.05 -0.0007089685 21 19
## 2 1 -0.5312500 -0.05 -0.0007089685 21 19
## 3 0 -0.6250000 -0.05 -0.0007089685 21 19
## 4 1 -2.0625000 -0.05 -0.0007089685 21 19
## 5 0 -0.2180556 -0.05 -0.0007089685 21 19
## 6 0 1.7916667 -0.05 -0.0007089685 21 19
## approach_score avoid_score DateTime Date Time
## 1 17 23 2011-04-27 14:31:00 2011-04-27 14:31:00
## 2 17 23 2011-04-27 14:32:00 2011-04-27 14:32:00
## 3 17 23 2011-04-27 14:33:00 2011-04-27 14:33:00
## 4 17 23 2011-04-27 14:34:00 2011-04-27 14:34:00
## 5 17 23 2011-04-27 14:35:00 2011-04-27 14:35:00
## 6 17 23 2011-04-27 14:36:00 2011-04-27 14:36:00
## part Speaker Volume No_trades sentiment_score one.min.return
## 1 Prepared remarks Bernanke 5609 1054 0.0000000 gain
## 2 Prepared remarks Bernanke 1369 284 1.0000000 gain
## 3 Prepared remarks Bernanke 1162 210 -1.0000000 gain
## 4 Prepared remarks Bernanke 1633 279 -1.0000000 gain
## 5 Prepared remarks Bernanke 1404 265 1.0000000 loss
## 6 Prepared remarks Bernanke 2260 403 -0.6666667 gain
## two.min.return five.min.return ten.min.return
## 1 gain gain gain
## 2 gain gain gain
## 3 gain gain gain
## 4 gain gain gain
## 5 loss gain gain
## 6 gain gain gain
Our Policy Stance and RSS score are still numeric, so we create new variables which are taken by extracting the variables from their original “hawkish”, “neutral”, and “dovish” score, and save them into factor, and labeled them accordingly.
The same procedure is applied for the RSS score. We create a new variable rss by extracting the variable from its original anxiety “avoid”, “neutral”, and excitement “approach” score, and save it into factor and labeled them accordingly.
Also, we factorized our target variables, that is, “ten.min.return”. We do not flip over the score because the model is created to predict the positive outcome, that is, the gain return.
## Preparation of a new explainer is initiated
## -> model label : ranger ( default )
## -> data : 2099 rows 29 cols
## -> target variable : 2099 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.12.1 , task classification ( default )
## -> predicted values : numerical, min = 0.02421984 , mean = 0.6054935 , max = 0.994481
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.6471508 , mean = 0.0009857974 , max = 0.5797627
## A new explainer has been created!
We chose the policy stance as a protected group variable that may conceal sensitive data and needs to be tested for fairness. Policy stance consists of “hawkish” and “dovish,” and it is classified by the algorithm using a ‘search and count’ function that leaves some room for potential bias depending on the language and wording style of the communicator.
Further, we select “dovish” as the privileged subgroup because we found that the dovish score is dominating as a predictor in our models. It may imply that dovish may have been treated differently by the algorithm.
fobject_ps <- fairness_check(rf_explainer, # explainer
protected = esms$ps, # ps as protected variable as factor
privileged = "dovish", # level in protected variable, potentially more privileged
cutoff = 0.5, # cutoff - optional, default = 0.5
colorize = FALSE)
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 1 in total ( [32m compatible [39m )
## -> Metric calculation : 12/12 metrics calculated for all models
## Fairness object created succesfully
By using the “ranger” model to see if this variable has a bias, to get:
##
## Fairness check for models: ranger
##
## ranger passes 4/5 metrics
## Total loss: 1.915318
The one failed metric is in the Predictive Equality Ratio. This is actually good enough.
None of our ranger models reach the red fields on the left. This means there is no bias toward unprivileged groups (in our case, they are “neutral” and “hawkish”). However, in one metric, that is Predictive Equality Ratio (PER), “hawkish” enters the red field on the right quite far. This means there is a bias toward privileged subgroup (i.e., “dovish”), mainly by hawkish with the value of ratio more than 2. This is an interesting result since the overall policy stance score of the text in our datasets is actually ‘dovish’. Why does this bias occur?
From the density plot we can see that it is more likely that the model will categorize “hawkish” to contribute to the ten-minute return compared to “neutral” and “dovish” respectively.
We could also see the raw (unscaled) metrics using the metric_score object.
Because it is infrequent for a model to pass all metrics, it is better to elaborate on how the fairness test is conducted in different models and explainers.
Here we present one object with all explainers.
From the plot, we can see that our models perform relatively well in terms of satisfying fairness testing in 3 metrics: Accuracy equality ratio, Equal opportunity ratio, and Predictive parity ratio. However, we can see that for the Predictive equality ratio, the gbm and ranger model strongly bias towards the neutral subgroup and towards the privileged group for the hawkish subgroup. Also, in this metric, in the ranger_2 model, hawkish strongly bias towards the dovish with the ratio value of about 2.2.
Among all four models, only ranger_3 has no values reaching the red field both to the left and right. This means this model treats the variables equally. Note that ranger_3 in our model is developed by fear and Speaker as predictors.
Let us see the fairness metrics for all models.
##
## Fairness check for models: ranger, ranger_2, ranger_3, gbm
##
## [32mranger passes 5/5 metrics
## [39mTotal loss: 0.3213423
##
## [33mranger_2 passes 4/5 metrics
## [39mTotal loss: 1.984299
##
## [32mranger_3 passes 5/5 metrics
## [39mTotal loss: 0.3213423
##
## [31mgbm passes 3/5 metrics
## [39mTotal loss: 1.857252
The metrics scores also support our previous inference. The ranger passes only 3 metrics out of 5. The ranger_2 and gbm passe 4 out of 5 metrics. The best model in terms of fairness metrics, thus, is the ranger_3 and passes all 5 metrics.
To see what consists of the fairness object can be examined by evaluating the parity loss from each metric.
fobject_ps_mix$parity_loss_metric_data
## TPR TNR PPV NPV FNR FPR FDR
## 1 0.000000000 NaN 0.15081155 NA NaN 0.000000 0.2595238
## 2 0.007362293 0.0766040 0.02785503 0.003873998 0.6545323 1.159271 0.7779351
## 3 0.000000000 NaN 0.15081155 NA NaN 0.000000 0.2595238
## 4 0.154319680 0.4462236 0.15478199 0.169365348 1.0705113 1.241188 0.8722794
## FOR TS STP ACC F1
## 1 NA 0.15081155 0.0000000 0.15081155 0.09252516
## 2 0.1896209 0.02139447 0.1669535 0.01826086 0.01095359
## 3 NA 0.15081155 0.0000000 0.15081155 0.09252516
## 4 0.5733673 0.26411634 0.3229578 0.16168642 0.15270457
The closer to 0 means the model treats all subgroups equally. The majority of the metrics are dominated with 0 values, which means our model is more likely to treat subgroups variables equally.
Now we want to see how the models perform based on different metrics.
From the plot, we see that ranger_3 and ranger_2 have the smallest metric scores, meaning that they are the best models. They both have parity loss in ACC and PPV only. This confirms our previous metrics scores analysis again.
We also compute the fairness PCA and plot it.
## Fairness PCA :
## PC1 PC2 PC3 PC4
## [1,] 0.1201169 1.678501 -1.318390e-16 1.076342e-16
## [2,] -2.6322166 -1.517985 1.110223e-16 -8.500145e-17
## [3,] 0.1201169 1.678501 -1.318390e-16 1.076342e-16
## [4,] 2.3919828 -1.839018 2.220446e-16 3.556183e-17
##
## Created with:
## [1] "ranger" "ranger_2" "ranger_3" "gbm"
##
## First two components explained 100 % of variance.
To depict the fairness of our group data more clearly, we use heatmap.
We now want to see the metric within groups and decide, which model to use of the two best models we already have, that is ranger and ranger_3.
Regarding the fairness metric- the FPR scale, the ranger_2 shows a better fairness score than ranger_3. However, in terms of the performance metric, ranger_3 shows a higher score, meaning higher accuracy. Both models show similar patterns in both parity loss metrics. This can also be seen from the below radar plot:
We can also test how cut-off points affect the parity loss:
We can also see how parity loss metrics change if we modify the cutoff only for one subgroup (here, “dovish”).
From the above results, we conclude that the variable policy stance is sufficiently fair in our model, and there is no need to run the bias mitigation. However, we found that only one model out of five passes all scores. Our first ranger model passes only 4 out of 5 metrics. This is not an excellent result but also not bad. Therefore, we decide not to continue to mitigate the bias. We suspect that the most probable cause of the bias is the ‘algorithm caused’ bias, and therefore it is still possible to be treated using various ways of bias mitigation.
Please note the naming: ESVX corresponds to what was previously FOMC_sentiment_VX_m.
We now go to the fairness test for ESVX data. The whole procedure is similar to the previous one with policy stance. We upload the data; then, to pre-process it for the fairness test, we remove all vars that are not required and/or not factorized.
Next, we add the variable policy stance score in the form of a factor with levels. This time we do not convert the rss_score because we are aware that it consists only of one level.
Also, we factorize our target variables, the ten.min.return. We do not flip over the score because the model already predicts the positive outcome, that is, the gain return. We also factorized “part”, to be able to compute model comparisons later.
## Preparation of a new explainer is initiated
## -> model label : ranger ( default )
## -> data : 2162 rows 28 cols
## -> target variable : 2162 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.12.1 , task classification ( default )
## -> predicted values : numerical, min = 0.02676032 , mean = 0.5460687 , max = 0.9973651
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.6608349 , mean = -0.0002777185 , max = 0.6426
## A new explainer has been created!
We chose Speaker as the protected group- the variable that contains sensitive data and needs to be tested for fairness. The reason is apparent- not only it contains gender variation, speaker is also a perfect medium to analyze the possible bias in emotion analysis based on the text as it captures language style, wording, tones, and to some extent, the culture behind the communicator.
Further, we select Bernanke as the suggested privileged subgroup because we found different patterns from the logistic regression model and random forest regarding Bernanke’s workflow, while not so much on Powell or Yellen. It may imply that Bernanke’s wording has been treated somehow differently.
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 1 in total ( [32m compatible [39m )
## -> Metric calculation : 12/12 metrics calculated for all models
## Fairness object created succesfully
The ranger model result for fairness checks:
##
## Fairness check for models: ranger
##
## ranger passes 4/5 metrics
## Total loss: 1.07354
Let’s see which part of the metric passes and which fails.
Our ranger passes 4 metrics. The failed metric is in the Predictive Equality Ratio. This is similar to our previous fairness testing for the policy stance variable. According to this metric, this model is biased toward Yellen as an unprivileged subgroup with the ratio just slightly over the threshold of 0.8.
Why does this bias occur?
From the density plot above, we can see that it is more likely that the ranger model will categorize Bernanke and Powell, respectively, to contribute to the ten-minute return compares to Yellen. However, the difference between Bernanke and Powell is very little.
We can also see the raw (unscaled) metrics using the metric_score object:
Again, we now elaborate on how the fairness test results in different models and explainers to understand the bias tendency in our model better.
According to this plot, only in the ranger model is there a strong bias toward Bernanke (the privileged subgroup) from Yellen (unprivileged subgroup). This can be seen in 3 metrics: Equal opportunity ratio, Predictive equality ratio (which is very strong, the value of the ratio is more than 16), and the Statistical parity ratio.
Note that in the ranger model, we create a ranger from building a model that predicts the ten-minute return from policy stance and part. Strong bias in this model may reflect the speaker effect, or “Yellen effect” in particular, that is strongly sensitive towards policy stance (i.e., the use of hawkish and dovish terms) and part (i.e., prepared remarks and Q&A). Two possible reasons for differences: their monetary policy styles are saliently different, and perhaps because during the Q&A direct communication between the Speaker and the journalists there may be conveyed a wider unveiled variation between Yellen and Bernanke (and Powell) regarding emotion tone in their communication styles.
Now, we want to see what the plot_density object can tell us regarding the likelihood of each model.
As we can see here, ranger and ranger_3 give us more or less similar results, in which the probability of the three speakers contributed to the ten-minute return that is equally around 0.5. This result basically tells us nothing. But, the gbm and ranger_2 do show a significantly different pattern from the previous two; however, they are pretty similar between themselves. Both in gbm and ranger_2, Yellen is the least likely to contribute to the return. Bernanke received the highest probability in both models; however, ranger_2 gives the more considerable differences between the speakers.
Let us see what do the metrics say.
From this plot we can infer that most models suffer from bias because they have relatively lots of metrics. Only in FPR and STP, the ranger has a score closest to 0.
But, let us see the fairness metrics for each model.
##
## Fairness check for models: ranger, ranger_2, ranger_3, gbm
##
## [31mranger passes 1/5 metrics
## [39mTotal loss: 18.69909
##
## [33mranger_2 passes 4/5 metrics
## [39mTotal loss: 1.385423
##
## [31mranger_3 passes 3/5 metrics
## [39mTotal loss: 1.217589
##
## [31mgbm passes 3/5 metrics
## [39mTotal loss: 1.63332
The ranger only passes 1 metric, ranger_3 passes 3 metrics, while ranger_2 and gbm passe 4 metrics. This suggests that ranger_2 and gbm are our best models, meaning, treat the variables most equally.
Note that in the ranger_2 model we try to predict the ten-minute return from the whole variable in the dataset using the Random Forest model.
Now we want to see how our models perform based on different metrics.
From the plot, we see that ranger_2 and gbm have the smallest metric scores, meaning that they are the best models. They both have the biggest parity loss in STP. This plot also confirms our previous metrics scores analysis.
For a better visual, we also try to compute the fairness PCA and then plot it.
## Fairness PCA :
## PC1 PC2 PC3 PC4
## [1,] -3.9216739 -0.5298641 0.03516476 5.551115e-17
## [2,] 2.2923151 -1.5312095 -0.37072023 2.775558e-17
## [3,] 0.5006532 1.4239014 -0.96961498 4.163336e-17
## [4,] 1.1287056 0.6371722 1.30517045 1.665335e-16
##
## Created with:
## [1] "ranger" "ranger_2" "ranger_3" "gbm"
##
## First two components explained 91 % of variance.
The heatmap depicts the fairness of our group data more clearly.
We now want to see the metric within groups and decide which model to use out of the two best models we already have, that is ranger_2 and gbm.
Both of our models show very different patterns in terms of the FPR for the fairness metric. The gbm shows higher scores for all subgroups and is more diverse compared to ranger_2. This means that ranger_2 performs better in terms of the fairness metric (at least in FPR). Also, in terms of the performance metric, ranger_2 has higher accuracy compared to gbm.
Overall, we can say that the result of fairness testing for the Speaker variable is worse than for the Policy stance. However, we still have two of our models that pass four out of five metrics. Therefore we can say that our variables are treated sufficiently fairly in our models. As for the types of bias, we suggest that for Speaker the bias can be classified as a “historical bias”, according to Mehrabi et al. (2019).
We have no control over how the data was extracted and the quality of extraction from the source material. In fact, we found one session that was extracted twice (that is, Bernanke (January 24, 2012; line 15156 - 21916, and line 21917 - 31608)). The other problem is that one session is actually missing, that is, the transcript of Yellen (September 17, 2014).
Currently, the pursuit of fairness in emotion and sentiment algorithms is a new agenda in sentiment analysis (see, for example Mohammad, 2021). There are no conclusive findings at the moment, but factors such as gender and race, which can be traced back to culture and languages, evidently also have their potential to affect fairness. Our finding supports this argument, particularly with the speaker variable.
The Rss score variable is also crucial in our model. Unfortunately, we do not have the solution to test fairness on numeric variables or calculate our own factor level based on the score (within norms) rather than from the predetermined factor level. The fairmodel package that we use to test does not have the option to test variable that is not categorical.
Apart from the emotions detection algorithm, we also found that speaker and which part of the conference (whether the prepared remarks or the Q&A session) matter. This makes sense, as non-verbal communication such as tone and emotions is conveyed during Q&A.
Another essential potential impact is taking us back to the primary goal of this project- Central Bank can use an emotional tone to sharpen its message. We learn from our results that Powell uses a more ‘emotional’ tone compared to Bernanke and Yellen, affecting the number of trades.
On the other hand, we also should consider that emotions alone may only give a short-term outcome and perhaps will suffer from an inconsistency in the long term. A typical solution for this problem is the commitment and consistency of the implemented policies.
This document is a report of a final project for the Responsible Machine Learning course delivered for Ph.D. students at the University of Warsaw (PhD program in Quantitative Psychology and Economics) by prof.Przemysław Biecek in the summer semester 2021.