1

We have chosen to work with five firm characteristics (beta, turn, mom1m, mom12m and mom36m) and four macro indicators (dp, ep, bm and ntis). Beta is the market beta, which indicates the movement of the individual stock relative to the market. Turn is the turnover of a stock and is a measure of liquidity. Mom1m is the one month momentum, which is included to capture short-term reversal. Mom12m (12 month momentum) is brought as the general momentum measure and mom36m as a long-term reversal indicator.

Bringing three momentum variables into the model might be a stretch, but these are shown (from variable importance in Gu, Kelly and Xiu (2019)) to be quite significant predictors in asset pricing.

Summary table of variables
Variable Mean Sd. Min Pctl. 25 Pctl. 75 Max
Return excess 0.01 0.18 -1 -0.06 0.07 19.88
Market beta 0.08 0.55 -0.99 -0.36 0.55 0.99
Turnover 0.04 0.58 -1 -0.45 0.54 0.99
Momentum (1) 0 0.6 -0.99 -0.55 0.55 0.99
Momentum (12) -0.01 0.59 -0.99 -0.53 0.52 0.99
Momentum (36) 0 0.56 -0.99 -0.47 0.47 0.99
Dividend-Price -3.92 0.14 -4.13 -4.01 -3.88 -3.28
Earnings-price -3.09 0.41 -4.84 -3.16 -2.87 -2.57
Book-Market 0.3 0.05 0.22 0.26 0.34 0.44
Net Eqiuty Expansion -0.01 0.02 -0.06 -0.02 0.01 0.03
Note: We note the oddly similar maximum values for beta, turnover and the three momentum variables. We have checked that this is in fact not an error by us.

Bm is the book-to-market ratio, which measures the total book value (that is, company value from earnings statement) relative to the market value (on the stock market), ntis is the Net Equity Expansion, which measures the ratio of 12-month moving sums of net issues on the NYSE divided by the total end-of-year market cap of NYSE stocks. EP is the earnings-to-price ratio, which is the difference between the log of earnings and the log of prices. DP is the Dividend-Price ration, which is the difference between log of dividends and log of prices.

We create a recipe from tidymodels to perform two further cleaning steps: 1) to locate all interacting terms between the macro indicators and company characteristics, and 2) to modify the industry specific value into a dummy variable, since this format is more friendly to machine learning.

After specifying the recipe-steps that should transform the data into something more machine learning-friendly, we can estimate a model.

2

As described in the paper, one limitation to the modelling approach relies on the fact that the function for the conditional expected return \(g^*(·)\) does not depend on neither i nor t. This implies that the function retains the same form over time and it relies on the entire panel of stocks. While this brings stability to the model, it comes with the cost that the ability to estimate risk premiums for individual stocks is significantly reduced. Furthermore, the function depends on the vector of predictors, z, only through \(z_{i,t}\). This implies that the prediction uses information strictly from the i’th stock at time t. Thus, historical observation are not accounted for.

As for the estimation approaches, there are limitations associated with all of those employed by the authors. Firstly, a limitation to the penalized regression stems from the fact that shrinkage and variable selection is forcing coefficients on most regressors close or exactly to zero when managing high dimensionality which can produce suboptimal forecasts in cases where predictors are have high correlations. On page 2235, an example is provided: “A simple example (…) is a case in which all of the predictors are equal to the forecast target plus an iid noise term. In this situation, choosing a subset of predictors via lasso penalty is inferior to taking a simple average of the predictors and using this as the sole predictor in a univariate regression”.Secondly, when using a random forest model for estimation, one pro of the framework is that it makes the model very flexible. As summarized by the authors on pages 2240-2241: “Advantages (…) are that it is invariant to monotonic transformations of predictors, that it naturally accommodates categorical and numerical data in the same model, that it can approximate potentially severe nonlinearities, and that a tree of depth L can capture (L−1)-way interactions”. However, this also implies the limitation that tree models usually suffers from overfitting, which requires thorough regularization of the model. As for the neural network, the complexity of the models makes them suffer from being non-transparent, non-interpretable and highly parameterized, which makes them difficult to use. Furthermore, the structure of neural networks makes cross-validation a difficult task during model selection process, which in this case leads the authors to fix a selection of network architectures ex ante for estimation, i.e., they take a guess.

As described in the paper, the model deviates from standard asset pricing approaches because the function for \(g^*(·)\) maintains the same form over time and across stocks instead of reestimating a cross-sectional model in each period or for each stock independently. Therefore, two alternative modelling approaches could be ones that would do just that. A) in each period, we reestimate \(g^*(·)\) as a cross-sectional model and B) we estimate \(g^*(·)\) as a time-series model for each stock individually.

A third alternative is to utilize simple variations of OLS-regressions. Making use of the APT, one could map the linear factor model for excess returns using betas for various factors that affect or are believed to affect returns. This would be done with an OLS-regression. The benefit of the OLS-regression compared to ML-approaches is that OLS would produce an unbiased estimator of excess returns, while ML-models produce an estimate that is, to some extent, biased. Z in the APT-model would be all included predictors of excess returns, the regressors. The functional form of the APT would be linear.

3

The objective function is to minimize the root mean squared prediction error (RMSE) of the ML-model. That is, to estimate a model that predicts excess returns as well as possible. The hyperparameter tuning process involves a step-wise process, where the engine tries out several different combinations of the parameters. For a Elastic Net model, this would indicate different values of penalties and alpha (the mixture between Ridge and Lasso). The hyperparameters are thus tuned so that MPSE is minimized.

When the goal of a specific model is prediction, then the means to get there is to minimize i.e., RMSE (to get the best average prediction), whereas if the goal is to uncover the most consistent model, you would want to use all of your data to get the best fit. Thus to minimize RMSE you need both training and testing data to verify that your model actually works when encountered with new data - otherwise the RMSE would take off in reality.

Selecting hyperparameters from a validation dataset can be problematic for a number of reasons. The validation dataset can be too small - and if this is the case, then a worry is that it might consist of data that poorly reflects other data if some specific events are dominating the validation set. The validation dataset might require you to shrink your training set, which will lessen the estimation possibilities.

As we have already split the dataset into a training (80\(\%\)) and a test set (20\(\%\)), we will now further split the training set into a training and validation set. The new training set will thus include 53\(\%\) of the original data and the validation set will include 27\(\%\). The reasononing behind the additional split of the training set is that we want a reliable model. If the split is not performed, then the results will be biased and we might end up with a false impression of the model accuracy. The training set will be fed in the learning phase to obtain patterns in the data. The validation set is used to validate the model performance during the training phase and provides information helpful in tuning the models hyperparameters. The test set is totally seperated from the training phase and will only be used to test the model after completing training to provide an unbiased performance metric.

4

We start by specifying the models for which we plan to test. This includes an Elastic Net (EN) model for which we tune the penalty term and mixture (between Ridge and Lasso estimation), and a Neural Network (NN), where we tune the number of epochs. Due to constraints on computing power, we have chosen not to tune any more hyperparameters on the NN, since this simple tuning procedure lasted +8 hours.

In an elastic net, we specify a model where we, through tuning of hyperparameters, take decisions on the mixture between Lasso and Ridge regression. In Lasso regression, coefficients on a subset of covariates are set to zero which means that Lasso can be thought of as a variable selection method. We do not want to include all variables with possible explanatory power in the model, only variables that have a large enough effect. The Ridge draws all coefficient estimates closer to zero and can therefore be seen as penalization on regressor variance. An elastic net is a combination of these models.

Neural networks takes information from an input layer, through one or multiple hidden layers, to an output layer. The output layer predicts future data. In a neural network, weights are assigned to variables to determine the importance of variables in terms of output. If an output exceeds a certain threshold, it activates a node that passes data through to a next layer in the network. In a neural network, hyperparamaters that are to be tuned are the number of trees and layers, the complexity of the network. The network aims to understand patterns in the data and enables predictive modelling of future data.

Next, we set up a cross validation technique - in our case the K-fold cross-validation. The idea behind the technique consists of evaluating performance of 10 different subsets of the training data and calculating average prediction error. We split the training data into three years of data and asses this on two years. This is a general step that we will utilize for both models.

The next step is to create the models. For the EN-model this includes to use cross-validation as a re-sampling technique, and a grid of 30 possible hyperparameters to tune - 10 different penalty options and three \(\alpha\) mixtures. The NN is only tuned with 2 combinations of epochs, namely 10 and 1000 (we would have included more hyperparametres if computation power allowed it).

After tuning the Elastic Net and Neural Network, we can extract the best performing hyperparameter combinations from the data tested on the validation set. The two best performing models are very close when comparing the rmse of the validation sample. The NN has a rmse of 0.1615, while the EN model has a loss of 0.1658.

Validation set prediction error
model rmse
nn 0.1615665
elnet 0.1658702

From the graph, it becomes visible that the model performing best in terms of minimizing RMSE is, for the Elastic Net, a model that take 0 of Lasso, meaning a Ridge regression, and a penalization term of 0.077. Also, Lasso and Elastic Net models with high penalization terms perform well. A Lasso regression performs the best but Elastic Net and Ridge falls just slightly behind.

For the Neural Network model, we receive the lowest RMSE through models with 10 epochs (training cycles through the dataset). The simpler NN is by far more precise than the complex one with 1000 epochs. More precisely, the hyperparameters chosen for the models are a Ridge regression with penalization of 0.77, and a Neural Network with 10 epochs and 15 hidden units.

Having fitted the model we are now able to use it for making predictions. We gather all predictions as a vector and all actual observations as well for comparison.

Top 10 hyperparameter combinations for the Elastic Net
top model engine epochs hidden.units rmse std_err
1 Preprocessor1_Model1 keras 10 15 0.2421783 0.0054685
2 Preprocessor1_Model2 keras 1000 15 1.2521498 0.2293456
Top 10 hyperparameter combinations for Elastic Net model
top model engine penalty mixture rmse std_err
1 Preprocessor1_Model09 glmnet 0.0774264 0.0 0.2354330 0.0045426
2 Preprocessor1_Model28 glmnet 0.0059948 1.0 0.2359042 0.0044302
3 Preprocessor1_Model10 glmnet 1.0000000 0.0 0.2360243 0.0044327
4 Preprocessor1_Model18 glmnet 0.0059948 0.5 0.2361817 0.0046046
5 Preprocessor1_Model19 glmnet 0.0774264 0.5 0.2372554 0.0045635
6 Preprocessor1_Model20 glmnet 1.0000000 0.5 0.2372554 0.0045635
7 Preprocessor1_Model29 glmnet 0.0774264 1.0 0.2372554 0.0045635
8 Preprocessor1_Model30 glmnet 1.0000000 1.0 0.2372554 0.0045635
9 Preprocessor1_Model27 glmnet 0.0004642 1.0 0.2381957 0.0045387
10 Preprocessor1_Model08 glmnet 0.0059948 0.0 0.2388621 0.0044622

The complexity of a model and it’s hyperparameters is best illustrated as an issue of the bias-variance trade-off, which is the relation between model complexity and flexibility. A highly complex model might have a low bias, since it’s well specified, but will encounter a high variance, since the model is too complex to understand patterns in unknown data - thus it is overfitted. On the other hand a model that’s too general might not capture the underlying structure in the data, which we state as being underfitted. These models will usually have a high bias and a low variance, since the model wouldn’t know how to read patterns in the data, thus the predictions will be less varying.

In the case of this analysis it can be a bit difficult to discuss the complexity relationship, since the models themselves are not very complex. Usually, the neural net would be the most complex model, but in this case we only have two combinations of the model, which consists of ‘epochs’ of 10 and 1000. Nevertheless, it’s still clear to see that the more simple model is the best performing with a rmse of 0.24 against 1.24 for the model with 1000 epochs. Sadly, it wasn’t possible to test more models due to constraints on computing power. In the case of the Elastic Net we have more options to interpret. The best performing model is basically a Ridge-regression, since \(\alpha = 0\). However, it’s a bit difficult to interpret which of Lasso and Ridge is definitively performing best, since the next best model is a Lasso-regession, and on the top five-list there are three \(50\%\) mixture models. Moreover, the rmse of the different models are very close to each other. Furthermore, an interpretation of the regularization values (penalties) making the top 10 list is a bit cumbersome, since there’s no clear pattern. The top three consists of high (1.0, low (0.005) and -low-medium (0.07) level penalty levels. However, there seems to be a bit of overweight of low penalty values. A fairly loose conclusion is that the neural net performs best with low complexity (low amount of epochs), and the elastic net performs best with a low penalty rate and a mixture of 0 (being a Ridge-model).

The rmse of both models are lower in the validation set, which is an indication of two well specified models. Furthermore, the validation set is randomly picked throughout the time series, why the lower rmse has nothing to do with certain time periods being easier to predict.

5

We start out by taking the out-of-sample (“testing(split)”) stock return prediction for our tuned elastic net and neural network models. For the convenience of the reader, we store our results in a csv-file that the reader may refer to when going through our interpretation.

Next, we create a function to sort the stocks into an arbitrary number of portfolios as described in chapter 4 of “Tidy Finance with R”. We use the curly-curly operator to to add flexibility concerning which variable to use for the sorting. This is denoted as “var”. We then use quantile() to compute breakpoints for the n number of portfolios. Lastly, we assign portfolios to each stock using the findInterval() function. The function adds a new column that contains the portfolio number to which a stock belongs. The portfolios are reconstituted each month using value weights.

We now apply the assign_portfolio function to the elastic net and neural network respectively. We sort the stocks into 10 portfolios using the momentum column in the assign_portfolios dataset and add the portfolio column to data_for_sorting.

A note for the reader: If you run this after you have tuned the model yourself and the fit is in the directory, then rename ‘pred_collected_testing_copy’ to ‘pred_collected_testing’ as this will allow you to use your own estimated values instead of the ones that we got earlier.

We now construct a zero-net-investment portfolio that buys the highest expected return stocks and sells the lowest, i.e. we go long in the decile 10 stock and short in the decile 1 stocks. To do this we add 2 columns, breakpoint_low and breakpoint_high to our portfolio sorted datasets for the elastic net and neural network models respectively. We then rename all portfolios that matches the breakpoints to either low or high in order to distinguish between them. We then calculate the value weighted return for the high and low momentum portfolios. After, we subtract the low-portfolio alpha from the high to reflect our trading strategy. Lastly, we make a linear regression of the net zero strategy on the market excess return to see if we are generating positive alpha.
Long-short prediction sorted portfolios CAPM Beta and Alpha
model term estimate std.error statistic p.value
Elastic net (Intercept) 0.0211131 0.0063842 3.3070885 0.0011283
Elastic net mkt_excess 0.1052429 0.1421096 0.7405755 0.4598699
Neural net (Intercept) 0.0310839 0.0060633 5.1265784 0.0000007
Neural net mkt_excess 0.0493632 0.1349663 0.3657446 0.7149647

The results can be summarized as follows: Both the elastic net and neural network seem to be able to predict CAPM alpha generating strategies after observing the training set, i.e. excess returns after adjusting for market risk. Contrary to expectation, the elastic net model appears to be the better predictor with the estimated alpha exceeding that of the neural net. Moreover, the CAPM beta is estimated to be lower which implies that the risk profile of the portfolio predicted by the elastic net is lower than that of the neural network. Thus, the complexity of the neural network relative to the elastic net does not result in better out-of-sample predictions. However, it is worth noting that the predicted alpha from both methods fails to show statistical significance at both the 1, 5 and 10 percent confidence level.