Introduction

Machine learning makes use of technology and statistics to automate the analysis of data. Increased computing power and availability of data has encouraged expansion. The development of deep learning models (or neural networks), large language models (LLM) has been a major popular development.

Two major ideas

There are two major papers that help to explain the differences between traditional statistics and machine learning and between traditional quantitative investment and the modern variety:

  • Leo Breiman: Statistical Modeling: The Two Cultures. Breiman (2001). The first tried to model the data-generating process with linear regression and other tools and used these for interpretation; the second used tools like decision-trees and neural nets to find algorithms that could be used for prediction.

  • Rich Sutton: The Bitter Lesson. Sutton (2019) Which argues that more computing power, more data and scaling will always overcome intuition and subject knowledge.

These suggest that a computer scientist with a big, powerful computer and lots of data can do anything that a specialist in any area can complete. Breiman and Sutton are computer scientists.

The most famous paper Artificial Intelligence paper in recent years is Attention is all you need. Vaswani et al. (2023). This introduces the Transformer Architecture which underlies modern Large Language Models (LLMs) which is based on parallelization and the use of Graphics Processing Units (GPUs) that allows larger models to be processed more rapidly.

Evolution is in the direction of less human curation and more computing power and letting the machine learn from the data. Old ideas about having too many explanatory variables relative to data and over-fitting have been replaced in some areas.

Meanwhile, econometrics tends to focus issues of understanding, transparency and causation, which are less well suited to machine learning. One of the main papers to consider in this field is Let’s take the con out of econometrics Leamer (1983) which argues for quasi-experimental methods to uncover causality and ushered in developments in difference-in-difference and synthetic control tools.

Machine learning

There are two types of machine learning:

Machine learning can be:

Standard machine learning models include:

Nearest Neighbours
Nearest Neighbours

In this version with eight nearest neighbours, the eight blue elements are closest to the new red data point and will determine its classification. Using different numbers of neighbours can result in different classifications. Here is a toy example of using KNN to identify bad loans.

This will use the class package and will create artificial data for income, debt-to-income and credit score. These numbers then will be used to create a probability of default.

library(class)
set.seed(124) # for reproducibility
n <- 50 # training cases
train_data <- data.frame(
             income = rnorm(n, 50000, 15000),
             debt_to_income = runif(n, 0.1, 0.6),
             credit_score = rnorm(n, 680, 80)
             )

Now we create a binary variable that is 1 if there is a default and 0 if there is no default. Lower income, higher debt-to-income and lower credit score increase the probability of default

train_data$default <- ifelse(
                 train_data$income < 20000 | train_data$debt_to_income > 0.4 | train_data$credit_score < 600, 
                 sample(c(0, 1), n, replace = TRUE, prob = c(0.2, 0.8)), 
                 sample(c(0, 1), n, replace = TRUE, prob = c(0.8, 0.2))
                 )

Now we create the test data set that will be used to assess the model.

test_data <- data.frame(income = rnorm(50, 50000, 15000),
            debt_to_income = runif(50, 0.1, 0.6),
            credit_score = rnorm(50, 680, 80)
            )

We will also create probabilities of default.

test_data$default <- ifelse(
                test_data$income < 20000 | test_data$debt_to_income > 0.4 | test_data$credit_score < 600,
                sample(c(0, 1), 50, replace = TRUE, prob = c(0.2, 0.8)),
                sample(c(0, 1), 50, replace = TRUE, prob = c(0.8, 0.2))
                )

Now we normalise the data so that everything is in the same units. We are calculating distance so it makes a difference if we talk about 1 meter or 10000 cm.

normalise <- function(x){
    return((x - min(x)) / (max(x) - min(x)))
}

train_norm <- as.data.frame(lapply(train_data[, 1:3], 
                                      FUN = normalise))
test_norm <- as.data.frame(lapply(test_data[, 1:3], 
                                     FUN = normalise))

Now we run the K-nearest neighbours using the knn function. The key arguments are the training data, the test data and the variable that is to be estimated. In this case that is the default. It is also necessary to how many neighbours are used to estimate the default. In this case we use 10.

predictions <- knn(
  train = train_norm,
  test = test_norm,
  cl = test_data$default,
  k = 21
)

Evaluation

Now we compare the estimated defaults relative to the actual defaults, using the table function.

mytable <- table(predictions, test_data$default)
mytable
##            
## predictions  0  1
##           0 25 14
##           1  5  6

However, this shows that only six out of eleven defaults are predicted by the model. It is usual to assess the quality of these sort of categorical models using the following metrics:

\[\text{Accuracy} = \frac{\text{True Positive plus True Negative}}{\text{Total}} = 0.62\] \[\text{Sensitivity} = \frac{\text{True Positive}}{\text{All Positive}} = 0.55\] \[\text{Specificity} = \frac{\text{True Negative}}{\text{All Negative}} = 0.64\] In this case we would be most interested in catching as many defaults (positives) as possible. Therefore, we would be looking at the sensitivity. At 55% the model is only just identifying half the defaults.

  • K-means carries out unsupervised clustering by dividing the data into k-groups. Assigns data points to the nearest k clusters. It can find hidden patterns in the data, but it is computationally intensive. There is an example of using K-means to find different financial regimes on My Studies, under K-means example.

  • Support Vector Machines: split categories by maximising the distance between the elements and the dividing line. This is unsupervised learning. You can read more here about support vector machines. To be completed.

  • Decision Trees: break classifications into branches like a tree with nodes and leafs. They are easy to understand, handle non-linear responses and handle some interactions. However, they have high variance an***d are prone to overfitting.

This is an example using the same set of market tension data that are used later for the neural net. The FinRegimes.csv data contain information about the Vix index, the slope of the yield curve, the SPY ETF value, the TLT ETF value, the credit spread between government and corporate bonds and the level of the Fed funds rate that US banks lend money to each other. The aim is to determine whether the next day is likely to be an up day or a down day.

The initial steps are to prepare the data.

da <- read.csv("../Data/FinRegimes.csv")
da$Date <- as.Date(da$Date, format = "%d/%m/%Y")
colnames(da) <- c("Date", "VIX", "Yield", "SPY", 
                  "TLT", "Credit", "Fed")
# remove nas
da <- na.omit(da)

Instead of using continuous data we will categorise the data in high, low and normal.

da$VIXC <- ifelse(da$VIX > 60, "High", ifelse(da$VIX < 20, 
                                              "Low", "Stable"))
da$YC <- ifelse(da$Yield > 50, "Steep", ifelse(da$Yield < 0, 
                                               "Inverted", "Normal"))
da$ET <- c(ifelse(da$SPY[1:(length(da$SPY) - 4)] /
                         da$SPY[5:length(da$SPY)] > 1.2, 
                "Uptrend", ifelse(da$SPY[1:(length(da$SPY) - 4)] /
                          da$SPY[5:length(da$SPY)] < 0.8, 
                          "Downtrend", "Flat")), rep("Flat", 4))
da$TT <- c(ifelse(da$TLT[1:(length(da$TLT) - 4)] /
                  da$TLT[5:length(da$TLT)] > 1.2, "Uptrend", 
                ifelse(da$TLT[1:(length(da$TLT) - 4)] /
                         da$TLT[5:length(da$TLT)] < 0.8, 
                       "Downtrend", "Flat")), 
           rep("Flat", 4))
da$CR <- ifelse(da$Credit < 2100, "Low", ifelse(da$Credit > 2500, 
                                                "High", "Normal"))
da$FC <- ifelse(da$Fed > 4, "High", ifelse(da$Fed < 1, "Low", 
                                           "Normal"))
upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
                   /da$SPY[2:length(da$SPY)] >= 1, 
                        1, 0), 0)
upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
                   /da$SPY[2:length(da$SPY)] >= 1, 
                        1, 0), 0)
# the return is the latest date and the signals are the 
# day before. 
dac <- cbind(upday[1:(length(upday) - 1)], da[2:length(upday), ])
dac <- data.frame(dac)
dac <- dac[, -c(3, 4, 5, 6, 7)]
colnames(dac)[1] <- "Upday"
head(dac)
##   Upday       Date  Fed   VIXC     YC   ET   TT   CR   FC
## 4     1 2025-03-24 4.33    Low Normal Flat Flat High High
## 5     1 2025-03-21 4.33    Low Normal Flat Flat High High
## 6     0 2025-03-20 4.33    Low Normal Flat Flat High High
## 7     0 2025-03-19 4.33    Low Normal Flat Flat High High
## 8     1 2025-03-18 4.33 Stable Normal Flat Flat High High
## 9     0 2025-03-17 4.33 Stable Normal Flat Flat High High

Now create the test and training set in the same way that we did before.

# Now split into test and train
train <- NROW(dac) * 0.75
train_sample <- sample(1:train, size = train * 0.75)
dac_train <- dac[train_sample, ] 
dac_test <- dac[-train_sample, ]

Create the formula

myformula <- as.formula(paste("Upday ~", 
                              paste(colnames(dac[-c(1, 2, 3)]), 
                              collapse = "+")))
myformula
## Upday ~ VIXC + YC + ET + TT + CR + FC

The rpart package can be used to create a decision-tree. In this case we take the classification option, but it is also possible to use it for regression analysis. The package will automatically carry out cross validation. It performs 10-fold validation by default and takes the best model. The default can be changed and the cross-validation outcomes can be investigated. See Le Chat for fuller details.

The ppart package has a vignette that is very useful with a walk-through of the key features.

require(rpart)
## Loading required package: rpart
model <- rpart(
  formula = myformula,
  data = dac_train,
  method = "class"
)
summary(model)
## Call:
## rpart(formula = myformula, data = dac_train, method = "class")
##   n= 699 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.04892966      0 1.0000000 1.0000000 0.04034215
## 2 0.02752294      1 0.9510703 1.0244648 0.04039120
## 3 0.01000000      2 0.9235474 0.9755352 0.04027178
## 
## Variable importance
##   FC   YC VIXC   CR 
##   40   31   24    5 
## 
## Node number 1: 699 observations,    complexity param=0.04892966
##   predicted class=1  expected loss=0.4678112  P(node) =1
##     class counts:   327   372
##    probabilities: 0.468 0.532 
##   left son=2 (282 obs) right son=3 (417 obs)
##   Primary splits:
##       FC   splits as  RLL, improve=3.4670340, (0 missing)
##       CR   splits as  RRL, improve=3.3154150, (0 missing)
##       VIXC splits as  RL,  improve=2.6857260, (0 missing)
##       YC   splits as  RLR, improve=0.1161307, (0 missing)
##   Surrogate splits:
##       VIXC splits as  RL,  agree=0.770, adj=0.429, (0 split)
##       YC   splits as  RRL, agree=0.767, adj=0.422, (0 split)
##       CR   splits as  RLL, agree=0.652, adj=0.138, (0 split)
## 
## Node number 2: 282 observations,    complexity param=0.02752294
##   predicted class=0  expected loss=0.4716312  P(node) =0.4034335
##     class counts:   149   133
##    probabilities: 0.528 0.472 
##   left son=4 (163 obs) right son=5 (119 obs)
##   Primary splits:
##       YC   splits as  LLR, improve=1.8036130, (0 missing)
##       FC   splits as  -RL, improve=0.5776853, (0 missing)
##       VIXC splits as  RL,  improve=0.3160259, (0 missing)
##   Surrogate splits:
##       VIXC splits as  RL,  agree=0.830, adj=0.597, (0 split)
##       FC   splits as  -RL, agree=0.762, adj=0.437, (0 split)
## 
## Node number 3: 417 observations
##   predicted class=1  expected loss=0.4268585  P(node) =0.5965665
##     class counts:   178   239
##    probabilities: 0.427 0.573 
## 
## Node number 4: 163 observations
##   predicted class=0  expected loss=0.4233129  P(node) =0.2331903
##     class counts:    94    69
##    probabilities: 0.577 0.423 
## 
## Node number 5: 119 observations
##   predicted class=1  expected loss=0.4621849  P(node) =0.1702432
##     class counts:    55    64
##    probabilities: 0.462 0.538

This tells us…

It is also possible to plot the decision tree. The rpart.plot package is useful here.

require(rpart.plot)
## Loading required package: rpart.plot
rpart.plot(model, main = "Investment Decision Tree", 
           box.palette = 2)

This shows the categorical variables that are important: whether the VIX index is stable; whether the credit risk is normal; whether the yield curve is normal.

Resampling

In addition, machine learning often utilises the power of the machine to simulate and re-sample as a way to understand the uncertainty in the data and to separate the signal from the noise. Among these techniques are:

  • Bootstraping:
  • Random Forests:
  • Cross Validation Will resample data or use sections of the data set to estimate statistics or assess the variability of statistical estimates. Non-exhaustive cross-validation does not compute all the ways of splitting the original data. For example, k-folds cross validation will split the data into k equal sized sub-samples. A single sample is used to test the model. Each of the samples is used to estimate the model and the average of that is used as a single estimate. The advantage of this over bootstram is that all the observations are used once. . Here is a walk-through of the KNN method, using cross-validation. I have run this and it works fine.
  • Monte Carlo Simulation:

Deep learning

Using the brain metaphor. There are a network of neurons that are combined together with a variety of weights and biases that are optimised to allow the features to explain the target variable.

Neural network

Example

This is an example of using the neuralnet package in R. It will use financial variables to predict whether the stock market will move up or down. The variables to be used are: the VIX index of implied option volatility; the level of the SPY S&P 500 ETF; the level of the TLT US government bond ETF; the US 10-year less 2-year Yield curve and the level of the Fed funds rate. A neutral net with one hidden layer is used to forecast whether the S&P 500 will go up or down the next day.

Prepare the data

Use the caret package and import the data (this has been downloaded from Bloomberg originally.

require(caret)
da <- read.csv('../Data/FinRegimes.csv')
da$Date <- as.Date(da$Date, format = "%d/%m/%Y")
colnames(da) <- c("Date", "VIC", "Yield", "SPY", 
      "TLT", "Credit", "Fed")
# Remove NAs
da <- na.omit(da)
head(da)
##         Date   VIC  Yield    SPY   TLT  Credit  Fed
## 3 2025-03-25 17.15 29.163 575.46 89.76 2730.54 4.33
## 4 2025-03-24 17.48 29.580 574.08 89.77 2730.29 4.33
## 5 2025-03-21 19.28 29.392 563.98 90.70 2723.89 4.33
## 6 2025-03-20 19.80 26.908 565.49 91.24 2725.84 4.33
## 7 2025-03-19 19.90 26.224 567.13 91.18 2722.72 4.33
## 8 2025-03-18 21.70 23.911 561.02 90.71 2715.76 4.33

These are the explanatory variables that are supposed to explain the direction of US stocks. Now create a binary variable that is 1 if stocks go up and zero if they go down.

upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
                   /da$SPY[2:length(da$SPY)] >= 1, 
                        1, 0), 0)

Create a function that will normalise the data between zero and one and normalise. Combine the upday and normalised data but lag the explanatory variables by one day so that the trading signal is available in advance: explanatory variables will assess whether the US stock market will go up or down tomorrow.

normalise <- function(x){
  return((x - min(x)) / (max(x) - min(x)))
}
da <- apply(da[, -1], MARGIN = 2, FUN = normalise)
dac <- cbind(upday[1:(length(upday) - 1)], da[2:length(upday), ])
dac <- data.frame(dac)
colnames(dac)[1] <- "Upday"
head(dac)
##   Upday       VIC     Yield       SPY        TLT    Credit       Fed
## 4     1 0.1242813 0.5192425 0.8940782 0.07882883 0.9910820 0.8109641
## 5     1 0.1640867 0.5185366 0.8665413 0.08930180 0.9837364 0.8109641
## 6     0 0.1755860 0.5092101 0.8706582 0.09538288 0.9859745 0.8109641
## 7     0 0.1777974 0.5066419 0.8751295 0.09470721 0.9823935 0.8109641
## 8     1 0.2176028 0.4979575 0.8584710 0.08941441 0.9744052 0.8109641
## 9     0 0.1912870 0.5021251 0.8751840 0.08840090 0.9753922 0.8109641

Split the data into training and test sets.

set.seed(123)
train <- NROW(dac) * 0.75
train_sample <- sample(1:train, size = train * 0.75)
dac_train <- dac[train_sample, ] 
dac_test <- dac[-train_sample, ]

Now create the formula that will be applied to the data, forecasting whether market is up or down based on the explanatory variables.

myformula <- as.formula(paste("Upday ~", paste(colnames(da),
                           collapse = "+")))
myformula
## Upday ~ VIC + Yield + SPY + TLT + Credit + Fed

The next step will be to run the deep-learning neural network. This will be based on one hidden-layer. There are many options that can be fine-tuned to improve the model, but we will take the default values.

require(neuralnet)
## Loading required package: neuralnet
mynet <- neuralnet(myformula,
           hidden = 2, 
           data = dac_test)
summary(mynet)
##                     Length Class      Mode    
## call                   4   -none-     call    
## response             545   -none-     numeric 
## covariate           3270   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                   7   data.frame list    
## exclude                0   -none-     NULL    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## generalized.weights    1   -none-     list    
## startweights           1   -none-     list    
## result.matrix         20   -none-     numeric
plot(mynet, 'best')

This is a visual representation of the n neural net with one hidden layer.

Now use the model to predict the probability of an upday using the testing dataset. Create a dataframe of prediction probability and actual and then calculate the key performance metrics.

myprediction <- data.frame(cbind(predict(mynet, dac_test), 
                                 dac_test$Upday))
colnames(myprediction) <- c("Predict", "Actual")
head(myprediction)
##      Predict Actual
## 4  0.6769887      1
## 6  0.6812168      0
## 10 0.6649707      1
## 12 0.7184533      0
## 15 0.6585777      0
## 18 0.6813815      1
mytable <- table(Actual = myprediction$Actual, Predict = 
        myprediction$Predict > 0.5)
MyDown <- sum(mytable[1, ]) / sum(mytable)
MyUp <- sum(mytable[2, ]) / sum(mytable)
MyAccuracy <- sum(mytable[1, 1] + mytable[2, 2]) / sum(mytable)
MyPrecision <- mytable[2, 2] / (mytable[2, 2] + mytable[2, 1])
MySpecificity <- mytable[1, 1] / (mytable[1, 1] + mytable[1, 2])

Now we have the following results:

  • Percent of up days is 0.57
  • Percent of down days is 0.43
  • Accuracy (Correct day forecasts /Total days) is 0.6
  • Precision (Correct up day forecast / Total up days) is 0.81
  • Specificity (Correct down day forecast / Total down days) is 0.33

Therefore, we can see that nearly sixty percent days are updays and forty are down days. Are we trying to identify the times to invest in stocks or the times to be outside of the market? In the first case we want a high precision; in the second we would like to see specificity. Therefore, it is suggested that we hold stocks so long as the neutral net classifier suggests that the next day will be an upday.

Temperature

10 Jan 2025. Heat is like entropy, there is more disorder and a greater variety of possible solutions.

The softmax function will convert the model logit scores into probabilities.

\[P(x_i) = \frac{e^{z_i}}{\sum_j^ne^(z_j}\]

To change the temperature, divide the z by T. If T = 1, the probabilities are unchanged; if T = 0.5, the most probable is given the most weight; if T = 1.5, the least probable are given more weight.

Use in finance

If you look around you will find hundreds and hundreds of papers that apply deeplearning or neural networks to the trading and investment of financial assets.

Bibliography

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–213. https://www.jstor.org/stable/2676681.
Leamer, Edward E. 1983. “Let’s Take the Con Out of Econometrics.” American Economic Review. https://www.jstor.org/stable/1803924.
Sutton, Rich. 2019. “The Bitter Lesson.” http://www.incompleteideas.net/IncIdeas/BitterLesson.html.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.