Machine learning makes use of technology and statistics to automate the analysis of data. Increased computing power and availability of data has encouraged expansion. The development of deep learning models (or neural networks), large language models (LLM) has been a major popular development.
There are two major papers that help to explain the differences between traditional statistics and machine learning and between traditional quantitative investment and the modern variety:
Leo Breiman: Statistical Modeling: The Two Cultures. Breiman (2001). The first tried to model the data-generating process with linear regression and other tools and used these for interpretation; the second used tools like decision-trees and neural nets to find algorithms that could be used for prediction.
Rich Sutton: The Bitter Lesson. Sutton (2019) Which argues that more computing power, more data and scaling will always overcome intuition and subject knowledge.
These suggest that a computer scientist with a big, powerful computer and lots of data can do anything that a specialist in any area can complete. Breiman and Sutton are computer scientists.
The most famous paper Artificial Intelligence paper in recent years is Attention is all you need. Vaswani et al. (2023). This introduces the Transformer Architecture which underlies modern Large Language Models (LLMs) which is based on parallelization and the use of Graphics Processing Units (GPUs) that allows larger models to be processed more rapidly.
Evolution is in the direction of less human curation and more computing power and letting the machine learn from the data. Old ideas about having too many explanatory variables relative to data and over-fitting have been replaced in some areas.
Meanwhile, econometrics tends to focus issues of understanding, transparency and causation, which are less well suited to machine learning. One of the main papers to consider in this field is Let’s take the con out of econometrics Leamer (1983) which argues for quasi-experimental methods to uncover causality and ushered in developments in difference-in-difference and synthetic control tools.
There are two types of machine learning:
classification (deciding on the category)
regression (choosing a number)
Machine learning can be:
Supervised: where we have examples of categories or numbers that can be used to train the model. The model will try to predict classifications.
Unsupervised: where the machine will determine the categories itself. The model will find patterns in the data. This is not normally done for regression models.
Standard machine learning models include:
In this version with eight nearest neighbours, the eight blue elements are closest to the new red data point and will determine its classification. Using different numbers of neighbours can result in different classifications. Here is a toy example of using KNN to identify bad loans.
This will use the class package and will create
artificial data for income, debt-to-income and credit score. These
numbers then will be used to create a probability of default.
library(class)
set.seed(124) # for reproducibility
n <- 50 # training cases
train_data <- data.frame(
income = rnorm(n, 50000, 15000),
debt_to_income = runif(n, 0.1, 0.6),
credit_score = rnorm(n, 680, 80)
)
Now we create a binary variable that is 1 if there is a default and 0 if there is no default. Lower income, higher debt-to-income and lower credit score increase the probability of default
train_data$default <- ifelse(
train_data$income < 20000 | train_data$debt_to_income > 0.4 | train_data$credit_score < 600,
sample(c(0, 1), n, replace = TRUE, prob = c(0.2, 0.8)),
sample(c(0, 1), n, replace = TRUE, prob = c(0.8, 0.2))
)
Now we create the test data set that will be used to assess the model.
test_data <- data.frame(income = rnorm(50, 50000, 15000),
debt_to_income = runif(50, 0.1, 0.6),
credit_score = rnorm(50, 680, 80)
)
We will also create probabilities of default.
test_data$default <- ifelse(
test_data$income < 20000 | test_data$debt_to_income > 0.4 | test_data$credit_score < 600,
sample(c(0, 1), 50, replace = TRUE, prob = c(0.2, 0.8)),
sample(c(0, 1), 50, replace = TRUE, prob = c(0.8, 0.2))
)
Now we normalise the data so that everything is in the same units. We are calculating distance so it makes a difference if we talk about 1 meter or 10000 cm.
normalise <- function(x){
return((x - min(x)) / (max(x) - min(x)))
}
train_norm <- as.data.frame(lapply(train_data[, 1:3],
FUN = normalise))
test_norm <- as.data.frame(lapply(test_data[, 1:3],
FUN = normalise))
Now we run the K-nearest neighbours using the knn
function. The key arguments are the training data, the test data and the
variable that is to be estimated. In this case that is the default. It
is also necessary to how many neighbours are used to estimate the
default. In this case we use 10.
predictions <- knn(
train = train_norm,
test = test_norm,
cl = test_data$default,
k = 21
)
Now we compare the estimated defaults relative to the actual
defaults, using the table function.
mytable <- table(predictions, test_data$default)
mytable
##
## predictions 0 1
## 0 25 14
## 1 5 6
However, this shows that only six out of eleven defaults are predicted by the model. It is usual to assess the quality of these sort of categorical models using the following metrics:
\[\text{Accuracy} = \frac{\text{True Positive plus True Negative}}{\text{Total}} = 0.62\] \[\text{Sensitivity} = \frac{\text{True Positive}}{\text{All Positive}} = 0.55\] \[\text{Specificity} = \frac{\text{True Negative}}{\text{All Negative}} = 0.64\] In this case we would be most interested in catching as many defaults (positives) as possible. Therefore, we would be looking at the sensitivity. At 55% the model is only just identifying half the defaults.
K-means carries out unsupervised clustering by dividing the data into k-groups. Assigns data points to the nearest k clusters. It can find hidden patterns in the data, but it is computationally intensive. There is an example of using K-means to find different financial regimes on My Studies, under K-means example.
Support Vector Machines: split categories by maximising the distance between the elements and the dividing line. This is unsupervised learning. You can read more here about support vector machines. To be completed.
Decision Trees: break classifications into branches like a tree with nodes and leafs. They are easy to understand, handle non-linear responses and handle some interactions. However, they have high variance an***d are prone to overfitting.
This is an example using the same set of market tension data that are
used later for the neural net. The FinRegimes.csv data
contain information about the Vix index, the slope of
the yield curve, the SPY ETF value,
the TLT ETF value, the credit spread
between government and corporate bonds and the level of the Fed
funds rate that US banks lend money to each other. The aim is
to determine whether the next day is likely to be an up day or a down
day.
The initial steps are to prepare the data.
da <- read.csv("../Data/FinRegimes.csv")
da$Date <- as.Date(da$Date, format = "%d/%m/%Y")
colnames(da) <- c("Date", "VIX", "Yield", "SPY",
"TLT", "Credit", "Fed")
# remove nas
da <- na.omit(da)
Instead of using continuous data we will categorise the data in high, low and normal.
da$VIXC <- ifelse(da$VIX > 60, "High", ifelse(da$VIX < 20,
"Low", "Stable"))
da$YC <- ifelse(da$Yield > 50, "Steep", ifelse(da$Yield < 0,
"Inverted", "Normal"))
da$ET <- c(ifelse(da$SPY[1:(length(da$SPY) - 4)] /
da$SPY[5:length(da$SPY)] > 1.2,
"Uptrend", ifelse(da$SPY[1:(length(da$SPY) - 4)] /
da$SPY[5:length(da$SPY)] < 0.8,
"Downtrend", "Flat")), rep("Flat", 4))
da$TT <- c(ifelse(da$TLT[1:(length(da$TLT) - 4)] /
da$TLT[5:length(da$TLT)] > 1.2, "Uptrend",
ifelse(da$TLT[1:(length(da$TLT) - 4)] /
da$TLT[5:length(da$TLT)] < 0.8,
"Downtrend", "Flat")),
rep("Flat", 4))
da$CR <- ifelse(da$Credit < 2100, "Low", ifelse(da$Credit > 2500,
"High", "Normal"))
da$FC <- ifelse(da$Fed > 4, "High", ifelse(da$Fed < 1, "Low",
"Normal"))
upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
/da$SPY[2:length(da$SPY)] >= 1,
1, 0), 0)
upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
/da$SPY[2:length(da$SPY)] >= 1,
1, 0), 0)
# the return is the latest date and the signals are the
# day before.
dac <- cbind(upday[1:(length(upday) - 1)], da[2:length(upday), ])
dac <- data.frame(dac)
dac <- dac[, -c(3, 4, 5, 6, 7)]
colnames(dac)[1] <- "Upday"
head(dac)
## Upday Date Fed VIXC YC ET TT CR FC
## 4 1 2025-03-24 4.33 Low Normal Flat Flat High High
## 5 1 2025-03-21 4.33 Low Normal Flat Flat High High
## 6 0 2025-03-20 4.33 Low Normal Flat Flat High High
## 7 0 2025-03-19 4.33 Low Normal Flat Flat High High
## 8 1 2025-03-18 4.33 Stable Normal Flat Flat High High
## 9 0 2025-03-17 4.33 Stable Normal Flat Flat High High
Now create the test and training set in the same way that we did before.
# Now split into test and train
train <- NROW(dac) * 0.75
train_sample <- sample(1:train, size = train * 0.75)
dac_train <- dac[train_sample, ]
dac_test <- dac[-train_sample, ]
Create the formula
myformula <- as.formula(paste("Upday ~",
paste(colnames(dac[-c(1, 2, 3)]),
collapse = "+")))
myformula
## Upday ~ VIXC + YC + ET + TT + CR + FC
The rpart package can be used to create a decision-tree.
In this case we take the classification option, but it is also
possible to use it for regression analysis. The package will
automatically carry out cross validation. It performs 10-fold validation
by default and takes the best model. The default can be changed and the
cross-validation outcomes can be investigated. See Le
Chat for fuller details.
The ppart package has a vignette that is very useful with a walk-through of the key features.
require(rpart)
## Loading required package: rpart
model <- rpart(
formula = myformula,
data = dac_train,
method = "class"
)
summary(model)
## Call:
## rpart(formula = myformula, data = dac_train, method = "class")
## n= 699
##
## CP nsplit rel error xerror xstd
## 1 0.04892966 0 1.0000000 1.0000000 0.04034215
## 2 0.02752294 1 0.9510703 1.0244648 0.04039120
## 3 0.01000000 2 0.9235474 0.9755352 0.04027178
##
## Variable importance
## FC YC VIXC CR
## 40 31 24 5
##
## Node number 1: 699 observations, complexity param=0.04892966
## predicted class=1 expected loss=0.4678112 P(node) =1
## class counts: 327 372
## probabilities: 0.468 0.532
## left son=2 (282 obs) right son=3 (417 obs)
## Primary splits:
## FC splits as RLL, improve=3.4670340, (0 missing)
## CR splits as RRL, improve=3.3154150, (0 missing)
## VIXC splits as RL, improve=2.6857260, (0 missing)
## YC splits as RLR, improve=0.1161307, (0 missing)
## Surrogate splits:
## VIXC splits as RL, agree=0.770, adj=0.429, (0 split)
## YC splits as RRL, agree=0.767, adj=0.422, (0 split)
## CR splits as RLL, agree=0.652, adj=0.138, (0 split)
##
## Node number 2: 282 observations, complexity param=0.02752294
## predicted class=0 expected loss=0.4716312 P(node) =0.4034335
## class counts: 149 133
## probabilities: 0.528 0.472
## left son=4 (163 obs) right son=5 (119 obs)
## Primary splits:
## YC splits as LLR, improve=1.8036130, (0 missing)
## FC splits as -RL, improve=0.5776853, (0 missing)
## VIXC splits as RL, improve=0.3160259, (0 missing)
## Surrogate splits:
## VIXC splits as RL, agree=0.830, adj=0.597, (0 split)
## FC splits as -RL, agree=0.762, adj=0.437, (0 split)
##
## Node number 3: 417 observations
## predicted class=1 expected loss=0.4268585 P(node) =0.5965665
## class counts: 178 239
## probabilities: 0.427 0.573
##
## Node number 4: 163 observations
## predicted class=0 expected loss=0.4233129 P(node) =0.2331903
## class counts: 94 69
## probabilities: 0.577 0.423
##
## Node number 5: 119 observations
## predicted class=1 expected loss=0.4621849 P(node) =0.1702432
## class counts: 55 64
## probabilities: 0.462 0.538
This tells us…
It is also possible to plot the decision tree. The
rpart.plot package is useful here.
require(rpart.plot)
## Loading required package: rpart.plot
rpart.plot(model, main = "Investment Decision Tree",
box.palette = 2)
This shows the categorical variables that are important: whether the VIX index is stable; whether the credit risk is normal; whether the yield curve is normal.
In addition, machine learning often utilises the power of the machine to simulate and re-sample as a way to understand the uncertainty in the data and to separate the signal from the noise. Among these techniques are:
Using the brain metaphor. There are a network of neurons that are combined together with a variety of weights and biases that are optimised to allow the features to explain the target variable.
This is an example of using the neuralnet package in R. It will use financial variables to predict whether the stock market will move up or down. The variables to be used are: the VIX index of implied option volatility; the level of the SPY S&P 500 ETF; the level of the TLT US government bond ETF; the US 10-year less 2-year Yield curve and the level of the Fed funds rate. A neutral net with one hidden layer is used to forecast whether the S&P 500 will go up or down the next day.
Use the caret package and import the data (this has been
downloaded from Bloomberg originally.
require(caret)
da <- read.csv('../Data/FinRegimes.csv')
da$Date <- as.Date(da$Date, format = "%d/%m/%Y")
colnames(da) <- c("Date", "VIC", "Yield", "SPY",
"TLT", "Credit", "Fed")
# Remove NAs
da <- na.omit(da)
head(da)
## Date VIC Yield SPY TLT Credit Fed
## 3 2025-03-25 17.15 29.163 575.46 89.76 2730.54 4.33
## 4 2025-03-24 17.48 29.580 574.08 89.77 2730.29 4.33
## 5 2025-03-21 19.28 29.392 563.98 90.70 2723.89 4.33
## 6 2025-03-20 19.80 26.908 565.49 91.24 2725.84 4.33
## 7 2025-03-19 19.90 26.224 567.13 91.18 2722.72 4.33
## 8 2025-03-18 21.70 23.911 561.02 90.71 2715.76 4.33
These are the explanatory variables that are supposed to explain the direction of US stocks. Now create a binary variable that is 1 if stocks go up and zero if they go down.
upday <- c(ifelse(da$SPY[1:(length(da$SPY) -1)]
/da$SPY[2:length(da$SPY)] >= 1,
1, 0), 0)
Create a function that will normalise the data between zero and one and normalise. Combine the upday and normalised data but lag the explanatory variables by one day so that the trading signal is available in advance: explanatory variables will assess whether the US stock market will go up or down tomorrow.
normalise <- function(x){
return((x - min(x)) / (max(x) - min(x)))
}
da <- apply(da[, -1], MARGIN = 2, FUN = normalise)
dac <- cbind(upday[1:(length(upday) - 1)], da[2:length(upday), ])
dac <- data.frame(dac)
colnames(dac)[1] <- "Upday"
head(dac)
## Upday VIC Yield SPY TLT Credit Fed
## 4 1 0.1242813 0.5192425 0.8940782 0.07882883 0.9910820 0.8109641
## 5 1 0.1640867 0.5185366 0.8665413 0.08930180 0.9837364 0.8109641
## 6 0 0.1755860 0.5092101 0.8706582 0.09538288 0.9859745 0.8109641
## 7 0 0.1777974 0.5066419 0.8751295 0.09470721 0.9823935 0.8109641
## 8 1 0.2176028 0.4979575 0.8584710 0.08941441 0.9744052 0.8109641
## 9 0 0.1912870 0.5021251 0.8751840 0.08840090 0.9753922 0.8109641
Split the data into training and test sets.
set.seed(123)
train <- NROW(dac) * 0.75
train_sample <- sample(1:train, size = train * 0.75)
dac_train <- dac[train_sample, ]
dac_test <- dac[-train_sample, ]
Now create the formula that will be applied to the data, forecasting whether market is up or down based on the explanatory variables.
myformula <- as.formula(paste("Upday ~", paste(colnames(da),
collapse = "+")))
myformula
## Upday ~ VIC + Yield + SPY + TLT + Credit + Fed
The next step will be to run the deep-learning neural network. This will be based on one hidden-layer. There are many options that can be fine-tuned to improve the model, but we will take the default values.
require(neuralnet)
## Loading required package: neuralnet
mynet <- neuralnet(myformula,
hidden = 2,
data = dac_test)
summary(mynet)
## Length Class Mode
## call 4 -none- call
## response 545 -none- numeric
## covariate 3270 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 7 data.frame list
## exclude 0 -none- NULL
## net.result 1 -none- list
## weights 1 -none- list
## generalized.weights 1 -none- list
## startweights 1 -none- list
## result.matrix 20 -none- numeric
plot(mynet, 'best')
This is a visual representation of the n neural net with one hidden layer.
Now use the model to predict the probability of an upday using the testing dataset. Create a dataframe of prediction probability and actual and then calculate the key performance metrics.
myprediction <- data.frame(cbind(predict(mynet, dac_test),
dac_test$Upday))
colnames(myprediction) <- c("Predict", "Actual")
head(myprediction)
## Predict Actual
## 4 0.6769887 1
## 6 0.6812168 0
## 10 0.6649707 1
## 12 0.7184533 0
## 15 0.6585777 0
## 18 0.6813815 1
mytable <- table(Actual = myprediction$Actual, Predict =
myprediction$Predict > 0.5)
MyDown <- sum(mytable[1, ]) / sum(mytable)
MyUp <- sum(mytable[2, ]) / sum(mytable)
MyAccuracy <- sum(mytable[1, 1] + mytable[2, 2]) / sum(mytable)
MyPrecision <- mytable[2, 2] / (mytable[2, 2] + mytable[2, 1])
MySpecificity <- mytable[1, 1] / (mytable[1, 1] + mytable[1, 2])
Now we have the following results:
Therefore, we can see that nearly sixty percent days are updays and forty are down days. Are we trying to identify the times to invest in stocks or the times to be outside of the market? In the first case we want a high precision; in the second we would like to see specificity. Therefore, it is suggested that we hold stocks so long as the neutral net classifier suggests that the next day will be an upday.
10 Jan 2025. Heat is like entropy, there is more disorder and a greater variety of possible solutions.
The softmax function will convert the model logit scores into probabilities.
\[P(x_i) = \frac{e^{z_i}}{\sum_j^ne^(z_j}\]
To change the temperature, divide the z by T. If T = 1, the probabilities are unchanged; if T = 0.5, the most probable is given the most weight; if T = 1.5, the least probable are given more weight.
If you look around you will find hundreds and hundreds of papers that apply deeplearning or neural networks to the trading and investment of financial assets.