For my example of application of machine learning in finance, I chose
to use K-Nearest Neighbors (KNN) model for a classic classification
problem in finance: “Will prices go up or down in the future?”. This is
a very simple and straightforward example of how machine learning model,
in particular KNN model, can be used with financial data.
For this example, I am using the historical stock price of the
S&P500 index, taken from yahoo!finance. My goal is to use
KNN model to use stock price information at time \(t\) to predict whether stock price is
higher or lower at time \(t+1\). The
parameters used to predict this are Stochastic Oscillator, William %R,
Price Rate of Change, Moving Average Convergence Divergence, and daily
trading volume from the years 1990 to 2023.
# Load the SPY dataset
startDate <- as.Date("1990-01-01")
endDate <- as.Date("2023-12-31")
getSymbols("SPY", src = "yahoo", from = startDate, to = endDate)
## [1] "SPY"
# Prepare the dataset
spy <- SPY[, c("SPY.Open", "SPY.High", "SPY.Low", "SPY.Close", "SPY.Volume")]
# Convert to data frame
spy <- data.frame(Date = index(spy), coredata(spy))
# Drop all NA values if there are any
spy <- na.omit(spy)
# Glimpse the dataset
head(spy)
## Date SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume
## 1 1993-01-29 43.96875 43.96875 43.75000 43.93750 1003200
## 2 1993-02-01 43.96875 44.25000 43.96875 44.25000 480500
## 3 1993-02-02 44.21875 44.37500 44.12500 44.34375 201300
## 4 1993-02-03 44.40625 44.84375 44.37500 44.81250 529400
## 5 1993-02-04 44.96875 45.09375 44.46875 45.00000 531500
## 6 1993-02-05 44.96875 45.06250 44.71875 44.96875 492100
Next, an extra variable called “Change” was added to the dataset. This variable records daily price variation, which is determined as the difference between the closing price on day \(t\) and the closing price of the previous day \(t-1\).
# Create a column called "Change" that holds the daily change in price
spy$Change <- spy$SPY.Close - lag(spy$SPY.Close)
# Drop the rows that do not have the data to compute change in price
spy <- na.omit(spy)
#Glimpse the dataset
head(spy)
## Date SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume Change
## 2 1993-02-01 43.96875 44.25000 43.96875 44.25000 480500 0.31250
## 3 1993-02-02 44.21875 44.37500 44.12500 44.34375 201300 0.09375
## 4 1993-02-03 44.40625 44.84375 44.37500 44.81250 529400 0.46875
## 5 1993-02-04 44.96875 45.09375 44.46875 45.00000 531500 0.18750
## 6 1993-02-05 44.96875 45.06250 44.71875 44.96875 492100 -0.03125
## 7 1993-02-08 44.96875 45.12500 44.90625 44.96875 596100 0.00000
Following the addition of the “Change” variable, an extra column labeled “Flag” will be added to the dataset. In this column, a value of 1 will represent the occurrences where the change in price is positive, whereas a value of -1 will be given to occurrences where the change in price is negative. When there is not a change in price, a value of 0 is given.
# Return 1 if change is positive, -1 if change is negative, and 0 if there's no change
spy$Flag <- ifelse(spy$Change > 0, 1, ifelse(spy$Change < 0, -1, 0))
#Glimpse the dataset
head(spy)
## Date SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume Change Flag
## 2 1993-02-01 43.96875 44.25000 43.96875 44.25000 480500 0.31250 1
## 3 1993-02-02 44.21875 44.37500 44.12500 44.34375 201300 0.09375 1
## 4 1993-02-03 44.40625 44.84375 44.37500 44.81250 529400 0.46875 1
## 5 1993-02-04 44.96875 45.09375 44.46875 45.00000 531500 0.18750 1
## 6 1993-02-05 44.96875 45.06250 44.71875 44.96875 492100 -0.03125 -1
## 7 1993-02-08 44.96875 45.12500 44.90625 44.96875 596100 0.00000 0
In the original dataset taken from yahoo!finance, there were 5 variables, comprise of daily opening price, daily highest price, daily lowest price, daily closing price, and daily trading volume. I will use these 5 variables to calculate additional technical indicators, which are Stochastic Oscillator, William %R, Price Rate of Change, Moving Average Convergence Divergence.
Stochastic Oscillator follows the speed or the momentum of the price. As a rule, momentum changes before the price changes. It measures the level of the closing price relative to the low-high range over a period of time.
n <- 14
spy <- spy %>%
mutate(Low_14 = rollapply(SPY.Low, n, min, fill = NA, align = 'right'),
High_14 = rollapply(SPY.High, n, max, fill = NA, align = 'right'),
k_percent = 100 * ((SPY.Close - Low_14) / (High_14 - Low_14)))
William %R ranges from -100 to 0. When its value is above -20, it indicates a sell signal and when its value id below -80, it indicates a buy signal.
# Calculate William %R indicator
spy <- spy %>%
mutate(r_percent = ((High_14 - SPY.Close) / (High_14 - Low_14)) * -100)
EMA stands for Exponential Moving Average. When the MACD goes below the Signal Line, it indicates a sell signal. When it goes above the Signal Line, it indicates a buy signal.
# Calculate the MACD
spy <- spy %>%
mutate(ema_26 = EMA(SPY.Close, n = 26),
ema_12 = EMA(SPY.Close, n = 12),
MACD = ema_12 - ema_26,
MACD_EMA = EMA(MACD, n = 9))
It measures the most recent change in price with respect to the price in \(n\) days ago. In this case, I will set the value of \(n\) to 7.
# Calculate the Price Rate of Change
n <- 7
spy <- spy %>%
mutate(Price_Rate_Of_Change = (SPY.Close - lag(SPY.Close, n)) / lag(SPY.Close, n))
# Remove rows with NA values
spy <- na.omit(spy)
# Select features and target
spy_features <- spy %>%
select(k_percent, r_percent, Price_Rate_Of_Change, MACD, SPY.Volume)
spy_target <- spy$Flag
# Split the data into training (60%) and testing (40%) sets
set.seed(8012024)
trainIndex <- createDataPartition(spy_target, p = 0.6, list = FALSE)
trainX <- spy_features[trainIndex, ]
testX <- spy_features[-trainIndex, ]
trainY <- spy_target[trainIndex]
testY <- spy_target[-trainIndex]
# Normalize the data
preProcess_range_model <- preProcess(trainX, method = c("center", "scale"))
trainX <- predict(preProcess_range_model, trainX)
testX <- predict(preProcess_range_model, testX)
variables <- c('k_percent', 'r_percent', 'Price_Rate_Of_Change', 'MACD')
acc <- list()
for (i in 1:20) {
knn_pred <- knn(train = trainX, test = testX, cl = trainY, k = i)
acc[as.character(i)] = mean(knn_pred == testY)
}
acc <- unlist(acc)
tibble(acc = acc) %>%
mutate(k = row_number()) %>%
ggplot(aes(k, acc)) +
geom_col(aes(fill = k == which.max(acc))) +
labs(x = 'K', y = 'Accuracy', title = 'KNN Accuracy for different values of K') +
scale_x_continuous(breaks = 1:20) +
scale_y_continuous(breaks = round(c(seq(0.90, 0.94, 0.01), max(acc)),
digits = 3)) +
geom_hline(yintercept = max(acc), lty = 2) +
coord_cartesian(ylim = c(min(acc), max(acc))) +
guides(fill = "none")
By running the k from 1 to 20, we can see the best model is k = 14.
Thus, I will be using k = 14 for the final model.
# Train k-NN model with k = 14
knn_model <- knn(train = trainX, test = testX, cl = trainY, k = 14)
I will use a confusion matrix to evaluate the model.
# Evaluate the model
conf_matrix <- confusionMatrix(knn_model, as.factor(testY))
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction -1 0 1
## -1 892 11 439
## 0 0 0 0
## 1 526 19 1213
##
## Overall Statistics
##
## Accuracy : 0.679
## 95% CI : (0.6623, 0.6955)
## No Information Rate : 0.5329
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3578
##
## Mcnemar's Test P-Value : 3.051e-08
##
## Statistics by Class:
##
## Class: -1 Class: 0 Class: 1
## Sensitivity 0.6291 0.000000 0.7343
## Specificity 0.7325 1.000000 0.6236
## Pos Pred Value 0.6647 NaN 0.6900
## Neg Pred Value 0.7008 0.990323 0.6729
## Prevalence 0.4574 0.009677 0.5329
## Detection Rate 0.2877 0.000000 0.3913
## Detection Prevalence 0.4329 0.000000 0.5671
## Balanced Accuracy 0.6808 0.500000 0.6789
# Plot the confusion matrix
cm <- as.matrix(conf_matrix$table)
values_x <- levels(as.factor(testY))
values_y <- values_x[3:1]
cm_tbl <- tibble::as_tibble(cm)
prediction_levels <- c("1", "0", "-1")
reference_levels <- c("-1", "0", "1")
ggplot(data = cm_tbl, aes(x = Reference, y = Prediction, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = n), vjust = 1) +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Actual", y = "Predicted", title = "Confusion Matrix") +
scale_x_discrete(limits = reference_levels) +
scale_y_discrete(limits = prediction_levels)
The KNN model’s accuracy of 67.9% demonstrates its effectiveness in
predicting the next day’s stock price direction, surpassing the
performance of random guessing. This outcome underscores the value of
utilizing machine learning models in financial forecasting tasks.