Using KNN in Predicting Price Direction

For my example of application of machine learning in finance, I chose to use K-Nearest Neighbors (KNN) model for a classic classification problem in finance: “Will prices go up or down in the future?”. This is a very simple and straightforward example of how machine learning model, in particular KNN model, can be used with financial data.

For this example, I am using the historical stock price of the S&P500 index, taken from yahoo!finance. My goal is to use KNN model to use stock price information at time \(t\) to predict whether stock price is higher or lower at time \(t+1\). The parameters used to predict this are Stochastic Oscillator, William %R, Price Rate of Change, Moving Average Convergence Divergence, and daily trading volume from the years 1990 to 2023.

1. Data

1.1 Load the data

# Load the SPY dataset
startDate <- as.Date("1990-01-01")
endDate <- as.Date("2023-12-31")
getSymbols("SPY", src = "yahoo", from = startDate, to = endDate)

## [1] "SPY"

# Prepare the dataset
spy <- SPY[, c("SPY.Open", "SPY.High", "SPY.Low", "SPY.Close", "SPY.Volume")]

# Convert to data frame
spy <- data.frame(Date = index(spy), coredata(spy))

# Drop all NA values if there are any
spy <- na.omit(spy)

# Glimpse the dataset
head(spy)

##         Date SPY.Open SPY.High  SPY.Low SPY.Close SPY.Volume
## 1 1993-01-29 43.96875 43.96875 43.75000  43.93750    1003200
## 2 1993-02-01 43.96875 44.25000 43.96875  44.25000     480500
## 3 1993-02-02 44.21875 44.37500 44.12500  44.34375     201300
## 4 1993-02-03 44.40625 44.84375 44.37500  44.81250     529400
## 5 1993-02-04 44.96875 45.09375 44.46875  45.00000     531500
## 6 1993-02-05 44.96875 45.06250 44.71875  44.96875     492100

1.2 Create signal flags

Next, an extra variable called “Change” was added to the dataset. This variable records daily price variation, which is determined as the difference between the closing price on day \(t\) and the closing price of the previous day \(t-1\).

# Create a column called "Change" that holds the daily change in price
spy$Change <- spy$SPY.Close - lag(spy$SPY.Close)

# Drop the rows that do not have the data to compute change in price
spy <- na.omit(spy)

#Glimpse the dataset
head(spy)

##         Date SPY.Open SPY.High  SPY.Low SPY.Close SPY.Volume   Change
## 2 1993-02-01 43.96875 44.25000 43.96875  44.25000     480500  0.31250
## 3 1993-02-02 44.21875 44.37500 44.12500  44.34375     201300  0.09375
## 4 1993-02-03 44.40625 44.84375 44.37500  44.81250     529400  0.46875
## 5 1993-02-04 44.96875 45.09375 44.46875  45.00000     531500  0.18750
## 6 1993-02-05 44.96875 45.06250 44.71875  44.96875     492100 -0.03125
## 7 1993-02-08 44.96875 45.12500 44.90625  44.96875     596100  0.00000

Following the addition of the “Change” variable, an extra column labeled “Flag” will be added to the dataset. In this column, a value of 1 will represent the occurrences where the change in price is positive, whereas a value of -1 will be given to occurrences where the change in price is negative. When there is not a change in price, a value of 0 is given.

# Return 1 if change is positive, -1 if change is negative, and 0 if there's no change
spy$Flag <- ifelse(spy$Change > 0, 1, ifelse(spy$Change < 0, -1, 0))

#Glimpse the dataset
head(spy)

##         Date SPY.Open SPY.High  SPY.Low SPY.Close SPY.Volume   Change Flag
## 2 1993-02-01 43.96875 44.25000 43.96875  44.25000     480500  0.31250    1
## 3 1993-02-02 44.21875 44.37500 44.12500  44.34375     201300  0.09375    1
## 4 1993-02-03 44.40625 44.84375 44.37500  44.81250     529400  0.46875    1
## 5 1993-02-04 44.96875 45.09375 44.46875  45.00000     531500  0.18750    1
## 6 1993-02-05 44.96875 45.06250 44.71875  44.96875     492100 -0.03125   -1
## 7 1993-02-08 44.96875 45.12500 44.90625  44.96875     596100  0.00000    0

2. Feature engineering

In the original dataset taken from yahoo!finance, there were 5 variables, comprise of daily opening price, daily highest price, daily lowest price, daily closing price, and daily trading volume. I will use these 5 variables to calculate additional technical indicators, which are Stochastic Oscillator, William %R, Price Rate of Change, Moving Average Convergence Divergence.

2.1 Indicator calculation: Stochastic Oscillator

Stochastic Oscillator follows the speed or the momentum of the price. As a rule, momentum changes before the price changes. It measures the level of the closing price relative to the low-high range over a period of time.

n <- 14
spy <- spy %>%
  mutate(Low_14 = rollapply(SPY.Low, n, min, fill = NA, align = 'right'),
         High_14 = rollapply(SPY.High, n, max, fill = NA, align = 'right'),
         k_percent = 100 * ((SPY.Close - Low_14) / (High_14 - Low_14)))

2.2 Indicator calculation: William %R

William %R ranges from -100 to 0. When its value is above -20, it indicates a sell signal and when its value id below -80, it indicates a buy signal.

# Calculate William %R indicator
spy <- spy %>%
  mutate(r_percent = ((High_14 - SPY.Close) / (High_14 - Low_14)) * -100)

2.3 Indicator calculation: Moving Average Convergence Divergence (MACD)

EMA stands for Exponential Moving Average. When the MACD goes below the Signal Line, it indicates a sell signal. When it goes above the Signal Line, it indicates a buy signal.

# Calculate the MACD
spy <- spy %>%
  mutate(ema_26 = EMA(SPY.Close, n = 26),
         ema_12 = EMA(SPY.Close, n = 12),
         MACD = ema_12 - ema_26,
         MACD_EMA = EMA(MACD, n = 9))

2.4 Indicator calculation: Price Rate of Change

It measures the most recent change in price with respect to the price in \(n\) days ago. In this case, I will set the value of \(n\) to 7.

# Calculate the Price Rate of Change
n <- 7
spy <- spy %>%
  mutate(Price_Rate_Of_Change = (SPY.Close - lag(SPY.Close, n)) / lag(SPY.Close, n))

3. Using the KNN model

3.1 Split data into traning and testing sets

# Remove rows with NA values
spy <- na.omit(spy)

# Select features and target
spy_features <- spy %>%
  select(k_percent, r_percent, Price_Rate_Of_Change, MACD, SPY.Volume)
spy_target <- spy$Flag

# Split the data into training (60%) and testing (40%) sets
set.seed(8012024)
trainIndex <- createDataPartition(spy_target, p = 0.6, list = FALSE)
trainX <- spy_features[trainIndex, ]
testX <- spy_features[-trainIndex, ]
trainY <- spy_target[trainIndex]
testY <- spy_target[-trainIndex]

3.2 Normalized the data

# Normalize the data
preProcess_range_model <- preProcess(trainX, method = c("center", "scale"))
trainX <- predict(preProcess_range_model, trainX)
testX <- predict(preProcess_range_model, testX)

3.3 Find the best k value

variables <- c('k_percent', 'r_percent', 'Price_Rate_Of_Change', 'MACD')

acc <- list()
for (i in 1:20) {
  knn_pred <- knn(train = trainX, test = testX, cl = trainY, k = i)
  acc[as.character(i)] = mean(knn_pred == testY)
}

acc <- unlist(acc)
tibble(acc = acc) %>%
  mutate(k = row_number()) %>%
  ggplot(aes(k, acc)) +
  geom_col(aes(fill = k == which.max(acc))) +
  labs(x = 'K', y = 'Accuracy', title = 'KNN Accuracy for different values of K') +
  scale_x_continuous(breaks = 1:20) +
  scale_y_continuous(breaks = round(c(seq(0.90, 0.94, 0.01), max(acc)),
                                    digits = 3)) +
  geom_hline(yintercept = max(acc), lty = 2) +
  coord_cartesian(ylim = c(min(acc), max(acc))) +
  guides(fill = "none")

By running the k from 1 to 20, we can see the best model is k = 14. Thus, I will be using k = 14 for the final model.

3.4 Train the model

# Train k-NN model with k = 14
knn_model <- knn(train = trainX, test = testX, cl = trainY, k = 14)

4. Results

I will use a confusion matrix to evaluate the model.

# Evaluate the model
conf_matrix <- confusionMatrix(knn_model, as.factor(testY))
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   -1    0    1
##         -1  892   11  439
##         0     0    0    0
##         1   526   19 1213
## 
## Overall Statistics
##                                           
##                Accuracy : 0.679           
##                  95% CI : (0.6623, 0.6955)
##     No Information Rate : 0.5329          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3578          
##                                           
##  Mcnemar's Test P-Value : 3.051e-08       
## 
## Statistics by Class:
## 
##                      Class: -1 Class: 0 Class: 1
## Sensitivity             0.6291 0.000000   0.7343
## Specificity             0.7325 1.000000   0.6236
## Pos Pred Value          0.6647      NaN   0.6900
## Neg Pred Value          0.7008 0.990323   0.6729
## Prevalence              0.4574 0.009677   0.5329
## Detection Rate          0.2877 0.000000   0.3913
## Detection Prevalence    0.4329 0.000000   0.5671
## Balanced Accuracy       0.6808 0.500000   0.6789

# Plot the confusion matrix
cm <- as.matrix(conf_matrix$table)
values_x <- levels(as.factor(testY))
values_y <- values_x[3:1]

cm_tbl <- tibble::as_tibble(cm)

prediction_levels <- c("1", "0", "-1")
reference_levels <- c("-1", "0", "1")

ggplot(data = cm_tbl, aes(x = Reference, y = Prediction, fill = n)) +
  geom_tile(color = "black") +
  geom_text(aes(label = n), vjust = 1) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Actual", y = "Predicted", title = "Confusion Matrix") +
  scale_x_discrete(limits = reference_levels) +
  scale_y_discrete(limits = prediction_levels)

The KNN model’s accuracy of 67.9% demonstrates its effectiveness in predicting the next day’s stock price direction, surpassing the performance of random guessing. This outcome underscores the value of utilizing machine learning models in financial forecasting tasks.

Machine Learning Application in Finance

PhamMinhTam

2024-06-12