Forex trading involves buying or selling currency. There are various approaches to forex trading, one of which is technical analysis, which involves using technical indicators in the trading stragegy. Historical forex prices can be used to assess stragies before trading the live markets to maximize profits and more importantly, minimize losses.
Build machine learning models to identify profitable buying and selling opportunities for the EUR/USD currency pair using historical forex data sourced from Oanda. Here, we will build a random forest and a neural network model. The predictors will primary be techical indicators derived from the price. The target will be a variable with 3 categories:
Model performance will be assessed from a trading perspective by quantifying the predictive power and amount of pips gained/loss. We assume a trading strategy where any open trades will be closed once 75 pips were gained, 75 pips were lost, or 24 hours have past, whichever comes first.
Loading the data from OANDA’s rest API using the previously defined HisPrices()
function. The open, high, low, close (OHLC) prices from the 4-hour and 15-minute timeframes will be used.
# Retrieving 4-hour data from 2020 to present
date_interval <- as.character(seq(lubridate::ymd_hm("2020-01-01 0:00"),
lubridate::ymd_hm(stringr::str_sub(Sys.time(), 1, 16)),
by = "6 months"))
date_interval <- data.frame(from = date_interval, to = dplyr::lead(date_interval, default = as.character(Sys.Date())))
dataH4 <- mapply(function(from, to){HisPrices(accountType = "practice",
instrument = "EUR_CAD",
granularity = "H4",
tzone = "Canada/Eastern",
from = from, to = to)},
date_interval$from, date_interval$to,
SIMPLIFY = F) %>% dplyr::bind_rows()
# Retrieving 15-minute data from 2020 to present
date_interval <- as.character(seq(lubridate::ymd_hm("2020-01-01 0:00"),
lubridate::ymd_hm(stringr::str_sub(Sys.time(), 1, 16)),
by = "1 month"))
date_interval <- data.frame(from = date_interval, to = dplyr::lead(date_interval, default = as.character(Sys.Date())))
dataM15 <- mapply(function(from, to){HisPrices(accountType = "practice",
instrument = "EUR_CAD",
granularity = "M15",
tzone = "Canada/Eastern",
from = from, to = to)},
date_interval$from, date_interval$to,
SIMPLIFY = F) %>% dplyr::bind_rows()
save(list = c("dataH4", "dataM15"), file = "../data/forex_trade_prediction_data.Rdata")
load(file = "../data/forex_trade_prediction_data.Rdata")
head(dataH4)
## complete volume time_stamp open high low close
## 1 TRUE 6670 2020-01-01 17:00:00 1.45644 1.45686 1.45545 1.45574
## 2 TRUE 6650 2020-01-01 21:00:00 1.45574 1.45690 1.45520 1.45536
## 3 TRUE 15590 2020-01-02 01:00:00 1.45536 1.45597 1.45374 1.45430
## 4 TRUE 18915 2020-01-02 05:00:00 1.45430 1.45578 1.45370 1.45477
## 5 TRUE 27514 2020-01-02 09:00:00 1.45476 1.45504 1.45095 1.45168
## 6 TRUE 11810 2020-01-02 13:00:00 1.45166 1.45190 1.44990 1.45076
head(dataM15)
## complete volume time_stamp open high low close
## 1 TRUE 45 2020-01-01 17:00:00 1.45644 1.45686 1.45579 1.45600
## 2 TRUE 21 2020-01-01 17:15:00 1.45609 1.45631 1.45590 1.45605
## 3 TRUE 101 2020-01-01 17:30:00 1.45608 1.45616 1.45590 1.45614
## 4 TRUE 317 2020-01-01 17:45:00 1.45615 1.45644 1.45608 1.45642
## 5 TRUE 1444 2020-01-01 18:00:00 1.45638 1.45638 1.45562 1.45602
## 6 TRUE 280 2020-01-01 18:15:00 1.45602 1.45626 1.45596 1.45608
First, we determine how many candlesticks moved in both directions by 75 pips. If this applied to any candlestick, we would not be able to determine which direction it moved by 75 pips first.
num4h <- dataH4 %>% dplyr::summarise(both = sum((open - low) > 0.0075 & (high - open) > 0.0075))
num15m <- dataM15 %>% dplyr::summarise(both = sum((open - low) > 0.0075 & (high - open) > 0.0075))
There were 5 candlesticks in the 4-hour timeframe and 0 candlesticks in the 15-minute timeframe that moved more than 75 pips in both directions The 15-minute data will be used to derive the target, since none of its candlesticks moved by 75 pips in both directions. This avoids ambiguity and assumptions in candlestick movement. A rolling function will be applied to create the 3 category target over the next 24-hour window (96 15-minute candlesticks correspond to 24 hours).
data_target = dataM15
# Given to the current price, identify the index of the candle that reached >75 pips higher (if any)
data_target$index_at_incr <- runner::runner(x = dataM15,
f = function(x){which((x$high > (x$open[1] + 0.0075))[-1])[1]},
lag = -96, k = 97, na_pad = T)
# Given to the current price, identify the index of the candle that reached >75 pips lower (if any)
data_target$index_at_decr <- runner::runner(x = dataM15,
f = function(x){which((x$low < (x$open[1] - 0.0075))[-1])[1]},
lag = -96, k = 97, na_pad = T)
# Adding the price after 24 hours
data_target$last_price <- runner::runner(x = dataM15,
f = function(x){x$open[97]},
lag = -96, k = 97, na_pad = T)
# Putting together the 3 category variables: 0 if price did not move >75 pips in either direction, 1 if price moved up first by >75 pips, 2 if price moved down first by >75 pips.
data_target = data_target %>%
dplyr::mutate(target = dplyr::case_when(
(!is.na(index_at_incr) & index_at_incr < index_at_decr) | (!is.na(index_at_incr) & is.na(index_at_decr)) ~ 1,
(!is.na(index_at_decr) & index_at_decr < index_at_incr) | (!is.na(index_at_decr) & is.na(index_at_incr)) ~ 2,
dplyr::row_number() < nrow(data_target) - 95 ~ 0 # else is 0, except for last 95 rows, which should be NA because there won't be anough future data to determine the target
))
Adding continuous variables, including the stop and reverse (SAR), average directional index (ADX), moving average convergence divergence (MACD), relative strength index (RSI) and exponential moving average (EMA) indicators, as well as dummy variables for the london, new york and london trading sessions.
data_features = dataH4
sar.2 = TTR::SAR(data_features[, c("high","low")], accel = c(0.02, 0.2))
adx.14 = TTR::ADX(data_features[, c("high","low","close")], 14)[, 4]
data_features <- data_features %>%
dplyr::mutate(direction = dplyr::if_else(open <= close, "bull", "bear"),
sar.2 = sar.2,
sar.2_diff = sar.2 - close,
adx.14 = adx.14,
macd.12.24 = TTR::MACD(close, 12, 24, 9)[, "macd"],
macd.12.24_signal = TTR::MACD(close, 12, 24, 9)[, "signal"],
macd.12.24_status = dplyr::if_else(macd.12.24 >= macd.12.24_signal, "positive", "negative"),
rsi.12 = TTR::RSI(close, 12),
ema.120 = TTR::EMA(close, 120),
ema.120_diff = ema.120 - close,
target_time_stamp = dplyr::lead(time_stamp), # the target corresponds to the next time stamp
hour = lubridate::hour(time_stamp),
day = weekdays(time_stamp),
session_london = ifelse(dplyr::between(hour, 3, 12), 1, 0),
session_newyork = ifelse(dplyr::between(hour, 8, 17), 1, 0),
session_tokyo = ifelse(dplyr::between(hour, 19, 24) | dplyr::between(hour, 0, 4), 1, 0)
) %>%
dplyr::ungroup()
Joining together the target and predictors. Trading signals will be based on the 4-hour time frame.
forexdata <- data_features %>%
dplyr::left_join(data_target %>% dplyr::select(time_stamp, target, entry_price = open, last_price), by = c("target_time_stamp" = "time_stamp")) %>%
dplyr::mutate(target = as.factor(target)) %>%
na.omit()
# Test the last 500 candles
test <- tail(forexdata, 500)
# Train all minus the last 500 candles, which is the test set, and minus another 6 candles (corresponding to 24 hours), which we won't have enough info for to determine the target in real time. Setting this gap between the train and test set will prevent to cheat from being supplied future information.
train <- head(forexdata, nrow(forexdata) - 500 - 6)
Specifying the model formula.
mod_formula <- formula(target ~ close + direction + sar.2_diff + adx.14 + macd.12.24_status + rsi.12 + ema.120_diff +
session_london + session_newyork + session_tokyo)
Running the random forest model on the training set.
rfmodel <- randomForest::randomForest(mod_formula, data = train,
sampsize = min(table(train$target))) # restrict sampsize to address with imbalance
Obtaining the predictions, confusion matrix and other metrics from the test set.
predicted <- predict(rfmodel, test)
confusion <- caret::confusionMatrix(predicted, test$target)
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 74 49 29
## 1 110 74 75
## 2 35 24 30
##
## Overall Statistics
##
## Accuracy : 0.356
## 95% CI : (0.314, 0.3997)
## No Information Rate : 0.438
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.0343
##
## Mcnemar's Test P-Value : 7.11e-11
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.3379 0.5034 0.2239
## Specificity 0.7224 0.4759 0.8388
## Pos Pred Value 0.4868 0.2857 0.3371
## Neg Pred Value 0.5833 0.6971 0.7470
## Prevalence 0.4380 0.2940 0.2680
## Detection Rate 0.1480 0.1480 0.0600
## Detection Prevalence 0.3040 0.5180 0.1780
## Balanced Accuracy 0.5302 0.4897 0.5313
Calculating the proportion of correctly and incorrectly predicted values for targets 1 and 2. The main focus is to assess opened trades, which corresponds to target classes 1 and 2. No trades would have been taken if target class 0 was predicted, and missing winning trades is a smaller concern compared to taking losing trades.
summary1 <- as.data.frame(confusion$table) %>%
dplyr::group_by(Prediction) %>%
dplyr::arrange(Prediction) %>%
dplyr::mutate(Perc = round(Freq/sum(Freq)*100, 0))
summary1
## # A tibble: 9 × 4
## # Groups: Prediction [3]
## Prediction Reference Freq Perc
## <fct> <fct> <int> <dbl>
## 1 0 0 74 49
## 2 0 1 49 32
## 3 0 2 29 19
## 4 1 0 110 42
## 5 1 1 74 29
## 6 1 2 75 29
## 7 2 0 35 39
## 8 2 1 24 27
## 9 2 2 30 34
buys <- summary1 %>% dplyr::filter(Prediction == 1)
sells <- summary1 %>% dplyr::filter(Prediction == 2)
If EUR/USD was bought everytime the model predicted the target class 1, 29% of trades would have won 75 pips, 29% of trades would have lost 75 pips, and 42% of trades would have closed after 24 hours. Simiarly, if EUR/USD was sold everytime the model predicted the target class 2, 34% of trades would have won 75 pips, 27% of trades would have lost 75 pips, and 39% of trades would have closed after 24 hours.
Although the positive predictive values are low for target classes 1 and 2, this does not exactly mean that the model is not profitable, because closing the trade after 24 could either mean a gain or loss. Profitability can be further assess using the closing price after 24 hours.
summary2 <- test %>%
dplyr::mutate(predicted = predicted,
pips = dplyr::case_when(
predicted == 1 & target == 1 ~ 75,
predicted == 1 & target == 0 ~ last_price - entry_price,
predicted == 1 & target == 2 ~ -75,
predicted == 2 & target == 2 ~ 75,
predicted == 2 & target == 0 ~ entry_price - last_price,
predicted == 2 & target == 1 ~ -75,
predicted == 0 ~ 0),
win = ifelse(pips > 0, pips, NA),
loss = ifelse(pips < 0, pips, NA)
)
pips_total <- round(sum(summary2$pips), 0)
wisloss_ratio <- round(sum(summary2$win, na.rm = T)/abs(sum(summary2$loss, na.rm = T)), 2)
Taking trades based on the random forest model would have resulted in a total gain of 375 pips with a win/loss ratio of 1.05.
Preparing the data further into proper model inputs.
numvars <- c("close", "sar.2_diff", "adx.14", "rsi.12", "ema.120_diff")
catvars <- c("direction", "macd.12.24_status", "session_london", "session_newyork", "session_tokyo")
x_test_num <- test %>% dplyr::select(all_of(numvars)) %>% scale()
x_test_cat <- test %>% dplyr::select(all_of(catvars)) %>% fastDummies::dummy_cols(remove_first_dummy = T, remove_selected_columns = T)
x_test <- cbind(x_test_num, x_test_cat) %>% as.matrix()
y_test <- keras::to_categorical(test$target, 3)
## Loaded Tensorflow version 2.9.0-dev20220320
x_train_num <- train %>% dplyr::select(all_of(numvars)) %>% scale()
x_train_cat <- train %>% dplyr::select(all_of(catvars)) %>% fastDummies::dummy_cols(remove_first_dummy = T, remove_selected_columns = T)
x_train <- cbind(x_train_num, x_train_cat) %>% as.matrix()
y_train <- keras::to_categorical(train$target, 3)
Running the neural network model on the training set and plotting the learning curves.
model <- keras::keras_model_sequential()
model %>%
keras::layer_dense(units = 128, activation = 'sigmoid', input_shape = ncol(x_train), kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
keras::layer_dense(units = 64, activation = 'relu', kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
keras::layer_dense(units = 64, activation = 'relu', kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
keras::layer_dense(units = 3, activation = 'softmax')
model %>% keras::compile(
loss = 'binary_crossentropy',
optimizer = keras::optimizer_rmsprop(),
metrics = c('accuracy')
)
fit <- model %>% keras::fit(
x_train, y_train,
epochs = 100,
batch_size = 30,
validation_split = 0.2,
verbose=0
)
plot(fit)
Obtaining the predictions, confusion matrix and other metrics from the test set.
predicted <- model %>% predict(x_test) %>% keras::k_argmax() %>% as.array()
confusion <- caret::confusionMatrix(as.factor(predicted), as.factor(test$target))
## Warning in levels(reference) != levels(data): longer object length is not a multiple of shorter object length
## Warning in confusionMatrix.default(as.factor(predicted), as.factor(test$target)): Levels are not in the same order for reference and data. Refactoring data to match.
confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2
## 0 195 143 121
## 1 0 0 0
## 2 24 4 13
##
## Overall Statistics
##
## Accuracy : 0.416
## 95% CI : (0.3724, 0.4606)
## No Information Rate : 0.438
## P-Value [Acc > NIR] : 0.8501
##
## Kappa : -0.014
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2
## Sensitivity 0.8904 0.000 0.09701
## Specificity 0.0605 1.000 0.92350
## Pos Pred Value 0.4248 NaN 0.31707
## Neg Pred Value 0.4146 0.706 0.73638
## Prevalence 0.4380 0.294 0.26800
## Detection Rate 0.3900 0.000 0.02600
## Detection Prevalence 0.9180 0.000 0.08200
## Balanced Accuracy 0.4755 0.500 0.51026
Calculating the proportion of correctly and incorrectly predicted values for targets 1 and 2.
summary1 <- as.data.frame(confusion$table) %>%
dplyr::group_by(Prediction) %>%
dplyr::arrange(Prediction) %>%
dplyr::mutate(Perc = round(Freq/sum(Freq)*100, 0))
summary1
## # A tibble: 9 × 4
## # Groups: Prediction [3]
## Prediction Reference Freq Perc
## <fct> <fct> <int> <dbl>
## 1 0 0 195 42
## 2 0 1 143 31
## 3 0 2 121 26
## 4 1 0 0 NaN
## 5 1 1 0 NaN
## 6 1 2 0 NaN
## 7 2 0 24 59
## 8 2 1 4 10
## 9 2 2 13 32
buys <- summary1 %>% dplyr::filter(Prediction == 1)
sells <- summary1 %>% dplyr::filter(Prediction == 2)
The model did not predict any target class 1. If EUR/USD was sold everytime the model predicted the target class 2, 32% of trades would have won 75 pips, 10% of trades would have lost 75 pips, and 59% of trades would have closed after 24 hours.
summary2 <- test %>%
dplyr::mutate(predicted = predicted,
pips = dplyr::case_when(
predicted == 1 & target == 1 ~ 75,
predicted == 1 & target == 0 ~ (last_price - entry_price)*10000,
predicted == 1 & target == 2 ~ -75,
predicted == 2 & target == 2 ~ 75,
predicted == 2 & target == 0 ~ (entry_price - last_price)*10000,
predicted == 2 & target == 1 ~ -75,
predicted == 0 ~ 0),
win = ifelse(pips > 0, pips, NA),
loss = ifelse(pips < 0, pips, NA)
)
pips_total <- round(sum(summary2$pips), 0)
wisloss_ratio <- round(sum(summary2$win, na.rm = T)/abs(sum(summary2$loss, na.rm = T)), 2)
Taking trades based on the neural network model would have resulted in a total gain of 629 pips with a win/loss ratio of 2.03.
None of the models performed well for trading purposes. Further improvements can be made such as: