1 Overview

Forex trading involves buying or selling currency. There are various approaches to forex trading, one of which is technical analysis, which involves using technical indicators in the trading stragegy. Historical forex prices can be used to assess stragies before trading the live markets to maximize profits and more importantly, minimize losses.

2 Objective

Build machine learning models to identify profitable buying and selling opportunities for the EUR/USD currency pair using historical forex data sourced from Oanda. Here, we will build a random forest and a neural network model. The predictors will primary be techical indicators derived from the price. The target will be a variable with 3 categories:

0: indicating price did not move 75 pips ($0.0075), neither up or down, within the next 24 hours
1: indicating price moved up 75 pips first within the next 24 hours
2: indicating price moved down 75 pips first within the next 24 hours

Model performance will be assessed from a trading perspective by quantifying the predictive power and amount of pips gained/loss. We assume a trading strategy where any open trades will be closed once 75 pips were gained, 75 pips were lost, or 24 hours have past, whichever comes first.

3 Data preprocessing

3.1 Loading the data

Loading the data from OANDA’s rest API using the previously defined HisPrices() function. The open, high, low, close (OHLC) prices from the 4-hour and 15-minute timeframes will be used.

# Retrieving 4-hour data from 2020 to present
date_interval <- as.character(seq(lubridate::ymd_hm("2020-01-01 0:00"),
                           lubridate::ymd_hm(stringr::str_sub(Sys.time(), 1, 16)),
                           by = "6 months"))
date_interval <- data.frame(from = date_interval, to = dplyr::lead(date_interval, default = as.character(Sys.Date())))
dataH4 <- mapply(function(from, to){HisPrices(accountType = "practice",
                                               instrument = "EUR_CAD",
                                               granularity = "H4",
                                               tzone = "Canada/Eastern",
                                               from = from, to = to)},
                  date_interval$from, date_interval$to,
                  SIMPLIFY = F) %>% dplyr::bind_rows()

# Retrieving 15-minute data from 2020 to present
date_interval <- as.character(seq(lubridate::ymd_hm("2020-01-01 0:00"),
                                  lubridate::ymd_hm(stringr::str_sub(Sys.time(), 1, 16)),
                                  by = "1 month"))
date_interval <- data.frame(from = date_interval, to = dplyr::lead(date_interval, default = as.character(Sys.Date())))
dataM15 <- mapply(function(from, to){HisPrices(accountType = "practice",
                                               instrument = "EUR_CAD",
                                               granularity = "M15",
                                               tzone = "Canada/Eastern",
                                               from = from, to = to)},
                  date_interval$from, date_interval$to,
                  SIMPLIFY = F) %>% dplyr::bind_rows()
save(list = c("dataH4", "dataM15"), file = "../data/forex_trade_prediction_data.Rdata")
load(file = "../data/forex_trade_prediction_data.Rdata")

head(dataH4)

##   complete volume          time_stamp    open    high     low   close
## 1     TRUE   6670 2020-01-01 17:00:00 1.45644 1.45686 1.45545 1.45574
## 2     TRUE   6650 2020-01-01 21:00:00 1.45574 1.45690 1.45520 1.45536
## 3     TRUE  15590 2020-01-02 01:00:00 1.45536 1.45597 1.45374 1.45430
## 4     TRUE  18915 2020-01-02 05:00:00 1.45430 1.45578 1.45370 1.45477
## 5     TRUE  27514 2020-01-02 09:00:00 1.45476 1.45504 1.45095 1.45168
## 6     TRUE  11810 2020-01-02 13:00:00 1.45166 1.45190 1.44990 1.45076

head(dataM15)

##   complete volume          time_stamp    open    high     low   close
## 1     TRUE     45 2020-01-01 17:00:00 1.45644 1.45686 1.45579 1.45600
## 2     TRUE     21 2020-01-01 17:15:00 1.45609 1.45631 1.45590 1.45605
## 3     TRUE    101 2020-01-01 17:30:00 1.45608 1.45616 1.45590 1.45614
## 4     TRUE    317 2020-01-01 17:45:00 1.45615 1.45644 1.45608 1.45642
## 5     TRUE   1444 2020-01-01 18:00:00 1.45638 1.45638 1.45562 1.45602
## 6     TRUE    280 2020-01-01 18:15:00 1.45602 1.45626 1.45596 1.45608

3.2 Adding the target

First, we determine how many candlesticks moved in both directions by 75 pips. If this applied to any candlestick, we would not be able to determine which direction it moved by 75 pips first.

num4h <- dataH4 %>% dplyr::summarise(both = sum((open - low) > 0.0075 & (high - open) > 0.0075))
num15m <- dataM15 %>% dplyr::summarise(both = sum((open - low) > 0.0075 & (high - open) > 0.0075))

There were 5 candlesticks in the 4-hour timeframe and 0 candlesticks in the 15-minute timeframe that moved more than 75 pips in both directions The 15-minute data will be used to derive the target, since none of its candlesticks moved by 75 pips in both directions. This avoids ambiguity and assumptions in candlestick movement. A rolling function will be applied to create the 3 category target over the next 24-hour window (96 15-minute candlesticks correspond to 24 hours).

data_target = dataM15

# Given to the current price, identify the index of the candle that reached >75 pips higher (if any)
data_target$index_at_incr <- runner::runner(x = dataM15,
                                            f = function(x){which((x$high > (x$open[1] + 0.0075))[-1])[1]}, 
                                            lag = -96, k = 97, na_pad = T)
# Given to the current price, identify the index of the candle that reached >75 pips lower (if any)
data_target$index_at_decr <- runner::runner(x = dataM15,
                                            f = function(x){which((x$low < (x$open[1] - 0.0075))[-1])[1]},
                                            lag = -96, k = 97, na_pad = T)
# Adding the price after 24 hours
data_target$last_price <- runner::runner(x = dataM15,
                                            f = function(x){x$open[97]},
                                            lag = -96, k = 97, na_pad = T)
# Putting together the 3 category variables: 0 if price did not move >75 pips in either direction, 1 if price moved up first by >75 pips, 2 if price moved down first by >75 pips.
data_target = data_target %>%
  dplyr::mutate(target = dplyr::case_when(
    (!is.na(index_at_incr) & index_at_incr < index_at_decr) | (!is.na(index_at_incr) & is.na(index_at_decr)) ~ 1,
    (!is.na(index_at_decr) & index_at_decr < index_at_incr) | (!is.na(index_at_decr) & is.na(index_at_incr)) ~ 2,
    dplyr::row_number() < nrow(data_target) - 95 ~ 0 # else is 0, except for last 95 rows, which should be NA because there won't be anough future data to determine the target
  ))

3.3 Adding the predictors

Adding continuous variables, including the stop and reverse (SAR), average directional index (ADX), moving average convergence divergence (MACD), relative strength index (RSI) and exponential moving average (EMA) indicators, as well as dummy variables for the london, new york and london trading sessions.

data_features = dataH4
sar.2 = TTR::SAR(data_features[, c("high","low")], accel = c(0.02, 0.2))
adx.14 = TTR::ADX(data_features[, c("high","low","close")], 14)[, 4]
data_features <- data_features %>%
  dplyr::mutate(direction = dplyr::if_else(open <= close, "bull", "bear"),
                sar.2 = sar.2,
                sar.2_diff = sar.2 - close,
                adx.14 = adx.14,
                macd.12.24 = TTR::MACD(close, 12, 24, 9)[, "macd"],
                macd.12.24_signal = TTR::MACD(close, 12, 24, 9)[, "signal"],
                macd.12.24_status = dplyr::if_else(macd.12.24 >= macd.12.24_signal, "positive", "negative"),
                rsi.12 = TTR::RSI(close, 12),
                ema.120 = TTR::EMA(close, 120),
                ema.120_diff = ema.120 - close,
                target_time_stamp = dplyr::lead(time_stamp), # the target corresponds to the next time stamp
                hour = lubridate::hour(time_stamp),
                day = weekdays(time_stamp),
                session_london = ifelse(dplyr::between(hour, 3, 12), 1, 0),
                session_newyork = ifelse(dplyr::between(hour, 8, 17), 1, 0),
                session_tokyo = ifelse(dplyr::between(hour, 19, 24) | dplyr::between(hour, 0, 4), 1, 0)
  ) %>%
  dplyr::ungroup()

3.4 Joining target and predictors

Joining together the target and predictors. Trading signals will be based on the 4-hour time frame.

forexdata <- data_features %>%
  dplyr::left_join(data_target %>% dplyr::select(time_stamp, target, entry_price = open, last_price), by = c("target_time_stamp" = "time_stamp")) %>%
  dplyr::mutate(target = as.factor(target)) %>%
  na.omit()

3.5 Splitting the data into train and test sets.

# Test the last 500 candles
test <- tail(forexdata, 500)
# Train all minus the last 500 candles, which is the test set, and minus another 6 candles (corresponding to 24 hours), which we won't have enough info for to determine the target in real time. Setting this gap between the train and test set will prevent to cheat from being supplied future information.
train <- head(forexdata, nrow(forexdata) - 500 - 6)

4 Model fitting

Specifying the model formula.

mod_formula <- formula(target ~ close + direction + sar.2_diff + adx.14 + macd.12.24_status + rsi.12 + ema.120_diff +
                         session_london + session_newyork + session_tokyo)

4.1 Random forest model

Running the random forest model on the training set.

rfmodel <- randomForest::randomForest(mod_formula, data = train,
                                      sampsize = min(table(train$target))) # restrict sampsize to address with imbalance

Obtaining the predictions, confusion matrix and other metrics from the test set.

predicted <- predict(rfmodel, test)
confusion <- caret::confusionMatrix(predicted, test$target)
confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0  74  49  29
##          1 110  74  75
##          2  35  24  30
## 
## Overall Statistics
##                                          
##                Accuracy : 0.356          
##                  95% CI : (0.314, 0.3997)
##     No Information Rate : 0.438          
##     P-Value [Acc > NIR] : 0.9999         
##                                          
##                   Kappa : 0.0343         
##                                          
##  Mcnemar's Test P-Value : 7.11e-11       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.3379   0.5034   0.2239
## Specificity            0.7224   0.4759   0.8388
## Pos Pred Value         0.4868   0.2857   0.3371
## Neg Pred Value         0.5833   0.6971   0.7470
## Prevalence             0.4380   0.2940   0.2680
## Detection Rate         0.1480   0.1480   0.0600
## Detection Prevalence   0.3040   0.5180   0.1780
## Balanced Accuracy      0.5302   0.4897   0.5313

Calculating the proportion of correctly and incorrectly predicted values for targets 1 and 2. The main focus is to assess opened trades, which corresponds to target classes 1 and 2. No trades would have been taken if target class 0 was predicted, and missing winning trades is a smaller concern compared to taking losing trades.

summary1 <- as.data.frame(confusion$table) %>%
  dplyr::group_by(Prediction) %>%
  dplyr::arrange(Prediction) %>% 
  dplyr::mutate(Perc = round(Freq/sum(Freq)*100, 0))
summary1

## # A tibble: 9 × 4
## # Groups:   Prediction [3]
##   Prediction Reference  Freq  Perc
##   <fct>      <fct>     <int> <dbl>
## 1 0          0            74    49
## 2 0          1            49    32
## 3 0          2            29    19
## 4 1          0           110    42
## 5 1          1            74    29
## 6 1          2            75    29
## 7 2          0            35    39
## 8 2          1            24    27
## 9 2          2            30    34

buys <- summary1 %>% dplyr::filter(Prediction == 1)
sells <- summary1 %>% dplyr::filter(Prediction == 2)

If EUR/USD was bought everytime the model predicted the target class 1, 29% of trades would have won 75 pips, 29% of trades would have lost 75 pips, and 42% of trades would have closed after 24 hours. Simiarly, if EUR/USD was sold everytime the model predicted the target class 2, 34% of trades would have won 75 pips, 27% of trades would have lost 75 pips, and 39% of trades would have closed after 24 hours.

Although the positive predictive values are low for target classes 1 and 2, this does not exactly mean that the model is not profitable, because closing the trade after 24 could either mean a gain or loss. Profitability can be further assess using the closing price after 24 hours.

summary2 <- test %>% 
  dplyr::mutate(predicted = predicted,
                pips = dplyr::case_when(
                  predicted == 1 & target == 1 ~ 75,
                  predicted == 1 & target == 0 ~ last_price - entry_price,
                  predicted == 1 & target == 2 ~ -75,
                  predicted == 2 & target == 2 ~ 75,
                  predicted == 2 & target == 0 ~ entry_price - last_price,
                  predicted == 2 & target == 1 ~ -75,
                  predicted == 0 ~ 0),
                win = ifelse(pips > 0, pips, NA),
                loss = ifelse(pips < 0, pips, NA)
                )
pips_total <- round(sum(summary2$pips), 0)
wisloss_ratio <- round(sum(summary2$win, na.rm = T)/abs(sum(summary2$loss, na.rm = T)), 2)

Taking trades based on the random forest model would have resulted in a total gain of 375 pips with a win/loss ratio of 1.05.

4.2 Neural network model

Preparing the data further into proper model inputs.

numvars <- c("close", "sar.2_diff", "adx.14", "rsi.12", "ema.120_diff")
catvars <- c("direction", "macd.12.24_status", "session_london", "session_newyork", "session_tokyo")

x_test_num <- test %>% dplyr::select(all_of(numvars)) %>% scale()
x_test_cat <- test %>% dplyr::select(all_of(catvars)) %>% fastDummies::dummy_cols(remove_first_dummy = T, remove_selected_columns = T)
x_test <- cbind(x_test_num, x_test_cat) %>% as.matrix()
y_test <- keras::to_categorical(test$target, 3)

## Loaded Tensorflow version 2.9.0-dev20220320

x_train_num <- train %>% dplyr::select(all_of(numvars)) %>% scale()
x_train_cat <- train %>% dplyr::select(all_of(catvars)) %>% fastDummies::dummy_cols(remove_first_dummy = T, remove_selected_columns = T)
x_train <- cbind(x_train_num, x_train_cat) %>% as.matrix()
y_train <- keras::to_categorical(train$target, 3)

Running the neural network model on the training set and plotting the learning curves.

model <- keras::keras_model_sequential()
model %>%
  keras::layer_dense(units = 128, activation = 'sigmoid', input_shape = ncol(x_train), kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
  keras::layer_dense(units = 64, activation = 'relu', kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
  keras::layer_dense(units = 64, activation = 'relu', kernel_regularizer = keras::regularizer_l2(l = 0.001)) %>%
  keras::layer_dense(units = 3, activation = 'softmax')

model %>% keras::compile(
  loss = 'binary_crossentropy',
  optimizer = keras::optimizer_rmsprop(),
  metrics = c('accuracy')
)

fit <- model %>% keras::fit(
  x_train, y_train,
  epochs = 100,
  batch_size = 30,
  validation_split = 0.2,
  verbose=0
)

plot(fit)

Obtaining the predictions, confusion matrix and other metrics from the test set.

predicted <- model %>% predict(x_test) %>% keras::k_argmax() %>% as.array()
confusion <- caret::confusionMatrix(as.factor(predicted), as.factor(test$target))

## Warning in levels(reference) != levels(data): longer object length is not a multiple of shorter object length

## Warning in confusionMatrix.default(as.factor(predicted), as.factor(test$target)): Levels are not in the same order for reference and data. Refactoring data to match.

confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2
##          0 195 143 121
##          1   0   0   0
##          2  24   4  13
## 
## Overall Statistics
##                                           
##                Accuracy : 0.416           
##                  95% CI : (0.3724, 0.4606)
##     No Information Rate : 0.438           
##     P-Value [Acc > NIR] : 0.8501          
##                                           
##                   Kappa : -0.014          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.8904    0.000  0.09701
## Specificity            0.0605    1.000  0.92350
## Pos Pred Value         0.4248      NaN  0.31707
## Neg Pred Value         0.4146    0.706  0.73638
## Prevalence             0.4380    0.294  0.26800
## Detection Rate         0.3900    0.000  0.02600
## Detection Prevalence   0.9180    0.000  0.08200
## Balanced Accuracy      0.4755    0.500  0.51026

Calculating the proportion of correctly and incorrectly predicted values for targets 1 and 2.

summary1 <- as.data.frame(confusion$table) %>%
  dplyr::group_by(Prediction) %>%
  dplyr::arrange(Prediction) %>% 
  dplyr::mutate(Perc = round(Freq/sum(Freq)*100, 0))
summary1

## # A tibble: 9 × 4
## # Groups:   Prediction [3]
##   Prediction Reference  Freq  Perc
##   <fct>      <fct>     <int> <dbl>
## 1 0          0           195    42
## 2 0          1           143    31
## 3 0          2           121    26
## 4 1          0             0   NaN
## 5 1          1             0   NaN
## 6 1          2             0   NaN
## 7 2          0            24    59
## 8 2          1             4    10
## 9 2          2            13    32

buys <- summary1 %>% dplyr::filter(Prediction == 1)
sells <- summary1 %>% dplyr::filter(Prediction == 2)

The model did not predict any target class 1. If EUR/USD was sold everytime the model predicted the target class 2, 32% of trades would have won 75 pips, 10% of trades would have lost 75 pips, and 59% of trades would have closed after 24 hours.

summary2 <- test %>% 
  dplyr::mutate(predicted = predicted,
                pips = dplyr::case_when(
                  predicted == 1 & target == 1 ~ 75,
                  predicted == 1 & target == 0 ~ (last_price - entry_price)*10000,
                  predicted == 1 & target == 2 ~ -75,
                  predicted == 2 & target == 2 ~ 75,
                  predicted == 2 & target == 0 ~ (entry_price - last_price)*10000,
                  predicted == 2 & target == 1 ~ -75,
                  predicted == 0 ~ 0),
                win = ifelse(pips > 0, pips, NA),
                loss = ifelse(pips < 0, pips, NA)
                )
pips_total <- round(sum(summary2$pips), 0)
wisloss_ratio <- round(sum(summary2$win, na.rm = T)/abs(sum(summary2$loss, na.rm = T)), 2)

Taking trades based on the neural network model would have resulted in a total gain of 629 pips with a win/loss ratio of 2.03.

5 Concluding Remarks

None of the models performed well for trading purposes. Further improvements can be made such as:

Explore other predictors or combination of predictors. Not only technical indicators, but also economic indicators, such as interest rates, consumer price indices, etc.
Explore other target definitions
Explore other specifications and paramaters for the random forest and neural network models
Explore other machine learning models, such as long short term memory (LSTM) networks
Explore use of more data

EURUSD Trade Prediction

Kim G

2022-07-05