This article is to demonstrate semi-automatic Algorithmic Trading using R; to predict the increase/decrease of trend in a stock market to help decision-making practices.
#> Observations: 3,747
#> Variables: 10
#> $ price.open <dbl> 1700.0, 1712.5, 1725.0, 1700.0, 1725.0, 1712.5,...
#> $ price.high <dbl> 1725.0, 1725.0, 1725.0, 1725.0, 1725.0, 1712.5,...
#> $ price.low <dbl> 1687.5, 1700.0, 1700.0, 1687.5, 1662.5, 1687.5,...
#> $ price.close <dbl> 1700.0, 1725.0, 1712.5, 1725.0, 1712.5, 1712.5,...
#> $ volume <dbl> 103252000, 34434000, 49406000, 83410000, 264260...
#> $ price.adjusted <dbl> 1147.050, 1163.919, 1155.485, 1163.919, 1155.48...
#> $ ref.date <date> 2005-04-01, 2005-04-04, 2005-04-05, 2005-04-06...
#> $ ticker <chr> "BBCA.JK", "BBCA.JK", "BBCA.JK", "BBCA.JK", "BB...
#> $ ret.adjusted.prices <dbl> NA, 0.014706009, -0.007245967, 0.007298854, -0....
#> $ ret.closing.prices <dbl> NA, 0.014705882, -0.007246377, 0.007299270, -0....
#> price.open price.high price.low price.close
#> 8 8 8 8
#> volume price.adjusted ref.date ticker
#> 8 8 0 0
#> ret.adjusted.prices ret.closing.prices
#> 16 16
# graph visualization
tail(stock_data,7*4*6) %>%
ggplot(aes(x = ref.date, y = price.close)) +
geom_candlestick(aes(open = price.open,
high = price.high,
low = price.low,
close = price.close)) +
# exponentially moving average
geom_ma(ma_fun = EMA, n = 12, colour = "blue", lwd = 0.7) +
geom_ma(ma_fun = EMA, n = 26, colour = "red", lwd = 0.7) +
labs(title = "BBCA.JK Stock 2005 - 2020",
x = "Date",
y = "Price Close") +
theme_tq()
To predict the trend of an asset, we can utilize technical indicators. Technical indicator is a series of data that explain the summarized information of an asset, derived by applying particular formula to the a sset price information data i.e. OHLCV. There are 3 commonly used technical indicators:
MACD line - Signal line
reach zero, there is a tendency of change in the trend of MACD line. This also called an MACD oscillator.# calculate MACD(12,26,9)
stock_ti <- stock_data %>%
tq_mutate(
select = price.close,
mutate_fun = MACD,
nFast = 12,
nSlow = 26,
nSig = 9, # signal
maType = EMA,
percent = TRUE
) %>%
mutate(i_MACD = macd - signal) %>%
select(-macd, -signal)
# MACD plot
tail(stock_ti, 7*4*6) %>%
ggplot(aes(x = ref.date, y = i_MACD)) +
geom_line(aes(y = i_MACD))
# calculate SAR(0.04, 0.4)
stock_ti <- stock_ti %>%
tq_mutate(
select = c(price.high, price.low),
mutate_fun = SAR,
accel = c(0.02, 0.2),
col_rename = "sar") %>%
mutate(i_SAR = 100 * (price.close - sar) / sar) %>%
select(-sar)
stock_ti <- stock_ti %>%
tq_mutate(
select = price.close,
mutate_fun = RSI,
n = 14, # calculation period
maType = EMA,
col_rename = "i_RSI"
)
tail(stock_ti, 7*4*6) %>%
ggplot(aes(x = ref.date, y = i_RSI)) +
geom_line()
Portofolio performances for each asset can be measured by various aspect, few of which mentioned below:
# daily return data
stock_met <- stock_ti %>%
tq_mutate(
select = price.close,
mutate_fun = dailyReturn,
col_rename = "returnDaily",
type = "log"
) %>%
drop_na()
# cumulative return
stock_met %>%
tq_transmute(
select = returnDaily,
mutate_fun = Return.cumulative
)
#> returnDaily
#> Cumulative Return 6.403238
# annualized return
stock_met %>%
tq_performance(
Ra = returnDaily,
performance_fun = Return.annualized,
scale = 252,
geometric = TRUE
)
stock_met %>%
tq_performance(
Ra = returnDaily,
performance_fun = SharpeRatio.annualized,
scale = 252,
geometric = TRUE
)
Strategies for Algorithmic Trading are numerous but we will try to use long-short trades strategy. This strategy is basically positioning ourself in the current trend:
long position: buying assets and holding it for the long run with the hope of increment on the asset’s price. When the price is high enough, the asset will be sold for a profit.
short position: selling assets with the hope of decrement on the asset’s price. When the price is low enough, users can buy their asset again for a profit.
We can develop encoding for specific position [^1]:
If we going long, multiply our return with 1, so a rising price would give a profit, and falling price would be a loss.
If we going short, multiply our return with -1, so a falling price would give a profit, and rising price would be a loss.
After picking the base algorithm, we can prepare predictor of trend on the dataset using feature engineering.
# trade signals
stock_met <- stock_met %>%
mutate(
signalMACD = ifelse(i_MACD >= 0, 1, -1) %>%
lag(default = 0),
signalRSI = ifelse(i_RSI <= 10, 1, ifelse(i_RSI > 90, -1, 0)) %>% lag(default = 0),
signalSAR = ifelse(i_SAR >= 0, 1, -1) %>% lag(default = 0)
)
# lag because the signals will be used to predict trend of the following day. The default argument is to make the first day value become 0 (hold signal), because we don’t trade on the first day since we don’t have any prior information.
# calculate cumulative return based on defined rule
stock_met <- stock_met %>%
mutate(
returnMACD = returnDaily * signalMACD,
returnSAR = returnDaily * signalSAR,
returnRSI = returnDaily * signalRSI,
cumRetMACD = cumprod(1 + returnMACD) - 1,
cumRetSAR = cumprod(1 + returnSAR) - 1,
cumRetRSI = cumprod(1 + returnRSI) - 1
)
# preparing basic feature and target
stockTrain <- stock_met %>%
select(ref.date, returnDaily, matches("i_")) %>%
mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
select(-returnDaily)
# prepare time aspect features
stockTrain <- stockTrain %>%
mutate(
timeDays = day(ref.date) %>% as.factor(),
timeMonths = month(ref.date) %>% as.factor(),
timeWeekdays = weekdays(ref.date) %>% as.factor(),
timeYears = year(ref.date) %>% as.numeric()
) %>%
select(-ref.date)
head(stockTrain)
# readjust using lag, then drop all observation containing NA
stockTrain <- stockTrain %>%
mutate_at(vars(-trend), funs(lag(.))) %>%
drop_na()
head(stockTrain)
# split train:test
trainRow <- 1:(nrow(stockTrain) - 28)
# reserve a test dataset
stockTest <- stockTrain %>% dplyr::slice(-trainRow)
# adjust train dataset observation
stockTrain <- stockTrain %>% dplyr::slice(trainRow)
Calculating using trend signal from technical indicators:
# prepare trend dataset
trendData <- stock_met %>%
select(returnDaily, matches("signal")) %>%
mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
mutate_at(vars(contains("signal")), funs(lag)) %>%
mutate_at(vars(contains("signal")), funs(ifelse(. >= 0, "Up", "Down") %>% as.factor())) %>%
rename_at(vars(contains("signal")), funs(sub("signal", "trend", .))) %>%
select(-returnDaily) %>%
drop_na()
# MACD signal prediction vs actual trend
confusionMatrix(
trendData %>% pull(trendSAR),
trendData %>% pull(trend))
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Down Up
#> Down 593 945
#> Up 880 1279
#>
#> Accuracy : 0.5064
#> 95% CI : (0.4901, 0.5226)
#> No Information Rate : 0.6016
#> P-Value [Acc > NIR] : 1.0000
#>
#> Kappa : -0.0222
#>
#> Mcnemar's Test P-Value : 0.1341
#>
#> Sensitivity : 0.4026
#> Specificity : 0.5751
#> Pos Pred Value : 0.3856
#> Neg Pred Value : 0.5924
#> Prevalence : 0.3984
#> Detection Rate : 0.1604
#> Detection Prevalence : 0.4160
#> Balanced Accuracy : 0.4888
#>
#> 'Positive' Class : Down
#>
Using Decision Tree model:
# all predictors
treeMod <- ctree(
trend ~ .,
data = stockTrain
)
# using only MACD & SAR indicator
treeMACDSAR <- ctree(
trend ~ i_MACD + i_SAR,
data = stockTrain,
control = ctree_control(
mincriterion = 0,
minsplit = 0,
minbucket = 0
)
)
Using Random Forest algorithm:
# training environment for caret (random forest)
# initial window and horizon
initialWindow <- 252
horizon <- 21
# the total number of windows
windowNumber <- nrow(stockTrain) - (initialWindow + horizon) + 1
# set total length to windowNumber + 1
seeds <- vector(mode = "list", length = windowNumber + 1)
# set up training control
trControl <- trainControl(
method = "timeslice",
initialWindow = initialWindow,
horizon = horizon,
fixedWindow = TRUE,
classProbs = TRUE,
summaryFunction = twoClassSummary,
seeds = seeds,
allowParallel = TRUE
)
# random forest grid for mtry
rfGrid <- expand.grid(
mtry = seq(from = 2, to = ncol(stockTrain) - 1, by = 2)
)
# set seeds for random forest grid
for(i in 1:windowNumber) seeds[[i]] <- 1:nrow(rfGrid)
seeds[[(windowNumber + 1)]] <- 1
# register parallel processing
cl <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cl)
# train random forest
rfMod <- train(
y = stockTrain %>% pull(trend),
x = stockTrain %>% select(-trend) %>% as.data.frame(),
method = "rf",
tuneGrid = rfGrid,
metric = "ROC",
trControl = trControl %>% list_modify(seeds = seeds),
preProc = c("center", "scale")
)
# stop parallel processing
stopCluster(cl)
registerDoSEQ()
#> $everything
#> user system elapsed
#> 43.64 36.90 1595.47
#>
#> $final
#> user system elapsed
#> 2.28 0.40 13.39
#>
#> $prediction
#> [1] NA NA NA
#> Random Forest
#>
#> 3669 samples
#> 7 predictor
#> 2 classes: 'Down', 'Up'
#>
#> Pre-processing: centered (4), scaled (4), ignore (3)
#> Resampling: Rolling Forecasting Origin Resampling (21 held-out with a fixed window)
#> Summary of sample sizes: 252, 252, 252, 252, 252, 252, ...
#> Resampling results across tuning parameters:
#>
#> mtry ROC Sens Spec
#> 2 0.5439342 0.1106527 0.9062007
#> 4 0.5405758 0.1824732 0.8376693
#> 6 0.5391590 0.2237950 0.8019825
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
Using test data:
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Down Up
#> Down 0 0
#> Up 18 10
#>
#> Accuracy : 0.3571
#> 95% CI : (0.1864, 0.5593)
#> No Information Rate : 0.6429
#> P-Value [Acc > NIR] : 0.9995
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : 6.151e-05
#>
#> Sensitivity : 0.0000
#> Specificity : 1.0000
#> Pos Pred Value : NaN
#> Neg Pred Value : 0.3571
#> Prevalence : 0.6429
#> Detection Rate : 0.0000
#> Detection Prevalence : 0.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : Down
#>
Rearrange flow from the start:
stockNew <- stockNew %>%
na.omit() %>%
tq_mutate(
select = price.close,
mutate_fun = dailyReturn,
col_rename = "returnDaily",
type = "log"
) %>%
tq_mutate(
select = price.close,
mutate_fun = MACD,
nFast = 12,
nSlow = 26,
nSig = 9,
maType = EMA,
percent = TRUE
) %>%
mutate(i_MACD = macd - signal) %>%
select(-macd, -signal) %>%
tq_mutate(
select = c(price.high, price.low),
mutate_fun = SAR,
accel = c(0.02, 0.2),
col_rename = "sar"
) %>%
mutate(i_SAR = 100 * (price.close - sar) / sar) %>%
select(-sar) %>%
tq_mutate(
select = price.close,
mutate_fun = RSI,
n = 14,
maType = EMA,
col_rename = "i_RSI"
) %>%
drop_na() %>%
mutate(
timeDays = day(ref.date) %>% as.factor(),
timeMonths = month(ref.date) %>% as.factor(),
timeWeekdays = weekdays(ref.date) %>% as.factor(),
timeYears = year(ref.date) %>% as.numeric()
) %>%
mutate(
trendrf = predict(rfMod, .), # rf
trendtreeMod = predict(treeMod,.), # d.tree
trendtreeMACDSAR = predict(treeMACDSAR, .) # d.tree
) %>%
mutate(
signalMACD = ifelse(i_MACD >= 0, 1, -1) %>% lag(default = 0),
signalSAR = ifelse(i_SAR >= 0, 1, -1) %>% lag(default = 0),
signalRSI = ifelse(i_RSI <= 10, 1, ifelse(i_RSI > 90, -1, 0)) %>% lag(default = 0),
signalrf = ifelse(trendrf == "Up", 1, -1) %>% lag(default = 0),
signaltreeMod = ifelse(trendtreeMod == "Up", 1, -1) %>% lag(default = 0),
signaltreeMACDSAR = ifelse(trendtreeMACDSAR == "Up", 1, -1) %>% lag(default = 0),
returnMACD = returnDaily * signalMACD,
returnSAR = returnDaily * signalSAR,
returnRSI = returnDaily * signalRSI,
returnrf = returnDaily * signalrf,
returntreeMod = returnDaily * signaltreeMod,
returntreeMACDSAR = returnDaily * signaltreeMACDSAR,
cumRetDaily = cumprod(1 + returnDaily) - 1,
cumRetMACD = cumprod(1 + returnMACD) - 1,
cumRetSAR = cumprod(1 + returnSAR) - 1,
cumRetRSI = cumprod(1 + returnRSI) - 1,
cumRetrf = cumprod(1 + returnrf) - 1,
cumRettreeMod = cumprod(1 + returntreeMod) - 1,
cumRettreeMACDSAR = cumprod(1 + returntreeMACDSAR) - 1)
#> returnDaily returnMACD returnSAR returnRSI returnrf
#> Cumulative Return 6.403238 -0.9596904 -0.8299039 0.1874724 2.753114e+20
#> returntreeMod returntreeMACDSAR
#> Cumulative Return 6.403238 3.124661e+20
# convert to long table
stockNewLong <- stockNew %>%
gather(key = algorithms, value = returnAlgo, matches("return")) %>%
group_by(algorithms)
# annualized return
stockNewLong %>%
tq_performance(
Ra = returnAlgo,
performance_fun = Return.annualized,
scale = 252,
geometric = TRUE
)
# sharpe ratio
stockNewLong %>%
tq_performance(
Ra = returnAlgo,
performance_fun = SharpeRatio.annualized,
scale = 252,
geometric = TRUE
)
Confusion matrix for rf & treeMACDSAR:
# select trend predictions data only
trendData_n <- stockNew %>%
mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
select(trend, matches("trend")) %>%
mutate_at(vars(contains("trend"), -trend), funs(lag)) %>%
drop_na()
# confusion matrix for d.tree with full feature
confusionMatrix(
trendData_n %>% pull(trendtreeMod),
trendData_n %>% pull(trend)
)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Down Up
#> Down 0 0
#> Up 1473 2224
#>
#> Accuracy : 0.6016
#> 95% CI : (0.5856, 0.6174)
#> No Information Rate : 0.6016
#> P-Value [Acc > NIR] : 0.5072
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.0000
#> Specificity : 1.0000
#> Pos Pred Value : NaN
#> Neg Pred Value : 0.6016
#> Prevalence : 0.3984
#> Detection Rate : 0.0000
#> Detection Prevalence : 0.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : Down
#>
# confusion matrix for random forest
confusionMatrix(
trendData_n %>% pull(trendrf),
trendData_n %>% pull(trend)
)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Down Up
#> Down 1455 0
#> Up 18 2224
#>
#> Accuracy : 0.9951
#> 95% CI : (0.9923, 0.9971)
#> No Information Rate : 0.6016
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9898
#>
#> Mcnemar's Test P-Value : 6.151e-05
#>
#> Sensitivity : 0.9878
#> Specificity : 1.0000
#> Pos Pred Value : 1.0000
#> Neg Pred Value : 0.9920
#> Prevalence : 0.3984
#> Detection Rate : 0.3936
#> Detection Prevalence : 0.3936
#> Balanced Accuracy : 0.9939
#>
#> 'Positive' Class : Down
#>
The random forest or decision tree MACD-SAR model can be used.
An Out-of-Sample data test may need to be done for a more reliable model evaluation. This random forest model use the latest data as its sample data and therefore unable to perform out-of-sample data prediction. This can be done by using the more older data for the assets (perhaps use data from before 2017 as the sample data (train-validation data) and use 2017-2020 data as the out-of-sample (test) data set).
If you do, the test data set should be obtained separately and you should perform the similar data pre-processing to the data set before predicting. You can follow the rearranged flow and use the test data set instead of the similar stock data I used in this article.
I hope this explanation helps you!
Go to algotech.netlify.com for more technical articles on data science and algorit.ma for awesome Algoritma Data Science Academy in Jakarta, Indonesia.