This article is to demonstrate semi-automatic Algorithmic Trading using R; to predict the increase/decrease of trend in a stock market to help decision-making practices.

Libraries

pacman::p_load(bizdays, 
               caret, 
               doParallel, 
               parallel, 
               partykit, 
               randomForest, 
               ROCR, 
               tidyquant, 
               tidyverse, 
               timeDate, 
               BatchGetSymbols)

Obtain Stock Data

# tickers
tickers <- c("BBCA.JK") # BCA

# dates
first_date <- "2005-04-01" # last 5 years
last_date <- Sys.Date()

# additional
thresh_bad_data <- 0.95 # remove asset with >5% NA
bench_ticker <- "^GSPC" # benchmark asset -> main stock index
cache_folder <- "data/BGS_Cache" # cache folder

l_out <- BatchGetSymbols(tickers = tickers,
                         first.date = first_date,
                         last.date = last_date,
                         bench.ticker = bench_ticker,
                         thresh.bad.data = thresh_bad_data)

stock_data <- l_out$df.tickers
df_control <- l_out$df.control

# save data
# write.csv(stock_data, "data/stock_data.csv", row.names = FALSE)

Read Data

# read data
stock_data <- read_csv("data/stock_data.csv")
glimpse(stock_data)

#> Observations: 3,747
#> Variables: 10
#> $ price.open          <dbl> 1700.0, 1712.5, 1725.0, 1700.0, 1725.0, 1712.5,...
#> $ price.high          <dbl> 1725.0, 1725.0, 1725.0, 1725.0, 1725.0, 1712.5,...
#> $ price.low           <dbl> 1687.5, 1700.0, 1700.0, 1687.5, 1662.5, 1687.5,...
#> $ price.close         <dbl> 1700.0, 1725.0, 1712.5, 1725.0, 1712.5, 1712.5,...
#> $ volume              <dbl> 103252000, 34434000, 49406000, 83410000, 264260...
#> $ price.adjusted      <dbl> 1147.050, 1163.919, 1155.485, 1163.919, 1155.48...
#> $ ref.date            <date> 2005-04-01, 2005-04-04, 2005-04-05, 2005-04-06...
#> $ ticker              <chr> "BBCA.JK", "BBCA.JK", "BBCA.JK", "BBCA.JK", "BB...
#> $ ret.adjusted.prices <dbl> NA, 0.014706009, -0.007245967, 0.007298854, -0....
#> $ ret.closing.prices  <dbl> NA, 0.014705882, -0.007246377, 0.007299270, -0....

colSums(is.na(stock_data))

#>          price.open          price.high           price.low         price.close 
#>                   8                   8                   8                   8 
#>              volume      price.adjusted            ref.date              ticker 
#>                   8                   8                   0                   0 
#> ret.adjusted.prices  ret.closing.prices 
#>                  16                  16

# data cleaning
stock_data <- stock_data %>% 
  na.omit()

head(stock_data)

# graph visualization 
tail(stock_data,7*4*6) %>% 
  ggplot(aes(x = ref.date, y = price.close)) +
  geom_candlestick(aes(open = price.open, 
                       high = price.high, 
                       low = price.low, 
                       close = price.close)) +
  
  # exponentially moving average
  geom_ma(ma_fun = EMA, n = 12, colour = "blue", lwd = 0.7) +
  geom_ma(ma_fun = EMA, n = 26, colour = "red", lwd = 0.7) +
  labs(title = "BBCA.JK Stock 2005 - 2020",
       x = "Date",
       y = "Price Close") +
  theme_tq()

Data Pre-processing

To predict the trend of an asset, we can utilize technical indicators. Technical indicator is a series of data that explain the summarized information of an asset, derived by applying particular formula to the a sset price information data i.e. OHLCV. There are 3 commonly used technical indicators:

Moving Average Convergence/Divergence (MACD)
- consist of MACD line and Signal line.
- positive MACD value indicating a possibility of an upward trend in the future, and vice versa.
- if MACD line - Signal line reach zero, there is a tendency of change in the trend of MACD line. This also called an MACD oscillator.

# calculate MACD(12,26,9)
stock_ti <- stock_data %>%
  tq_mutate(
    select = price.close,
    mutate_fun = MACD,
    nFast = 12,
    nSlow = 26,
    nSig = 9, # signal
    maType = EMA,
    percent = TRUE
  ) %>%
  mutate(i_MACD = macd - signal) %>%
  select(-macd, -signal)

# MACD plot
tail(stock_ti, 7*4*6) %>% 
  ggplot(aes(x = ref.date, y = i_MACD)) +
  geom_line(aes(y = i_MACD))

Stop and Reverse (SAR)
- a.k.a Parabolic SAR–indicator
- to track the trend changes using algorithm applied to high and low price series.

# calculate SAR(0.04, 0.4)
stock_ti <- stock_ti %>%
  tq_mutate(
    select = c(price.high, price.low),
    mutate_fun = SAR,
    accel = c(0.02, 0.2),
    col_rename = "sar") %>% 
  mutate(i_SAR = 100 * (price.close - sar) / sar) %>% 
  select(-sar)

tail(stock_ti, 7*4*6) %>% 
  ggplot(aes(x = ref.date, y = i_SAR)) +
  geom_line()

Relative Strength Index (RSI)
- is a momentum oscillator, that focused on detecting overbought and oversold points.
- based on “momentum” rather than trend shift.
- used informations from the price changes, particularly on the gains and losses.

stock_ti <- stock_ti %>%
  tq_mutate(
    select = price.close,
    mutate_fun = RSI,
    n = 14, # calculation period
    maType = EMA,
    col_rename = "i_RSI"
  )

tail(stock_ti, 7*4*6) %>% 
  ggplot(aes(x = ref.date, y = i_RSI)) +
  geom_line()

Portofolio performances for each asset can be measured by various aspect, few of which mentioned below:

Cumulative Return: cumulative sum of returns from a portofolio during a period of time.

# daily return data
stock_met <- stock_ti %>%
  tq_mutate(
    select = price.close,
    mutate_fun = dailyReturn,
    col_rename = "returnDaily",
    type = "log"
  ) %>%
  drop_na()

# cumulative return
stock_met %>% 
  tq_transmute(
    select = returnDaily,
    mutate_fun = Return.cumulative
  )

#>                   returnDaily
#> Cumulative Return    6.403238

Annualized Return: Annualized return is the total return that you will obtain for an investment in each year.

# annualized return
stock_met %>%
  tq_performance(
    Ra = returnDaily,
    performance_fun = Return.annualized,
    scale = 252,
    geometric = TRUE
  )

Sharpe Ratio: a measurement to compare the return we get vs the risk that we need to shoulder.

stock_met %>%
  tq_performance(
    Ra = returnDaily,
    performance_fun = SharpeRatio.annualized,
    scale = 252,
    geometric = TRUE
  )

Strategies for Algorithmic Trading are numerous but we will try to use long-short trades strategy. This strategy is basically positioning ourself in the current trend:

long position: buying assets and holding it for the long run with the hope of increment on the asset’s price. When the price is high enough, the asset will be sold for a profit.
short position: selling assets with the hope of decrement on the asset’s price. When the price is low enough, users can buy their asset again for a profit.

We can develop encoding for specific position [^1]:

If we going long, multiply our return with 1, so a rising price would give a profit, and falling price would be a loss.
If we going short, multiply our return with -1, so a falling price would give a profit, and rising price would be a loss.

After picking the base algorithm, we can prepare predictor of trend on the dataset using feature engineering.

# trade signals
stock_met <- stock_met %>%
  mutate(
    signalMACD = ifelse(i_MACD >= 0, 1, -1) %>% 
      lag(default = 0),
    signalRSI = ifelse(i_RSI <= 10, 1, ifelse(i_RSI > 90, -1, 0)) %>% lag(default = 0),
    signalSAR = ifelse(i_SAR >= 0, 1, -1) %>% lag(default = 0)
  )

# lag because the signals will be used to predict trend of the following day. The default argument is to make the first day value become 0 (hold signal), because we don’t trade on the first day since we don’t have any prior information.

# calculate cumulative return based on defined rule
stock_met <- stock_met %>%
  mutate(
    returnMACD = returnDaily * signalMACD,
    returnSAR = returnDaily * signalSAR,
    returnRSI = returnDaily * signalRSI,
    cumRetMACD = cumprod(1 + returnMACD) - 1,
    cumRetSAR = cumprod(1 + returnSAR) - 1,
    cumRetRSI = cumprod(1 + returnRSI) - 1
  )

# preparing basic feature and target
stockTrain <- stock_met %>%
  select(ref.date, returnDaily, matches("i_")) %>%
  mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
  select(-returnDaily)

# prepare time aspect features
stockTrain <- stockTrain %>%
  mutate(
    timeDays = day(ref.date) %>% as.factor(),
    timeMonths = month(ref.date) %>% as.factor(),
    timeWeekdays = weekdays(ref.date) %>% as.factor(),
    timeYears = year(ref.date) %>% as.numeric()
  ) %>%
  select(-ref.date)

head(stockTrain)

# readjust using lag, then drop all observation containing NA
stockTrain <- stockTrain %>%
  mutate_at(vars(-trend), funs(lag(.))) %>%
  drop_na()

head(stockTrain)

Cross-validation

# split train:test
trainRow <- 1:(nrow(stockTrain) - 28)

# reserve a test dataset
stockTest <- stockTrain %>% dplyr::slice(-trainRow)

# adjust train dataset observation
stockTrain <- stockTrain %>% dplyr::slice(trainRow)

Calculating using trend signal from technical indicators:

# prepare trend dataset
trendData <- stock_met %>%
  select(returnDaily, matches("signal")) %>%
  mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
  mutate_at(vars(contains("signal")), funs(lag)) %>%
  mutate_at(vars(contains("signal")), funs(ifelse(. >= 0, "Up", "Down") %>% as.factor())) %>%
  rename_at(vars(contains("signal")), funs(sub("signal", "trend", .))) %>%
  select(-returnDaily) %>%
  drop_na()

# MACD signal prediction vs actual trend
confusionMatrix(
  trendData %>% pull(trendSAR),
  trendData %>% pull(trend))

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down  593  945
#>       Up    880 1279
#>                                           
#>                Accuracy : 0.5064          
#>                  95% CI : (0.4901, 0.5226)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : 1.0000          
#>                                           
#>                   Kappa : -0.0222         
#>                                           
#>  Mcnemar's Test P-Value : 0.1341          
#>                                           
#>             Sensitivity : 0.4026          
#>             Specificity : 0.5751          
#>          Pos Pred Value : 0.3856          
#>          Neg Pred Value : 0.5924          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.1604          
#>    Detection Prevalence : 0.4160          
#>       Balanced Accuracy : 0.4888          
#>                                           
#>        'Positive' Class : Down            
#>

Model Fitting

Using Decision Tree model:

# all predictors
treeMod <- ctree(
  trend ~ .,
  data = stockTrain
)

# using only MACD & SAR indicator
treeMACDSAR <- ctree(
  trend ~ i_MACD + i_SAR,
  data = stockTrain,
  control = ctree_control(
    mincriterion = 0,
    minsplit = 0,
    minbucket = 0
  )
)

Using Random Forest algorithm:

# training environment for caret (random forest)

# initial window and horizon
initialWindow <- 252
horizon <- 21

# the total number of windows
windowNumber <- nrow(stockTrain) - (initialWindow + horizon) + 1

# set total length to windowNumber + 1
seeds <- vector(mode = "list", length = windowNumber + 1)

# set up training control
trControl <- trainControl(
  method = "timeslice",
  initialWindow = initialWindow,
  horizon = horizon,
  fixedWindow = TRUE,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  seeds = seeds,
  allowParallel = TRUE
)

# random forest grid for mtry
rfGrid <- expand.grid(
  mtry = seq(from = 2, to = ncol(stockTrain) - 1, by = 2)
)

# set seeds for random forest grid
for(i in 1:windowNumber) seeds[[i]] <- 1:nrow(rfGrid)
seeds[[(windowNumber + 1)]] <- 1

# register parallel processing
cl <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cl)

# train random forest
rfMod <- train(
  y = stockTrain %>% pull(trend),
  x = stockTrain %>% select(-trend) %>% as.data.frame(),
  method = "rf",
  tuneGrid = rfGrid,
  metric = "ROC",
  trControl = trControl %>% list_modify(seeds = seeds),
  preProc = c("center", "scale")
) 

# stop parallel processing
stopCluster(cl)
registerDoSEQ()

# save random forest model
saveRDS(rfMod, "mods/rfMod.RDS")

# read random forest model
rfMod <- readRDS("mods/rfMod.RDS")

# elapsed time for random forest
rfMod$times

#> $everything
#>    user  system elapsed 
#>   43.64   36.90 1595.47 
#> 
#> $final
#>    user  system elapsed 
#>    2.28    0.40   13.39 
#> 
#> $prediction
#> [1] NA NA NA

# print cross-validated model
rfMod

#> Random Forest 
#> 
#> 3669 samples
#>    7 predictor
#>    2 classes: 'Down', 'Up' 
#> 
#> Pre-processing: centered (4), scaled (4), ignore (3) 
#> Resampling: Rolling Forecasting Origin Resampling (21 held-out with a fixed window) 
#> Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  ROC        Sens       Spec     
#>   2     0.5439342  0.1106527  0.9062007
#>   4     0.5405758  0.1824732  0.8376693
#>   6     0.5391590  0.2237950  0.8019825
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

Model Evaluation

Using test data:

# predict test data
confusionMatrix(
  rfMod %>% predict(stockTest),
  stockTest %>% pull(trend)
)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down Up
#>       Down    0  0
#>       Up     18 10
#>                                           
#>                Accuracy : 0.3571          
#>                  95% CI : (0.1864, 0.5593)
#>     No Information Rate : 0.6429          
#>     P-Value [Acc > NIR] : 0.9995          
#>                                           
#>                   Kappa : 0               
#>                                           
#>  Mcnemar's Test P-Value : 6.151e-05       
#>                                           
#>             Sensitivity : 0.0000          
#>             Specificity : 1.0000          
#>          Pos Pred Value :    NaN          
#>          Neg Pred Value : 0.3571          
#>              Prevalence : 0.6429          
#>          Detection Rate : 0.0000          
#>    Detection Prevalence : 0.0000          
#>       Balanced Accuracy : 0.5000          
#>                                           
#>        'Positive' Class : Down            
#>

Rearrange flow from the start:

# read data
stockNew <- read_csv("data/stock_data.csv")
head(stockNew)

stockNew <- stockNew %>%
  na.omit() %>% 
  
  tq_mutate(
    select = price.close,
    mutate_fun = dailyReturn,
    col_rename = "returnDaily",
    type = "log"
  ) %>%
  
  tq_mutate(
    select = price.close,
    mutate_fun = MACD,
    nFast = 12,
    nSlow = 26,
    nSig = 9,
    maType = EMA,
    percent = TRUE
  ) %>%
  mutate(i_MACD = macd - signal) %>%
  select(-macd, -signal) %>%
  
  tq_mutate(
    select = c(price.high, price.low),
    mutate_fun = SAR,
    accel = c(0.02, 0.2),
    col_rename = "sar"
  ) %>%
  mutate(i_SAR = 100 * (price.close - sar) / sar) %>%
  select(-sar) %>%
  
  tq_mutate(
    select = price.close,
    mutate_fun = RSI,
    n = 14,
    maType = EMA,
    col_rename = "i_RSI"
  ) %>%
  
  drop_na() %>%
  
  mutate(
    timeDays = day(ref.date) %>% as.factor(),
    timeMonths = month(ref.date) %>% as.factor(),
    timeWeekdays = weekdays(ref.date) %>% as.factor(),
    timeYears = year(ref.date) %>% as.numeric()
  ) %>%
  
  mutate(
    trendrf = predict(rfMod, .), # rf
    trendtreeMod = predict(treeMod,.), # d.tree
    trendtreeMACDSAR = predict(treeMACDSAR, .) # d.tree
  ) %>%
  
  mutate(
    signalMACD = ifelse(i_MACD >= 0, 1, -1) %>% lag(default = 0),
    signalSAR = ifelse(i_SAR >= 0, 1, -1) %>% lag(default = 0),
    signalRSI = ifelse(i_RSI <= 10, 1, ifelse(i_RSI > 90, -1, 0)) %>% lag(default = 0),
    signalrf = ifelse(trendrf == "Up", 1, -1) %>% lag(default = 0),
    signaltreeMod = ifelse(trendtreeMod == "Up", 1, -1) %>% lag(default = 0),
    signaltreeMACDSAR = ifelse(trendtreeMACDSAR == "Up", 1, -1) %>% lag(default = 0),
    
    returnMACD = returnDaily * signalMACD,
    returnSAR = returnDaily * signalSAR,
    returnRSI = returnDaily * signalRSI,
    returnrf = returnDaily * signalrf,
    returntreeMod = returnDaily * signaltreeMod,
    returntreeMACDSAR = returnDaily * signaltreeMACDSAR,
    
    cumRetDaily = cumprod(1 + returnDaily) - 1,
    cumRetMACD = cumprod(1 + returnMACD) - 1,
    cumRetSAR = cumprod(1 + returnSAR) - 1,
    cumRetRSI = cumprod(1 + returnRSI) - 1,
    cumRetrf = cumprod(1 + returnrf) - 1,
    cumRettreeMod = cumprod(1 + returntreeMod) - 1,
    cumRettreeMACDSAR = cumprod(1 + returntreeMACDSAR) - 1)

stockNew %>%
  tq_transmute(
    select = matches("return"),
    mutate_fun = Return.cumulative
  )

#>                   returnDaily returnMACD  returnSAR returnRSI     returnrf
#> Cumulative Return    6.403238 -0.9596904 -0.8299039 0.1874724 2.753114e+20
#>                   returntreeMod returntreeMACDSAR
#> Cumulative Return      6.403238      3.124661e+20

# convert to long table
stockNewLong <- stockNew %>%
  gather(key = algorithms, value = returnAlgo, matches("return")) %>%
  group_by(algorithms)

# annualized return
stockNewLong %>%
  tq_performance(
    Ra = returnAlgo,
    performance_fun = Return.annualized,
    scale = 252,
    geometric = TRUE
  )

# sharpe ratio
stockNewLong %>%
  tq_performance(
    Ra = returnAlgo,
    performance_fun = SharpeRatio.annualized,
    scale = 252,
    geometric = TRUE
  )

Confusion matrix for rf & treeMACDSAR:

# select trend predictions data only
trendData_n <- stockNew %>%
  mutate(trend = ifelse(returnDaily >= 0, "Up", "Down") %>% as.factor()) %>%
  select(trend, matches("trend")) %>%
  mutate_at(vars(contains("trend"), -trend), funs(lag)) %>%
  drop_na()

# confusion matrix for d.tree with full feature
confusionMatrix(
  trendData_n %>% pull(trendtreeMod),
  trendData_n %>% pull(trend)
)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down    0    0
#>       Up   1473 2224
#>                                           
#>                Accuracy : 0.6016          
#>                  95% CI : (0.5856, 0.6174)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : 0.5072          
#>                                           
#>                   Kappa : 0               
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.0000          
#>             Specificity : 1.0000          
#>          Pos Pred Value :    NaN          
#>          Neg Pred Value : 0.6016          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.0000          
#>    Detection Prevalence : 0.0000          
#>       Balanced Accuracy : 0.5000          
#>                                           
#>        'Positive' Class : Down            
#>

# confusion matrix for random forest
confusionMatrix(
  trendData_n %>% pull(trendrf),
  trendData_n %>% pull(trend)
)

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down 1455    0
#>       Up     18 2224
#>                                           
#>                Accuracy : 0.9951          
#>                  95% CI : (0.9923, 0.9971)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9898          
#>                                           
#>  Mcnemar's Test P-Value : 6.151e-05       
#>                                           
#>             Sensitivity : 0.9878          
#>             Specificity : 1.0000          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.9920          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.3936          
#>    Detection Prevalence : 0.3936          
#>       Balanced Accuracy : 0.9939          
#>                                           
#>        'Positive' Class : Down            
#>

The random forest or decision tree MACD-SAR model can be used.

An Out-of-Sample data test may need to be done for a more reliable model evaluation. This random forest model use the latest data as its sample data and therefore unable to perform out-of-sample data prediction. This can be done by using the more older data for the assets (perhaps use data from before 2017 as the sample data (train-validation data) and use 2017-2020 data as the out-of-sample (test) data set).

If you do, the test data set should be obtained separately and you should perform the similar data pre-processing to the data set before predicting. You can follow the rearranged flow and use the test data set instead of the similar stock data I used in this article.

I hope this explanation helps you!

Go to algotech.netlify.com for more technical articles on data science and algorit.ma for awesome Algoritma Data Science Academy in Jakarta, Indonesia.

Stock Trend Prediction

Nabiilah Ardini Fauziyyah

27/4/2020