This article is to demonstrate semi-automatic Algorithmic Trading using R; to predict the increase/decrease of trend in a stock market to help decision-making practices.



Read Data

#> Observations: 3,747
#> Variables: 10
#> $ price.open          <dbl> 1700.0, 1712.5, 1725.0, 1700.0, 1725.0, 1712.5,...
#> $ price.high          <dbl> 1725.0, 1725.0, 1725.0, 1725.0, 1725.0, 1712.5,...
#> $ price.low           <dbl> 1687.5, 1700.0, 1700.0, 1687.5, 1662.5, 1687.5,...
#> $ price.close         <dbl> 1700.0, 1725.0, 1712.5, 1725.0, 1712.5, 1712.5,...
#> $ volume              <dbl> 103252000, 34434000, 49406000, 83410000, 264260...
#> $ price.adjusted      <dbl> 1147.050, 1163.919, 1155.485, 1163.919, 1155.48...
#> $ ref.date            <date> 2005-04-01, 2005-04-04, 2005-04-05, 2005-04-06...
#> $ ticker              <chr> "BBCA.JK", "BBCA.JK", "BBCA.JK", "BBCA.JK", "BB...
#> $ ret.adjusted.prices <dbl> NA, 0.014706009, -0.007245967, 0.007298854, -0....
#> $ ret.closing.prices  <dbl> NA, 0.014705882, -0.007246377, 0.007299270, -0....
#>          price.open          price.high           price.low         price.close 
#>                   8                   8                   8                   8 
#>              volume      price.adjusted            ref.date              ticker 
#>                   8                   8                   0                   0 
#> ret.adjusted.prices  ret.closing.prices 
#>                  16                  16

Data Pre-processing

To predict the trend of an asset, we can utilize technical indicators. Technical indicator is a series of data that explain the summarized information of an asset, derived by applying particular formula to the a sset price information data i.e. OHLCV. There are 3 commonly used technical indicators:

  • Moving Average Convergence/Divergence (MACD)
    • consist of MACD line and Signal line.
    • positive MACD value indicating a possibility of an upward trend in the future, and vice versa.
    • if MACD line - Signal line reach zero, there is a tendency of change in the trend of MACD line. This also called an MACD oscillator.

  • Stop and Reverse (SAR)
    • a.k.a Parabolic SAR–indicator
    • to track the trend changes using algorithm applied to high and low price series.

  • Relative Strength Index (RSI)
    • is a momentum oscillator, that focused on detecting overbought and oversold points.
    • based on “momentum” rather than trend shift.
    • used informations from the price changes, particularly on the gains and losses.

Portofolio performances for each asset can be measured by various aspect, few of which mentioned below:

  • Cumulative Return: cumulative sum of returns from a portofolio during a period of time.
#>                   returnDaily
#> Cumulative Return    6.403238
  • Annualized Return: Annualized return is the total return that you will obtain for an investment in each year.
  • Sharpe Ratio: a measurement to compare the return we get vs the risk that we need to shoulder.

Strategies for Algorithmic Trading are numerous but we will try to use long-short trades strategy. This strategy is basically positioning ourself in the current trend:

  • long position: buying assets and holding it for the long run with the hope of increment on the asset’s price. When the price is high enough, the asset will be sold for a profit.

  • short position: selling assets with the hope of decrement on the asset’s price. When the price is low enough, users can buy their asset again for a profit.

We can develop encoding for specific position [^1]:

  • If we going long, multiply our return with 1, so a rising price would give a profit, and falling price would be a loss.

  • If we going short, multiply our return with -1, so a falling price would give a profit, and rising price would be a loss.

After picking the base algorithm, we can prepare predictor of trend on the dataset using feature engineering.

Cross-validation

Calculating using trend signal from technical indicators:

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down  593  945
#>       Up    880 1279
#>                                           
#>                Accuracy : 0.5064          
#>                  95% CI : (0.4901, 0.5226)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : 1.0000          
#>                                           
#>                   Kappa : -0.0222         
#>                                           
#>  Mcnemar's Test P-Value : 0.1341          
#>                                           
#>             Sensitivity : 0.4026          
#>             Specificity : 0.5751          
#>          Pos Pred Value : 0.3856          
#>          Neg Pred Value : 0.5924          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.1604          
#>    Detection Prevalence : 0.4160          
#>       Balanced Accuracy : 0.4888          
#>                                           
#>        'Positive' Class : Down            
#> 

Model Fitting

Using Decision Tree model:

Using Random Forest algorithm:

#> $everything
#>    user  system elapsed 
#>   43.64   36.90 1595.47 
#> 
#> $final
#>    user  system elapsed 
#>    2.28    0.40   13.39 
#> 
#> $prediction
#> [1] NA NA NA
#> Random Forest 
#> 
#> 3669 samples
#>    7 predictor
#>    2 classes: 'Down', 'Up' 
#> 
#> Pre-processing: centered (4), scaled (4), ignore (3) 
#> Resampling: Rolling Forecasting Origin Resampling (21 held-out with a fixed window) 
#> Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  ROC        Sens       Spec     
#>   2     0.5439342  0.1106527  0.9062007
#>   4     0.5405758  0.1824732  0.8376693
#>   6     0.5391590  0.2237950  0.8019825
#> 
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.

Model Evaluation

Using test data:

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down Up
#>       Down    0  0
#>       Up     18 10
#>                                           
#>                Accuracy : 0.3571          
#>                  95% CI : (0.1864, 0.5593)
#>     No Information Rate : 0.6429          
#>     P-Value [Acc > NIR] : 0.9995          
#>                                           
#>                   Kappa : 0               
#>                                           
#>  Mcnemar's Test P-Value : 6.151e-05       
#>                                           
#>             Sensitivity : 0.0000          
#>             Specificity : 1.0000          
#>          Pos Pred Value :    NaN          
#>          Neg Pred Value : 0.3571          
#>              Prevalence : 0.6429          
#>          Detection Rate : 0.0000          
#>    Detection Prevalence : 0.0000          
#>       Balanced Accuracy : 0.5000          
#>                                           
#>        'Positive' Class : Down            
#> 

Rearrange flow from the start:

stockNew <- stockNew %>%
  na.omit() %>% 
  
  tq_mutate(
    select = price.close,
    mutate_fun = dailyReturn,
    col_rename = "returnDaily",
    type = "log"
  ) %>%
  
  tq_mutate(
    select = price.close,
    mutate_fun = MACD,
    nFast = 12,
    nSlow = 26,
    nSig = 9,
    maType = EMA,
    percent = TRUE
  ) %>%
  mutate(i_MACD = macd - signal) %>%
  select(-macd, -signal) %>%
  
  tq_mutate(
    select = c(price.high, price.low),
    mutate_fun = SAR,
    accel = c(0.02, 0.2),
    col_rename = "sar"
  ) %>%
  mutate(i_SAR = 100 * (price.close - sar) / sar) %>%
  select(-sar) %>%
  
  tq_mutate(
    select = price.close,
    mutate_fun = RSI,
    n = 14,
    maType = EMA,
    col_rename = "i_RSI"
  ) %>%
  
  drop_na() %>%
  
  mutate(
    timeDays = day(ref.date) %>% as.factor(),
    timeMonths = month(ref.date) %>% as.factor(),
    timeWeekdays = weekdays(ref.date) %>% as.factor(),
    timeYears = year(ref.date) %>% as.numeric()
  ) %>%
  
  mutate(
    trendrf = predict(rfMod, .), # rf
    trendtreeMod = predict(treeMod,.), # d.tree
    trendtreeMACDSAR = predict(treeMACDSAR, .) # d.tree
  ) %>%
  
  mutate(
    signalMACD = ifelse(i_MACD >= 0, 1, -1) %>% lag(default = 0),
    signalSAR = ifelse(i_SAR >= 0, 1, -1) %>% lag(default = 0),
    signalRSI = ifelse(i_RSI <= 10, 1, ifelse(i_RSI > 90, -1, 0)) %>% lag(default = 0),
    signalrf = ifelse(trendrf == "Up", 1, -1) %>% lag(default = 0),
    signaltreeMod = ifelse(trendtreeMod == "Up", 1, -1) %>% lag(default = 0),
    signaltreeMACDSAR = ifelse(trendtreeMACDSAR == "Up", 1, -1) %>% lag(default = 0),
    
    returnMACD = returnDaily * signalMACD,
    returnSAR = returnDaily * signalSAR,
    returnRSI = returnDaily * signalRSI,
    returnrf = returnDaily * signalrf,
    returntreeMod = returnDaily * signaltreeMod,
    returntreeMACDSAR = returnDaily * signaltreeMACDSAR,
    
    cumRetDaily = cumprod(1 + returnDaily) - 1,
    cumRetMACD = cumprod(1 + returnMACD) - 1,
    cumRetSAR = cumprod(1 + returnSAR) - 1,
    cumRetRSI = cumprod(1 + returnRSI) - 1,
    cumRetrf = cumprod(1 + returnrf) - 1,
    cumRettreeMod = cumprod(1 + returntreeMod) - 1,
    cumRettreeMACDSAR = cumprod(1 + returntreeMACDSAR) - 1)
#>                   returnDaily returnMACD  returnSAR returnRSI     returnrf
#> Cumulative Return    6.403238 -0.9596904 -0.8299039 0.1874724 2.753114e+20
#>                   returntreeMod returntreeMACDSAR
#> Cumulative Return      6.403238      3.124661e+20

Confusion matrix for rf & treeMACDSAR:

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down    0    0
#>       Up   1473 2224
#>                                           
#>                Accuracy : 0.6016          
#>                  95% CI : (0.5856, 0.6174)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : 0.5072          
#>                                           
#>                   Kappa : 0               
#>                                           
#>  Mcnemar's Test P-Value : <2e-16          
#>                                           
#>             Sensitivity : 0.0000          
#>             Specificity : 1.0000          
#>          Pos Pred Value :    NaN          
#>          Neg Pred Value : 0.6016          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.0000          
#>    Detection Prevalence : 0.0000          
#>       Balanced Accuracy : 0.5000          
#>                                           
#>        'Positive' Class : Down            
#> 
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Down   Up
#>       Down 1455    0
#>       Up     18 2224
#>                                           
#>                Accuracy : 0.9951          
#>                  95% CI : (0.9923, 0.9971)
#>     No Information Rate : 0.6016          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9898          
#>                                           
#>  Mcnemar's Test P-Value : 6.151e-05       
#>                                           
#>             Sensitivity : 0.9878          
#>             Specificity : 1.0000          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.9920          
#>              Prevalence : 0.3984          
#>          Detection Rate : 0.3936          
#>    Detection Prevalence : 0.3936          
#>       Balanced Accuracy : 0.9939          
#>                                           
#>        'Positive' Class : Down            
#> 

The random forest or decision tree MACD-SAR model can be used.

An Out-of-Sample data test may need to be done for a more reliable model evaluation. This random forest model use the latest data as its sample data and therefore unable to perform out-of-sample data prediction. This can be done by using the more older data for the assets (perhaps use data from before 2017 as the sample data (train-validation data) and use 2017-2020 data as the out-of-sample (test) data set).

If you do, the test data set should be obtained separately and you should perform the similar data pre-processing to the data set before predicting. You can follow the rearranged flow and use the test data set instead of the similar stock data I used in this article.

I hope this explanation helps you!

Go to algotech.netlify.com for more technical articles on data science and algorit.ma for awesome Algoritma Data Science Academy in Jakarta, Indonesia.