DMwR In Class - Predicting Stock Market Returns

Introduction

Stock market trading is an application area with a large potential for data mining. Huge amounts of data are available. Still, there are researchers claiming the impossibility of making money out of forecasting future prices - the efficient markets hypothesis. The goal of trading is to maintain a portfolio of assets based on buy and sell orders, and achieve profit with this.

We will use data mining in a slightly more specific trading context. We will trade a single security - the S&P 500 market index. Given historic prices data of this security and an initial capital we will try to maximize our profit over a future testing period by means of trading actions - buy, sell, hold.

Our trading strategy will base the decisions on the results of a data mining process. This process will try to forecast the future evolution of prices based on historical data. The overall evaluation criteria will be the profit/loss resulting from the trading actions.

This case study will provide information on:

  • Time series modeling
  • Handling regime shifts with windowing mechanisms
  • Support vector machines
  • Multivariate adaptive regression splines
  • Evaluating time series models with the Monte Carlo method
  • Several new evaluation statistics related either to the prediction of rare events or with financial trading performance

We will use a data set of OHLC S&P index prices available in the package DMwR2.

Load Libraries

library(xts)
library(DMwR2)
library(quantmod)
library(TTR)
library(performanceEstimation) 
library(nnet)
library(e1071)
library(kernlab)
library(earth)
library(randomForest)

Import Data

data(GSPC,package="DMwR2") 
first(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 1970-01-02     92.06     93.54    91.79         93     8050000
##            GSPC.Adjusted
## 1970-01-02            93
last(GSPC)
##            GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume
## 2016-01-25   1906.28   1906.28  1875.97    1877.08  4401380000
##            GSPC.Adjusted
## 2016-01-25       1877.08

Note that price data can also be directly downloaded from the web.

CompanySymbol = 'MSFT' # changing the ticker dynamically changes the chart
getSymbols.google(CompanySymbol,from='2010-01-01',env=.GlobalEnv) # get stock price data
## [1] "MSFT"
plot(Cl(get(CompanySymbol)),main=paste("Closing prices of", CompanySymbol)) # Cl gets the closing prices

getFinancials("AAPL") # get financial data
## [1] "AAPL.f"
viewFin(AAPL.f)
## Annual Balance Sheet for AAPL
##                                              2016-09-24 2015-09-26
## Cash & Equivalents                                   NA         NA
## Short Term Investments                         58554.00   30212.00
## Cash and Short Term Investments                67155.00   41601.00
## Accounts Receivable - Trade, Net               15754.00   16849.00
## Receivables - Other                                  NA         NA
## Total Receivables, Net                         29299.00   30343.00
## Total Inventory                                 2132.00    2349.00
## Prepaid Expenses                                     NA         NA
## Other Current Assets, Total                     8283.00   15085.00
## Total Current Assets                          106869.00   89378.00
## Property/Plant/Equipment, Total - Gross        61245.00   49257.00
## Accumulated Depreciation, Total               -34235.00  -26786.00
## Goodwill, Net                                   5414.00    5116.00
## Intangibles, Net                                3206.00    3893.00
## Long Term Investments                         170430.00  164065.00
## Other Long Term Assets, Total                   8757.00    5422.00
## Total Assets                                  321686.00  290345.00
## Accounts Payable                               37294.00   35490.00
## Accrued Expenses                               20951.00   24169.00
## Notes Payable/Short Term Debt                   8105.00    8499.00
## Current Port. of LT Debt/Capital Leases         3500.00    2500.00
## Other Current liabilities, Total                9156.00    9952.00
## Total Current Liabilities                      79006.00   80610.00
## Long Term Debt                                 75427.00   53329.00
## Capital Lease Obligations                            NA         NA
## Total Long Term Debt                           75427.00   53329.00
## Total Debt                                     87032.00   64328.00
## Deferred Income Tax                            26019.00   24062.00
## Minority Interest                                    NA         NA
## Other Liabilities, Total                       12985.00   12989.00
## Total Liabilities                             193437.00  170990.00
## Redeemable Preferred Stock, Total                    NA         NA
## Preferred Stock - Non Redeemable, Net                NA         NA
## Common Stock, Total                            31251.00   27416.00
## Additional Paid-In Capital                           NA         NA
## Retained Earnings (Accumulated Deficit)        96364.00   92284.00
## Treasury Stock - Common                              NA         NA
## Other Equity, Total                              596.00   -1117.00
## Total Equity                                  128249.00  119355.00
## Total Liabilities & Shareholders' Equity  321686.00  290345.00
## Shares Outs - Common Stock Primary Issue             NA         NA
## Total Common Shares Outstanding                 5336.17    5578.75
##                                              2014-09-27 2013-09-28
## Cash & Equivalents                                   NA         NA
## Short Term Investments                         14845.00   31841.00
## Cash and Short Term Investments                25077.00   40546.00
## Accounts Receivable - Trade, Net               17460.00   13102.00
## Receivables - Other                                  NA         NA
## Total Receivables, Net                         27219.00   20641.00
## Total Inventory                                 2111.00    1764.00
## Prepaid Expenses                                     NA         NA
## Other Current Assets, Total                    14124.00   10335.00
## Total Current Assets                           68531.00   73286.00
## Property/Plant/Equipment, Total - Gross        39015.00   28519.00
## Accumulated Depreciation, Total               -18391.00  -11922.00
## Goodwill, Net                                   4616.00    1577.00
## Intangibles, Net                                4142.00    4179.00
## Long Term Investments                         130162.00  106215.00
## Other Long Term Assets, Total                   3764.00    5146.00
## Total Assets                                  231839.00  207000.00
## Accounts Payable                               30196.00   22367.00
## Accrued Expenses                                7689.00    4782.00
## Notes Payable/Short Term Debt                   6308.00       0.00
## Current Port. of LT Debt/Capital Leases              NA         NA
## Other Current liabilities, Total               19255.00   16509.00
## Total Current Liabilities                      63448.00   43658.00
## Long Term Debt                                 28987.00   16960.00
## Capital Lease Obligations                            NA         NA
## Total Long Term Debt                           28987.00   16960.00
## Total Debt                                     35295.00   16960.00
## Deferred Income Tax                            20259.00   16489.00
## Minority Interest                                    NA         NA
## Other Liabilities, Total                        7598.00    6344.00
## Total Liabilities                             120292.00   83451.00
## Redeemable Preferred Stock, Total                    NA         NA
## Preferred Stock - Non Redeemable, Net                NA         NA
## Common Stock, Total                            23313.00   19764.00
## Additional Paid-In Capital                           NA         NA
## Retained Earnings (Accumulated Deficit)        87152.00  104256.00
## Treasury Stock - Common                              NA         NA
## Other Equity, Total                             -282.00    -296.00
## Total Equity                                  111547.00  123549.00
## Total Liabilities & Shareholders' Equity  231839.00  207000.00
## Shares Outs - Common Stock Primary Issue             NA         NA
## Total Common Shares Outstanding                 5866.16    6294.37
## attr(,"col_desc")
## [1] "As of 2016-09-24" "As of 2015-09-26" "As of 2014-09-27"
## [4] "As of 2013-09-28"

Define the Predictive Task

The trading strategies we will use assumes that we obtain a prediction of the tendency of the market in the next few days. Based on this prediction, we will place orders that will be profittable if the tendency is confirmed in the future.

Let us assume that if the prices vary more than p%, we consider this worthwhile in terms of trading (e.g., covering transaction costs). In this context, we want our prediction models to forecast whether this margin is attainable in the next k days. This means that predicting a particular quote for a specific future time t+k might not be the best idea. In effect, what we want is to have a prediction of the overall dynamics of the price in the next k days, and this is not captured by a particular price at a specific time.

For instance, the closing price at time t+k may represent a variation much lower than p%, but it could have been preceded by a period of prices representing variations much higher than p% within the window t···t+k. So, what we want in effect is to have a good prediction of the overall tendency of the prices in the next k days.

We will describe a variable, calculated with the quotes data, that can be seen as an indicator (a value) of the tendency in the next k days. The value of this indicator should be related to the confidence we have that the target margin p will be attainable in the next k days.

The idea is that positive variations will lead us to buy, while negative variations will trigger sell actions. The indicator we are proposing resumes the tendency as a single value, positive for upward tendencies, and negative for downward price tendencies.

The following function implements the proposed indicator:

T.ind <- function(quotes,tgt.margin=0.025,n.days=10) {
   v <- apply(HLC(quotes),1,mean) # function HLC() extracts the High, Low, and Close quotes
   v[1] <- Cl(quotes)[1]           
   
   r <- matrix(NA,ncol=n.days,nrow=NROW(quotes))
   for(x in 1:n.days) r[,x] <- Next(Delt(v,k=x),x)
   
   x <- apply(r,1,function(x) 
                  sum(x[x > tgt.margin | x < -tgt.margin]))
   if (is.xts(quotes)) xts(x,time(quotes)) else x
}

Let us inspect the values of the T Indicator.

data(GSPC,package="DMwR2") 
avgPrice <- function(p) apply(HLC(p), 1, mean)
addAvgPrice <- newTA(FUN=avgPrice, col=1, legend='AvgPrice')
addT.ind <- newTA(FUN=T.ind, col='red', legend='tgtRet')
candleChart(last(GSPC,'3 months'), theme='white', TA=c(addAvgPrice(on=1), addT.ind()))

Choosing the Predictors

The main assumption behind trying to forecast the future behavior of financial markets is that it is possible to do so by observing the past behavior of the market. More precisely, that if in the past behavior p was followed by f, and this pattern occurs frequently, then if we are observing again p we are confident that f will follow.

We are approximating the future behavior using our T indicator. We need to decide how to describe the recent past behavior of the prices. We will try to collect a series of indicators that capture the recent dynamics of the prices.

Obvious candidates are the recent prices of the asset We will focus on the Closing prices, more precisely on the arithmetic h-days returns.

Additional information can be given by calculating relevant statistics on the recent evolution of the prices. Technical indicators are numeric summaries that reflect some properties of the price time series. We will select an illustrative set of technical indicators calculated with the recent prices and use them as predictors for our models. Package TTR contains a huge sample of technical indicators.

Auxiliary functions (technical indicators) we will use to obtain the predictors:

myATR <- function(x) ATR(HLC(x))[,'atr'] # Average True Range, measures volatility of series  
mySMI <- function(x) SMI(HLC(x))[, "SMI"] #  Stochastic Momentum Index 
myADX <- function(x) ADX(HLC(x))[,'ADX'] # Welles Wilder's Directional Movement Index 
myAroon <- function(x) aroon(cbind(Hi(x),Lo(x)))$oscillator # Identify starting trends
myBB <- function(x) BBands(HLC(x))[, "pctB"] # Bollinger Bands
myChaikinVol <- function(x) Delt(chaikinVolatility(cbind(Hi(x),Lo(x))))[, 1] # Chaikin Volatility
myCLV <- function(x) EMA(CLV(HLC(x)))[, 1] # Close Location Value 
myEMV <- function(x) EMV(cbind(Hi(x),Lo(x)),Vo(x))[,2] # Arms' Ease of Movement Value 
myMACD <- function(x) MACD(Cl(x))[,2] # Moving Average Convergence Divergence
myMFI <- function(x) MFI(HLC(x), Vo(x)) # Money Flow Index
mySAR <- function(x) SAR(cbind(Hi(x),Cl(x))) [,1] # Parabolic Stop-and-Reverse
myVolat <- function(x) volatility(OHLC(x),calc="garman")[,1] # volatility

Let us create the objects with the data we will use. We start with the full model then use random forests to estimate the importance of the variables involved in a prediction task. This importance can be estimated by calculating the percentage increase in the error of the random forest if we remove each variable in turn.

data.model <- specifyModel(T.ind(GSPC) ~ Delt(Cl(GSPC),k=1:10) + 
                             myATR(GSPC) + 
                             mySMI(GSPC) + 
                             myADX(GSPC) + 
                             myAroon(GSPC) + 
                             myBB(GSPC) + 
                             myChaikinVol(GSPC) + 
                             myCLV(GSPC) + 
                             CMO(Cl(GSPC)) + 
                             EMA(Delt(Cl(GSPC))) + 
                             myEMV(GSPC) + 
                             myVolat(GSPC) + 
                             myMACD(GSPC) + 
                             myMFI(GSPC) + 
                             RSI(Cl(GSPC)) + 
                             mySAR(GSPC) + 
                             runMean(Cl(GSPC)) + 
                             runSD(Cl(GSPC)))
set.seed(1234) 
rf <- buildModel(data.model,method='randomForest', 
                 training.per=c("1995-01-01","2005-12-30"), 
                 ntree=1000, 
                 importance=TRUE) 
varImpPlot(rf@fitted.model, type = 1) # Type 2 shows ranking based on decrease in node impurity 

imp <- importance(rf@fitted.model, type = 1) 
rownames(imp)[which(imp > 30)] 
##  [1] "myATR.GSPC"      "mySMI.GSPC"      "myADX.GSPC"     
##  [4] "myAroon.GSPC"    "myEMV.GSPC"      "myVolat.GSPC"   
##  [7] "myMACD.GSPC"     "myMFI.GSPC"      "mySAR.GSPC"     
## [10] "runMean.Cl.GSPC" "runSD.Cl.GSPC"

We use 30 as the threshold as there seems to be a clear difference in scores of the features above and below this value. We then run the pared model. This gives us a quantmod object (data.model) containing the specification of the data we plan to use with our predictive models. This data has as a target the value of the T indicator and as predictors a series of other variables that resulted from a feature selection process.

data.model <- specifyModel(T.ind(GSPC) ~ myATR(GSPC) + 
                             mySMI(GSPC) + 
                             myADX(GSPC) + 
                             myAroon(GSPC) + 
                             myEMV(GSPC) + 
                             myVolat(GSPC) + 
                             myMACD(GSPC) + 
                             myMFI(GSPC) + 
                             mySAR(GSPC) + 
                             runMean(Cl(GSPC)) + 
                             runSD(Cl(GSPC)))

Prediction Tasks

We explore two alternatives to obtain predictions for the correct trading signal. The first is using multiple regression, using the T value as the target variable and forecast this value using the predictors; information. The second alternative is a classification prediction task consists of predicting the signals directly. This means to use as a target variable the “correct” signal for day d. Again using the T indicator and the same thresholds, for the available historical data, we obtain the signal of each day by calculating the T value using the following 10 days and using the thresholds to decide on the signal. The target variable in this second task is nominal.

# Regression
Tdata.train <- as.data.frame(modelData(data.model, data.window=c('1970-01-02','2005-12-30'))) # convert to data frame
Tdata.eval <- na.omit(as.data.frame(modelData(data.model, data.window=c('2006-01-01','2016-01-25')))) 
Tform <- as.formula('T.ind.GSPC ~ .') # the formula to be used in models

# Classification
buy.thr <- 0.1 
sell.thr <- -0.1 
Tdata.trainC <- cbind(Signal=trading.signals(Tdata.train[["T.ind.GSPC"]], 
                                             buy.thr,sell.thr), 
                      Tdata.train[,-1]) 
Tdata.evalC <- cbind(Signal=trading.signals(Tdata.eval[["T.ind.GSPC"]], 
                                            buy.thr,sell.thr), 
                     Tdata.eval[,-1]) 
TformC <- as.formula("Signal ~ .")

head(Tdata.train)
##            T.ind.GSPC myATR.GSPC mySMI.GSPC myADX.GSPC myAroon.GSPC
## 1970-02-18 0.15215888   1.757594 -27.544696   48.67410          -30
## 1970-02-19 0.05307516   1.757766 -22.066236   45.56942          -30
## 1970-02-20 0.05120619   1.765782 -16.483777   42.76333          -25
## 1970-02-24 0.00000000   1.756084 -11.374860   39.93884          -20
## 1970-02-25 0.00000000   1.822792  -4.722482   37.88671           85
## 1970-02-26 0.00000000   1.835450   0.825322   35.98116           85
##              myEMV.GSPC myVolat.GSPC myMACD.GSPC myMFI.GSPC mySAR.GSPC
## 1970-02-18 0.0002233277    0.2144511   -2.124947   68.64115   85.02000
## 1970-02-19 0.0002792395    0.2151897   -1.976305   75.99052   85.02000
## 1970-02-20 0.0001116343    0.2181275   -1.818074   75.65063   85.16720
## 1970-02-24 0.0002632325    0.2184983   -1.658710   74.89682   85.38157
## 1970-02-25 0.0003729626    0.2296881   -1.478420   75.28083   85.66384
## 1970-02-26 0.0003360702    0.2291771   -1.299327   74.16197   86.07746
##            runMean.Cl.GSPC runSD.Cl.GSPC
## 1970-02-18          86.583     0.4582108
## 1970-02-19          86.769     0.5230780
## 1970-02-20          86.939     0.6298933
## 1970-02-24          87.037     0.7129286
## 1970-02-25          87.362     0.9422276
## 1970-02-26          87.558     1.0431437

Evaluating the Predictions

Our prediction task means that the models will forecast a value for the T indicator. We will map these predicted values into a trading signal where T < sell.threshold generates a sell indicator (s), T > buy.threshold generates a buy signal (b), and a hold (h) otherwise. Function trading.signals() in package DMwR2 does that mapping for us. The result is a factor with possible values: “b”, “s” or “h”.

Evaluating these signals could be done through a standard Error Rate, given that it is a nominal variable - however, there is a strong imbalance and thus this is not a good idea. We will use the Precision/Recall setup, together with the F-measure to unify both into a single score. We will also evaluate our trials using trading-related metrics such as:

  • Economic results factors such as:
    • net balance over the trading period
    • percentage return over this period
    • excess return over buy-and-hold
  • Risk-related metrics such as:
    • the Sharpe ratio
    • the maximum draw-down
  • Characteristics of the trades such as:
    • number of trades
    • average return per trade
    • percentage of profitable trades

Performance Estimation

The usual techniques for model evaluation revolve around resampling. Examples of Resampling-based methods includ Holdout, Cross-validation and Bootstrap.

For time series, note that any form of resampling changes the natural order of the data so we need special handling to address the time dependency. The general guidelines is not to “forget” the time tags of the observations and not to evaluate a model on past data.

A possible method is to divide the existing data in two time windows - Past data (observations till a time t) and “Future” data (observations after t). We can use one of these three learn-test alternatives: Fixed learning window, Growing window, and Sliding window. A Fixed learning window obtains a single model with the available “training” data, and applied to all test period. With a growing window, every test case obtains a new model using all data available till then. For the Sliding window, every test case obtains a new model using the previous observations of the time series.

An illustrative example using Monte Carlo estimation using SVM (fixed vs sliding window) is provided below.

exp <- performanceEstimation( 
  PredTask(Tform, Tdata.train[1:2000,], 'SP500'), 
  c(Workflow('standardWF', wfID="standSVM", # fixed window
             learner='svm',learner.pars=list(cost=10,gamma=0.01)), 
    Workflow('timeseriesWF', wfID="slideSVM", # time series workflow
             type="slide", relearn.step=90, # sliding window, slide every 90 rows 
             learner='svm',learner.pars=list(cost=10,gamma=0.01)) 
    ), 
  EstimationTask(metrics="theil",  
                 method=MonteCarlo(nReps=10,szTrain=0.5,szTest=0.25))) # the "theil" metric is typically used in time series tasks and the used baseline is the last observed value of the target variable. It is calculated as: sum( (t_i - p_i)^2 ) / sum( (t_i - t_{i-1})^2 ), where t_{i-1} is the last observed value of the target variable
## 
## 
## ##### PERFORMANCE ESTIMATION USING  MONTE CARLO  #####
## 
## ** PREDICTIVE TASK :: SP500
## 
## ++ MODEL/WORKFLOW :: standSVM 
## Task for estimating  theil  using
## 10  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1005 ; test size =  500 
## Repetition  2 
##   start test =  1057 ; test size =  500 
## Repetition  3 
##   start test =  1115 ; test size =  500 
## Repetition  4 
##   start test =  1254 ; test size =  500 
## Repetition  5 
##   start test =  1305 ; test size =  500 
## Repetition  6 
##   start test =  1311 ; test size =  500 
## Repetition  7 
##   start test =  1312 ; test size =  500 
## Repetition  8 
##   start test =  1318 ; test size =  500 
## Repetition  9 
##   start test =  1329 ; test size =  500 
## Repetition  10 
##   start test =  1428 ; test size =  500 
## 
## 
## 
## ++ MODEL/WORKFLOW :: slideSVM 
## Task for estimating  theil  using
## 10  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1005 ; test size =  500 
## Repetition  2 
##   start test =  1057 ; test size =  500 
## Repetition  3 
##   start test =  1115 ; test size =  500 
## Repetition  4 
##   start test =  1254 ; test size =  500 
## Repetition  5 
##   start test =  1305 ; test size =  500 
## Repetition  6 
##   start test =  1311 ; test size =  500 
## Repetition  7 
##   start test =  1312 ; test size =  500 
## Repetition  8 
##   start test =  1318 ; test size =  500 
## Repetition  9 
##   start test =  1329 ; test size =  500 
## Repetition  10 
##   start test =  1428 ; test size =  500
summary(exp)
## 
## == Summary of a  Monte Carlo Performance Estimation Experiment ==
## 
## Task for estimating  theil  using
## 10  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## 
## * Predictive Tasks ::  SP500
## * Workflows  ::  standSVM, slideSVM 
## 
## -> Task:  SP500
##   *Workflow: standSVM 
##              theil
## avg     0.99348693
## std     0.01973701
## med     1.00372563
## iqr     0.01936481
## min     0.95419701
## max     1.01218863
## invalid 0.00000000
## 
##   *Workflow: slideSVM 
##              theil
## avg     0.99348693
## std     0.01973701
## med     1.00372563
## iqr     0.01936481
## min     0.95419701
## max     1.01218863
## invalid 0.00000000
plot(exp)

We are now ready to proceed to build the prediction models. We consider three different modesl: artificial neural networks, Support Vector Machine, and Multivariate Adaptive Regression Splines.

Artificial Neural Networks

Artificial neural networks (ANNs) are frequently used in financial forecasting because of their ability to deal with highly nonlinear problems. The package nnetis one of the several packages in R that implements feedforward neural nets. This type of neural network is among the most used and also what we will be applying.

ANNs are formed by a set of computing units (the neurons) linked to each other. Each neuron executes two consecutive calculations: a linear combination of its inputs, followed by a nonlinear computation of the result to obtain its output value that is then fed to other neurons in the network. Each of the neuron connections has an associated weight. Constructing an artificial neural network consists of establishing an architecture for the network and then using an algorithm to find the weights of the connections between the neurons.

Feed-forward artificial neural networks have their neurons organized in layers. The first layer contains the input neurons of the network. The training observations of the problem are presented to the network through these input neurons. The final layer contains the predictions of the neural network for any case presented at its input neurons. In between, we usually have one or more “hidden” layers of neurons. The weight updating algorithms, such as the back-propagation method, try to obtain the connection weights that optimize a certain error criterion, that is, trying to ensure that the network outputs are in accordance with the cases presented to the model. This is accomplished by an iterative process of presenting several times the training cases at the input nodes of the network, and after obtaining the prediction of the network at the output nodes and calculating the respective prediction error, updating the weights in the network to try to improve its prediction error. This iterative process is repeated until some convergence criterion is met.

Feed-forward ANNs with one hidden layer can be easily obtained in R using a function of the package nnet. The networks obtained by this function can be used for both classification and regression problems and thus are applicable to both our prediction tasks.

ANNs are known to be sensitive to different scales of the variables used in a prediction problem. In this context, it makes sense to transform the data before giving them to the network, in order to avoid eventual negative impacts on the performance. In our case we will standardize the data with the goal of making all variables have a mean value of zero and a standard deviation of one. The function scale() can be used to carry out this transformation for our data.

Regression Task

set.seed(1234)
norm.data <- data.frame(T.ind.GSPC=Tdata.train[[1]],scale(Tdata.train[,-1]))
nn <- nnet(Tform, norm.data[1:1000, ], size = 5, decay = 0.01, 
           maxit = 1000, linout = TRUE, trace = FALSE)
preds <- predict(nn, norm.data[1001:2000, ]) 

By default, the function nnet() sets the initial weights of the links between neurons with random values in the interval [-0.5···0.5]. This means that two successive runs of the function with exactly the same arguments can actually lead to different solutions.

In this illustrative example we have used the dirst 1,000 cases to obtain the network and tested the model on the following 1,000. After normalizing our training data, we call the function nnet() to obtain the model. The ???rst two parameters are the usual of any modeling function in R: the functional form of the model speci???ed by a formula, and the training sample used to obtain the model. We have also used some of the parameters of the nnet() function. Namely, the parameter size allows us to specify how many nodes the hidden layer will have. There is no magic recipe on which value to use here. One usually tries several values to observe the network behavior. Still, one frequently uses a value smaller than the number of predictors of the problem. The parameter decay controls the weight updating rate of the back-propagation algorithm. Again, trial and error is your best friend here. Finally, the parameter maxit controls the maximum number of iterations the weight convergence process is allowed to use, while the linout=TRUE setting tells the function that we are handling a regression problem. The trace=FALSE is used to avoid some of the output of the function regarding the optimization process. The function predict() can be used to obtain the predictions of the neural network for a set of test data.

Let us evaluate the results of the ANN for predicting the correct signals for the test set. We do this by transforming the numeric predictions into signals and then evaluate them using the statistics described above.

sigs.nn <- trading.signals(preds,0.1,-0.1) 
true.sigs <- trading.signals(Tdata.train[1001:2000, "T.ind.GSPC"], 0.1, -0.1) 
sigs.PR(sigs.nn,true.sigs)
##     precision    recall
## s   0.2809917 0.1931818
## b   0.3108108 0.2857143
## s+b 0.2973978 0.2373887

Function trading.signals() transforms numeric predictions into signals, given the buy and sell thresholds, respectively. The function sigs.PR(), also from the DMwR2 package, obtains a matrix with the precision and recall scores of the two types of events, and overall. These scores show that the performance of the ANN is not brilliant.In effect,we get rather low precision scores, and also not so interesting recall values. The latter are not so serious as they basically mean lost opportunities and not costs. On the contrary, low precision scores mean that the model gave wrong signals rather frequently. If these signals are used for trading, this may lead to serious losses of money.

ANNs can also be used for classification tasks. For these problems the main difference in terms of network topology is that instead of a single output unit, we will have as many output units as there are values of the target variable (sometimes known as the classes). Each of these output units will produce a probability estimate of the respective class value. This means that for each test case, an ANN can produce a set of probability values, one for each possible class value. The use of the nnet() function for these tasks is very similar to its use for regression problems. The following code illustrates this, using our training data:

Classification Task

set.seed(1234) 
norm.data <- data.frame(Signal=Tdata.trainC$Signal,scale(Tdata.trainC[,-1])) 
nn <- nnet(Signal ~ ., norm.data[1:1000, ], size = 10, decay = 0.01, 
           maxit = 1000, trace = FALSE)
preds <- predict(nn, norm.data[1001:2000, ], type = "class") 

The type=“class” argument is used to obtain a single class label for each test case instead of a set of probability estimates. With the network predictions we can calculate the model precision and recall as follows:

sigs.PR(preds, norm.data[1001:2000, 1])
##     precision    recall
## s   0.3607595 0.3238636
## b   0.3267974 0.3105590
## s+b 0.2250804 0.2077151

The precision and recall scores are similar to the ones obtained in the regression task.

Support Vector Machines

Support vector machines (SVMs) are modeling tools that, as ANNs, can be applied to both regression and classification tasks. SVMS can be implemented using the packages kernlab and e1071.

The basic idea behind SVMs is that of mapping the original data into a new, highdimensional space, where it is possible to apply linear models to obtain a separating hyper plane, for example, separating the classes of the problem, in the case of binary classification tasks.The mapping of the original data into this new space is carried out with the help of the so-called kernel functions. These functions are interesting because when applied to any pair of cases in the original space produce a value that is equal to the dot product of thesecases in the new (and extremely large) space. Dot products playan essential role in the optimization process that SVMs use to obtain their solution. Calculating dot products in these highdimensionality spaces where SVMs project the data, is computationally expensive. In this context, the equivalence provided by kernel functions is very interesting because it allows avoiding these extra computation costs

Regression Task

set.seed(1234)
sv <- svm(Tform, Tdata.train[1:1000, ], gamma = 0.001, cost = 100) # default gamma is 1/ncol
s.preds <- predict(sv, Tdata.train[1001:2000, ])
sigs.svm <- trading.signals(s.preds, 0.1, -0.1) 
true.sigs <- trading.signals(Tdata.train[1001:2000, "T.ind.GSPC"], 0.1, -0.1)
sigs.PR(sigs.svm, true.sigs)
##     precision      recall
## s       0.375 0.017045455
## b         NaN 0.000000000
## s+b     0.375 0.008902077

As we can observe, the SVM model achieves a considerably better score than the ANN in terms of precision, although with a much lower recall.

Classification Task

ksv <- ksvm(Signal ~ ., Tdata.trainC[1:1000, ], C = 10) 
ks.preds <- predict(ksv, Tdata.trainC[1001:2000, ]) 
sigs.PR(ks.preds, Tdata.trainC[1001:2000, 1])
##     precision    recall
## s   0.2386364 0.2386364
## b   0.3421053 0.1614907
## s+b 0.2698413 0.2017804

Note that we have used the C parameter of the ksvm() function of package kernlab, to specify a different cost of constraints violations, which by default is 1. Apart from this we have used the default parameter values, which for classification involves, for instance, using the radial basis kernel.

Multivariate Adaptive Regression Splines

Multivariate adaptive regression splines (MARS) are an example of an additive regression model. The main idea behind generalized additive models is that a complex function may be decomposed in an additive way such that each term has a simpler form. The main advantage/motivation of this decomposition lies on the fact that additive models are generally considered very interpretable as you can easily understand the contribution of each term towards the predictions of the model. MARS may be useful if we face complex non-linear relationships between predictor and target, especially in high dimension. With MARS, we are essentially building linear relationships between predictor and target by segmenting predictor variables. The disadvantage for MARS is: 1) computational complexity, require recursive partitioning, 2) potential overfit problem, 3) difficult to implement (with hinge function).

MARS builds models in two phases: the forward and backward passes. The forward pass start with an intercept (mean of the target) and iteratively adds new basis function terms. This is carried out until a certain termination criterion is met. The backward pass iteratively tries to remove each term in turn use a cross validation criterion to compare and select alternatives.

Let take a look as an example.

data(Boston, package="MASS") 
sp <- sample(1:nrow(Boston),as.integer(0.7*nrow(Boston))) 
tr <- Boston[sp,] 
ts <- Boston[-sp,] 
mars <- earth(medv ~ .,tr, pmethod = "forward") # backward (default) produces lower Rsq 
preds <- predict(mars,ts) 
(mae <- mean(abs(ts$medv - preds)))
## [1] 2.65609
plot(mars)

summary(mars)
## Call: earth(formula=medv~., data=tr, pmethod="forward")
## 
##                 coefficients
## (Intercept)       -27.532039
## h(crim-4.55587)    -0.482866
## h(crim-18.811)      0.543873
## h(4.93-indus)       1.183864
## h(nox-0.624)     -100.095279
## h(nox-0.668)       93.198652
## h(rm-6.474)         6.592849
## h(dis-1.4118)      54.525616
## h(2.3817-dis)      56.952016
## h(dis-2.3817)     -55.192033
## h(3-rad)           -1.811984
## h(rad-3)            0.373438
## h(277-tax)          0.024191
## h(tax-277)         -0.014355
## h(18.6-ptratio)     0.725101
## h(ptratio-18.6)    -0.841909
## h(389.25-black)    -0.007657
## h(black-389.25)    -0.162592
## h(6.29-lstat)       2.671068
## h(lstat-6.29)      -0.426354
## 
## Selected 20 of 24 terms, and 10 of 13 predictors
## Termination condition: Reached nk 27
## Importance: rm, lstat, ptratio, black, dis, nox, tax, crim, rad, ...
## Number of terms at each degree of interaction: 1 19 (additive model)
## GCV 13.26437    RSS 3717.958    GRSq 0.8405412    RSq 0.8730244

The following code applies the function earth() to the financial regression task we have been using:

Regression Task

e <- earth(Tform, Tdata.train[1:1000, ]) 
e.preds <- predict(e, Tdata.train[1001:2000, ]) 
sigs.e <- trading.signals(e.preds, 0.1, -0.1) 
true.sigs <- trading.signals(Tdata.train[1001:2000, "T.ind.GSPC"], 0.1, -0.1) 
sigs.PR(sigs.e, true.sigs)
##     precision    recall
## s   0.2894737 0.2500000
## b   0.3504274 0.2546584
## s+b 0.3159851 0.2522255
summary(e)
## Call: earth(formula=Tform, data=Tdata.train[1:1000,])
## 
##                           coefficients
## (Intercept)                  0.5241811
## h(myATR.GSPC-2.56817)        1.2724353
## h(-61.825-mySMI.GSPC)        0.0594203
## h(myADX.GSPC-40.6215)       -0.0104803
## h(50.657-myADX.GSPC)        -0.0025279
## h(myADX.GSPC-50.657)         0.0823181
## h(0.204717-myVolat.GSPC)    -0.5105416
## h(myVolat.GSPC-0.204717)    -5.6100523
## h(myVolat.GSPC-0.271459)     5.6725474
## h(mySAR.GSPC-74.7031)       -0.0693496
## h(87.944-mySAR.GSPC)        -0.0575104
## h(mySAR.GSPC-87.944)         0.0800171
## h(runMean.Cl.GSPC-79.265)    0.2074780
## h(81.942-runMean.Cl.GSPC)    0.1058493
## h(runMean.Cl.GSPC-81.942)   -0.2185753
## 
## Selected 15 of 18 terms, and 6 of 11 predictors
## Termination condition: Reached nk 23
## Importance: myVolat.GSPC, runMean.Cl.GSPC, myATR.GSPC, mySMI.GSPC, ...
## Number of terms at each degree of interaction: 1 14 (additive model)
## GCV 0.01470628    RSS 13.86568    GRSq 0.3536668    RSq 0.38939

You may get more detailed information on the importance assigned (in decreasing importance) by the algorithm to the variables. Note that MARS is only applicable to regression problems so we do not show any example for the classification task.

evimp(e, trim=FALSE)
##                      nsubsets   gcv    rss
## myVolat.GSPC               13 100.0  100.0
## runMean.Cl.GSPC            13 100.0  100.0
## myATR.GSPC                 12  96.7   96.4
## mySMI.GSPC                 11  81.5   82.4
## mySAR.GSPC                  7  44.1   47.7
## myADX.GSPC                  5  57.8>  58.5>
## myAroon.GSPC-unused         0   0.0    0.0
## myEMV.GSPC-unused           0   0.0    0.0
## myMACD.GSPC-unused          0   0.0    0.0
## myMFI.GSPC-unused           0   0.0    0.0
## runSD.Cl.GSPC-unused        0   0.0    0.0

The Trading Set up

There are many ways of setting up our trading strategies. One possible strategy could be as follows:

  • All decisions will be taken at the end of the day.
  • If at the end of the day we have a prediction for a low value of T
    • If we already have a opened position we ignore this signal
    • If we do not have a position we open a short position issuing a sell order
    • When this order is carried out in the future at a price p we immediately post two extra orders:
      • A buy limit order with a limit price of p - t% (t% is our target profit) with a deadline of 10 days
      • A buy stop order with a limit price of p + l% (l% is our maximum accepted loss)
  • If the models forecast a high value of T
    • If we already have a opened position we ignore this signal
    • If we do not have a position we open a long position issuing a buy order
    • When this order is carried out in the future at a price p we immediately post two extra orders:
      • A sell limit order with a limit price of p + t% (t% is our target profit) with a deadline of 10 days
      • A sell stop order with a limit price of p - l% (l% is our maximum accepted loss)

A second possible strategy, which is similar to the 1st strategy but with two exceptions, is one where we will always open new positions (as long as we have enough money) even if we have already an opened position we will wait for ever (unless the limit loss fires) for positions to reach the target profit.

The two trading strategies are implemented by functions policy.1() and policy.2() below:

policy.1 <- function(signals,market,opened.pos,money,
                     bet=0.2,hold.time=10,
                     exp.prof=0.025, max.loss= 0.05
)
{
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b' && !nOs) {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0) 
      orders <- rbind(orders,
                      data.frame(order=c(1,-1,-1),order.type=c(1,2,3), 
                                 val = c(quant,
                                         market[d,'Close']*(1+exp.prof),
                                         market[d,'Close']*(1-max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
    
    # ii) short positions  
  } else if (signals[d] == 's' && !nOs) {
    # this is the nr of stocks we already need to buy 
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
                               "N.stocks"])*market[d,'Close']
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(-1,1,1),order.type=c(1,2,3), 
                                 val = c(quant,
                                         market[d,'Close']*(1-exp.prof),
                                         market[d,'Close']*(1+max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
  }
  
  # Now lets check if we need to close positions
  # because their holding time is over
  if (nOs) 
    for(i in 1:nOs) {
      if (d - opened.pos[i,'Odate'] >= hold.time)
        orders <- rbind(orders,
                        data.frame(order=-opened.pos[i,'pos.type'],
                                   order.type=1,
                                   val = NA,
                                   action = 'close',
                                   posID = rownames(opened.pos)[i]
                        )
        )
    }
  
  orders
}
policy.2 <- function(signals,market,opened.pos,money,
                     bet=0.2,exp.prof=0.025, max.loss= 0.05
)
{
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b') {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0) 
      orders <- rbind(orders,
                      data.frame(order=c(1,-1,-1),order.type=c(1,2,3), 
                                 val = c(quant,
                                         market[d,'Close']*(1+exp.prof),
                                         market[d,'Close']*(1-max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
    
    # ii) short positions  
  } else if (signals[d] == 's') {
    # this is the money already committed to buy stocks
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
                               "N.stocks"])*market[d,'Close']
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(-1,1,1),order.type=c(1,2,3), 
                                 val = c(quant,
                                         market[d,'Close']*(1-exp.prof),
                                         market[d,'Close']*(1+max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
  }
  
  orders
}

The function has four parameters that we can use to tune this strategy. These are the bet parameter, which specifies the percentage of our current money, that we will invest each time we open a new position; the exp.prof parameter, which indicates the profit margin we wish for our positions and is used when posting the limit orders; the max.loss, which indicates the maximum loss we are willing to admit before we close the position, and is used in stop orders; and the hold.time parameter, which indicates the number of days we are willing to wait to reach the pro???t margin. If the holding time is reached without achieving the wanted margin, the positions are closed.

Notice that whenever we opena new position,we send three orders back to the simulator: a market order to open the position, a limit order to specify our target pro???t margin, and a stop order to limit our losses.

A Trading Simulator

Function trading.simulator() in package DMwR2 implements a simulation of a trading market. It accepts several arguments controlling this simulation, namely:

  • The market quotes during the simulation period
  • The signals issued for this period
  • A trading policy function that will transform these signals into market orders
  • The parameters to be passed to this trading policy function
  • The cost of each transaction (defaults to 5 Eur)
  • The initial capital available for trading (defaults to 1M Eur)

The result of the function is an object of class tradeRecord that can be used for obtaining trading evaluation metrics and also for the visualization of the trading performance during the simulation period.

Below shows an illustrative example, applying SVM to the first trading strategy.

data(GSPC, package="DMwR") # Train and test periods used in this illustration 
start <- 1
len.tr <- 1000 # first 1000 for training models 
len.ts <- 500 # next 500 for testing them 
tr <- start:(start+len.tr-1)
ts <- (start+len.tr):(start+len.tr+len.ts-1) 
# The market quotes during this "testing period" 
# This will be used by the simulator 
# Note: you need the training data created previously! 
date <- rownames(Tdata.train[start+len.tr,]) 
market <- GSPC[paste(date,'/',sep='')][1:len.ts]

s <- svm(Tform,Tdata.train[tr,],cost=10,gamma=0.01) 
p <- predict(s,Tdata.train[ts,]) 
sig <- trading.signals(p,0.1,-0.1) # predictions to signals 
## now using the simulated trader 
t1 <- trading.simulator(market,sig, 'policy.1', # the policy function name
                        list(exp.prof=0.05,bet=0.2,hold.time=30)) 
t1
## 
## Object of class tradeRecord with slots:
## 
##   trading: <xts object with a numeric  500 x 5  matrix>
##   positions: <numeric  8 x 7  matrix>
##   init.cap :  1e+06 
##   trans.cost :  5 
##   policy.func :  policy.1 
##   policy.pars : <list with  3  elements>
summary(t1)
## 
## == Summary of a Trading Simulation with  500  days ==
## 
## Trading policy function :  policy.1 
## Policy function parameters:
##   exp.prof  =  0.05 
##   bet  =  0.2 
##   hold.time  =  30 
## 
## Transaction costs :  5 
## Initial Equity    :  1e+06 
## Final Equity      :  1019713   Return :  1.97 %
## Number of trading positions:  8 
## 
## Use function "tradingEvaluation()" for further stats on this simulation.
plot(t1,market,theme='white',name='SP500')

## Rentability =  1.971254 %

Let’s try the second strategy.

t2 <- trading.simulator(market,sig,'policy.2',list(exp.prof=0.05,bet=0.3))
summary(t2)
## 
## == Summary of a Trading Simulation with  500  days ==
## 
## Trading policy function :  policy.2 
## Policy function parameters:
##   exp.prof  =  0.05 
##   bet  =  0.3 
## 
## Transaction costs :  5 
## Initial Equity    :  1e+06 
## Final Equity      :  1152332   Return :  15.23 %
## Number of trading positions:  37 
## 
## Use function "tradingEvaluation()" for further stats on this simulation.
tradingEvaluation(t2)
##     NTrades       NProf    PercProf          PL         Ret   RetOverBH 
##       37.00       26.00       70.27   152332.27       15.23        8.38 
##       MaxDD SharpeRatio     AvgProf     AvgLoss       AvgPL     MaxProf 
##    67492.27        0.06        4.99       -4.89        2.05        5.26 
##     MaxLoss 
##       -5.00
plot(t2,market,theme='white',name='SP500')

## Rentability =  15.23323 %

Model Selection

Time series problems like the one we are addressing bring new challenges in terms of obtaining reliable estimates of our evaluation metrics. This is caused by the fact that all data observations have an attached time tag that imposes an ordering among them. This means that cross-validation should not be applied to time series problems.

However, to ensure statistical reliability, it is important to involve repeating the estimation process several times under different conditions, preferably randomly selected. Given a time series dataset spanning from time t to time t+N, we should ensure that the sum of the training and validation sets size is smaller than N to ensure that we are able to randomly generate di???erent experimental scenarios with the data that was provided. In terms of experimental methodology, we will use a Monte Carlo experiment to obtain reliable estimates of ourevaluation metrics. Monte Carlo methods rely on random sampling to obtain their results.

Let’s see how to select and compare different trials of this problem. The plan is to split the data in two partitions (Tdata.train, 30 years; and Tdata.eval, 10 years). We will perform model selection and comparison looking only at Tdata.train. After we select the “best” model we will use it to trade on Tdata.eval - the final evaluation.

The function performanceEstimation() from package performanceEstimation will be used for model selection. The different steps we have described to address this problem include many parameters and alternatives. Exploring and comparing all possibilities requires far too much computation power. We will illustrate the comparison methodology selecting a few of these alternatives, namely: In terms of modeling we will consider a few variants of SVM and we will also consider some alternatives in terms of trading policies.

Our approach to solve the trading task consists of the following steps:

  • Obtain a model and its predictions for the T indicator
  • Transform these prediction into trading signals
  • Trade using these signals and a certain trading policy

This workflow will have to be executed for different train and test partitions within the Monte Carlo simulation, each time collecting the respective trading results. As this is a specific workflow we will write a function implementing it. The evaluation workflow function is given below.

tradingEval <- function(trueSigs,predSigs,tradeRec,...) 
{
    ## Signals evaluation
    st <- sigs.PR(predSigs,trueSigs)
    dim(st) <- NULL
    names(st) <- paste(rep(c('prec','rec'),each=3),c('s','b','sb'),sep='.')
    
    ## Trading record evaluation
    tradRes <- tradingEvaluation(tradeRec)
    return(c(st,tradRes))
}

The training workflow is given below.

tradingWF <- function(form,train,test, quotes,
                      learner,learner.pars=NULL,
                      learn.test.type='fixed',relearn.step=20,
                      b.t,s.t,
                      policy,policy.pars,
                      trans.cost=5,init.cap=1e+06)
{
  ## model and predictions for test set
  if (learn.test.type == 'fixed') {
    m <- do.call(learner,c(list(form,train),learner.pars))
    preds <- predict(m,test)
  } else {  # either slide or growing window strategies
    data <- rbind(train,test)
    n <- NROW(data)
    train.size <- NROW(train)
    sts <- seq(train.size+1,n,by=relearn.step)
    
    preds <- vector()
    for(s in sts) {
      tr <- if (learn.test.type=='slide') data[(s-train.size):(s-1),] else data[1:(s-1),]
      ts <- data[s:min((s+relearn.step-1),n),]
      
      m <- do.call(learner,c(list(form,tr),learner.pars))
      preds <- c(preds,predict(m,ts))
    }    
  } 
  
  ## trading signals corresponding to preds
  predSigs <- trading.signals(preds,b.t,s.t)
  tgtName <- all.vars(form)[1]
  trueSigs <- trading.signals(test[,tgtName],b.t,s.t)

  ## trading record from trading with the signals
  date <- rownames(test)[1]
  market <- get(quotes)[paste(date,"/",sep='')][1:length(preds),]
  tradeRec <- trading.simulator(market,predSigs,
                                   policy.func=policy,policy.pars=policy.pars,
                                   trans.cost=trans.cost,init.cap=init.cap)
  
  return(list(trueSigs=trueSigs,predSigs=predSigs,tradeRec=tradeRec))
}

Note that it takes a long time to run this code for performance estimation so we won’t run it. The code is given below. The results can be found in the book.

# mc.res <- performanceEstimation(
#   PredTask(Tform,Tdata.train),
#   workflowVariants('tradingWF',
#                    quotes='GSPC',
#                    learner='svm',
#                    pred.target="indicator",
#                    learner.pars=list(cost=c(1,10),gamma=0.01),
#                    b.t=c(0.01,0.05),
#                    s.t=c(-0.01,-0.05),
#                    policy='policy.2',
#                    policy.pars=list(bet=c(0.2,0.5),
#                                     exp.prof=0.05,max.loss=0.05)
#                    ), ## Monte Carlo repeating 5 times: 10y for training and 5y for testing
#   EstimationTask(method=MonteCarlo(nReps=5,szTrain=2540,szTest=1270,seed=1234),
#                  evaluator="tradingEval") )
# mc.res2 <- subset(mc.res,
#                   metrics=c("PercProf", "Ret", "MaxDD"),
#                   partial=FALSE)
# topPerformers(mc.res2, maxs=c(TRUE,TRUE,FALSE))
# plot(subset(mc.res2,metrics="Ret"))
# getWorkflow("tradingWF.v5",mc.res)
# summary(subset(mc.res2, workflows="tradingWF.v5"))

Alvin Eng

13 August 2017