Pair trading using Neural network models

Introduction

In this exercise, I will use the time series data of two commodies, iron ore and steel, from Chinese Future market as my target. Because these two product are expected to be highly correlated, I would use models to capture their unexpected price deviation and generate trading signals based on those unexpectations. To select the most suitable model, I will compare the performance of three models, ETS model, ARIMA model, and Neural Network model.

Data Preparation

I download iron contract price from the website of Dalian commodity Exchange, http://www.dce.com.cn/, and steel price from Shanghai Futures Exchange, http://www.shfe.com.cn/.

rm(list = ls())
cat("\f")

packages <- c("fpp2", "seasonal", "readxl", "stargazer", "rugarch")
for (i in 1:length(packages)){
  if(!packages[i] %in% installed.packages()){
    install.packages(packages[i])
  }
  library(packages[i], character.only = T)
}

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## -- Attaching packages ---------------------------------------------- fpp2 2.4 --

## v ggplot2   3.3.5     v fma       2.4  
## v forecast  8.15      v expsmooth 2.3

##

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

## 载入需要的程辑包：parallel

## 
## 载入程辑包：'rugarch'

## The following object is masked from 'package:stats':
## 
##     sigma

rm(packages)

setwd("C:/Users/admin/Desktop/study/hw3")

steel <- read.delim("steel.txt")
iron <- read.delim("iron.txt")

head(steel)

##         日期 开盘 最高 最低 收盘  成交量  持仓量 结算价
## 1 2013/08/19 3802 3843 3785 3812 1892324 1552992   3814
## 2 2013/08/20 3810 3828 3792 3796 1517702 1548418   3811
## 3 2013/08/21 3796 3807 3786 3789 1150250 1559268   3797
## 4 2013/08/22 3777 3799 3762 3769 1747080 1579348   3777
## 5 2013/08/23 3772 3820 3771 3817 2136050 1553756   3803
## 6 2013/08/26 3827 3848 3802 3813 2048358 1536606   3822

head(iron)

##         日期 开盘 最高 最低 收盘 成交量 持仓量 结算价
## 1 2013/10/18  978  984  962  977 300818  65042    975
## 2 2013/10/21  976  977  960  969  95432  68130    967
## 3 2013/10/22  963  966  948  948  91296  76004    956
## 4 2013/10/23  949  953  936  939 116490  81162    944
## 5 2013/10/24  938  939  916  926 169028  87872    928
## 6 2013/10/25  928  936  917  918 104060  86472    927

Data Visualization

Several data points are zero and NA, which could be due to technique errors. Therefore, I delete rows containing NAs and replace zeros with the values of previous data. The plot looks better now.

steel.adj.price <- ts(na.omit(steel$结算价), start = as.Date("2014-01-01")
                      , end = as.Date("2020-11-30"))
iron.adj.price <- ts(na.omit(iron$结算价), start = as.Date("2014-01-01")
                     , end = as.Date("2020-11-30"))

autoplot(steel.adj.price, series = "steel")

steel.adj.price[which.min(steel.adj.price)] <- steel.adj.price[which.min(steel.adj.price)-1]

autoplot(steel.adj.price, series = "steel")

autoplot(iron.adj.price, series = "iron")

for(i in which(iron.adj.price == 0)) {
  iron.adj.price[i] <- iron.adj.price[i-1]
}
autoplot(iron.adj.price, series = "iron")

Model selection

Before model selection, I need to check the stationarity of the data. I first difference the log return and use these two log return value to generate a deviation log return, which is the target I would use models to forecast. If the forecast deviation is higher than actual one or lower than the acutal, trading signal will be generated correspondingly.

steel.return <- log10(steel.adj.price / lag(steel.adj.price, 1))
iron.return <- log10(iron.adj.price / lag(iron.adj.price, 1))


ggAcf(steel.return)

ggAcf(iron.return)

steel.iron.return.diff <- steel.return - iron.return

plot(steel.iron.return.diff)

ggAcf(steel.iron.return.diff)

To better evaluate the performance of each model, I also download 100-day price data starting from the end of the training data as my test data. I will compare the accuracy of each model to select the best model. The accuracy test shows that neutral net work models are the best model.

test.steel <-  ts(na.omit(steel$结算价), start = as.Date("2020-11-30") )

test.iron <-  ts(na.omit(iron$结算价), start = as.Date("2020-11-30")
                 , end = as.Date("2020-11-30") + 101 )

test.return.diff <- log10(test.steel / lag(test.steel, 1)) - log10(test.iron / lag(test.iron, 1))

ets.fit <- ets(steel.iron.return.diff)
checkresiduals(ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 117.46, df = 8, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 10

arima.fit <- auto.arima(steel.iron.return.diff)
checkresiduals(arima.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with zero mean
## Q* = 16.512, df = 9, p-value = 0.05693
## 
## Model df: 1.   Total lags used: 10

nt.fit <- nnetar(steel.iron.return.diff)
checkresiduals(nt.fit)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

accuracy(forecast(ets.fit, h = 100), test.return.diff)

##                         ME        RMSE         MAE      MPE     MAPE      MASE
## Training set  5.710209e-06 0.011220361 0.007531200      Inf      Inf 0.7825206
## Test set     -7.635275e-04 0.005377468 0.003643917 100.2441 100.6359 0.3786170
##                   ACF1 Theil's U
## Training set 0.1954390        NA
## Test set     0.1406888  1.004771

accuracy(forecast(arima.fit, h = 100), test.return.diff)

##                         ME        RMSE         MAE     MPE    MAPE      MASE
## Training set -6.893402e-05 0.010990489 0.007366322     NaN     Inf 0.7653891
## Test set     -8.388981e-04 0.005385542 0.003641728 99.7332 99.7332 0.3783895
##                      ACF1 Theil's U
## Training set -0.002049981        NA
## Test set      0.140082087  1.005565

accuracy(forecast(nt.fit, h = 100), test.return.diff)

##                         ME        RMSE         MAE      MPE     MAPE      MASE
## Training set -2.237983e-05 0.010165180 0.007130760      NaN      Inf 0.7409134
## Test set     -1.042753e-03 0.005459424 0.003687927 99.70581 104.5555 0.3831897
##                    ACF1 Theil's U
## Training set 0.03720757        NA
## Test set     0.14896713  1.022907

Forecast and Interpretation

When the forecast log return deviation is higher than the actual one, I interpret that the actual deviation might be underestimated, so I should buy the expensive one and short the cheap one. Therefore, positive log return of steel and negative log return of iron should be added up to generate daily return. When situation is inverse, trading signals are also inverse. The profit of 100 days trading is shown below. We can see a downward profit plot, showing this type of trading algorithms is flawed. I think potential improvement could be setting up a suitable threshold. Only when there is an extreme deviation, I trigger the signal. However, to find the best threshold needs optimazation methods, such as corss validation, which will be my future attemps to fix this model.

fnt <- forecast(nt.fit, h = 100) 
autoplot(fnt)

profit <- rep(NA, 101)

for(i in 2:100){
  profit[1] = 0
  if(fnt$mean[i] > test.return.diff[i]){
    profit[i] =  profit[i-1] + test.return.diff[i]
    
  } else {
    profit[i] = profit[i-1] - test.return.diff[i]
  }
}

plot(profit)

Pair trading using Neural network models

Hao Fu

2021/12/12