1.Basic therom

1.1 assumptions

i.Arrival of trades
We suppose during the trading time of a day,within interval \([t,t+\tau]\), the number of trades(use \(N(\tau)\) for annotation) follows a Possion process and the arrival rate is \(\lambda\). \[P[N(\tau)=k]=\frac{e^{-\lambda\tau}({\lambda\tau})^k}{k!}\] ii.Volume of each trade
The volume of trade \(V\) follows a exponetial distribution with intensity \(\frac{1}{\beta}\).That is: \[V_i - Exp(\frac{1}{\beta})\] iii.Sign of trade
The side of trading,buy or sell follows a bernulli distribution,which means sign \(S_i\),\(1\) means buy,\(-1\) means sell: \[P(S_i=1)=p,P(S_i=-1)=1-p\] iv.Further assumptions
1.the 3 distributions are independent of each other.
2.the covariance of V is zero:
\(Cov(V_i,V_j)=0\) for any i not eqaul to j
3.the correlation of buy and sell sign is this(long memory): \[corr(S_i,S_j)=\rho^{|i-j|}\] v.Imbalance of trade
with in time \([t,t+\tau]\),define \[IMB=\sum_{i=1}^{N(\tau)} S_i \times V_i\] here three things are random,\(N(\tau)\),\(S_i\),\(V_i\).
The imbalance within in the interval has its economic meanings,it reflected the buy power or sell power because it it the net demand of this stock,and as the basic principle of economics goes, the higher the demand, the higher the price,it will be useful to our strategy.More interestingly,it reflect the behavior of human natrual some how.

1.2 estimations

\(IMB\) can be calculated from the real data, and we can get the moments of it to estimate the parameter.
\(E(IMB)\) and \(Var(IMB)\) can be expressed with unknown parameter.Since their are 3 random variables, conditional expectation should be used to calculate these 2 moments. \[E(IMB)=E[E(IMB|N(\tau)=k)]\] \[Var(IMB)=E[Var(IMB|N(\tau)=k)]+Var(E(IMB|N(\tau)=k))\]

after calculation,the final results are as followes.
\[E(IMB)=\lambda \tau (2p-1) \beta\] \[Var(IMB)=2\lambda \tau {\beta}^2 + 8{\beta}^2(1-p)p[\frac{\rho e^{\lambda \tau (\rho-1)}}{(1-\rho)^2}+\frac{\rho \lambda \tau}{1-\rho}-\frac{\rho}{(1-\rho)^2}]\]

2.Data processing

2.1 data description

The data we is are SH600159 (贵州茅台),since in many occasions same methods are applied to these two stocks,in the article only one example is shown beacuse the other one is similar.
The data are from http://yucezhe.com/ .For every stock,there are trade dataset that records the traded and the quote data that records the market quote.
The time interval is a whole year roughtly,and the total data size is about 300-500 million observations when they are all read into R. Every trading day there is a csv data,so totally there are hunderd of csv to read in and combine at last. A single file looks like follows:
quote quote

The following code is written to read this bunch of data inside R:

library(magrittr)

#if(file.exists("D:\\mirco_structure")==F)
#{dir.create("D:\\mirco_structure")}
data.path<-"D:\\mirco_structure"
setwd(data.path)
file<-dir(data.path)
file<-file[grep(pattern = ".*SH.*",x = file)]
file<-sapply(X = file,function(k)paste(data.path,k,sep = "\\"))
Maotai.path<-file[grep(pattern = "SH600519",x = file)]#get all the csv file paths
ICBC.path<-file[grep(pattern = "SH601398",x = file)]#get all the csv file pathbs

#600519=== Maotai===
Maotai.csv<-dir(Maotai.path)
  #quote csv pathv
  Maotai.quote.path<-Maotai.csv[grep(pattern = "quote.*",x = Maotai.csv)]
  Maotai.quote.path<-sapply(X = Maotai.quote.path,function(k)paste(Maotai.path,k,sep = "\\"))
  #trade csv path
  Maotai.trade.path<-Maotai.csv[grep(pattern = "trade.*",x = Maotai.csv)]
  Maotai.trade.path<-sapply(X = Maotai.trade.path,function(k)paste(Maotai.path,k,sep = "\\"))

#601319=== ICBC===
ICBC.csv<-dir(ICBC.path)
  #quote csv path
  ICBC.quote.path<-ICBC.csv[grep(pattern = "quote.*",x = ICBC.csv)]
  ICBC.quote.path<-sapply(X = ICBC.quote.path,function(k)paste(ICBC.path,k,sep = "\\"))
  #trade csv path
  ICBC.trade.path<-ICBC.csv[grep(pattern = "trade.*",x = ICBC.csv)]
  ICBC.trade.path<-sapply(X = ICBC.trade.path,function(k)paste(ICBC.path,k,sep = "\\"))
  

#============================ importing data ========================================================= 
  t1<-Sys.time()
  
  if(is.null(Maotai.quote.path)==F)
    Maotai.quote.data<-lapply(X = Maotai.quote.path,
                              function(i) read.csv(file = i,header = T,stringsAsFactors = F));
  
  if(is.null(Maotai.trade.path)==F)
    Maotai.trade.data<-lapply(X = Maotai.trade.path,
                              function(i) read.csv(file = i,header = T,stringsAsFactors = F));
  
  if(is.null(ICBC.quote.path)==F)
    ICBC.quote.data<-lapply(X = ICBC.quote.path,
                            function(i) read.csv(file = i,header = T,stringsAsFactors = F));
  
  if(is.null(ICBC.trade.path)==F)
    ICBC.trade.data<-lapply(X = ICBC.trade.path,
                            function(i) read.csv(file = i,header = T,stringsAsFactors = F));
  duration<-Sys.time()-t1#about 3 minutes
  print(duration)

It take less than three minutes. And every day is an element of the list,to combine them together,we need to use data.table package function \(rbindlist\). And save them as RDS data for later convience for importing.

  library(data.table)
  mt_quote<-rbindlist(Maotai.quote.data)
  mt_trade<-rbindlist(Maotai.trade.data)
  icbc_quote<-rbindlist(ICBC.quote.data)
  icbc_trade<-rbindlist(ICBC.trade.data)
  
  # RDS compress the dataset so it is much more conveniet to read them later and occupied less memory
  saveRDS(object = mt_quote,"mt_quote.RDS")
  saveRDS(object = mt_trade,"mt_trade.RDS")
  saveRDS(object = icbc_quote,"icbc_quote.RDS")
  saveRDS(object = icbc_trade,"icbc_trade.RDS")

  #load dataset
  mt_quote<-readRDS("D:\\mirco_structure\\mt_quote.RDS")
  mt_trade<-readRDS("D:\\mirco_structure\\mt_trade.RDS")

Quote data

  head(mt_quote)

##    X     date  time price volume turnover ntrade BS acc_volume
## 1: 1 20130104 91503     0      0        0      0             0
## 2: 2 20130104 91507     0      0        0      0             0
## 3: 3 20130104 91513     0      0        0      0             0
## 4: 4 20130104 91519     0      0        0      0             0
## 5: 5 20130104 91533     0      0        0      0             0
## 6: 6 20130104 91543     0      0        0      0             0
##    acc_turnover AskPrice1 AskPrice2 AskPrice3 AskPrice4 AskPrice5
## 1:            0    209.02         0         0         0         0
## 2:            0    212.50         0         0         0         0
## 3:            0    212.50         0         0         0         0
## 4:            0    212.66         0         0         0         0
## 5:            0    212.66         0         0         0         0
## 6:            0    213.20         0         0         0         0
##    AskPrice6 AskPrice7 AskPrice8 AskPrice9 AskPrice10 AskVolume1
## 1:         0         0         0         0          0       1600
## 2:         0         0         0         0          0       4600
## 3:         0         0         0         0          0       4600
## 4:         0         0         0         0          0       4600
## 5:         0         0         0         0          0       4700
## 6:         0         0         0         0          0       7700
##    AskVolume2 AskVolume3 AskVolume4 AskVolume5 AskVolume6 AskVolume7
## 1:          0          0          0          0          0          0
## 2:        406          0          0          0          0          0
## 3:        906          0          0          0          0          0
## 4:        106          0          0          0          0          0
## 5:          6          0          0          0          0          0
## 6:       1556          0          0          0          0          0
##    AskVolume8 AskVolume9 AskVolume10 BidPrice1 BidPrice2 BidPrice3
## 1:          0          0           0    209.02         0         0
## 2:          0          0           0    212.50         0         0
## 3:          0          0           0    212.50         0         0
## 4:          0          0           0    212.66         0         0
## 5:          0          0           0    212.66         0         0
## 6:          0          0           0    213.20         0         0
##    BidPrice4 BidPrice5 BidPrice6 BidPrice7 BidPrice8 BidPrice9 BidPrice10
## 1:         0         0         0         0         0         0          0
## 2:         0         0         0         0         0         0          0
## 3:         0         0         0         0         0         0          0
## 4:         0         0         0         0         0         0          0
## 5:         0         0         0         0         0         0          0
## 6:         0         0         0         0         0         0          0
##    BidVolume1 BidVolume2 BidVolume3 BidVolume4 BidVolume5 BidVolume6
## 1:       1600        300          0          0          0          0
## 2:       4600          0          0          0          0          0
## 3:       4600          0          0          0          0          0
## 4:       4600          0          0          0          0          0
## 5:       4700          0          0          0          0          0
## 6:       7700          0          0          0          0          0
##    BidVolume7 BidVolume8 BidVolume9 BidVolume10
## 1:          0          0          0           0
## 2:          0          0          0           0
## 3:          0          0          0           0
## 4:          0          0          0           0
## 5:          0          0          0           0
## 6:          0          0          0           0

  colnames(mt_quote)

##  [1] "X"            "date"         "time"         "price"       
##  [5] "volume"       "turnover"     "ntrade"       "BS"          
##  [9] "acc_volume"   "acc_turnover" "AskPrice1"    "AskPrice2"   
## [13] "AskPrice3"    "AskPrice4"    "AskPrice5"    "AskPrice6"   
## [17] "AskPrice7"    "AskPrice8"    "AskPrice9"    "AskPrice10"  
## [21] "AskVolume1"   "AskVolume2"   "AskVolume3"   "AskVolume4"  
## [25] "AskVolume5"   "AskVolume6"   "AskVolume7"   "AskVolume8"  
## [29] "AskVolume9"   "AskVolume10"  "BidPrice1"    "BidPrice2"   
## [33] "BidPrice3"    "BidPrice4"    "BidPrice5"    "BidPrice6"   
## [37] "BidPrice7"    "BidPrice8"    "BidPrice9"    "BidPrice10"  
## [41] "BidVolume1"   "BidVolume2"   "BidVolume3"   "BidVolume4"  
## [45] "BidVolume5"   "BidVolume6"   "BidVolume7"   "BidVolume8"  
## [49] "BidVolume9"   "BidVolume10"

date and time is the record of a new quote happens,will be used later.
AskPrice1 and BidPrice1 will be used because they capture the bid-ask spread.
AskVolume1 and BidVolume1 will be used since different weights can be calculated.

Trade data

  head(mt_trade)

##    X     date id_trade     time sign BS price ntrade
## 1: 1 20130104        0 92500000   NA  B   212    100
## 2: 2 20130104        1 92500000   NA  B   212    100
## 3: 3 20130104        2 92500000   NA  B   212    400
## 4: 4 20130104        3 92500000   NA  B   212    100
## 5: 5 20130104        4 92500000   NA  B   212    100
## 6: 6 20130104        5 92500000   NA  B   212     75

  colnames(mt_trade)

## [1] "X"        "date"     "id_trade" "time"     "sign"     "BS"      
## [7] "price"    "ntrade"

date and time is the record of a new trade happens,will be used later.Here time is a little different in scale with the quote,need to be modified below.
BS indicates the direction of the trade,if it is a sell initiated or buy initiated.
Price is the price of the real trade.
ntrade is the traded amount of the corresponding trade.Namely the volume,it should be normalized because otherwise it would be too large to deal with.

2.2 data.table

data.table package is extremly useful here since the data is a little huge and data.table’s reference feature prevents R from copy every thing and makes it much faster in data munipulation. ALL the rest data process are conducted in data.table commands,which resembles SQL and data frame.Quick reference is here https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf

function to return intervals of time for separating the data in responce to different \(\tau\),this function would be used many times in later analysis, mainly for labelling the data and cut them into different pieces.

  trauch<-function(t){
      library(chron)
      trauch.time <- t / (60 * 24) # 5 minutes tau
      time.index1<-seq(from =  times(as.numeric(times('9:30:00'))+t / (60 * 24)), to = times('11:30:00'), by = trauch.time)
      time.index2<-seq(from = times(as.numeric(times('13:00:00'))+t / (60 * 24)), to = times('15:00:00'), by = trauch.time)
      time.index<-c(time.index1,time.index2)
      c.time<-as.character(time.index) 
      c.time<-as.character(time.index)
      c.time<-sub(x = c.time,pattern = ":",replacement = "")
      c.time<-substr(x = c.time,start = 1,stop = 4)
      n.time<-as.numeric(c.time)*100000
      return(n.time)
  }

daily.interval<-trauch(5)
daily.interval

##  [1]  93500000  94000000  94500000  95000000  95500000 100000000 100500000
##  [8] 101000000 101500000 102000000 102500000 103000000 103500000 104000000
## [15] 104500000 105000000 105500000 110000000 110500000 111000000 111500000
## [22] 112000000 112500000 113000000 130500000 131000000 131500000 132000000
## [29] 132500000 133000000 133500000 134000000 134500000 135000000 135500000
## [36] 140000000 140500000 141000000 141500000 142000000 142500000 143000000
## [43] 143500000 144000000 144500000 145000000 145500000 150000000

Data cleansing using data table

mt_trade.clean<-mt_trade[time>=93000000,] #9:30 a.m. market open

meanV<-mt_trade.clean[,mean(ntrade)]    #354.1357
norm.mt.trade<-mt_trade.clean[,normalV:=ntrade/meanV]#normalize the volumn

norm.mt.trade<-norm.mt.trade[,period:=findInterval(x = time,vec = daily.interval,rightmost.closed = F,all.inside = F)]
setkey(norm.mt.trade,date,period)

#every single imb before summed up
norm.mt.trade<-norm.mt.trade[,imb:=ifelse(BS=="B",normalV,-normalV)]

3. Estimation Results

3.1 \(\lambda\) (Possion intensity)

According to the expectation \(E(N)=\lambda \tau\), we use the average trading number during the time interval \([t,t+\tau]\),to estimate \(\lambda \tau\).

possion.count<-norm.mt.trade[,.N,by="date,period"]#count of arrival each day each period
lambda<-sum(possion.count$N)/(247*48*300)#0.796505

3.2 \(\beta\) (Exponential intensity)

Since all the volume of trades are normalized,the estimation of \(\beta\) using first moment will always be 1.

3.3 \(p\) (probability of a buy trade)

By the law of large numbers, we use proportion of number of buy in all number of trading to estimate p.

norm.mt.trade[,.N,by=BS]#count BS,has 8 NA

##    BS       N
## 1:  B 1393155
## 2:  S 1439129
## 3:          8

p<-1393155/(1393155+1439129)  #p=0.4918839
e<-c("B"=1393155,"S"=1439129)
chisq.test(e)  #significant test of whether probabilitis are equal

## 
##  Chi-squared test for given probabilities
## 
## data:  e
## X-squared = 746.26, df = 1, p-value < 2.2e-16

As shown above, p is close to 0.5. From our first intuition, it may be interpreted as buy and sell probability do not differ “that much”. However, under the confidential level of 99%, chi-square test rejected the null hypothesis that they have equal probability, which means throughout the whole year, the probability of a sell trade is higher than a buy trade.

3.4 \(\rho\) (correlation of buy or sell)

We tried 2 methods for estimation of \(\rho\). First we use \(Var(IMB)\) equation to deduce the estimation of \(\rho\) because in the analytical solution of \(Var(IMB)\) there is a \(\rho\) term and we can calculate the \(Var(IMB)\) through raw data. Another method is to estimate \(\rho\) purely by the buy-sell sequence’s auto-correlation of n lags.

  var<-21129.55
  lambda<-0.796505
  tau<-300
  beta<-1
  p<-0.4918839

relation<-function(rho){
  x<-2*lambda*tau*beta^2+8*beta^2*(1-p)*p*(rho*exp(lambda*tau*(rho-1))/(1-rho)^2+rho*lambda*tau/(1-rho)-rho/(1-rho)^2);
  return (abs(x-var))
}

s<-optim(par = 0.5,fn = relation,method = "BFGS")
s$par

## [1] 0.9826307

The \(\rho\) is calculated from \(Var(IMB)\) is 0.9826307, which indicates that the correlation between two trades are strongly positive, and since buy trade takes the value 1, sell trade takes the value -1. This correlation validates the cluster (i.e. herding) effect of trades that buy trades would more likely lead to another buy trade, and sell trades would more likely leads to another sell trade. As the distance between two trades grows larger, the influence of the past trade on the current trade would decreases exponentially.
Using our second method to estimate \(\rho\), we use the acf() function in R to calculate up to 200th lag of auto-correlation. Apparently it also exhibits a “long-memory” feature.

norm.mt.trade[,sign:=ifelse(BS=="B",1,-1)]
t<-acf(x = norm.mt.trade$sign,lag.max = 200)

To take squares to get the appoximation of \(\rho\)

vv<-sapply(2:200,function(y) t$acf[y]^(1/(y-1)))
plot(vv,pch=20,ylab = "rho",xlab = "#iteration",main = "Take root",col="salmon3")

To divide each other to get the appoximation of \(\rho\)

ss<-sapply(3:201,function(y) t$acf[y]/t$acf[y-1])
plot(ss,col="red3",pch=20,ylab = "rho",xlab = "#iteration",main = "Lag-divide")

The lag 1 correlation is 0.5597272, and since \(corr(S_i,S_j)=\rho^{|i-j|}\), and we have the different distant sequence of \(\rho\) now, we could take the root of ρ power, or we divide ρ power by its 1st lag. Both of them would indicate that \(\rho\) approaches 1 very closely. The difference of the plot or the accuracy of the estimation might be the result that our very assumption of the correlation or even the assumption of Poisson arrival are sometimes compromised. The good thing is, no matter which method we use, \(\rho\) is strongly positive and close to 1, thus validating the empirical study of the herding of trades in the market.

4.Mirco-level features

4.1 Cut into 16 daily sub-sessions

We cut the data into 16 sub-sessions,each session lasts 15 minutes, to cut them, we could using the labels we assigned before and separate the data.
e.g. the first session.

norm.mt.trade1<-subset(norm.mt.trade,period %in% 0:2)

combine them all in a list of sub-session trades.

sep.norm<-list("1"=norm.mt.trade1,"2"=norm.mt.trade2,"3"=norm.mt.trade3,
               "4"=norm.mt.trade4,"5"=norm.mt.trade5,"6"=norm.mt.trade6,
               "7"=norm.mt.trade7,"8"=norm.mt.trade8,"9"=norm.mt.trade9,
               "10"=norm.mt.trade10,"11"=norm.mt.trade11,"12"=norm.mt.trade12,
               "13"=norm.mt.trade13,"14"=norm.mt.trade15,"15"=norm.mt.trade15,
               "16"=norm.mt.trade16)

Then we need to normalize each sub-session.Calculate the mean of each sub-session.

meanV.sep<-sapply(sep.norm, function(y) mean(y$ntrade))

e.g. Normalize the first sub-session

norm.mt.trade1[,normalV:=ntrade/meanV.sep[1]]

calculating the single imb of each trade.

norm.mt.trade1[,imb:=ifelse(BS=="B",normalV,-normalV)]

1. Probability of a trade in sub-sessions
The probability of a buy trade varies around 0.5, but whether it can be taken as equal as the probability of s sell trade requires rigorous statistical tests, we apply the chi-square test of the data.

#possibility===
p.sep<-sapply(sep.norm,function(y){ 
  t<-y[,.N,by=BS]
  return (t[1,N]/(t[1,N]+t[2,N]))}
)


chi.sep<-sapply(sep.norm,function(y){
  t<-y[,.N,by=BS]
  ch<-chisq.test(c(t[1,N],t[2,N]))
  return (ch[3])
})

barplot(height =p.sep,xlab="sub-session",ylab = "p",axes = T,main = "SH600519 (Mao Tai)",col = "steelblue1")

as.vector(chi.sep)

## $`1.p.value`
## [1] 0.7924044
## 
## $`2.p.value`
## [1] 4.787328e-06
## 
## $`3.p.value`
## [1] 0.593348
## 
## $`4.p.value`
## [1] 8.017145e-18
## 
## $`5.p.value`
## [1] 2.890268e-05
## 
## $`6.p.value`
## [1] 8.386993e-29
## 
## $`7.p.value`
## [1] 1.767337e-06
## 
## $`8.p.value`
## [1] 0.2886933
## 
## $`9.p.value`
## [1] 8.106206e-94
## 
## $`10.p.value`
## [1] 5.35036e-49
## 
## $`11.p.value`
## [1] 1.51585e-37
## 
## $`12.p.value`
## [1] 1.417616e-48
## 
## $`13.p.value`
## [1] 4.794043e-32
## 
## $`14.p.value`
## [1] 2.154209e-06
## 
## $`15.p.value`
## [1] 2.154209e-06
## 
## $`16.p.value`
## [1] 0.0001069616

Only sub-session 1, sub-session 3, sub-session 8 do not reject the null hypothesis that number of buy and number of sell has no difference, other session all reject the null hypothesis. At the period of daily market open and the period of noon-break, the probability of a buy or sell trade is not so different. The reason for this may be the lack of information at market opening so that buy or sell trend has not been formed, the behavior of people who trade may exhibit more randomness at the beginning. As for noon, it might be caused by the “take a break” effect, since the trades volume and arrival during this point is low, and those who are still trading just minutes before lunchtime may not hold a strong view of buy or sell their position.

2.Lambda(arrival rate of trades) in sub-sessions

lambda.sep<-sapply(sep.norm,function(y){
  t<-y[,.N,by="date,period"]
  return (sum(t[,N])/(247*3*300))
}
)
lambda.sep

##         1         2         3         4         5         6         7 
## 1.2189249 1.1444130 0.9624696 0.8536212 0.7565407 0.7205983 0.6911606 
##         8         9        10        11        12        13        14 
## 0.6457850 0.7049393 0.6556770 0.6800045 0.6817814 0.6755286 0.7571660 
##        15        16 
## 0.7571660 0.9042510

barplot(height =lambda.sep,xlab="sub-session",ylab = "lambda",axes = T,main = "SH600519 (Mao Tai)",col = "burlywood1")

As expected, the Poisson intensity exhibits a U-like shape, but with the second peak lower than the first peak. It coincide with empirical studies that the frequency of trades would be highest at market opening, trading is most active at that time and trade volume clustered at the very beginning. From that on, it gradually decreases until noon break, and when the market re-opens in the afternoon, it climbs up slowly and reaches another peak at market closing, because there will be many unfinished trades of a day to be finished before the market closing.

3.beta in sub-sessions

beta<-sapply(sep.norm,function(y){
  return (mean(y$normalV))
}
)
beta

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

As we normalized them all,beta are always 1.

4.rho and lag-1 correlation in sub-sessions

rho.lag1.sep<-sapply(sep.norm,function(y) {
  x<-shift(y$sign,n=1,fill = NA,type = "lead")
  x<-na.omit(x)
  y<-y$sign
  y<-y[1:length(x)]
  return (cor(x,y))
})

plot(rho.lag1.sep,main="lag-1 correlation",pch=20,col="blue3")

use \(Var(IMB)\) to estimate \(\rho\)

#Var(imb)===
varimb<-sapply(sep.norm,function(y) {
  t<-y[,sum(imb),by="date,period"]
  return (var(t[,V1]))
})

paramater<-cbind(p.sep,lambda.sep,beta,varimb)

#rho===
corr<-c()
for(i in 1:nrow(paramater)){
  p<-paramater[i,1]
  lambda<-paramater[i,2]
  tau<-300
  beta<-paramater[i,3]
  var<-paramater[i,4]
  
  relation<-function(rho){
    x<-2*lambda*tau*beta^2+8*beta^2*(1-p)*p*(rho*exp(lambda*tau*(rho-1))/(1-rho)^2+rho*lambda*tau/(1-rho)-rho/(1-rho)^2);
    return (abs(x-var))
  }
  
  t<-optim(par = 0.5,fn = relation,method = "BFGS")
  corr<-c(corr,t$par)
}

plot(x=1:16,y=corr,xlab="sub-session",ylab = "correlation",axes = T,main = "SH600519 (Mao Tai)",col = "seagreen2",ylim = c(0.95,1),pch=16)

The correlation, all sub-session estimates are around 0.95~1, which implies that there is always a strong clustering effect of sign of trades throughout the entire day. The lag-1 correlation, has some kind of a going down trend, but still is larger than 0.5.

5.Imbalance and return

5.1 non-linear fit

As a validation of theory, we set the relationship between Imbalace and Return to be a power function as follows. \[Return=\beta^{+} \sigma_{return} |IMB|^{\gamma^{+}} if IMB>0\] \[Return=\beta^{-} \sigma_{return} |IMB|^{\gamma^{-}} if IMB<0\] Return here is calculated by \(\frac{midquote(t+\tau)-midquote(t)}{midquote(t)}\),and here we use another definition for mid-quote since it can reflect the effect of volume into the real power of buy or sell side. \[midquote=\frac{bid_1 \times asksize_1 +ask_1 \times bidsize_1}{asksize_1+bidsize_1}\]

interval<-trauch(5)

mt_trade.clean<-mt_trade[time>=93000000,]
mt_trade.clean[,period:=findInterval(x = time,vec = interval,rightmost.closed = F,all.inside = F)]#assign daily session index
mt_quote.clean<-mt_quote[time<=150000,]
mt_quote.clean<-mt_quote.clean[,.(X,date,time,price,volume,AskPrice1,AskVolume1,BidPrice1,BidVolume1)]#keep useful variable

#common key
setkey(mt_quote.clean,date)
setkey(mt_trade.clean,date)

mt_quote.clean[,weight_mid:=ifelse(BidPrice1==0,AskPrice1,(BidPrice1*AskVolume1+AskPrice1*BidVolume1)/(AskVolume1+BidVolume1))]

mt_trade.clean[, interval := 
                 findInterval(time/1000,mt_quote.clean[.BY][, time],rightmost.closed = T,all.inside = T)
               ,by = date]

mt_trade.clean<-mt_trade.clean[BS!=" ",]

meanV<-mt_trade.clean[,mean(ntrade)]    #354.1357
mt_trade.clean<-mt_trade.clean[,normalV:=ntrade/meanV]
mt_trade.clean<-mt_trade.clean[,imb:=ifelse(BS=="B",normalV,-normalV)]

mt_trade.clean[,weight_mid:=mt_quote.clean[.BY][interval,weight_mid],by=date]#merge
mt_trade.lm<-mt_trade.clean[,.(weight_mid,IMB=sum(imb)),by="date,period"]


mt_trade.lm[,group:=date*100+period]#combine two keys into one and grouping data with hash style key
setkey(mt_trade.lm,group)

mt_trade.lm<-mt_trade.lm[, .SD[c(1,.N)], by=group]#select first and last one during each period each day

#calculating return
mt.model<-mt_trade.lm[,ret:=c(NA,(weight_mid[2]-weight_mid[1])/weight_mid[1]),by=group] 
mt.model<-mt.model[ret!="NA",]#get rid of redundent data

sigma<-sd(mt.model$ret*100)


pos.data<-mt.model[IMB>0,]
neg.data<-mt.model[IMB<0,]

imb.relation<-function(IMB,gamma,beta){
  beta*sigma*abs(IMB)^gamma
}

pos.fit<-nls(100*ret~imb.relation(IMB,gamma,beta),data = pos.data,start = list(gamma=0.2,beta=0.3),trace = F)
neg.fit<-nls(100*ret~imb.relation(IMB,gamma,beta),data = neg.data,start = list(gamma=0.2,beta=-0.3),trace = F)

the estimated parameter and the fitted plot.

pos.fit$par

## NULL

neg.fit$par

## NULL

mt.points<-mt.model[IMB!=0,.(IMB,ret)]
plot(x=mt.points$IMB,y = mt.points$ret*100,pch=20,xlab="Imbalance",ylab="Return",main="imbalance and return relation")
imb<-mt.model[IMB!=0,IMB]
t<--1500:3000
lines(x=t,y=ifelse(t>0,predict(pos.fit,list(IMB=t)),predict(neg.fit,list(IMB=t))),type="p",col="red2",pch=20)

or to make it more good looking, use ggplot2 instead.

library(ggplot2)
ggplot(aes(x = IMB,y=ret*100),data = mt.points)+
  geom_point()+
  geom_line(color="red",aes(x = mt.points$IMB,y=ifelse(mt.points$IMB<0,predict(neg.fit,list(IMB=mt.points$IMB)),predict(pos.fit,list(IMB=mt.points$IMB)))))+
  xlab("IMB")+ylab("Return")

The relationship above can be easily exploited by identifing the IMB with in the traunch time we cutted before using the historical data above and then determine whether or not to buy,i.e. we deploy our buy order based on the historical IMB and select which interval to flash enter and flash out. Of course,in A-share T+1 is a limit of the strategy, further improvement shall be done. But here we haven’t adapted it into A-share yet.

5.2 risk-aversion behavior

Interesting enough is that there is a side discovery that validated the theory of behavior finance.Note the assymetry in the plot above? To explain the assymetry phenomenon, we’d like to introduce Prospect Theory https://en.wikipedia.org/wiki/Prospect_theory. This theory can be shown intuitively by the graph below: quote
Some corollaries that can be applied in our analysis are: 1. Losses hurt more than gains feel good (loss aversion).
2. Investors tend to behave risk averse when holding gains.
3. Investors tend to behave risk seeking when suffering losses.

Other revised versions such as Cumulative Prospect Theory and Rank-Dependent Expected Utility Theory can also help explain why asymmetric effect exists.

On the other hand, empirical studies can also help explain our statistical result. Empirical researchers find that there exists asymmetric mean reversion effect in stock markets. It indicates that stock price is more likely to drop further when stock price goes down and less likely to increase further when stock price goes up. So we have the asymmetric effect shown above.

Interesting! That all for this project recently.

High Frequency Data and Micro-Structure

Xinyu Jiao & Shi Gan

2016年3月1日