Copyright © 2018 T Leitch & J Liew
Recent developments
- Big data, machine learning, are tools readily applied in hedge funds.
- Social media sites such as Twitter, StockTwits, Estimize, and LinkedIn are full of untapped information *Recently we’ve seen strong activity in research employing such data
- Recently Google-Trends and stock prices have been investigated..
- Tweets and Crowd-sourced earnings
- Tweets sentiments and IPOs
- Tweets sentiments and Earnings, in regressions
- Social Media’s Tweet sentiment as the “Six”th factor
The “Sixth” Factor

- Liew and Budavari (2016) identify a “sixth” factor in sample of U.S. equities
- Data cover Jan 4, 2012 to Oct 30th, 2015
- Factor based on stock characteristics generated by user’s self-assessed “bullish” or “bearish” tweets
- Show significance by coefficients on time-series regression > 2.0
User-identified sentiment

- $Cash-tags invested by StockTwits associate tweets with given stock ticker
- Bar label allows users to identify sentiment either Bullish or Bearish
- Linking sentiments to prices allows for factor model existence research
Sentiments of AAPL over time…


*
- % increases in Adjusted R-Squared for CAPM, FF3, and FFM5
- ~500 stocks
IPOs and Tweet Sentiments
*

*

*

*

Also, includes…
- Background
- What is Statistical Arbitrage?
- Where does Statistical Arbitrage fit in the context of the Hedge Fund Industry?
- Who are the Stat Arb (Quant) Players?
- Coverage
- Different Types of Stat Arb Strategies
- Top Ten List
- Mean-Reversion Overview
- Z-Score
- Co-Integration
- Grouping Stocks
- Mean-Reversion Case Study
- The “Secret-Sauce”
- Avellaneda and Lee (2008)
- Case Study: Mean-Reversion Model Implemented
- Optional Homework
Attempts at defining Stat Arb
“…refers to highly technical short-term mean-reversion strategies involving large numbers of securities (hundreds to thousands, depending on the amount of risk capital), very short holding periods (measured in days to seconds), and substantial computational, trading, and IT infrastructure“ – Andrew Lo
http://www.alphasimplex.com/
“In the context of hedge funds, a style of management that employs complex statistical models that try to capture small abnormalities in a security’s intraday return. “– Campbell R. Harvey
http://www.duke.edu/~charvey/
An attempt to profit from pricing inefficiencies that are identified through the use of mathematical models. Statistical arbitrage attempts to profit from the likelihood that prices will trend toward a historical norm. Unlike pure arbitrage, statistical arbitrage is not riskless. – InvestorWords
Quantitative investment approach characterized by low use of discretion, high use of technology, research, data, and statistics/financial models. ## Where does Statistical Arbitrage fit in the context of the hedge fund industry?
Hedge Fund Industry: $2.9 T 2014



Statistical Arbitrage exists in Equity Market Neutral (2.2%), also component of Multi-Strategy (14.1%), Managed Futures (3.9%), and some Global Macro (17.6%)
Relatively small part of the hedge fund industry compared to Long/Short (22%) and Event Driven (26%)
Statistical Arbitrage (quant) probably less than 5-10% of overall industry
Different Types of Stat Arb Strategies
Top Ten Lists: Delusions and Characteristics for Success
- I’m different from all the others who have tried
- Teach me, I can learn the secret-sauce
- It’s easy to make money
- I’ll get better by reading that next paper/book
- I’m a damn-good programmer
- I think I have a clue, I know how markets behave, I can do this…
- More complexity the better
- There exists a perfect money-making model
- My ideas are good enough to make money
- My idea is unique
Top 10 Successful Characteristics for Stat Arb
- Actively try to be different
- Teach others and you will master the ability to create your own secret-sauce
- It’s hard to make money so don’t get arrogant, be thankful you have a shot
- Assume what you read is already arb’ed away, but keep reading to help generate the next idea/ adjustment/ extension
- Constantly improve your skill set
- Assume you don’t have a clue and always be a constant student of the market, you can do it!
- Keep it simple, really understand what you’re capturing
- The perfect money-making machine is dynamic
- Constantly refine your ideas
- Keep generating new ideas
Mean-Reversion/Counter-Trend Strategies
Z-Score
Z-Score: \(\frac{x_{i}-\mu}{\sigma}\)
\(x_{i}\) : ith observation
\(\mu\) : mean
\(\sigma\) : Standard dev.
Maps data to common scale

Winsorizing outlier by pushing them back to -3 or 3, respectively.
Why do you want to do this?
What is Co-Integration?
Background info on Co-Integration
- Where does it come from?
- Econometric framework that allows for fluctuations from market equilibrium. Note that deviations are expected in the short-run but in the long-run economic forces will prevail. When testing a given economic theory or model with data, co-integration provides a useful framework to determine if the data supports the theory or model.
- Common definitions:
- Two (or more) time-series are co-integrated if their difference results in white-noise.
- Two (or more) time-series are themselves non-stationary, but a linear combination of them is stationary, then the series are said to be co-integrated.
- Stationary process defined as a stochastic process whose joint probability distribution does not change with shifts in time or space, i.e. moments of the distribution are well-defined, such as mean, variance, etc.
- Non-stationary process defined as a stochastic process whose joint probability distribution changes over time, i.e. moments of the distribution are not-well behaved, mean goes to infinity or variance goes to infinity as time increase. Ex. Random walk.
- Two (or more) time-series have unit-roots and are integrated of order 1, I(1), can be combined by a linear transformation to be integrated of a lower order, I(0).
- Unit-root results in non-stationary process
- I(d) : Integrated of order d, whereby d differences will result in a stationary process
- How do we test for it?
- Dickey-Fuller tests for stationary
- Augmented Dickey-Fuller test for stationary series in presence of serial-correlation
- Others
- How do we make an index from this?
Co-Integration Procedure
- Step 1: Take all unique two stocks “pairs” from universe
- e.g. All S&P500 stocks, How many pairs? ((500*500)-500) / 2 = 124,750 {computationally feasible}
- Step 2: Run regressions of stock 1 on stock 2 to get beta or hedge ratio
- Step 3: Compute spread, test if it has unit root, using Augmented Dickey-Fuller test, store p-values
- Step 4: Compute z-score on the spread
- Step 5: Back-test z-scores employing open long/short, close long/short position parameters
- Step 6: Create dynamic portfolio of 100 best “pair” using historical PnL (net of cost)
Are KO and PEP Co-Integrated?
Ex. KO and PEP
“100 Best Pairs” project.

How to Group Stocks?
Grouping set of observations into subsets
- Clustering
- GIC Sectors
- Correlations
Algorithm
Stock Sectors Example

Mean-Reversion Framework
Mean-Reversion Framework
Step 1: Define the “residual”
Consider two stocks: stock A, and stock B (or ETFs, PCs, EW, ERW, MV, linear regression, etc…)
- “residual” = Return from Stock A – Return from Stock B [Classic Pairs]
- “residual” = Return from Stock A – Return from ETF which Stock A belongs
- “residual” = Return from Stock A – Returns from PCs from group of stocks which Stock A belongs
- “residual” = Return from Stock A – Return from EW portfolio which Stock A belongs
- “residual” = Return from Stock A – Return from ERW portfolio which Stock A belongs
- “residual” = Returns from Stock A – [α + β*r(ETF)] Avellaneda and Lee
- Or in general,
- “residual” = Return from Stock A – Return from “Middle”
Step 2: Sum the “residual” = X (e.g. 60-days), why do this?
Step 3: Z-Score of X’s to determine if signal is too far away and will revert back
Step 4: Examine performance net of costs (study robustness of optimal parameters)
Avellaneda and Lee (2008) Framework
Step 1: Identify “middle”, use ETFs (and PCAs)
\[R_{n}^{S} = \beta_{0}+\beta R_{n}^{I} + \epsilon_{n},n=1,2,...60\]
Step 2: Identify auxiliary process, sum the residuals: \[X_{k}=\Sigma^{k}_{j=1} \epsilon_{j}, k=1,2,...,60, \] \[X_{n+1} = a +bX_{n}+\zeta_{n+1},n=1,...,59 \] \[\kappa = -log(b)*252\] \[m=\frac{a}{1-b}\] \[\sigma=\sqrt{\frac{Variance(\zeta)*2\kappa}{1-b^{2}}}\] \[\sigma_{eq}=\sqrt{\frac{Variance(\zeta)}{1-b^{2}}}\]
Step 3: Estimate Ornstein-Uhlembeck
process parameters: \[dX_{i}(t)=\kappa_{i}(m_{i}-X_{i}(t))dt + \sigma_{i}dW_{i}(t), \kappa_{i}>0\]
Step 4: Compute S-Scores (Z-Scores):
\[ s=\frac{X(t)-m}{\sigma_{eq}}\]
S-Score Calculation
- Residual based on linear regression (OLS)
- Residual = return(JPM) – (a + b*return(XLF)), or EW, EW-R…
- Build Time Series of cumsum of residuals on a window period (e.g. 60 days)
- Estimate OU process parameters using cumsum of residual series in AR(1) linear regression
- Estimates a, b; calculate OU parameters from Avellaneda and Lee (2008)
- Calculate S-Score
Matching JPM vs XLF…


Flip the S-score…
(only for this example, error in doc)


Practical Implications
Signals to Positions
How to map signals to positions
Signals in the most primitive form are: “buy,” “sell,” or “do nothing,”…in integers: 1, -1, 0.
Additionally, signals may exist in the continuum of [-3, 3] obtained from some form of z-score methodology
- The challenge is to convert the signals into positions, as we need to trade positions
One way is to take a hard number of shares 100 (or contract, etc.) to trade (others ways: risk, mkt-cap, opt., etc.)
For example, all trades are represented by 100 shares to buy/sell, so signal-to-position is a simple linear function:
ex. (-1 signal)* (100 shares) = -100 [trade this quantity]
Alternative derivation of positions can be achieved through a monotonic transformation of the signal space [-3,3] to positions space such as:
Forecasting Regressions
First what are some of the typical variables that help forecast asset price movements?
- Ex. VIX, MOVE, corporate bond spreads, term-spreads, emerging market spreads, long-term yields, dividend yields, earnings yields, historical volatility, bid-ask spreads, liquidity, macro variables, corporate actions, news, the “middle,” flows, activity, trend, reversion, etc. For equities see: http://www.alphasimplex.com/pdfs/JPM2008_Final.pdf
- How can we alter this regression to include multi-horizon forecasts?
- How can we include as sense of competition amongst the factors? Static vs dynamic?
- Contemporaneous relationship are important too, how can we include this in a forecasting regression framework?
How can we balance additional complexity with realizable PnL benefits? \[r_{t+1}=\alpha_{t}+b_{1t}*f_{1t}+b_{2t}*f_{2t}+b_{kt}*f_{kt}+\epsilon_{t}\] \(r_{t+1}\) : asset to be traded
\(\epsilon_{t}\) : residual
\(f_{jt}\): forecasting variables j \(b_{jt}\) : sensitivities to given factor j
How to Employ: Principal Components Analysis (PCA)
PCA is a mathematical procedure that transforms data into orthogonal components. The top components are typically employed to determine time-series of residuals and thus measure deviations that that are “too far-away” and thus predicted to revert back.
Nice way to understand the structure of your data. Typically, for stock data the first principal component resembles a long-only “pseudo-market” portfolio. The second component looks like a long/short portfolio. Note that PCA assumes that the structure is constant over the period examined, financial data comes in with time-stamps, thus periods of non-traditional behavior becomes problematic.
Matlab – svd(); C++ - pca.h
Independent Component Analysis (ICA)
Example: Intra-Day Mean-Reversion Model
Outline
Present results from our mean-reversion model that trades nine liquid ETFs, no overnight positions, employ 1-minute bar data
- Current markets examined:
- SPY (SP500) (2) DIA (Dow Jones), (3) QQQQ (Nasdaq 100) , (4) MDY (midCap 400), (5) SMH (Semi-Conductor), (6) IWM (Russell 2000), (7) EWJ (MSCI Japan), (8) OIH (Oil Services), (9) BBH (Biotech)
- Idea: To provide liquidity to markets when a market gets pushed too far away from the “middle”
- Over the long-run markets revert back to normal conditions and alpha can be harvested over time
- Alpha is robust to time-of-day (and markets)
Confirms prior knowledge of efficacy of mean-reversion across other markets
Data from 6/30/2008 to 4/9/2009: One minute closing prices per market
Proposal
- Trade Mean-Reversion across the following portfolios:
Implementation Issues
- Lessons learned on back-testing assumptions:
- Fill Rate of 100%, very unrealistic assumption
- Market impact, very different when you trade, even with the most seemingly innocuous size constraints (“simulation” vs “live”)
- Risk management very important for passive execution
- Reflections:
- Observed PnL in real-time behavior was as anticipated, in terms of mean-revering environment yielded profitable days and trending did not
Example: Ultra-high frequency “2xSPY ≈ SSO”
Introduction
- Simple pure-arbitrage with active-execution SPY vs SSO (intraday is equivalent to 2x SPY performance)
- Inputs are bid and ask quotes (and size), arb the following events, roughly speaking:
- SPY Bid > (2) SSO Ask
- SSO Bid > (4) SPY Ask
- Complete Arb Basket consists of 4-legs, ((1)&(2)) and ((3)&(4)), order of execution may vary
- Cost assumptions: $1 ticket charge and $0.003/share
- Ex. Complete Arb Basket of 1,000 shrs of SPY vs 1,470 shrs of SSO
= $4 (four tickets) + 2* $0.003 * (1,000 shrs + 1,470 shrs)
= $18.82 per trade
Brief Description of ETFs

- “SPY” – SPDR S&P500
- Avg Vol. 142M
- Price 112.42 (as of 2:51pm 12/30/2009)

- “SSO” – ProShares Ultra S&P500
- Avg Vol. 17.6 M
- Price 38.95 (as of 2:52pm 12/30/2009)
“ProShares Ultra S&P500 (the Fund) seeks daily investment results that correspond to twice (200%) the daily performance of the S&P 500 Index. “
Example of Raw Quotes Data
*
Create Dynamic Hedge Ratio (DHR) for 1,000 Share of SPY
DHR = (1,000 * lag_SPY / 2 ) / lag_SSO)
SPY/SSO Arb Algorithm
- Step 1: Compute Indicator
- Diff_Arb1 = 2DHR(SSO Bid) – 1,000 *(SPY Ask)
- Diff_Arb2 = 1,000 * (SPY Bid) - 2* DHR*(SSO Ask)
- Step 2:
- If (Diff_Arb1 > λ), then trade
- If (Diff_Arb2 > λ’), then trade
- Assume sub-millisecond execution
- Step 3: Limit Trading to One Arb Basket
- max book size = (+/-)1,000 shrs of SPY and (-/+) DHR shrs of SSO
How long does it last?

- All times above 1 second are put at 1 second, peak is less 5 milliseconds
Histogram of the Quote Size
Statistics Bid/Ask Size
- Data from 9:30 am to 11:30 am


Conclusion
- This arb printed money in the past, but now it’s too crowded, speed-race…
- Need to execute on the order of (~20)micro-seconds
- Everyone co-locate their programs on exchange-server, even adjusts hardware for edge
- Similar arbs currently work, but need to really understand the ETF product/re-balance process
Secret-Sauce Steps
- Gather Data
- What do you want to trade? Stocks, bonds, futures, currencies, options, etc. Gathering and cleaning the data can be very time-consuming, but is the first step
- Look at Data
- Always, look at your data using equity lines, understand your data, fix gaps, NaN fill forward, other tools summary stats or PCA. Do you need to transform data: trade equal-risk per position, does it co-move? Can you see periods of high correlation in the tails? Entropy?
- Determine Model Type: MR or MO
- What kind of alpha do you want to extract? Is it trend/momentum (MO) or mean-reversion (MR)? What inputs are going to be used to predict or build signals? Just prices, trades, bid/ask, size, volume, open interest, time-of-day, trade-time, block-orders, transaction order, etc.
- If MR: Define Residuals
- How do you define the “residual” is it based on benchmark of equally-weighted, equal-risk weighted, MV, regressions, PCA, others? Is the thesis, more precisely, that the cumulative residuals or some function of residual is mean-reverting?
- Generate Signal
- Typically, the core-model is determined by some measure of “near” versus “far” away. Cumulated residuals is a good starting point, but improvements can be made by thinking about how distance is measured/defined. Z-Score, S-Score, and Winsorizing help with signal construction and combination.
- Signals into Positions
- Ultimately you need to convert information from signal to actual positions, as only positions are traded. Need to be mindful of the underlying liquidity. Once again look at your signals and positions over time, this tells you if your model is fast or slow regarding turnover.
- Positions to historical PnL net of costs
- Try to get to Sharpe Ratio as fast as possible. Comparative statistics are easy to do if your code runs quickly. Examine PnL sensitivities to at most one/two parameters at a time. If you look at too many parameters you’ll will lose intuition on the behavior of your model.
Strategies and Some Tickers you could use
- Macro
- Managed Futures
- FX
- Carry trade: DBV
- Outrisght: UUP FXE
- Emerging Markets
- EEM
- VWO
- Numerous country and region specific funds
- Event Driven
- Momentum/Value
- Value
- Mergers
- Convertibles
- Distressed
- Stat arb