Lecture 7

Recent developments

Big data, machine learning, are tools readily applied in hedge funds.
Social media sites such as Twitter, StockTwits, Estimize, and LinkedIn are full of untapped information *Recently we’ve seen strong activity in research employing such data
1. Recently Google-Trends and stock prices have been investigated..
1. Tweets and Crowd-sourced earnings
1. Tweets sentiments and IPOs
1. Tweets sentiments and Earnings, in regressions
Social Media’s Tweet sentiment as the “Six”th factor

The “Sixth” Factor

Liew and Budavari (2016) identify a “sixth” factor in sample of U.S. equities
Data cover Jan 4, 2012 to Oct 30th, 2015
Factor based on stock characteristics generated by user’s self-assessed “bullish” or “bearish” tweets
Show significance by coefficients on time-series regression > 2.0

User-identified sentiment

$Cash-tags invested by StockTwits associate tweets with given stock ticker
Bar label allows users to identify sentiment either Bullish or Bearish
Linking sentiments to prices allows for factor model existence research

Sentiments of AAPL over time…

Correlation of $AAPL’s Tweets sentiment vs other factors

*

% increases in Adjusted R-Squared for CAPM, FF3, and FFM5
~500 stocks

IPOs and Tweet Sentiments

Fun with quant strategies open to all!

https://websim.worldquant.com/simulate

Also, includes…

Background
- What is Statistical Arbitrage?
- Where does Statistical Arbitrage fit in the context of the Hedge Fund Industry?
- Who are the Stat Arb (Quant) Players?
Coverage
- Different Types of Stat Arb Strategies
Top Ten List
Mean-Reversion Overview
Z-Score
Co-Integration
Grouping Stocks
Mean-Reversion Case Study
The “Secret-Sauce”
Avellaneda and Lee (2008)
Case Study: Mean-Reversion Model Implemented
Optional Homework

Attempts at defining Stat Arb

“…refers to highly technical short-term mean-reversion strategies involving large numbers of securities (hundreds to thousands, depending on the amount of risk capital), very short holding periods (measured in days to seconds), and substantial computational, trading, and IT infrastructure“ – Andrew Lo
http://www.alphasimplex.com/

“In the context of hedge funds, a style of management that employs complex statistical models that try to capture small abnormalities in a security’s intraday return. “– Campbell R. Harvey
http://www.duke.edu/~charvey/

An attempt to profit from pricing inefficiencies that are identified through the use of mathematical models. Statistical arbitrage attempts to profit from the likelihood that prices will trend toward a historical norm. Unlike pure arbitrage, statistical arbitrage is not riskless. – InvestorWords

Quantitative investment approach characterized by low use of discretion, high use of technology, research, data, and statistics/financial models. ## Where does Statistical Arbitrage fit in the context of the hedge fund industry?

Hedge Fund Industry: $2.9 T 2014

Statistical Arbitrage exists in Equity Market Neutral (2.2%), also component of Multi-Strategy (14.1%), Managed Futures (3.9%), and some Global Macro (17.6%)
Relatively small part of the hedge fund industry compared to Long/Short (22%) and Event Driven (26%)
Statistical Arbitrage (quant) probably less than 5-10% of overall industry

Different Types of Stat Arb Strategies

Sharpe Ratio vs Capacity

Top Ten Lists: Delusions and Characteristics for Success

I’m different from all the others who have tried
Teach me, I can learn the secret-sauce
It’s easy to make money
I’ll get better by reading that next paper/book
I’m a damn-good programmer
I think I have a clue, I know how markets behave, I can do this…
More complexity the better
There exists a perfect money-making model
My ideas are good enough to make money
My idea is unique

Top 10 Successful Characteristics for Stat Arb

Actively try to be different
Teach others and you will master the ability to create your own secret-sauce
It’s hard to make money so don’t get arrogant, be thankful you have a shot
Assume what you read is already arb’ed away, but keep reading to help generate the next idea/ adjustment/ extension
Constantly improve your skill set
Assume you don’t have a clue and always be a constant student of the market, you can do it!
Keep it simple, really understand what you’re capturing
The perfect money-making machine is dynamic
Constantly refine your ideas
Keep generating new ideas

Mean-Reversion/Counter-Trend Strategies

How to define Pairs?
- Considerations: Instruments to trade, “Distance”, timing of trade, size of trade, etc.
How can you extend this approach to portfolio/basket trading?
Starter Matlab code (need to find the “cheat”):

Z-Score

Z-Score: $\frac{x_{i}-\mu}{\sigma}$
$x_{i}$ : ith observation
$\mu$ : mean
$\sigma$ : Standard dev.

Maps data to common scale

Winsorizing outlier by pushing them back to -3 or 3, respectively.

Why do you want to do this?

What is Co-Integration?

Background info on Co-Integration

Where does it come from?
- Econometric framework that allows for fluctuations from market equilibrium. Note that deviations are expected in the short-run but in the long-run economic forces will prevail. When testing a given economic theory or model with data, co-integration provides a useful framework to determine if the data supports the theory or model.
Common definitions:
- Two (or more) time-series are co-integrated if their difference results in white-noise.
- Two (or more) time-series are themselves non-stationary, but a linear combination of them is stationary, then the series are said to be co-integrated.
  - Stationary process defined as a stochastic process whose joint probability distribution does not change with shifts in time or space, i.e. moments of the distribution are well-defined, such as mean, variance, etc.
  - Non-stationary process defined as a stochastic process whose joint probability distribution changes over time, i.e. moments of the distribution are not-well behaved, mean goes to infinity or variance goes to infinity as time increase. Ex. Random walk.
- Two (or more) time-series have unit-roots and are integrated of order 1, I(1), can be combined by a linear transformation to be integrated of a lower order, I(0).
  - Unit-root results in non-stationary process
  - I(d) : Integrated of order d, whereby d differences will result in a stationary process
How do we test for it?
- Dickey-Fuller tests for stationary
- Augmented Dickey-Fuller test for stationary series in presence of serial-correlation
- Others
How do we make an index from this?

Co-Integration Procedure

Step 1: Take all unique two stocks “pairs” from universe
- e.g. All S&P500 stocks, How many pairs? ((500*500)-500) / 2 = 124,750 {computationally feasible}
Step 2: Run regressions of stock 1 on stock 2 to get beta or hedge ratio
Step 3: Compute spread, test if it has unit root, using Augmented Dickey-Fuller test, store p-values
Step 4: Compute z-score on the spread
Step 5: Back-test z-scores employing open long/short, close long/short position parameters
Step 6: Create dynamic portfolio of 100 best “pair” using historical PnL (net of cost)

Are KO and PEP Co-Integrated?

Ex. KO and PEP

“100 Best Pairs” project.

How to Group Stocks?

Grouping set of observations into subsets

Clustering
GIC Sectors
Correlations

Algorithm

Stock Sectors Example

Mean-Reversion Framework

Step 1: Define the “residual”
- Consider two stocks: stock A, and stock B (or ETFs, PCs, EW, ERW, MV, linear regression, etc…)
  - “residual” = Return from Stock A – Return from Stock B [Classic Pairs]
  - “residual” = Return from Stock A – Return from ETF which Stock A belongs
  - “residual” = Return from Stock A – Returns from PCs from group of stocks which Stock A belongs
  - “residual” = Return from Stock A – Return from EW portfolio which Stock A belongs
  - “residual” = Return from Stock A – Return from ERW portfolio which Stock A belongs
  - “residual” = Returns from Stock A – [α + β*r(ETF)] Avellaneda and Lee
- Or in general,
  - “residual” = Return from Stock A – Return from “Middle”
Step 2: Sum the “residual” = X (e.g. 60-days), why do this?
Step 3: Z-Score of X’s to determine if signal is too far away and will revert back
Step 4: Examine performance net of costs (study robustness of optimal parameters)

Avellaneda and Lee (2008) Framework

Step 1: Identify “middle”, use ETFs (and PCAs)
\[R_{n}^{S} = \beta_{0}+\beta R_{n}^{I} + \epsilon_{n},n=1,2,...60\]
Step 2: Identify auxiliary process, sum the residuals: \[X_{k}=\Sigma^{k}_{j=1} \epsilon_{j}, k=1,2,...,60, \] \[X_{n+1} = a +bX_{n}+\zeta_{n+1},n=1,...,59 \] \[\kappa = -log(b)*252\] \[m=\frac{a}{1-b}\] \[\sigma=\sqrt{\frac{Variance(\zeta)*2\kappa}{1-b^{2}}}\] \[\sigma_{eq}=\sqrt{\frac{Variance(\zeta)}{1-b^{2}}}\]
Step 3: Estimate Ornstein-Uhlembeck
process parameters: \[dX_{i}(t)=\kappa_{i}(m_{i}-X_{i}(t))dt + \sigma_{i}dW_{i}(t), \kappa_{i}>0\]
Step 4: Compute S-Scores (Z-Scores):
\[ s=\frac{X(t)-m}{\sigma_{eq}}\]

S-Score Calculation

Residual based on linear regression (OLS)
Residual = return(JPM) – (a + b*return(XLF)), or EW, EW-R…
Build Time Series of cumsum of residuals on a window period (e.g. 60 days)
Estimate OU process parameters using cumsum of residual series in AR(1) linear regression
Estimates a, b; calculate OU parameters from Avellaneda and Lee (2008)
Calculate S-Score

Matching JPM vs XLF…

Flip the S-score…

(only for this example, error in doc)

Practical Implications

Signals to Positions

How to map signals to positions

Signals in the most primitive form are: “buy,” “sell,” or “do nothing,”…in integers: 1, -1, 0.
Additionally, signals may exist in the continuum of [-3, 3] obtained from some form of z-score methodology

The challenge is to convert the signals into positions, as we need to trade positions

One way is to take a hard number of shares 100 (or contract, etc.) to trade (others ways: risk, mkt-cap, opt., etc.)

For example, all trades are represented by 100 shares to buy/sell, so signal-to-position is a simple linear function:

    ex. (-1 signal)* (100 shares) = -100  [trade this quantity]

Alternative derivation of positions can be achieved through a monotonic transformation of the signal space [-3,3] to positions space such as:

Forecasting Regressions

First what are some of the typical variables that help forecast asset price movements?
- Ex. VIX, MOVE, corporate bond spreads, term-spreads, emerging market spreads, long-term yields, dividend yields, earnings yields, historical volatility, bid-ask spreads, liquidity, macro variables, corporate actions, news, the “middle,” flows, activity, trend, reversion, etc. For equities see: http://www.alphasimplex.com/pdfs/JPM2008_Final.pdf
How can we alter this regression to include multi-horizon forecasts?
How can we include as sense of competition amongst the factors? Static vs dynamic?
Contemporaneous relationship are important too, how can we include this in a forecasting regression framework?
How can we balance additional complexity with realizable PnL benefits? \[r_{t+1}=\alpha_{t}+b_{1t}*f_{1t}+b_{2t}*f_{2t}+b_{kt}*f_{kt}+\epsilon_{t}\] $r_{t+1}$ : asset to be traded
$\epsilon_{t}$ : residual
$f_{jt}$: forecasting variables j $b_{jt}$ : sensitivities to given factor j

How to Employ: Principal Components Analysis (PCA)

PCA is a mathematical procedure that transforms data into orthogonal components. The top components are typically employed to determine time-series of residuals and thus measure deviations that that are “too far-away” and thus predicted to revert back.

Nice way to understand the structure of your data. Typically, for stock data the first principal component resembles a long-only “pseudo-market” portfolio. The second component looks like a long/short portfolio. Note that PCA assumes that the structure is constant over the period examined, financial data comes in with time-stamps, thus periods of non-traditional behavior becomes problematic.

Matlab – svd(); C++ - pca.h

Independent Component Analysis (ICA)

Example: Intra-Day Mean-Reversion Model

Outline

Present results from our mean-reversion model that trades nine liquid ETFs, no overnight positions, employ 1-minute bar data
Current markets examined:
- 1. SPY (SP500) (2) DIA (Dow Jones), (3) QQQQ (Nasdaq 100) , (4) MDY (midCap 400), (5) SMH (Semi-Conductor), (6) IWM (Russell 2000), (7) EWJ (MSCI Japan), (8) OIH (Oil Services), (9) BBH (Biotech)
Idea: To provide liquidity to markets when a market gets pushed too far away from the “middle”
- Over the long-run markets revert back to normal conditions and alpha can be harvested over time
- Alpha is robust to time-of-day (and markets)
Confirms prior knowledge of efficacy of mean-reversion across other markets

Data from 6/30/2008 to 4/9/2009: One minute closing prices per market

Proposal

Trade Mean-Reversion across the following portfolios:

Implementation Issues

Lessons learned on back-testing assumptions:
1. Fill Rate of 100%, very unrealistic assumption
2. Market impact, very different when you trade, even with the most seemingly innocuous size constraints (“simulation” vs “live”)
3. Risk management very important for passive execution
Reflections:
- Observed PnL in real-time behavior was as anticipated, in terms of mean-revering environment yielded profitable days and trending did not

Example: Ultra-high frequency “2xSPY ≈ SSO”

Introduction

Simple pure-arbitrage with active-execution SPY vs SSO (intraday is equivalent to 2x SPY performance)
- No overnight positions
Inputs are bid and ask quotes (and size), arb the following events, roughly speaking:
- 1. SPY Bid > (2) SSO Ask
- 1. SSO Bid > (4) SPY Ask
Complete Arb Basket consists of 4-legs, ((1)&(2)) and ((3)&(4)), order of execution may vary
Cost assumptions: $1 ticket charge and $0.003/share
- Ex. Complete Arb Basket of 1,000 shrs of SPY vs 1,470 shrs of SSO
  = $4 (four tickets) + 2* $0.003 * (1,000 shrs + 1,470 shrs)
  = $18.82 per trade

Brief Description of ETFs

“SPY” – SPDR S&P500
- Avg Vol. 142M
- Price 112.42 (as of 2:51pm 12/30/2009)

“SSO” – ProShares Ultra S&P500
- Avg Vol. 17.6 M
- Price 38.95 (as of 2:52pm 12/30/2009)

“ProShares Ultra S&P500 (the Fund) seeks daily investment results that correspond to twice (200%) the daily performance of the S&P 500 Index. “

Example of Raw Quotes Data

*

Create Dynamic Hedge Ratio (DHR) for 1,000 Share of SPY

DHR = (1,000 * lag_SPY / 2 ) / lag_SSO)

SPY/SSO Arb Algorithm

Step 1: Compute Indicator
- Diff_Arb1 = 2DHR(SSO Bid) – 1,000 *(SPY Ask)
- Diff_Arb2 = 1,000 * (SPY Bid) - 2* DHR*(SSO Ask)
Step 2:
- If (Diff_Arb1 > λ), then trade
- If (Diff_Arb2 > λ’), then trade
- Assume sub-millisecond execution
Step 3: Limit Trading to One Arb Basket
- max book size = (+/-)1,000 shrs of SPY and (-/+) DHR shrs of SSO

How many times did we do the arb?

How long does it last?

All times above 1 second are put at 1 second, peak is less 5 milliseconds

Histogram of the Quote Size

Statistics Bid/Ask Size

Data from 9:30 am to 11:30 am

Conclusion

This arb printed money in the past, but now it’s too crowded, speed-race…
Need to execute on the order of (~20)micro-seconds
Everyone co-locate their programs on exchange-server, even adjusts hardware for edge
Similar arbs currently work, but need to really understand the ETF product/re-balance process

Secret-Sauce Steps

Gather Data
- What do you want to trade? Stocks, bonds, futures, currencies, options, etc. Gathering and cleaning the data can be very time-consuming, but is the first step
Look at Data
- Always, look at your data using equity lines, understand your data, fix gaps, NaN fill forward, other tools summary stats or PCA. Do you need to transform data: trade equal-risk per position, does it co-move? Can you see periods of high correlation in the tails? Entropy?
Determine Model Type: MR or MO
- What kind of alpha do you want to extract? Is it trend/momentum (MO) or mean-reversion (MR)? What inputs are going to be used to predict or build signals? Just prices, trades, bid/ask, size, volume, open interest, time-of-day, trade-time, block-orders, transaction order, etc.
If MR: Define Residuals
- How do you define the “residual” is it based on benchmark of equally-weighted, equal-risk weighted, MV, regressions, PCA, others? Is the thesis, more precisely, that the cumulative residuals or some function of residual is mean-reverting?
Generate Signal
- Typically, the core-model is determined by some measure of “near” versus “far” away. Cumulated residuals is a good starting point, but improvements can be made by thinking about how distance is measured/defined. Z-Score, S-Score, and Winsorizing help with signal construction and combination.
Signals into Positions
- Ultimately you need to convert information from signal to actual positions, as only positions are traded. Need to be mindful of the underlying liquidity. Once again look at your signals and positions over time, this tells you if your model is fast or slow regarding turnover.
Positions to historical PnL net of costs
- Try to get to Sharpe Ratio as fast as possible. Comparative statistics are easy to do if your code runs quickly. Examine PnL sensitivities to at most one/two parameters at a time. If you look at too many parameters you’ll will lose intuition on the behavior of your model.

Lecture 7

Recent developments

The “Sixth” Factor

User-identified sentiment

Sentiments of AAPL over time…

Correlation of $AAPL’s Tweets sentiment vs other factors

*

*

*

IPOs and Tweet Sentiments

*

*

*

*

Fun with quant strategies open to all!

Also, includes…

Attempts at defining Stat Arb

Hedge Fund Industry: $2.9 T 2014

Different Types of Stat Arb Strategies

Sharpe Ratio vs Capacity

Speed vs Complexity

Roughly speaking…

Top Ten Lists: Delusions and Characteristics for Success

Top 10 Successful Characteristics for Stat Arb

Mean-Reversion/Counter-Trend Strategies

Z-Score

What is Co-Integration?

Background info on Co-Integration

Co-Integration Procedure

Ex. KO and PEP

How to Group Stocks?

Indices Examples

Stock Sectors Example

Liquid ETFs

Mean-Reversion Framework

Mean-Reversion Framework

Avellaneda and Lee (2008) Framework

S-Score Calculation

*

*

*

Matching JPM vs XLF…

Flip the S-score…

Practical Implications

Signals to Positions

How to map signals to positions

Forecasting Regressions

How to Employ: Principal Components Analysis (PCA)

Example: Intra-Day Mean-Reversion Model

Outline

Data from 6/30/2008 to 4/9/2009: One minute closing prices per market

*

*

Proposal

Implementation Issues

Example: Ultra-high frequency “2xSPY ≈ SSO”

Introduction

Brief Description of ETFs

Example of Raw Quotes Data

*

SPY/SSO Arb Algorithm

How many times did we do the arb?

How long does it last?

Histogram of the Quote Size

Statistics Bid/Ask Size

Conclusion

Secret-Sauce Steps

Strategies and Some Tickers you could use