1 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spuious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

library(wbstats)
# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"), 
      country="MEX", start_date = 1980, end_date = 2020)
# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"), 
      country="MEX", start_date = 1980, end_date = 2020)

The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables:

plot.ts(infantm$SP.DYN.IMRT.IN)

plot.ts(exports$TX.VAL.MRCH.XD.WD)

Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?

m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)
## 
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.842 -41.039  -4.786  39.072  73.624 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            280.3572    15.5425   18.04  < 2e-16 ***
## infantm$SP.DYN.IMRT.IN  -6.3166     0.5205  -12.13 8.14e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.48 on 39 degrees of freedom
## Multiple R-squared:  0.7906, Adjusted R-squared:  0.7852 
## F-statistic: 147.3 on 1 and 39 DF,  p-value: 8.139e-15

Research about spurious regression, and explain in which cases you can end up with a spurious regression

Are the series cointegrated? Is the regression spurious or valid? Run the corresponding test.

INTERESTINGLY, ACCORDING TO THIS REGRESSION, THE INFANT MORTALITY INDEX IS SIGNIFICANTLY RELATED TO THE NUMBER OF EXPORTS IN MEXICO. THIS DOES NOT MAKE SENSE SINCE BOTH PHENOMENA ARE NOT RELATED AT ALL. THIS REGRESSION SUGGESTS THAT WHEN THE EXPORTS GO UP, IN MEXICO THE NUMBER OF INFANT DEATHS DECREASES SINGIFICANTLY.

WHEN YOU RUN A REGRESSION AND THE DEPENDENT AND THE INDEPENDENT VARIABLES ARE NON-STATIONARY, MOST OF THE TIME THE RESULT WILL SHOW THAT BOTH ARE STRONGLY AND SIGNIFICANTLY RELATED (NEGATIVE OR POSITIVE). HOWEVER, THIS RELATIONSHIP MIGHT NOT BE TRUE!! IT MIGHT BE SPURIOUS.

THE ONLY CASE THAT THIS TYPE OF REGRESSION IS NOT SPURIOUS OR VALID IS WHEN BOTH VARIABLES ARE COINTEGRATED. ONLY IF THE 2 NON-STATIONARY SERIES ARE COINTEGRATED, THEN THE REGRESSION BETWEEN THEM WILL BE VALID. IF THEY ARE NOT COINTEGRATED, THEN THE REGRESSION WILL BE SPURIOUS.

WHAT IS COINTEGRATION?????? WE RUN THE TEST TO CHECK WHETHER BOTH SERIES ARE COINTEGRATED:

library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(m1$residuals, k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  m1$residuals
## Dickey-Fuller = -2.1794, Lag order = 0, p-value = 0.503
## alternative hypothesis: stationary

SINCE THE P-VALUE OF THE TEST IS >0.05, THEN WE CANNOT REJECT THE NULL HYPOTHESES THAT STATES THAT THE ERRORS IS A STATIONARY SERIES. THEN, SINCE THE ERRORS ARE NOT STATIONARY, THEN WE CONCLUDE THAT THESE 2 NON-STATIONARY SERIES ARE NOT COINTEGRATED. THEN, SINCE THEY ARE NOT COINTEGRATED, THE REGRESSION BETWEEN MORTALITY RATE AND EXPORTS IS NOT VALID; IT IS A SPURIOUS REGRESSION

2 Cointegration between Financial series

Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

  1. From Jan 1, 2015 to Oct 2, 2017.

  2. From Oct 3, 2017 to Feb 28, 2018

For both cases, run a cointegration test and INTERPRET your results.

  1. From Jan 1, 2015 to Oct 2, 2017:

Downloading the stock prices:

library(quantmod)
getSymbols(Symbols=c("^MXX","^GSPC"), peridiocity = "daily",
                     from="2015-01-01", to="2017-10-02")
## [1] "^MXX"  "^GSPC"
data = merge(MXX,GSPC)
data = na.omit(data)

I create an index of $1.00 invested in each instrument.

I get the first index of both instruments:

firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstus = as.numeric(GSPC$GSPC.Adjusted[1])

Then, I create the indexes for $1.00:

invmxx = data$MXX.Adjusted / firstmxx
invus = data$GSPC.Adjusted / firstus 

plot(invmxx$MXX.Adjusted)

plot(invus$GSPC.Adjusted)

Now I check whether invmxx and invus are cointegrated.

I run a regression with both non-stationary variables: Mexico investment index and US investment index. In this case I set the Mexico index as the dependent variable, but I can run the regression interchanging the dependent variable.

reg1 = lm(invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted )
summary(reg1)
## 
## Call:
## lm(formula = invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06656 -0.01835  0.00244  0.01944  0.06810 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.35668    0.01383   25.79   <2e-16 ***
## invus$GSPC.Adjusted  0.69852    0.01311   53.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02702 on 672 degrees of freedom
## Multiple R-squared:  0.8086, Adjusted R-squared:  0.8083 
## F-statistic:  2840 on 1 and 672 DF,  p-value: < 2.2e-16

Before I do the cointegration test, I will examine the error (residuals) of this regression according to the regression equation:

  • Errors = (Real invmxx - Predicted invxx)

  • Predicted mxx = 0.357 + 0.6978 x (Real invus)

  • Errors = (Real invmxx - 0.357 + 0.6978 x (Real invus))

We can see that Errors is basically a “scaled” distance from the Mexican index to the US index. It is a linear combination of both indexes. Then, if this “scaled” distance between the series stays within a certain range over time, then we can say that both series are cointegrated. The condition is that the errors series must be a STATIONARY series in order to conclude that both NON-STATIONARY series are cointegrated.

Now we run the test for cointegration:

library(tseries)
adf.test(reg1$residuals, k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  reg1$residuals
## Dickey-Fuller = -3.5752, Lag order = 0, p-value = 0.03508
## alternative hypothesis: stationary

Since p-value<0.05, then we conclude that the residuals is a stationary variable, so both indexes are COINTEGRATED; we can say that in this period, the Financial Mexican Market is COINTEGRATED with the US Financial Market.

Then, we can rely on the regression results. Then, we can say that for each $1 gained in the US market, then in the Mexican market we gain on average $69.85 cents. There is a positive and significance relationship between the US and the Mexican market.

  1. From Oct 3, 2017 to Feb 28, 2018

Downloading the stock prices:

getSymbols(Symbols=c("^MXX","^GSPC"), peridiocity = "daily",
                     from="2017-10-03", to="2018-02-28")
## [1] "^MXX"  "^GSPC"
data = merge(MXX,GSPC)
data = na.omit(data)

I create an index of $1.00 invested in each instrument.

I get the first index of both instruments:

firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstus = as.numeric(GSPC$GSPC.Adjusted[1])

Then, I create the indexes for $1.00:

invmxx = data$MXX.Adjusted / firstmxx
invus = data$GSPC.Adjusted / firstus 

plot(invmxx$MXX.Adjusted)

plot(invus$GSPC.Adjusted)

Now I check whether invmxx and invus are cointegrated.

I start running the regression with these non-stationary variables:

reg1 = lm(invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted )
summary(reg1)
## 
## Call:
## lm(formula = invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.037454 -0.014008 -0.002567  0.017749  0.040777 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.79716    0.05928  13.447  < 2e-16 ***
## invus$GSPC.Adjusted  0.16206    0.05640   2.874  0.00501 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02006 on 95 degrees of freedom
## Multiple R-squared:  0.07997,    Adjusted R-squared:  0.07028 
## F-statistic: 8.257 on 1 and 95 DF,  p-value: 0.005008
library(tseries)
adf.test(reg1$residuals, k = 0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  reg1$residuals
## Dickey-Fuller = -2.0418, Lag order = 0, p-value = 0.5592
## alternative hypothesis: stationary

Since the p-value of the test is >0.05, then we cannot reject the null hypothesis that states that the residuals are stationary. Since the residuals is a non-stationary variable, then for this period, the Mexican market index is NOT COINTEGRATED with the US market.

3 Holding return of a portfolio of 2 stocks

Once we have an idea about cointegration, we will review how to form a portfolio of 2 assets and calculate its holding return over time. The holding period return of an asset from day 1 to day N (HPR) can be calculated with any of the following ways:

  1. HPR = (Price of the stock at day N / Price of the stock at day 1) - 1
  2. HPR = exp (Sum of all continuous compounded returns from day 1 to day N) - 1

The expected return of a portfolio of 2 assets is estimated as a weighted average of the expected stock returns:

\[E[R_p]=w_1 ∗E[R_1]+w_2 ∗E(R_2)\]

The holding return of a portfolio of 2 assets for a specific period of time is estimated as a weighted average of the holding returns of the assets: \[HPR_p =w_1 ∗HPR_1 +w_2 ∗HPR_2\]

Download daily prices from CEMEX and ALFA from Jan 1, 2015 to Dec 31, 2017.

getSymbols(c("ALFAA.MX", "CEMEXCPO.MX"), from = "2015-01-01", to = "2017-12-31", source = "yahoo")
## [1] "ALFAA.MX"    "CEMEXCPO.MX"

Calculate their holding period returns of both stocks

HPR_cemex <- (Ad(CEMEXCPO.MX) / as.numeric(CEMEXCPO.MX[1,6])) - 1
plot(HPR_cemex)  

HPR_cemex2 <- (Ad(CEMEXCPO.MX) / as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])) - 1

HPR_alfa <- (Ad(ALFAA.MX) / as.numeric(ALFAA.MX[1,6])) - 1
plot(HPR_alfa)

Create a portfolio 1 assigning 30% to CEMEX and 70% for ALFA and calculate the HPR of the portfolio:

HPR_port1 <- 0.7 * HPR_alfa + 0.3 * HPR_cemex
plot(HPR_port1)

Create a portfolio 2 assigning -100% to CEMEX and +200% to ALFA and calculate the HPR of this portfolio:

HPR_port2 <- 2 * HPR_alfa - 1 * HPR_cemex

What does a negative sign mean a portfolio? Briefly explain with the previous example.

The negative weight for CEMEX means that I have a short position in CEMEX and the positive weight means that I have a long position in Alfa. A short position means that I borrow money from this stock and increase the amount of my investment. In this case, I am borrowing money equal to my original investment by shorting CEMEX. Then, I will have 200% of money, and invest this in Alfa.

Then, I hope that the CEMEX price go down in order to make a profit from the shorting strategy, and I hope that Alfa price go up in order to make a profit from the long position of Alfa. Let’s see how much money I would be mad with this portfolio:

plot(HPR_port2)

This is a not good result for a portfolio since the final HPR is very negative!

4 CHALLENGE: Statistical arbitrage

Using the CEMEX and the ALFAA daily price series from Jan 1, 2015 to Dec 31, 2017, examine whether these two series are cointegrated.

stocks <- cbind(Ad(ALFAA.MX), Ad(CEMEXCPO.MX))
colnames(stocks) <- c("alfa", "cemex")
reg1 <- lm(alfa ~ cemex, data = stocks)
summary(reg1)
## 
## Call:
## lm(formula = alfa ~ cemex, data = stocks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9180 -1.0153  0.5667  1.8439  5.4945 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.80633    0.51679   82.83   <2e-16 ***
## cemex       -1.04424    0.03811  -27.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.777 on 752 degrees of freedom
## Multiple R-squared:  0.4996, Adjusted R-squared:  0.4989 
## F-statistic: 750.8 on 1 and 752 DF,  p-value: < 2.2e-16

The beta1 is negative and significant. This result is valid ONLY if the prices are COINTEGRATED. Let’s check for cointegration:

# I check whether the errors are stationary:
plot(reg1$residuals, type = "l")

adf.test(reg1$residuals, k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  reg1$residuals
## Dickey-Fuller = -3.8726, Lag order = 0, p-value = 0.01537
## alternative hypothesis: stationary

Since the p-value of the test<0.05, then the errors are STATIONARY, so the stock prices are COINTEGRATED. Then, the previous regression is VALID.

Assume you are in December 31, 2017. If the series are cointegrated, that means that the residual of the regression between these series is a stationary series. Then, if this is the case, what can you do to take advantage in financial trading? If you were to invest from Jan 1, 2018 to Feb 28, 2018 in a portfolio of these stocks, which weights would you assign? Calculate the HPR of your portfolio from Jan 1st to Feb 28, 2018.

Let’s do an analysis to decide which portfolio I can construct to take advantage of the cointegration between the prices:

We have to be aware that both prices move in an opposite direction most of the time since the beta 1 coefficient of the regression is negative. The errors might increase in the following days. The error is equal to the real price of ALFA - predicted price of ALFA:

p_alfa = b0 + b1*p_cemex + er1

Predicted_price_alfa = E[p_alfa] = b0 + b1*p_cemex

p_alfa = Predicted_price_alfa + er1

er1 = p_alfa - Predicted_price_alfa

er1 = real price of Alfa - predicted Alfa price

er1 = p_alfa - (42.8063265 + -1.0442364 * pcemex)

We see that at the end of the series the errors are negative:

tail(reg1$residuals)
## 2017-12-21 2017-12-22 2017-12-26 2017-12-27 2017-12-28 2017-12-29 
##  -6.461918  -6.807727  -6.399748  -6.389845  -6.598647  -6.431657

When the error is negative the predicted price is higher than the real price. The er1 was very negative at the end of 2017, so it is very likely that this error will go to zero or positive for the first days of 2018. For this to happen, the predicted price should decrease or the Alfa price should go up. The Alfa predicted price goes down if Cemex price goes up since the beta is negative. Looking at the performance of both stocks, it is more likely that CEMEX will not go up, so it is likely that Alfa will go up.

The predicted Alfa price can decrease only if the price of Cemex increases since b1 is negative. Since the beta of the regression is negative, then it is expected that the price of Alfa will go up. Then, I could design a portfolio shorting Cemex 100%, and going long for Alfa 200% for 2018:

I will short Cemex by -1, and assign +2 to Alfa and see how much $ I would have made for the first 2 months of 2018:

getSymbols(c("ALFAA.MX", "CEMEXCPO.MX"), from = "2018-01-01", to = "2018-12-31", source = "yahoo")
## [1] "ALFAA.MX"    "CEMEXCPO.MX"
HPR_cemex <- (Ad(CEMEXCPO.MX) / as.numeric(CEMEXCPO.MX[1,6])) - 1
plot(HPR_cemex)  

HPR_alfa <- (Ad(ALFAA.MX) / as.numeric(ALFAA.MX[1,6])) - 1
plot(HPR_alfa)

HPR_port1 <- 2 * HPR_alfa - 1 * HPR_cemex
plot(subset(HPR_port1, index(HPR_port1) <= "2018-03-01"))

Feb 28 is the day 41, so:

print(paste("The HPR of this strategy at the end of Feb 2018 would have been:", HPR_port1[41,]))
## [1] "The HPR of this strategy at the end of Feb 2018 would have been: 0.210380643707167"

I would have made 21.03% in only 2 months! Although this sounds a very good strategy, this is a VERY RISKY strategy in the real world. You can also lose a lot of money with this strategy, so you have to be careful. It is always recommended to do this strategy using stocks that are cointegrated due to the relationship between them. Now, if I had hold this portfolio for the whole year 2018, I would have got the following HPR:

plot(HPR_port1)

print(paste("The HPR of this strategy at the end of 2018 would have been:", last(HPR_port1)))
## [1] "The HPR of this strategy at the end of 2018 would have been: 0.458814383466793"

I would have made almost 50% of return after 1 year!! Remember, that this is a VERY RISKY strategy! you can also lose a lot of money!

This is called statistical arbitrage, and some technical analysts use this technique to find arbitrage opportunities (be careful, this is not a safe strategy to follow).