1 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spuious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

I load the package

# Set all the working directory 
setwd("~/Desktop/Financial Econometrics II ")
library(wbstats)
# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"), 
      country="MEX", start_date = 1980, end_date = 2020)
# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"), 
      country="MEX", start_date = 1980, end_date = 2020)

The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables

plot.ts(infantm$SP.DYN.IMRT.IN)

plot.ts(exports$TX.VAL.MRCH.XD.WD)

Now run a regression model using these series. Report the result of the regression.

m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)

summary(m1)
## 
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.331 -39.723  -7.456  37.848  77.194 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            275.1783    16.0408   17.16  < 2e-16 ***
## infantm$SP.DYN.IMRT.IN  -6.1847     0.5327  -11.61  4.6e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.65 on 38 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7801, Adjusted R-squared:  0.7743 
## F-statistic: 134.8 on 1 and 38 DF,  p-value: 4.602e-14

Did you find significant relationship? Is your result what you expected?

According to this model, I see that the infant mortality is negatively and SIGNIFICANTLY related to exports in México. I see that the absolute t-value of beta 1 is much higher than 2, indicating that the relationship is VERY VERY VERY STRONG. We can assume that the lower the infant mortality, the higher the exports in Mexico.

Nevertheless there is no a logic explication to connect the infant mortality with the exports of Mexico. This is wired finding since it is hard to imagine a relationship or an effect of infant mortality

  • The rule of dumb is that before, you run wether one variable is related with another variable, if they are time series varibale and you run a regression you have ti assure that the variables are cointagreted, if they are not cointegraited then the regression is not valid or SPURIOUS *

Research about spurious regression, and explain in which cases you can end up with a spurious regression. IN STATITCS, A SPURIOS RELATIONSHIP OR SPURIOUS CORRELATION IS A MATHEMATICAL RELATIONSHIP IN WICH TWO OR MORE EVENTS OR VARIABLES ARE ASSOCIATED BUT NOT CAUSALLY RELATED, DUE TO EITHIR COINCIDENCE OR THE PRESENCE OF A CERTAIN THIRD, UNSEEN FACTOR (REFERRED TO AS A “COMMON RESPONSE VARIABLE”, “CONFOUNDING FACTOR”, OR “LURKING VARIABLE”)

NOTES OF TEACHER: When you run a regression between 2 non-stationary variables, it is very likely that the results will not be reliable (SPURIOUS REGRESSION) It will be a valid regression ONLY IF the 2 non-sationary variables are cointegrated.

2 non-stationary variables are considerd cointegreted only if the residuals of the regression between them is a stationary variable.

To better understand this concept we have tu remember the “residuals” concept.

If X and Y are non-stationary variables: E[yt] = b0 + b1xt b1 -> Is the slop Expected value is the regression line Yt = b0 + b1xt + Et(error/residual) Then the residuals or the errors is the difference between the real value of y and its expected value E[Yt] Errort = Yt - Expected[Y] -> the expected value is the regression line

The sum of all the residual have to be 0 in order to have the best line to represent all the dods.

The erros are scald distance becouse of the betas of the regression. If the errors series are STATIONARY, then, we can consider that Y is COINTEGRETED WITH X.

Are the series cointegrated? Is the regression spurious or valid? Run the corresponding test

library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(m1$residuals, k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  m1$residuals
## Dickey-Fuller = -1.5881, Lag order = 0, p-value = 0.7351
## alternative hypothesis: stationary

Since the p-value>0.05 we cannot REJECT the null hypothesis that states that the residual series is NON-SATATIONARY; In other words, we conclude that the errors (residuals) is a NON-STATIONARY variable; with this test now I can say thath both series ARE NOT COINTEGRATED.

Since both series are NOT COINTEGRETED, then the result of the regression between them is SPUROUS (NOT RELIABLE)

NOTES OF THE TEACHER: We can always transform a non-stationary variable into a starionary variable: most of the time, with the first difference of the log of the variable (% in cc), the series becomes stationary (also with the seasonal difference)

If X and Y are STATIONARY variables, then we DO NOT NEED TO TEST FOR COINTEGRATION: ALWAYS THE REGRESSION BETWEEN 2 STATIONARY VARIABLES WILL BE VALID.

2 Cointegration between Financial series

Using daily of Mexican IPCyC market and the S&P 500, examine wether two series are cointegreted. Generete an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

From Jan 1, 2015 to Oct 2, 2017.

From Oct 3, 2017 to Feb 28, 2018

2.1 Cointegration From Jan 1, 2015 to Oct 2, 2017

Loading the package and then downloading the data from Jan 1, 2015 to Oct 2, 2017.

library(quantmod)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: TTR
library(tseries)
getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2015-01-01", to = "2017-10-02")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^MXX"  "^GSPC"
#This is because of the free days on both countries

data = na.omit(merge (MXX, GSPC))

Then I get the first index of of both instruments:

firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstusa = as.numeric(GSPC$GSPC.Adjusted[1])

Then I create the indexes for $1.00

invmxx <- data$MXX.Adjusted / firstmxx
invusa <- data$GSPC.Adjusted / firstusa

In order to see the behavior of the 2 indexes according to $1.00 I plot them

plot(invmxx)

plot(invusa)

Now with this two varibles I check wether invmxx and invusa are cointegreited:

m1 <- lm(invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
summary(m1)
## 
## Call:
## lm(formula = invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.06656 -0.01835  0.00244  0.01944  0.06810 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.35668    0.01383   25.79   <2e-16 ***
## invusa$GSPC.Adjusted  0.69852    0.01311   53.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02702 on 672 degrees of freedom
## Multiple R-squared:  0.8086, Adjusted R-squared:  0.8083 
## F-statistic:  2840 on 1 and 672 DF,  p-value: < 2.2e-16

Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variable

adf.test(m1$residuals, k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  m1$residuals
## Dickey-Fuller = -3.5752, Lag order = 0, p-value = 0.03508
## alternative hypothesis: stationary

Since the p-value<0.05 then I can conclude that the residuals is a stationary variable, so both indexes are COINTEGRATED; so in this period of time the Finanacial Mexixican Marker is COINTEGRETED with the USA Financial Market.

Then, I can rely on the coefficients of my model, so I can conlcude that for each peso gain in te USA market in average I gain in the Mexican market $69.852 cents of a peso, becouse there is a positive and significant relationship between the USA and the Mexican Market.

2.2 Cointegration From Oct 3, 2017 to Feb 28, 2018

I download the data

getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2017-10-03", to = "2018-02-28")
## [1] "^MXX"  "^GSPC"
data_1 = na.omit(merge (MXX, GSPC))

Then I construct a little portfolio about this:

firstmxx_1 = as.numeric(MXX$MXX.Adjusted[1])
firstusa_1 = as.numeric(GSPC$GSPC.Adjusted[1])
# When I divided a price with its original price I assigned 1.00 peso 
invmxx_1 <- data_1$MXX.Adjusted / firstmxx_1
invusa_1 <- data_1$GSPC.Adjusted / firstusa_1

In order to see the behavior of the 2 indexes according to $1.00 I plot them

plot(invmxx_1)

plot(invusa_1)

Now with this two varibles I check wether invmxx and invusa are cointegreited:

m2 <- lm(invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
summary(m2)
## 
## Call:
## lm(formula = invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.037454 -0.014008 -0.002567  0.017749  0.040777 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.79716    0.05928  13.447  < 2e-16 ***
## invusa_1$GSPC.Adjusted  0.16206    0.05640   2.874  0.00501 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02006 on 95 degrees of freedom
## Multiple R-squared:  0.07997,    Adjusted R-squared:  0.07028 
## F-statistic: 8.257 on 1 and 95 DF,  p-value: 0.005008

Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variabl

adf.test(m2$residuals, k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  m2$residuals
## Dickey-Fuller = -2.0418, Lag order = 0, p-value = 0.5592
## alternative hypothesis: stationary

Since the p-value>0.05 we cannot REJECT the null hypothesis that states that the residuals series is NON-SATATIONARY; In other words, we conclude that the errors (residuals) is a NON-STATIONARY variable; with this test now I can say thath both series ARE NOT COINTEGRATED.

Since both series are NOT COINTEGRETED, then the result of the regression between them is SPUROUS (NOT RELIABLE).

3 Holding return of a portfolio of 2 stocks

Download daily prices from CEMEX and ALFA from Jan 1, 2015 to Dec 31, 2017.

getSymbols(Symbols<-c("CEMEXCPO.MX", "ALFAA.MX"), periodicity= "daily", from = "2015-01-01", to = "2017-12-01")
## [1] "CEMEXCPO.MX" "ALFAA.MX"

First I calculate the daily continously compunded return

CEMEXCPO.MX$stockreturn <- log(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted / (lag(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted,1)))
ALFAA.MX$stockreturn <- log(ALFAA.MX$ALFAA.MX.Adjusted / (lag(ALFAA.MX$ALFAA.MX.Adjusted,1)))

I calculate the Holding period return of both stocks

# Calculation for the CEMEX stock 
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
## [1] "HPR of CEMEX =  9.26812986347962"
# Calculation for the ALFAA stock 
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
## [1] "HPR of ALFAA =  -34.9480177258728"

Now I create a portfolio 1 assigning 30% to CEMEX and 70% for ALFA and calculate the HPR of the portfolio

p1 <- HPRCEMEX*0.3 + HPRALFAA*0.7 
print(paste("HPR of my portfolio 1 ", p1))
## [1] "HPR of my portfolio 1  -21.6831734490671"

Create a portfolio 2 assigning -100% to CEMEX and +200% to ALFA and calculate the HPR of this portfolio

p2 <- HPRCEMEX*-1 + HPRALFAA*2
print(paste("HPR of my portfolio 2 ", p2))
## [1] "HPR of my portfolio 2  -79.1641653152253"

What does a negative sign mean a portfolio? Briefly explain with the previous example. The negative sign means that I am selling short one stock in order to have leverage and buy more stocks of another stock.

4 CHALLENGE: Statistical arbitrage

Using the CEMEX and the ALFAA daily price series from Jan 1, 2015 to Dec 31, 2017, examine whether these two series are cointegrated. Assume you are in December 31, 2017. If the series are cointegrated that means that the residual of the regression between these series is a stationary series.

dataset<- merge(ALFAA.MX$ALFAA.MX.Adjusted, CEMEXCPO.MX$CEMEXCPO.MX.Adjusted)
reg1 <- lm( dataset$ALFAA.MX.Adjusted  ~ dataset$CEMEXCPO.MX.Adjusted)
summary(reg1)
## 
## Call:
## lm(formula = dataset$ALFAA.MX.Adjusted ~ dataset$CEMEXCPO.MX.Adjusted)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.675 -1.102  0.458  1.692  5.347 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  42.87757    0.47587   90.10   <2e-16 ***
## dataset$CEMEXCPO.MX.Adjusted -1.03010    0.03512  -29.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.556 on 733 degrees of freedom
## Multiple R-squared:   0.54,  Adjusted R-squared:  0.5393 
## F-statistic: 860.4 on 1 and 733 DF,  p-value: < 2.2e-16

Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variable

adf.test(reg1$residuals, k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  reg1$residuals
## Dickey-Fuller = -3.8521, Lag order = 0, p-value = 0.01639
## alternative hypothesis: stationary

Then, if this is the case, what can you do to take advantage in financial trading? In my opinion, if we know that this stocks are cointegreted then we can assume that if the residual of the model start to increase then the dependent variable is going to increase in a very significant way and the independent variable will decrease. And if the residuals of the model start to decrease then we can assume that the independent variable will increase and the dependet variable will decrease.

rm(list = ls())
getSymbols(("CEMEXCPO.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
## [1] "CEMEXCPO.MX"
getSymbols(("ALFAA.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
## [1] "ALFAA.MX"

Calculate the HPR of your portfolio from Jan 1st to Feb 28, 2018.

# Calculation for the CEMEX stock 
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
## [1] "HPR of CEMEX =  -15.7929656158527"
# Calculation for the ALFAA stock 
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
## [1] "HPR of ALFAA =  1.18954999293077"

If you were to invest from Jan 1, 2018 to Feb 28, 2018 in a portfolio of these stocks, which weights would you assign? I will assign a value of +200% to ALFAA and a value of -100% to CEMEX

PEduardo <- HPRALFAA*2+ HPRCEMEX*-1
print(paste("HPR of my portfolio = ", PEduardo))
## [1] "HPR of my portfolio =  18.1720656017143"