In this workshop we introduce the concept of cointegration of time-series variables.
When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spuious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.
Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)
We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.
library(wbstats)
# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"),
country="MEX", start_date = 1980, end_date = 2020)
# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"),
country="MEX", start_date = 1980, end_date = 2020)
The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables:
plot.ts(infantm$SP.DYN.IMRT.IN)
plot.ts(exports$TX.VAL.MRCH.XD.WD)
Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?
m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)
Call:
lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
Residuals:
Min 1Q Median 3Q Max
-49.842 -41.039 -4.786 39.072 73.624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 280.3572 15.5425 18.04 < 2e-16 ***
infantm$SP.DYN.IMRT.IN -6.3166 0.5205 -12.13 8.14e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 41.48 on 39 degrees of freedom
Multiple R-squared: 0.7906, Adjusted R-squared: 0.7852
F-statistic: 147.3 on 1 and 39 DF, p-value: 8.139e-15
hist(infantm$SP.DYN.IMRT.IN)
hist(exports$TX.VAL.MRCH.XD.WD)
cor(infantm$SP.DYN.IMRT.IN, exports$TX.VAL.MRCH.XD.WD)
[1] -0.889162
cor.test(infantm$SP.DYN.IMRT.IN, exports$TX.VAL.MRCH.XD.WD)
Pearson's product-moment correlation
data: infantm$SP.DYN.IMRT.IN and exports$TX.VAL.MRCH.XD.WD
t = -12.135, df = 39, p-value = 8.139e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9397443 -0.8004851
sample estimates:
cor
-0.889162
scatter.smooth(infantm$SP.DYN.IMRT.IN, exports$TX.VAL.MRCH.XD.WD)
AFTER A QUICK TEST WE CAN SEE THAT THERE IS A STRONG CORRELATION BETWEEN THE TWO VARIABLES. ACCORDING TO THE RESULTS FROM THE MODEL, THERE IS A NEGATIVELY AND SIGNIFCANT RELATIONSHIP, AND I THINK THIS IS ACCURATE SINCE THE HIGHER THE EXPORTS THE LOWER THE IMPACT ON CHILD MORTALITY, ALTHOUGH IMPORTS MIGHT BE OTHER STORY.
m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)
Call:
lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
Residuals:
Min 1Q Median 3Q Max
-49.842 -41.039 -4.786 39.072 73.624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 280.3572 15.5425 18.04 < 2e-16 ***
infantm$SP.DYN.IMRT.IN -6.3166 0.5205 -12.13 8.14e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 41.48 on 39 degrees of freedom
Multiple R-squared: 0.7906, Adjusted R-squared: 0.7852
F-statistic: 147.3 on 1 and 39 DF, p-value: 8.139e-15
Research about spurious regression, and explain in which cases you can end up with a spurious regression
RESEARCH BROUGHT TO ME THE REALIZATION THAT I FOUND A WORD FOR SOMETHING I’VE SOMETIMES ENCOUNTERED. A SPURIOUS RELATIONSHIP IS A MATHEMATICAL RELATION IN WHICH TWO OR MORE VARIABLES ARE ASSOCIATED BUT NOT REALLY RELATED, AND THIS CAN BE DUE TO COINICIDENCE OR AN UNSEEN FACTOR.
Are the series cointegrated? Is the regression spurious or valid? Run the corresponding test
library(tseries)
adf.test(m1$residuals, k = 0)
Augmented Dickey-Fuller Test
data: m1$residuals
Dickey-Fuller = -2.1794, Lag order = 0, p-value = 0.503
alternative hypothesis: stationary
Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.
From Jan 1, 2015 to Oct 2, 2017.
From Oct 3, 2017 to Feb 28, 2018
For both cases, run a cointegration test and INTERPRET your results.
library(quantmod)
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: TTR
library(tseries)
getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2015-01-01", to = "2017-10-02")
‘getSymbols’ currently uses auto.assign=TRUE by default, but will
use auto.assign=FALSE in 0.5-0. You will still be able to use
‘loadSymbols’ to automatically load data. getOption("getSymbols.env")
and getOption("getSymbols.auto.assign") will still be checked for
alternate defaults.
This message is shown once per session and may be disabled by setting
options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
[1] "^MXX" "^GSPC"
data = na.omit(merge (MXX, GSPC))
firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstusa = as.numeric(GSPC$GSPC.Adjusted[1])
invmxx <- data$MXX.Adjusted / firstmxx
invusa <- data$GSPC.Adjusted / firstusa
plot(invmxx)
plot(invusa)
m1 <- lm(invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
summary(m1)
Call:
lm(formula = invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
Residuals:
Min 1Q Median 3Q Max
-0.06656 -0.01835 0.00244 0.01944 0.06810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.35668 0.01383 25.79 <2e-16 ***
invusa$GSPC.Adjusted 0.69852 0.01311 53.29 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02702 on 672 degrees of freedom
Multiple R-squared: 0.8086, Adjusted R-squared: 0.8083
F-statistic: 2840 on 1 and 672 DF, p-value: < 2.2e-16
adf.test(m1$residuals, k=0)
Augmented Dickey-Fuller Test
data: m1$residuals
Dickey-Fuller = -3.5752, Lag order = 0, p-value =
0.03508
alternative hypothesis: stationary
BASED OFF THE P VALUE, WE CAN TELL THAT THE RESIDUALS ARE FROM AS TATIONARY VARIABLE, AND THUS BOTH INDEXES ARE COINTEGRATED. ALSO, BASED OFF THE COEFICCIENTS, WE CAN TELL THAT FOR EACH PESO EARNED IN THE US MARKET, THERE IS A PROFIT OF 70 CENTS.
getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2017-10-03", to = "2018-02-28")
[1] "^MXX" "^GSPC"
data_1 = na.omit(merge (MXX, GSPC))
firstmxx_1 = as.numeric(MXX$MXX.Adjusted[1])
firstusa_1 = as.numeric(GSPC$GSPC.Adjusted[1])
# When I divided a price with its original price I assigned 1.00 peso
invmxx_1 <- data_1$MXX.Adjusted / firstmxx_1
invusa_1 <- data_1$GSPC.Adjusted / firstusa_1
plot(invmxx_1)
plot(invusa_1)
m2 <- lm(invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
summary(m2)
Call:
lm(formula = invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
Residuals:
Min 1Q Median 3Q Max
-0.037454 -0.014008 -0.002567 0.017749 0.040777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.79716 0.05928 13.447 < 2e-16 ***
invusa_1$GSPC.Adjusted 0.16206 0.05640 2.874 0.00501 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02006 on 95 degrees of freedom
Multiple R-squared: 0.07997, Adjusted R-squared: 0.07028
F-statistic: 8.257 on 1 and 95 DF, p-value: 0.005008
adf.test(m2$residuals, k=0)
Augmented Dickey-Fuller Test
data: m2$residuals
Dickey-Fuller = -2.0418, Lag order = 0, p-value = 0.5592
alternative hypothesis: stationary
THESE SERIES ARE NOT COINTEGRATED, AND WE CAN TELL THAT THE ANALYSIS CAME DOWN AS SPURIOUS.
Once we have an idea about cointegration, we will review how to form a portfolio of 2 assets and calculate its holding return over time. The holding period return of an asset from day 1 to day N (HPR) can be calculated with any of the following ways:
HPR = (Price of the stock at day N / Price of the stock at day 1) - 1
HPR = exp ( Sum of all continuous compounded returns from day 1 to day N) - 1
The expected return of a portfolio of 2 assets is estimated as a weighted average of the expected stock returns:
E[Rp]=w1∗E[R1]+w2∗E(R2)
The holding return of a portfolio of 2 assets for a specific period of time is estimated as a weighted average of the holding returns of the assets:
HPRp=w1∗HPR1+w2∗HPR2
Download daily prices from CEMEX and ALFA from Jan 1, 2015 to Dec 31, 2017.
Calculate their holding period returns of both stocks
Create a portfolio 1 assigning 30% to CEMEX and 70% for ALFA and calculate the HPR of the portfolio
Create a portfolio 2 assigning -100% to CEMEX and +200% to ALFA and calcuate the HPR of this portfolio
getSymbols(Symbols<-c("CEMEXCPO.MX", "ALFAA.MX"), periodicity= "daily", from = "2015-01-01", to = "2017-12-01")
[1] "CEMEXCPO.MX" "ALFAA.MX"
First I calculate the daily continously compunded return
CEMEXCPO.MX$stockreturn <- log(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted / (lag(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted,1)))
ALFAA.MX$stockreturn <- log(ALFAA.MX$ALFAA.MX.Adjusted / (lag(ALFAA.MX$ALFAA.MX.Adjusted,1)))
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
[1] "HPR of CEMEX = 9.26812986347962"
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
[1] "HPR of ALFAA = -34.9480141260153"
p1 <- HPRCEMEX*0.3 + HPRALFAA*0.7
print(paste("HPR of portfolio 1 ", p1))
[1] "HPR of portfolio 1 -21.6831709291668"
p2 <- HPRCEMEX*-1 + HPRALFAA*2
print(paste("HPR of portfolio 2 ", p2))
[1] "HPR of portfolio 2 -79.1641581155102"
What does a negative sign mean a portfolio? Briefly explain with the previous example.
THIS NEGATIVE SIGN SHOWS THAT ON ONE HAND WE ARE SHORT SELLING A STOCK TO GAIN LEVERAGE AND PURCHASE MORE STOCKS FROM A DIFFERENT INSTRUMENT.
Using the CEMEX and the ALFAA daily price series from Jan 1, 2015 to Dec 31, 2017, examine whether these two series are cointegrated. Assume you are in December 31, 2017. If the series are cointegrated that means that the residual of the regression between these series is a stationary series. Then, if this is the case, what can you do to take advantage in financial trading? FROM MY POV, BASED ON THE COINTEGRATION WE CAN ONLY ASSUME THAT IF THE RESIDUALS COMMING FROM A MODEL SHOW AN INCREASE THEN THE DEPENDENT VARIABLE WILL SKYROCKET AND THE INDEPENDENT VARIABLE WILL TANK. IF THE RESIDUALS SHOW A DECREASE THEN WE CAN INFER THAT THE INDEPENDENT VARIABLE WIL SKYROCKET AND THE DEPENDENT VARIABLE WILL TANK.
If you were to invest from Jan 1, 2018 to Feb 28, 2018 in a portfolio of these stocks, which weights would you assign? I WOULD INVEST HEAVILY INTO ALFAA AND SHORT MY POSITION OF CEMEX.
This is called statistical arbitrage, and some technical analysts use this technique to find arbitrage opportunities (be careful, this is not a safe strategy to follow)
dataset<- merge(ALFAA.MX$ALFAA.MX.Adjusted, CEMEXCPO.MX$CEMEXCPO.MX.Adjusted)
reg1 <- lm( dataset$ALFAA.MX.Adjusted ~ dataset$CEMEXCPO.MX.Adjusted)
summary(reg1)
Call:
lm(formula = dataset$ALFAA.MX.Adjusted ~ dataset$CEMEXCPO.MX.Adjusted)
Residuals:
Min 1Q Median 3Q Max
-8.6538 -1.0994 0.4569 1.6883 5.3340
Coefficients:
Estimate Std. Error t value
(Intercept) 42.77131 0.47469 90.10
dataset$CEMEXCPO.MX.Adjusted -1.02755 0.03503 -29.33
Pr(>|t|)
(Intercept) <2e-16 ***
dataset$CEMEXCPO.MX.Adjusted <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.55 on 733 degrees of freedom
Multiple R-squared: 0.54, Adjusted R-squared: 0.5393
F-statistic: 860.4 on 1 and 733 DF, p-value: < 2.2e-16
adf.test(reg1$residuals, k=0)
Augmented Dickey-Fuller Test
data: reg1$residuals
Dickey-Fuller = -3.8521, Lag order = 0, p-value =
0.01639
alternative hypothesis: stationary
rm(list = ls())
getSymbols(("CEMEXCPO.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
[1] "CEMEXCPO.MX"
getSymbols(("ALFAA.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
[1] "ALFAA.MX"
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
[1] "HPR of CEMEX = -15.7929645407336"
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
[1] "HPR of ALFAA = 1.18957542276574"
Calculate the HPR of your portfolio from Jan 1st to Feb 28, 2018.
PSTEFAN <- HPRALFAA*2+ HPRCEMEX*-1
print(paste("HPR of my portfolio = ", PSTEFAN))
[1] "HPR of my portfolio = 18.1721153862651"