When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spuious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.
Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)
We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.
I load the package
# Set all the working directory
setwd("~/Desktop/Financial Econometrics II ")
library(wbstats)
# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"),
country="MEX", start_date = 1980, end_date = 2020)
# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"),
country="MEX", start_date = 1980, end_date = 2020)
The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables
plot.ts(infantm$SP.DYN.IMRT.IN)
plot.ts(exports$TX.VAL.MRCH.XD.WD)
Now run a regression model using these series. Report the result of the regression.
m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)
##
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.331 -39.723 -7.456 37.848 77.194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 275.1783 16.0408 17.16 < 2e-16 ***
## infantm$SP.DYN.IMRT.IN -6.1847 0.5327 -11.61 4.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.65 on 38 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7801, Adjusted R-squared: 0.7743
## F-statistic: 134.8 on 1 and 38 DF, p-value: 4.602e-14
Did you find significant relationship? Is your result what you expected?
According to this model, I see that the infant mortality is negatively and SIGNIFICANTLY related to exports in México. I see that the absolute t-value of beta 1 is much higher than 2, indicating that the relationship is VERY VERY VERY STRONG. We can assume that the lower the infant mortality, the higher the exports in Mexico.
Nevertheless there is no a logic explication to connect the infant mortality with the exports of Mexico. This is wired finding since it is hard to imagine a relationship or an effect of infant mortality
Research about spurious regression, and explain in which cases you can end up with a spurious regression. IN STATITCS, A SPURIOS RELATIONSHIP OR SPURIOUS CORRELATION IS A MATHEMATICAL RELATIONSHIP IN WICH TWO OR MORE EVENTS OR VARIABLES ARE ASSOCIATED BUT NOT CAUSALLY RELATED, DUE TO EITHIR COINCIDENCE OR THE PRESENCE OF A CERTAIN THIRD, UNSEEN FACTOR (REFERRED TO AS A “COMMON RESPONSE VARIABLE”, “CONFOUNDING FACTOR”, OR “LURKING VARIABLE”)
NOTES OF TEACHER: When you run a regression between 2 non-stationary variables, it is very likely that the results will not be reliable (SPURIOUS REGRESSION) It will be a valid regression ONLY IF the 2 non-sationary variables are cointegrated.
2 non-stationary variables are considerd cointegreted only if the residuals of the regression between them is a stationary variable.
To better understand this concept we have tu remember the “residuals” concept.
If X and Y are non-stationary variables: E[yt] = b0 + b1xt b1 -> Is the slop Expected value is the regression line Yt = b0 + b1xt + Et(error/residual) Then the residuals or the errors is the difference between the real value of y and its expected value E[Yt] Errort = Yt - Expected[Y] -> the expected value is the regression line
The sum of all the residual have to be 0 in order to have the best line to represent all the dods.
The erros are scald distance becouse of the betas of the regression. If the errors series are STATIONARY, then, we can consider that Y is COINTEGRETED WITH X.
Are the series cointegrated? Is the regression spurious or valid? Run the corresponding test
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(m1$residuals, k=0)
##
## Augmented Dickey-Fuller Test
##
## data: m1$residuals
## Dickey-Fuller = -1.5881, Lag order = 0, p-value = 0.7351
## alternative hypothesis: stationary
Since the p-value>0.05 we cannot REJECT the null hypothesis that states that the residual series is NON-SATATIONARY; In other words, we conclude that the errors (residuals) is a NON-STATIONARY variable; with this test now I can say thath both series ARE NOT COINTEGRATED.
Since both series are NOT COINTEGRETED, then the result of the regression between them is SPUROUS (NOT RELIABLE)
NOTES OF THE TEACHER: We can always transform a non-stationary variable into a starionary variable: most of the time, with the first difference of the log of the variable (% in cc), the series becomes stationary (also with the seasonal difference)
If X and Y are STATIONARY variables, then we DO NOT NEED TO TEST FOR COINTEGRATION: ALWAYS THE REGRESSION BETWEEN 2 STATIONARY VARIABLES WILL BE VALID.
Using daily of Mexican IPCyC market and the S&P 500, examine wether two series are cointegreted. Generete an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.
From Jan 1, 2015 to Oct 2, 2017.
From Oct 3, 2017 to Feb 28, 2018
Loading the package and then downloading the data from Jan 1, 2015 to Oct 2, 2017.
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
library(tseries)
getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2015-01-01", to = "2017-10-02")
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "^MXX" "^GSPC"
#This is because of the free days on both countries
data = na.omit(merge (MXX, GSPC))
Then I get the first index of of both instruments:
firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstusa = as.numeric(GSPC$GSPC.Adjusted[1])
Then I create the indexes for $1.00
invmxx <- data$MXX.Adjusted / firstmxx
invusa <- data$GSPC.Adjusted / firstusa
In order to see the behavior of the 2 indexes according to $1.00 I plot them
plot(invmxx)
plot(invusa)
Now with this two varibles I check wether invmxx and invusa are cointegreited:
m1 <- lm(invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
summary(m1)
##
## Call:
## lm(formula = invmxx$MXX.Adjusted ~ invusa$GSPC.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06656 -0.01835 0.00244 0.01944 0.06810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35668 0.01383 25.79 <2e-16 ***
## invusa$GSPC.Adjusted 0.69852 0.01311 53.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02702 on 672 degrees of freedom
## Multiple R-squared: 0.8086, Adjusted R-squared: 0.8083
## F-statistic: 2840 on 1 and 672 DF, p-value: < 2.2e-16
Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variable
adf.test(m1$residuals, k=0)
##
## Augmented Dickey-Fuller Test
##
## data: m1$residuals
## Dickey-Fuller = -3.5752, Lag order = 0, p-value = 0.03508
## alternative hypothesis: stationary
Since the p-value<0.05 then I can conclude that the residuals is a stationary variable, so both indexes are COINTEGRATED; so in this period of time the Finanacial Mexixican Marker is COINTEGRETED with the USA Financial Market.
Then, I can rely on the coefficients of my model, so I can conlcude that for each peso gain in te USA market in average I gain in the Mexican market $69.852 cents of a peso, becouse there is a positive and significant relationship between the USA and the Mexican Market.
I download the data
getSymbols(Symbols<-c("^MXX", "^GSPC"), periodicity= "daily", from = "2017-10-03", to = "2018-02-28")
## [1] "^MXX" "^GSPC"
data_1 = na.omit(merge (MXX, GSPC))
Then I construct a little portfolio about this:
firstmxx_1 = as.numeric(MXX$MXX.Adjusted[1])
firstusa_1 = as.numeric(GSPC$GSPC.Adjusted[1])
# When I divided a price with its original price I assigned 1.00 peso
invmxx_1 <- data_1$MXX.Adjusted / firstmxx_1
invusa_1 <- data_1$GSPC.Adjusted / firstusa_1
In order to see the behavior of the 2 indexes according to $1.00 I plot them
plot(invmxx_1)
plot(invusa_1)
Now with this two varibles I check wether invmxx and invusa are cointegreited:
m2 <- lm(invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
summary(m2)
##
## Call:
## lm(formula = invmxx_1$MXX.Adjusted ~ invusa_1$GSPC.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.037454 -0.014008 -0.002567 0.017749 0.040777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79716 0.05928 13.447 < 2e-16 ***
## invusa_1$GSPC.Adjusted 0.16206 0.05640 2.874 0.00501 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02006 on 95 degrees of freedom
## Multiple R-squared: 0.07997, Adjusted R-squared: 0.07028
## F-statistic: 8.257 on 1 and 95 DF, p-value: 0.005008
Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variabl
adf.test(m2$residuals, k=0)
##
## Augmented Dickey-Fuller Test
##
## data: m2$residuals
## Dickey-Fuller = -2.0418, Lag order = 0, p-value = 0.5592
## alternative hypothesis: stationary
Since the p-value>0.05 we cannot REJECT the null hypothesis that states that the residuals series is NON-SATATIONARY; In other words, we conclude that the errors (residuals) is a NON-STATIONARY variable; with this test now I can say thath both series ARE NOT COINTEGRATED.
Since both series are NOT COINTEGRETED, then the result of the regression between them is SPUROUS (NOT RELIABLE).
Download daily prices from CEMEX and ALFA from Jan 1, 2015 to Dec 31, 2017.
getSymbols(Symbols<-c("CEMEXCPO.MX", "ALFAA.MX"), periodicity= "daily", from = "2015-01-01", to = "2017-12-01")
## [1] "CEMEXCPO.MX" "ALFAA.MX"
First I calculate the daily continously compunded return
CEMEXCPO.MX$stockreturn <- log(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted / (lag(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted,1)))
ALFAA.MX$stockreturn <- log(ALFAA.MX$ALFAA.MX.Adjusted / (lag(ALFAA.MX$ALFAA.MX.Adjusted,1)))
I calculate the Holding period return of both stocks
# Calculation for the CEMEX stock
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
## [1] "HPR of CEMEX = 9.26812986347962"
# Calculation for the ALFAA stock
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
## [1] "HPR of ALFAA = -34.9480177258728"
Now I create a portfolio 1 assigning 30% to CEMEX and 70% for ALFA and calculate the HPR of the portfolio
p1 <- HPRCEMEX*0.3 + HPRALFAA*0.7
print(paste("HPR of my portfolio 1 ", p1))
## [1] "HPR of my portfolio 1 -21.6831734490671"
Create a portfolio 2 assigning -100% to CEMEX and +200% to ALFA and calculate the HPR of this portfolio
p2 <- HPRCEMEX*-1 + HPRALFAA*2
print(paste("HPR of my portfolio 2 ", p2))
## [1] "HPR of my portfolio 2 -79.1641653152253"
What does a negative sign mean a portfolio? Briefly explain with the previous example. The negative sign means that I am selling short one stock in order to have leverage and buy more stocks of another stock.
Using the CEMEX and the ALFAA daily price series from Jan 1, 2015 to Dec 31, 2017, examine whether these two series are cointegrated. Assume you are in December 31, 2017. If the series are cointegrated that means that the residual of the regression between these series is a stationary series.
dataset<- merge(ALFAA.MX$ALFAA.MX.Adjusted, CEMEXCPO.MX$CEMEXCPO.MX.Adjusted)
reg1 <- lm( dataset$ALFAA.MX.Adjusted ~ dataset$CEMEXCPO.MX.Adjusted)
summary(reg1)
##
## Call:
## lm(formula = dataset$ALFAA.MX.Adjusted ~ dataset$CEMEXCPO.MX.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.675 -1.102 0.458 1.692 5.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.87757 0.47587 90.10 <2e-16 ***
## dataset$CEMEXCPO.MX.Adjusted -1.03010 0.03512 -29.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.556 on 733 degrees of freedom
## Multiple R-squared: 0.54, Adjusted R-squared: 0.5393
## F-statistic: 860.4 on 1 and 733 DF, p-value: < 2.2e-16
Before doing a interpretation of this model I am going to test the residuals in order to check if the residuals of the model behave like a stationary variable
adf.test(reg1$residuals, k=0)
##
## Augmented Dickey-Fuller Test
##
## data: reg1$residuals
## Dickey-Fuller = -3.8521, Lag order = 0, p-value = 0.01639
## alternative hypothesis: stationary
Then, if this is the case, what can you do to take advantage in financial trading? In my opinion, if we know that this stocks are cointegreted then we can assume that if the residual of the model start to increase then the dependent variable is going to increase in a very significant way and the independent variable will decrease. And if the residuals of the model start to decrease then we can assume that the independent variable will increase and the dependet variable will decrease.
rm(list = ls())
getSymbols(("CEMEXCPO.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
## [1] "CEMEXCPO.MX"
getSymbols(("ALFAA.MX"), periodicity= "daily", from = "2018-01-01", to = "2018-02-28")
## [1] "ALFAA.MX"
Calculate the HPR of your portfolio from Jan 1st to Feb 28, 2018.
# Calculation for the CEMEX stock
n <- (nrow(CEMEXCPO.MX))
price0 <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[1])
pricen <- as.numeric(CEMEXCPO.MX$CEMEXCPO.MX.Adjusted[n])
HPRCEMEX <- ((pricen /price0)-1)*100
print(paste("HPR of CEMEX = ", HPRCEMEX))
## [1] "HPR of CEMEX = -15.7929656158527"
# Calculation for the ALFAA stock
n_1 <- (nrow(ALFAA.MX))
price0_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[1])
pricen_1 <- as.numeric(ALFAA.MX$ALFAA.MX.Adjusted[n])
HPRALFAA <- ((pricen_1 /price0_1)-1 )*100
print(paste("HPR of ALFAA = ", HPRALFAA))
## [1] "HPR of ALFAA = 1.18954999293077"
If you were to invest from Jan 1, 2018 to Feb 28, 2018 in a portfolio of these stocks, which weights would you assign? I will assign a value of +200% to ALFAA and a value of -100% to CEMEX
PEduardo <- HPRALFAA*2+ HPRCEMEX*-1
print(paste("HPR of my portfolio = ", PEduardo))
## [1] "HPR of my portfolio = 18.1720656017143"