Abstract
In this workshop we learn the concepts of stationarity, unitary root and cointegration of time-series variables.You will work in RStudio. You can go to the RStudio web site and follow instructions to download the R software and the free version of RStudio. It is strongly recommended to have the latest version of R and RStudio. Once you are in RStudio, do the following.
Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:
Replicate all the R Code along with its output.
You have to respond to the challenges of the workshop and any other question asked in the workshop.
An y QUESTION or any INTERPRETATION you need to do will be written in CAPITAL LETTERS. For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.
You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.
In our previous workshop we learned about the random walk model applied to finance. We learned that the logarithm of the stock prices behaves similar to a random walk model with a positive drift (\(\phi_0>0\)).
As we learned, a random walk model can be expressed in the following equation:
\[ Y_t = φ_0 + Y_{t−1} + ε_t \]
Following this model we did a Monte Carlo simulation for the logarithm of the S&P500, and we found that most of the time the simulations of a random walk model behaves similar to the real log of the S&P500 index.
We learned that if \(\phi_0>0\), then the series will have a growing trend over time. We also learned that the higher the standard deviation of the daily shock, the more likely that the series will radically move up and down. The standard deviation of the shock is actually the daily volatility of the daily returns of the series.
An important insight about the random walk model is that the random walk model is a non-stationary series. What does this mean?
In short, a stationary series is a series that its mean in any time period is about the same (no matter which time periods we select), and its standard deviation is about the same for any time period. There is another condition of stationary related to auto-covariance of the series, but this condition must hold true for strong stationary series. For this class we do not learn about strong stationary series.
Then, the random-walk model will always behave as a non-stationary series.
Let’s analyze what part of the random walk equation makes the series stationary. We can re-write the previous random walk equation as follows:
\[ Y_t = φ_0 + \phi_1*Y_{t−1} + ε_t \] If the coefficient \(\phi_{1}=1\), then this equation becomes the random walk equation. When \(\phi_{1}=1\) the \(Y_t\) series becomes non-stationary.
Interestingly, if \(\phi_{1}<1\), then \(Y_t\) becomes a stationary series.
Let’s work with an example.
We will work with an online dataset that contains sales data of different consumer products.
Download the excel file from a web site:
download.file("http://www.apradie.com/ec2004/salesbrands3.xlsx", "salesbrands3.xlsx", mode="wb")
This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.
You have to install the readxl package. This package reads data from Excel files. We load this package:
library(readxl)
Now, we import the Excel file into an R dataframe:
<- read_excel("salesbrands3.xlsx", sheet = "Resultado")
sales_brands
# Always familiarize with the data
head(sales_brands, 5)
## # A tibble: 5 × 15
## date qbrand1t…¹ sales…² qbran…³ sales…⁴ qbran…⁵ sales…⁶ qbran…⁷
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2015-01-01 00:00:00 107. 60781. 195. 60701. 293. 75196. 186.
## 2 2015-02-01 00:00:00 106. 60411. 216. 68289. 494. 105011. 149.
## 3 2015-03-01 00:00:00 88.8 53175. 220. 62530. 357. 90903. 145.
## 4 2015-04-01 00:00:00 86.7 50862. 171. 50576. 291. 72853. 161.
## 5 2015-05-01 00:00:00 87.2 51014. 177. 54327. 318. 75277. 141.
## # … with 7 more variables: salesbrand4 <dbl>, qbrand5tons <dbl>,
## # salesbrand5 <dbl>, qbrand6tons <dbl>, salesbrand6 <dbl>, tonstotal <dbl>,
## # salestotal <dbl>, and abbreviated variable names ¹qbrand1tons,
## # ²salesbrand1, ³qbrand2tons, ⁴salesbrand2, ⁵qbrand3tons, ⁶salesbrand3,
## # ⁷qbrand4tons
As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format.
In order to perform some time-series analysis, it is recommended to use xts objects. We will construct a xts object using the xts() method. The objects of xts class are composed by 2 objects: the core data and the index (class Date, POSIX, etc).
# Load library xts. Install the package if you haven't already.
library(xts)
# We save the date column in a new dataset:
<- sales_brands$date
month
# We delete the date column from the sales data. The date is the column 1 of the dataset:
<- sales_brands[,c(-1)]
sales_brands # As you see, when you specify c(-1) in the column index that means that you do not want column 1!
# Now we create an xts dataset using the previous datasets:
<- xts(x = sales_brands, order.by = month)
sales.xts
head(sales.xts, 5)
## qbrand1tons salesbrand1 qbrand2tons salesbrand2 qbrand3tons
## 2015-01-01 106.62 60780.72 195.46 60700.51 293.10
## 2015-02-01 105.82 60411.47 216.09 68289.15 494.46
## 2015-03-01 88.82 53174.89 220.15 62529.54 356.88
## 2015-04-01 86.69 50861.69 170.74 50575.68 291.07
## 2015-05-01 87.23 51014.23 177.17 54327.48 318.09
## salesbrand3 qbrand4tons salesbrand4 qbrand5tons salesbrand5
## 2015-01-01 75196.35 185.91 48228.51 28.94 2295.50
## 2015-02-01 105010.90 149.30 39922.14 24.66 1973.43
## 2015-03-01 90902.84 144.65 39788.22 23.75 1903.24
## 2015-04-01 72853.34 161.43 42692.59 21.00 1684.28
## 2015-05-01 75277.10 141.09 37167.52 23.44 1883.41
## qbrand6tons salesbrand6 tonstotal salestotal
## 2015-01-01 338.38 55561.26 1148.41 302762.8
## 2015-02-01 284.52 51469.74 1274.85 327076.8
## 2015-03-01 251.34 50395.48 1085.59 298694.2
## 2015-04-01 230.29 49124.61 961.22 267792.2
## 2015-05-01 250.45 46906.63 997.47 266576.4
# Do you notice any difference between data frames and xts objects?
#delete month variable
rm(month)
We can have an overview of the behavior of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:
plot(sales.xts$qbrand2tons)
It seems that this product has a growing trend over time, except for the last year.
We can plot sales of brand 6
plot(sales.xts$qbrand6tons)
This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.
A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, stock prices, cost of good sold, etc is to transform the series into a stationary series.
When you work with sales data, I strongly recommend to transform the sales data by applying the natural logarithm. With the log of sales, if we take the first difference of the series we are calculating the monthly % change (in continuously compounded). This first difference of the log will become a stationary series most of the time when you analyze real sales data.
Then we calculate the logarithm of volume sales for the whole dataset:
<- log(sales.xts) ln_sales
The new object has the same structure as sales.xts but the columns contain the natural logarithm of the volume sales for all products.
Now, we can work with the first difference month-by-month using the diff() function for each variable. For example, to graph the first difference of the log of volume sales for both brands :
= diff(ln_sales$qbrand2tons)
sales_monthly_growth_q2 = diff(ln_sales$qbrand6tons)
sales_monthly_growth_q6
plot(sales_monthly_growth_q2, type = "l", col = "red")
lines(sales_monthly_growth_q6, type = "l", col = "blue")
The diff() function calculates the first difference as:
diff(ln_sales\(qbrand2tons) = ln_sales\)qbrand2tons - lag(ln_sales$qbrand2tons, 1)
Remember that the dollar sign operator ($) is used to access/refer the columns of a data frame or xts dataset.
We can see that the 2 series look like stationary since their means tend to be in the same level for any time period.
Another transformation with monthly data is annual percentage change, which can be calculated as the difference between the log value of the series and its log value 12 months ago. We can calculate this difference using the diff() function with the lag argument set to 12:
= diff(ln_sales$qbrand2tons,lag = 12)
sales_annual_growth_q2 = diff(ln_sales$qbrand6tons,lag = 12)
sales_annual_growth_q6
plot(sales_annual_growth_q2, type = "l", col = "red")
lines(sales_annual_growth_q6, type = "l", col = "blue")
In this case, it is difficult to evaluate graphically whether both series are stationary. We need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not.
There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools. The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky-Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (\(\phi1=1\)).
You need to install the package tseries to do the following analysis.
We can run the Dicky-Fuller test for the sales annual % growth of product 2 as follows:
library(tseries)
adf.test(na.omit(sales_annual_growth_q2), alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(sales_annual_growth_q2)
## Dickey-Fuller = -3.574, Lag order = 3, p-value = 0.04701
## alternative hypothesis: stationary
The null hypothesis of the Dicky-Fuller test is that the series is non-stationary. Actually, the null-hypothesis of this test is that the \(\phi_1=1\), which implies that the series is non-stationary. This is the reason why the Dicky-Fuller test is called a unit-root test.
As any statistical test, if the p-value of the test is less than 0.05, then we have statistical evidence (at the 95% confidence level) to REJECT the null-hypothesis.
The Dicky Fuller test reports a p-value less than 0.05, indicating that we can reject the null hypothesis that states a unit root for these series (\(\phi_1 = 1\)). In other words, since the p-value is less than 0.05 we can say that there is statistical evidence to conclude that the sales annual % change of product 2 can be treated as stationary.
RUN AND INTERPRET A DICKY-FULLER TEST FOR THE ANNUAL % DIFFERENCE OF BRAND 6, AND PROVIDE AN INTERPRETATION
I RUN THE adf.test FUNCTION FOR THE DICKY-FULLER TEST FOR THE ANNUAL % DIFFERENCE OF BRAND 6 (sales_annual_growth_q6):
adf.test(na.omit(sales_annual_growth_q6), alternative = "stationary")
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(sales_annual_growth_q6)
## Dickey-Fuller = -2.3423, Lag order = 3, p-value = 0.4387
## alternative hypothesis: stationary
SINCE THE P-VALUE>0.05 THEN WE CANNOT REJECT THE NULL HYPOTHESIS THAT STATES THAT THE SERIES IS NON-STATIONARY.
HOWEVER, WE CAN RE-RUN THIS SAME TEST BUT INDICATING THE PARAMETER LAG=1. THIS OPTION IS IMPORTANT. WITH THIS OPTION WE ARE INDICATING TO RUN THE AUGMENTED DICKY-FULLER TEST WITH 1 LAG.
WE STRONGLY RECOMMEND TO ALWAYS RUN THE AUGMENTED DICKY-FULLER TEST WITH LAGS=1 FOR MONTHLY OR QUARTERLY DATA. LET’S SEE HOW TO RUN THIS AUGMENTED DICKY-FULLER TEST:
adf.test(na.omit(sales_annual_growth_q6), alternative = "stationary",k=1)
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(sales_annual_growth_q6)
## Dickey-Fuller = -3.923, Lag order = 1, p-value = 0.02227
## alternative hypothesis: stationary
NOW WE GOT AN OPPOSITE RESULT! WE GOT A PVALUE<0.05, INDICATING THAT WE CAN TREAT THE SALES ANNUAL GROWTH OF PRODUCT 6 AS STATIONARY!
FOR ALL THE DICKY-FULLER TEST FOR THIS COURSE, USE THE PARAMETER k=1 !
When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spurious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.
Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)
We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.
library(wbstats)
## Warning: package 'wbstats' was built under R version 4.2.3
# Mexico - Infant mortality
<-wb_data(indicator = c("SP.DYN.IMRT.IN"),
infantmcountry="MEX", start_date = 1980, end_date = 2020)
# Mexico - Export value
<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"),
exportscountry="MEX", start_date = 1980, end_date = 2020)
The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables:
plot.ts(infantm$SP.DYN.IMRT.IN)
plot.ts(exports$TX.VAL.MRCH.XD.WD)
Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?
<- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
m1 summary(m1)
##
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.20 -41.43 -5.09 38.80 73.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 280.2244 15.5961 17.97 < 2e-16 ***
## infantm$SP.DYN.IMRT.IN -6.3214 0.5231 -12.08 9.28e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.62 on 39 degrees of freedom
## Multiple R-squared: 0.7892, Adjusted R-squared: 0.7838
## F-statistic: 146 on 1 and 39 DF, p-value: 9.283e-15
INTERESTINGLY, ACCORDING TO THIS REGRESSION, THE INFANT MORTALITY INDEX IS SIGNIFICANTLY RELATED TO THE NUMBER OF EXPORTS IN MEXICO. THIS DOES NOT MAKE SENSE SINCE BOTH PHENOMENA ARE NOT RELATED AT ALL. THIS REGRESSION SUGGESTS THAT WHEN THE EXPORTS GO UP, IN MEXICO THE NUMBER OF INFANT DEATHS DECREASES SINGIFICANTLY.
WHEN YOU RUN A REGRESSION WHERE THE DEPENDENT AND THE INDEPENDENT VARIABLES ARE NON-STATIONARY, MOST OF THE TIME THE RESULT WILL SHOW THAT BOTH ARE STRONGLY AND SIGNIFICANTLY RELATED (NEGATIVE OR POSITIVE). HOWEVER, THIS RELATIONSHIP MIGHT NOT BE TRUE!! IT MIGHT BE SPURIOUS.
THE ONLY CASE THAT THIS TYPE OF REGRESSION IS NOT SPURIOUS OR VALID IS WHEN BOTH VARIABLES ARE COINTEGRATED. ONLY IF THE 2 NON-STATIONARY SERIES ARE COINTEGRATED, THEN THE REGRESSION BETWEEN THEM WILL BE VALID. IF THEY ARE NOT COINTEGRATED, THEN THE REGRESSION WILL BE SPURIOUS.
THEN, WHAT IS COINTEGRATION??????
I EXPLAIN THAT IN THE FOLLOWING SECTION.
RESEARCH ABOUT SPURIOUS REGRESSION. With your words briefly explain in which cases you can end up with a spurious regression
When two non-stationary series are strongly related in the long run, then we say that both series are cointegrated. The question is how strong this long-term relationship needs to be in order to consider 2 non-stationary series as cointegrated.
To do a cointegration statistical test of 2 non-stationary series, we need to do the following:
Run a linear regression of the 2 non-stationary series
Get the residuals of this regression
Do a Dicky-Fuller test to check whether the residuals (errors) of the previous regression is a stationary series. If the p-value<0.05, then we can say that there is statistical evidence to consider the 2 non-stationary series as cointegrated.
If 2 non-stationary series are NOT cointegrated, then its linear regression becomes a spurious regression. In other words, when 2 non-stationary series are not cointegrated, the result of a linear regression using these 2 variables will not be reliable or valid.
Are the infant mortality and export series cointegrated? Is the regression spurious or valid? Run and interpret the corresponding test
Here is a code to help you respond this question.
library(tseries)
# The errors/residual of the regression are stored in the residuals
# attribute of the m1 model:
adf.test(m1$residuals, alternative = "stationary",k=1)
##
## Augmented Dickey-Fuller Test
##
## data: m1$residuals
## Dickey-Fuller = -2.3019, Lag order = 1, p-value = 0.4548
## alternative hypothesis: stationary
SINCE THE P-VALUE OF THE TEST IS >0.05, THEN WE CANNOT REJECT THE NULL HYPOTHESES THAT STATES THAT THE ERRORS IS A NON-STATIONARY SERIES. THEN, SINCE THE ERRORS ARE NOT STATIONARY, THEN WE CONCLUDE THAT THESE 2 NON-STATIONARY SERIES ARE NOT COINTEGRATED. THEN, SINCE THEY ARE NOT COINTEGRATED, THE REGRESSION BETWEEN MORTALITY RATE AND EXPORTS IS NOT VALID; IT IS A SPURIOUS REGRESSION!
Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.
From Jan 1, 2015 to Oct 2, 2017.
From Oct 3, 2017 to Feb 28, 2018
HINT: for creating the $1.00 index. For the first period, you can do the following process:
Downloading and merge the stock prices:
library(quantmod)
getSymbols(Symbols=c("^MXX","^GSPC"), peridiocity = "daily",
from="2015-01-01", to="2017-10-02")
## [1] "^MXX" "^GSPC"
#Putting the adjusted prices into one dataset:
= Ad(merge(MXX,GSPC))
data = na.omit(data) data
You can create an index of $1.00 invested in each instrument instead of using the original indexes for each market. To do this, you just have to divide any value of the index by its first value.
I get the first value for each instruments:
= as.numeric(MXX$MXX.Adjusted[1])
firstmxx = as.numeric(GSPC$GSPC.Adjusted[1]) firstus
Then, create the indexes for $1.00. Just divide each value by its first value:
= data$MXX.Adjusted / firstmxx
invmxx = data$GSPC.Adjusted / firstus invus
We can merge both datasets into 1, and plot both:
<- merge(invus,invmxx)
inv plot(inv)
I can see that they move very closely, but I need to do a cointegration test.
I run a linear regression using the invus as independent as the invmxx as dependent. This make sense since the Mexican financial market used to be affected by the US market (and not the opposite):
= lm(invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted )
reg1 = summary(reg1)
sreg1 sreg1
##
## Call:
## lm(formula = invmxx$MXX.Adjusted ~ invus$GSPC.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06656 -0.01835 0.00244 0.01944 0.06810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35668 0.01383 25.79 <2e-16 ***
## invus$GSPC.Adjusted 0.69852 0.01311 53.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02702 on 672 degrees of freedom
## Multiple R-squared: 0.8086, Adjusted R-squared: 0.8083
## F-statistic: 2840 on 1 and 672 DF, p-value: < 2.2e-16
You can do a similar process for the 2nd period.
For both time periods, run a cointegration test and INTERPRET your results.
We will do the cointegration test. Before, I will examine the error (residuals) of this regression according to the regression equation:
We can see that Errors is basically a “scaled” distance from the Mexican index to the US index. It is a linear combination of both indexes. Then, if this “scaled” distance between the series stays within a certain range over time, then we can say that both series are cointegrated. The condition is that the errors series must be a STATIONARY series in order to conclude that both NON-STATIONARY series are cointegrated.
We can see the errors over time:
<-reg1$residuals[,1]
errorsplot(errors)
Now we run the test for cointegration:
library(tseries)
adf.test(reg1$residuals, k = 1)
##
## Augmented Dickey-Fuller Test
##
## data: reg1$residuals
## Dickey-Fuller = -3.9394, Lag order = 1, p-value = 0.01203
## alternative hypothesis: stationary
Since p-value<0.05, then we conclude that the residuals is a stationary variable, so both indexes are COINTEGRATED in this time period; we can say that between Jan 2015 and Oct 2017 the Financial Mexican Market is COINTEGRATED with the US Financial Market.
Then, we can rely on the regression results. Then, we can say that for each $1 gained in the US market, then in the Mexican market we gain on average about $69.8516229 cents. There is a positive and significance relationship between the US and the Mexican market.
NOW I DO THE SAME FOR THE SECOND PERIOD:
Downloading the stock prices:
getSymbols(Symbols=c("^MXX","^GSPC"), peridiocity = "daily",
from="2017-10-03", to="2018-02-28")
## [1] "^MXX" "^GSPC"
= merge(MXX,GSPC)
data2 = na.omit(data2) data2
I create an index of $1.00 invested in each instrument.
I get the first index of both instruments:
= as.numeric(MXX$MXX.Adjusted[1])
firstmxx2 = as.numeric(GSPC$GSPC.Adjusted[1]) firstus2
Then, I create the indexes for $1.00:
= data2$MXX.Adjusted / firstmxx2
invmxx2 = data2$GSPC.Adjusted / firstus2
invus2
<-merge(invmxx2,invus2)
inv2plot(inv2)
Unlike the previous period, in this period it seems that the both markets are not quite following each other. Let’s do the cointegration test:
Now I check whether invmxx and invus are cointegrated:
= lm(invmxx2$MXX.Adjusted ~ invus2$GSPC.Adjusted )
reg2 = summary(reg2)
sreg2 sreg2
##
## Call:
## lm(formula = invmxx2$MXX.Adjusted ~ invus2$GSPC.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.037454 -0.014008 -0.002567 0.017749 0.040777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79716 0.05928 13.447 < 2e-16 ***
## invus2$GSPC.Adjusted 0.16206 0.05640 2.874 0.00501 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02006 on 95 degrees of freedom
## Multiple R-squared: 0.07997, Adjusted R-squared: 0.07028
## F-statistic: 8.257 on 1 and 95 DF, p-value: 0.005008
The errors of this regression look as follows:
<- reg2$residuals[,1]
errors2 plot(errors2)
The errors series does not look like a stationary variable. Let’s do the Augmented Dicky-Fuller test for the errors:
library(tseries)
adf.test(reg2$residuals,k=1)
##
## Augmented Dickey-Fuller Test
##
## data: reg2$residuals
## Dickey-Fuller = -2.0866, Lag order = 1, p-value = 0.5407
## alternative hypothesis: stationary
Since the p-value of the test is >0.05, then we cannot reject the null hypothesis that states that the residuals are NON-stationary. Since the residuals is a non-stationary variable, then for this period, the Mexican market index is NOT COINTEGRATED with the US market. Also, for this period, the regression result of both indexes are not reliable!
In the cases where 2 non-stationary variables are NOT COINTEGRATED, and I want to know what type of relationship there exist between them, I can first convert the variable into STATIONARY variables, and then RUN THE REGRESSION using the stationary transformed variable.
In Finance and Economics, more than 95% of the non-stationary variables such as prices and indexes, can easily be transformed to stationary when we apply the first difference of their natural logarithm.
Remember that the first difference of the log price is the % change or return of the variable in continuously compounded returns!
Then, for this 2nd period, if I want to see how the Mexican market is related to the US financial market, I can run the regression using the daily returns for both indexes:
= diff(log(invmxx2))
rmxx = diff(log(invus2))
rus
= lm(rmxx$MXX.Adjusted ~ rus$GSPC.Adjusted )
reg3 = summary(reg3)
sreg3 sreg3
##
## Call:
## lm(formula = rmxx$MXX.Adjusted ~ rus$GSPC.Adjusted)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.022724 -0.003596 0.000494 0.003757 0.013417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0009480 0.0006821 -1.390 0.168
## rus$GSPC.Adjusted 0.4696191 0.0901944 5.207 1.13e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.006643 on 94 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2238, Adjusted R-squared: 0.2156
## F-statistic: 27.11 on 1 and 94 DF, p-value: 1.128e-06
Since I run a linear regression using returns, and always the returns of financial series will be stationary, then the regression result will be VALID!
I can interpret this regression as follows:
The daily returns of the Mexican index is significantly and positively related to the daily returns of the S&P500 index. For +1% change in the S&P500 daily returns, it is expected that the daily return of the Mexican index will move in about 0.4696191%.