1 General directions for this Workshop

You will work in RStudio. You can go to the RStudio web site and follow instructions to download the R software and the free version of RStudio. It is strongly recommended to have the latest version of R and RStudio. Once you are in RStudio, do the following.

Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:

  • Replicate all the R Code along with its output.

  • You have to respond to the challenges of the workshop and any other question asked in the workshop.

Any QUESTION or any INTERPRETATION you need to do will be written in CAPITAL LETTERS. For ANY QUESTION or INTERPRETATION, you have to RESPOND IN CAPITAL LETTERS right after the question.

  • It is STRONGLY RECOMMENDED that you write your OWN NOTES in your Workshop as if this were your personal notebook. Your own workshop/notebook will be very helpful for your further study.

You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.

2 Stationarity of time series variables

2.1 Introduction

In our previous workshop we learned about the random walk model applied to finance. We learned that the logarithm of the stock prices behaves similar to a random walk model with a positive drift (\(\phi_0>0\)).

As we learned, a random walk model can be expressed in the following equation:

\[ Y_t = φ_0 + Y_{t−1} + ε_t \]

Following this model we did a Monte Carlo simulation for the logarithm of the S&P500, and we found that most of the time the simulations of a random walk model behaves similar to the real log of the S&P500 index.

We learned that if \(\phi_0>0\), then the series will have a growing trend over time. We also learned that the higher the standard deviation of the daily shock, the more likely that the series will radically move up and down. The standard deviation of the shock is actually the daily volatility of the daily returns of the series.

An important insight about the random walk model is that the random walk model is a non-stationary series. What does this mean?

In short, a stationary series is a series that its mean in any time period is about the same (no matter which time periods we select), and its standard deviation is about the same for any time period. There is another condition of stationary related to auto-covariance of the series, but this condition must hold true for strong stationary series. For this class we do not learn about strong stationary series.

Then, the random-walk model will always behave as a non-stationary series.

Let’s analyze what part of the random walk equation makes the series stationary. We can re-write the previous random walk equation as follows:

\[ Y_t = φ_0 + \phi_1*Y_{t−1} + ε_t \] If the coefficient \(\phi_{1}=1\), then this equation becomes the random walk equation. When \(\phi_{1}=1\) the \(Y_t\) series becomes non-stationary.

Interestingly, if \(\phi_{1}<1\), then \(Y_t\) becomes a stationary series.

Let’s work with an example.

2.2 Data collection and data management

We will work with an online dataset that contains sales data of different consumer products.

Download the excel file from a web site:

download.file("http://www.apradie.com/ec2004/salesbrands3.xlsx", "salesbrands3.xlsx", mode="wb")

This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.

You have to install the readxl package. This package reads data from Excel files. We load this package:

library(readxl)

Now, we import the Excel file into an R dataframe:

sales_brands <- read_excel("salesbrands3.xlsx", sheet = "Resultado")

# Always familiarize with the data
head(sales_brands, 5)
## # A tibble: 5 x 15
##   date                qbrand1tons salesbrand1 qbrand2tons salesbrand2
##   <dttm>                    <dbl>       <dbl>       <dbl>       <dbl>
## 1 2015-01-01 00:00:00       107.       60781.        195.      60701.
## 2 2015-02-01 00:00:00       106.       60411.        216.      68289.
## 3 2015-03-01 00:00:00        88.8      53175.        220.      62530.
## 4 2015-04-01 00:00:00        86.7      50862.        171.      50576.
## 5 2015-05-01 00:00:00        87.2      51014.        177.      54327.
## # ... with 10 more variables: qbrand3tons <dbl>, salesbrand3 <dbl>,
## #   qbrand4tons <dbl>, salesbrand4 <dbl>, qbrand5tons <dbl>, salesbrand5 <dbl>,
## #   qbrand6tons <dbl>, salesbrand6 <dbl>, tonstotal <dbl>, salestotal <dbl>

As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format.

In order to perform some time-series analysis, it is recommended to use xts objects. We will construct a xts object using the xts() method. The objects of xts class are composed by 2 objects: the core data and the index (class Date, POSIX, etc).

# Load library xts. Install the package if you haven't already.
library(xts)

# We save the date column in a new dataset:
month <- sales_brands$date

# We delete the date column from the sales data. The date is the column 1 of the dataset: 
sales_brands <- sales_brands[,c(-1)]
# As you see, when you specify c(-1) in the column index that means that you do not want column 1! 

# Now we create an xts dataset using the previous datasets: 
sales.xts <- xts(x = sales_brands, order.by = month)

head(sales.xts, 5)
##            qbrand1tons salesbrand1 qbrand2tons salesbrand2 qbrand3tons
## 2015-01-01      106.62    60780.72      195.46    60700.51      293.10
## 2015-02-01      105.82    60411.47      216.09    68289.15      494.46
## 2015-03-01       88.82    53174.89      220.15    62529.54      356.88
## 2015-04-01       86.69    50861.69      170.74    50575.68      291.07
## 2015-05-01       87.23    51014.23      177.17    54327.48      318.09
##            salesbrand3 qbrand4tons salesbrand4 qbrand5tons salesbrand5
## 2015-01-01    75196.35      185.91    48228.51       28.94     2295.50
## 2015-02-01   105010.90      149.30    39922.14       24.66     1973.43
## 2015-03-01    90902.84      144.65    39788.22       23.75     1903.24
## 2015-04-01    72853.34      161.43    42692.59       21.00     1684.28
## 2015-05-01    75277.10      141.09    37167.52       23.44     1883.41
##            qbrand6tons salesbrand6 tonstotal salestotal
## 2015-01-01      338.38    55561.26   1148.41   302762.8
## 2015-02-01      284.52    51469.74   1274.85   327076.8
## 2015-03-01      251.34    50395.48   1085.59   298694.2
## 2015-04-01      230.29    49124.61    961.22   267792.2
## 2015-05-01      250.45    46906.63    997.47   266576.4
# Do you notice any difference between data frames and xts objects?

#delete month variable
rm(month)

2.3 Stationary vs Non-stationary

We can have an overview of the behavior of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:

plot(sales.xts$qbrand2tons)

It seems that this product has a growing trend over time, except for the last year.

We can plot sales of brand 6

plot(sales.xts$qbrand6tons)

This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.

A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, stock prices, cost of good sold, etc is to transform the series into a stationary series.

When you work with sales data, I strongly recommend to transform the sales data by applying the natural logarithm. With the log of sales, if we take the first difference of the series we are calculating the monthly % change (in continuously compounded). This first difference of the log will become a stationary series most of the time when you analyze real sales data.

Then we calculate the logarithm of volume sales for the whole dataset:

ln_sales <- log(sales.xts)

The new object has the same structure as sales.xts but the columns contain the natural logarithm of the volume sales for all products.

Now, we can work with the first difference month-by-month using the diff() function for each variable. For example, to graph the first difference of the log of volume sales for both brands :

sales_monthly_growth_q2 = diff(ln_sales$qbrand2tons)
sales_monthly_growth_q6 = diff(ln_sales$qbrand6tons)

plot(sales_monthly_growth_q2, type = "l", col = "red")

lines(sales_monthly_growth_q6, type = "l", col = "blue")

The diff() function calculates the first difference as:

diff(ln_sales\(qbrand2tons) = ln_sales\)qbrand2tons - lag(ln_sales$qbrand2tons, 1)

Remember that the dollar sign operator ($) is used to access/refer the columns of a data frame or xts dataset.

We can see that the 2 series look like stationary since their means tend to be in the same level for any time period.

Another transformation with monthly data is annual percentage change, which can be calculated as the difference between the log value of the series and its log value 12 months ago. We can calculate this difference using the diff() function with the lag argument set to 12:

sales_annual_growth_q2 = diff(ln_sales$qbrand2tons,lag = 12)
sales_annual_growth_q6 = diff(ln_sales$qbrand6tons,lag = 12)

plot(sales_annual_growth_q2, type = "l", col = "red")

lines(sales_annual_growth_q6, type = "l", col = "blue")

In this case, it is difficult to evaluate graphically whether both series are stationary. We need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not.

There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools. The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky-Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (\(\phi1=1\)).

You need to install the package tseries to do the following analysis.

We can run the Dicky-Fuller test for the sales annual % growth of product 2 as follows:

library(tseries)

adf.test(na.omit(sales_annual_growth_q2), alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(sales_annual_growth_q2)
## Dickey-Fuller = -3.574, Lag order = 3, p-value = 0.04701
## alternative hypothesis: stationary

The null hypothesis of the Dicky-Fuller test is that the series is non-stationary. Actually, the null-hypothesis of this test is that the \(\phi_1=1\), which implies that the series is non-stationary. This is the reason why the Dicky-Fuller test is called a unit-root test.

As any statistical test, if the p-value of the test is less than 0.05, then we have statistical evidence (at the 95% confidence level) to REJECT the null-hypothesis.

The Dicky Fuller test reports a p-value less than 0.05, indicating that we can reject the null hypothesis that states a unit root for these series (\(\phi_1 = 1\)). In other words, since the p-value is less than 0.05 we can say that there is statistical evidence to conclude that the sales annual % change of product 2 can be treated as stationary.

3 CHALLENGE 1

RUN AND INTERPRET A DICKY-FULLER TEST FOR THE ANNUAL % DIFFERENCE OF BRAND 6, AND PROVIDE AN INTERPRETATION

4 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spurious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

library(wbstats)
## Warning: package 'wbstats' was built under R version 4.1.3
# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"), 
      country="MEX", start_date = 1980, end_date = 2020)

# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"), 
      country="MEX", start_date = 1980, end_date = 2020)

The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables:

plot.ts(infantm$SP.DYN.IMRT.IN)

plot.ts(exports$TX.VAL.MRCH.XD.WD)

Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?

m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)
## 
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.842 -41.039  -4.786  39.072  73.624 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            280.3572    15.5425   18.04  < 2e-16 ***
## infantm$SP.DYN.IMRT.IN  -6.3166     0.5205  -12.13 8.14e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.48 on 39 degrees of freedom
## Multiple R-squared:  0.7906, Adjusted R-squared:  0.7852 
## F-statistic: 147.3 on 1 and 39 DF,  p-value: 8.139e-15

4.1 CHALLENGE 2 - QUESTION:

RESEARCH ABOUT SPURIOUS REGRESSION. With your words briefly explain in which cases you can end up with a spurious regression

5 Cointegration

When two non-stationary series are strongly related in the long run, then we say that both series are cointegrated. The question is how strong this long-term relationship needs to be in order to consider 2 non-stationary series as cointegrated.

To do a cointegration statistical test of 2 non-stationary series, we need to do the following:

  1. Run a linear regression of the 2 non-stationary series

  2. Get the residuals of this regression

  3. Do a Dicky-Fuller test to check whether the residuals (errors) of the previous regression is a stationary series. If the p-value<0.05, then we can say that there is statistical evidence to consider the 2 non-stationary series as cointegrated.

If 2 non-stationary series are NOT cointegrated, then its linear regression becomes a spurious regression. In other words, when 2 non-stationary series are not cointegrated, the result of a linear regression using these 2 variables will not be reliable or valid.

Are the infant mortality and export series cointegrated? Is the regression spurious or valid? Run and interpret the corresponding test

Here is a code to help you respond this question.

library(tseries)
# The errors/residual of the regression are stored in the residuals
#   attribute of the m1 model:
adf.test(m1$residuals, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  m1$residuals
## Dickey-Fuller = -2.5699, Lag order = 3, p-value = 0.3494
## alternative hypothesis: stationary

6 CHALLENGE 3: Cointegration between Financial series

Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

  1. From Jan 1, 2015 to Oct 2, 2017.

  2. From Oct 3, 2017 to Feb 28, 2018

HINT: for creating the $1.00 index. For the first period, you can do the following process:

Downloading and merge the stock prices:

library(quantmod)
getSymbols(Symbols=c("^MXX","^GSPC"), peridiocity = "daily",
                     from="2015-01-01", to="2017-10-02")
## [1] "^MXX"  "^GSPC"
#Putting the adjusted prices into one dataset:
data = Ad(merge(MXX,GSPC))
data = na.omit(data)

You can create an index of $1.00 invested in each instrument instead of using the original indexes for each market. To do this, you just have to divide any value of the index by its first value.

I get the first value for each instruments:

firstmxx = as.numeric(MXX$MXX.Adjusted[1])
firstus = as.numeric(GSPC$GSPC.Adjusted[1])

Then, create the indexes for $1.00. Just divide each value by its first value:

invmxx = data$MXX.Adjusted / firstmxx
invus = data$GSPC.Adjusted / firstus 

We can merge both datasets into 1, and plot both:

inv <- merge(invus,invmxx)
plot(inv)

You can do a similar process for the 2nd period.

For both time periods, run a cointegration test and INTERPRET your results.

7 Workshop submission

The grade of this Workshop will be the following:

  • Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
  • Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
  • Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
  • Not submitted (0%)

Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.