1 General directions for this Workshop

You will work in RStudio. Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop.

At the beginning of the R Notebook write Workshop 2 - Time Series and your name (as we did in previous workshop).

You have to replicate all the steps explained in this workshop, and ALSO you have to do whatever is asked. Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question. It is STRONGLY RECOMMENDED that you write your OWN NOTES as if this were your notebook. Your own workshop/notebook will be very helpful for your further study.

You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.

2 Hand-on activity

We will work with an online dataset that contains sales data of different consumer products. The data was created using real products, but the names of the products were changed due to privacy issues.

rm(list=ls())
# To avoid scientific notation for numbers: 
options(scipen=999)

Download the excel file from a web site:

setwd("C:/Users/L03142143/Desktop/WS/WS2")
download.file("http://www.apradie.com/ec2004/salesbrands3.xlsx", "salesbrands3.xlsx", mode="wb")

This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.

Also, you have to install the readxl package. This package reads data from Excel files. We load this package:

library(readxl)

Now, we import the Excel file into an R dataframe:

sales_brands <- read_excel("salesbrands3.xlsx", sheet = "Resultado")

# Always familiarize with the data
head(sales_brands, 5)

## # A tibble: 5 x 15
##   date                qbrand1tons salesbrand1 qbrand2tons salesbrand2
##   <dttm>                    <dbl>       <dbl>       <dbl>       <dbl>
## 1 2015-01-01 00:00:00       107.       60781.        195.      60701.
## 2 2015-02-01 00:00:00       106.       60411.        216.      68289.
## 3 2015-03-01 00:00:00        88.8      53175.        220.      62530.
## 4 2015-04-01 00:00:00        86.7      50862.        171.      50576.
## 5 2015-05-01 00:00:00        87.2      51014.        177.      54327.
## # ... with 10 more variables: qbrand3tons <dbl>, salesbrand3 <dbl>,
## #   qbrand4tons <dbl>, salesbrand4 <dbl>, qbrand5tons <dbl>, salesbrand5 <dbl>,
## #   qbrand6tons <dbl>, salesbrand6 <dbl>, tonstotal <dbl>, salestotal <dbl>

As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format.

In order to perform some time-series analysis, it is recommended to use xts objects. We will construct a xts object using the xts() method. The objects of xts class are composed by 2 objects: the core data and the index (class Date, POSIX, etc).

# Load library xts. Install the package if you haven't already.
library(xts)

# We will start by turning sales_brands into a data frame 
# because data frames are easier to manipulate
sales_brands.df <- as.data.frame(sales_brands)

# Divide the index and the core data
fecha <- sales_brands.df$date
sales_brands.df <- sales_brands.df[,-1]

# Join both objects using xts() and assign the resulting object to sales.xts
sales.xts <- xts(x = sales_brands.df, order.by = fecha)

head(sales.xts, 5)

##            qbrand1tons salesbrand1 qbrand2tons salesbrand2 qbrand3tons
## 2015-01-01      106.62    60780.72      195.46    60700.51      293.10
## 2015-02-01      105.82    60411.47      216.09    68289.15      494.46
## 2015-03-01       88.82    53174.89      220.15    62529.54      356.88
## 2015-04-01       86.69    50861.69      170.74    50575.68      291.07
## 2015-05-01       87.23    51014.23      177.17    54327.48      318.09
##            salesbrand3 qbrand4tons salesbrand4 qbrand5tons salesbrand5
## 2015-01-01    75196.35      185.91    48228.51       28.94     2295.50
## 2015-02-01   105010.90      149.30    39922.14       24.66     1973.43
## 2015-03-01    90902.84      144.65    39788.22       23.75     1903.24
## 2015-04-01    72853.34      161.43    42692.59       21.00     1684.28
## 2015-05-01    75277.10      141.09    37167.52       23.44     1883.41
##            qbrand6tons salesbrand6 tonstotal salestotal
## 2015-01-01      338.38    55561.26   1148.41   302762.8
## 2015-02-01      284.52    51469.74   1274.85   327076.8
## 2015-03-01      251.34    50395.48   1085.59   298694.2
## 2015-04-01      230.29    49124.61    961.22   267792.2
## 2015-05-01      250.45    46906.63    997.47   266576.4

# Do you notice any difference between data frames and xts objects?

#delete variable fecha
rm(fecha)

3 Stationary vs Non-stationary

We can have an overview of the behaviour of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:

plot(sales.xts$qbrand2tons)

It seems that this product has a growing trend over time, except for the last year.

We can graph consumer brand 6

plot(sales.xts$qbrand6tons)

This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.

A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, cost of good sold, etc is to transform the series into a stationary series.

When you work with sales data, I strongly recommend to create the first difference of the logarithm of the variable. This is actually its percentage change (in continuously compounded). If you do this process to most of real-world sales variable the result will be stationary most of the time (about 95% of the variables). The other reason why it is recommended to do this transformation is that the first difference of the log of a variable is actually the % change of the variable from one period to the next (in continuously compounded percentage). Then, we can do this transformation as follows:

We first create the natural log of each sales variable (in tons). We can apply the log function to the xts object to get the log of all series:

ln_sales <- log(sales.xts)

The new object has the same structure as sales.xts but the columns contain the natural logarithm of the data. This is due to R vectorization, a key feature of R.

Now, I can work with the first difference month-by-month using the diff() function for each variable. For example, to graph the first difference of the log of volume sales for both brands :

plot(diff(ln_sales$qbrand2tons), type = "l", col = "red")

lines(diff(ln_sales$qbrand6tons), type = "l", col = "blue")

The first difference of the log value in R is diff(ln_sales$qbrand2tons). The diff() function generates on the fly this first difference as:

diff(ln_sales$qbrand2tons) = ln_sales$qbrand2tons - lag(ln_sales$qbrand2tons, 1)

Remember that the dollar sign operator ($) is used to access/refer to the columns of a data frame or xts object.

We can see now that the 2 series look like stationary since their means tend to be in the same level.

R performs the first difference calculation temporarily to do a graph, and then, this variable is not saved in the dataset or environment. In other words, we do not need to generate the first difference as another variable (column), but we can if we need to.

Another transformation with monthly data is annual percentage change. In this case, this is the difference between the log value of the series and its value 12 periods ago. We can check this difference using the diff() function with the lag argument set to 12:

plot(diff(ln_sales$qbrand2tons, lag = 12), type = "l", col = "red")

lines(diff(ln_sales$qbrand6tons, lag = 12), type = "l", col = "blue")

In this case, the series for both brands do not look stationary. But we need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not. There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools. The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (phi1=1). We can run this test in R as.

You need to install the package tseries to do the following analysis.

library(tseries)
adf.test(na.omit(diff(ln_sales$qbrand2tons, lag = 12)), alternative = "stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(diff(ln_sales$qbrand2tons, lag = 12))
## Dickey-Fuller = -3.574, Lag order = 3, p-value = 0.04701
## alternative hypothesis: stationary

The adf.test() function from the tseries package will do a Dickey-Fuller test if we set lags (k) equal to 0. The k parameter is equal to the number of $\phi$ in our ARMA model. Use the default value of k. Use ?adf.test to read about this function.

This is another way to do the same Dickey Fuller test:

s12.lnbrand2tons <- na.omit(diff(ln_sales$qbrand2tons, lag = 12))
adf.test(s12.lnbrand2tons, alternative = "stationary")

## 
##  Augmented Dickey-Fuller Test
## 
## data:  s12.lnbrand2tons
## Dickey-Fuller = -3.574, Lag order = 3, p-value = 0.04701
## alternative hypothesis: stationary

The Dicky Fuller test reports a p-value less than 0.05, indicating that we can reject the null hypothesis that states a unit root for these series ($\phi_1 = 1$). In other words, for the series with a Dicky-Fuller p-value less than 0.05 we can say that there is statistical evidence to conclude that the differenced series (s12.lnbrand2tons) is stationary.

Then for the variable s12.lnbrand2tons we can say that there is statistical evidence to say that the series is stationary.

If we try the same for brand 6 we will find that its seasonal difference (s12.lnbrand6tons) is also stationary. Try this by yourself.

For the case of monthly times series variables like sales of products or sales of a company, I strongly recommend to work with the seasonal difference of the log of sales (s12.lnbrand2tons), which is the annual percentage change month by month. The reason is because most of consumer products have a seasonal pattern. We can do this ONLY if the seasonal difference of the log is STATIONARY.

For these 2 products, then, I will use the seasonal difference of the log of sales.

4 Spurious regression

When we want to examine the relationship between two non-stationary variables by running a regression model, we have the risk to end up with a non-valid - spurious - regression. Before we understand why a regression model can be spurious, we start with and example using 2 real-world variables.

Install the wbstats package using RStudio. This package was written by the World Bank, and it has functions to download data of all countries around the world. The function wb downloads hundreds of time-series variables that the World Bank tracks for all countries. If you want to know more about this package, you can check its documentation in the cran web site (https://cran.r-project.org/web/packages/wbstats/wbstats.pdf)

We will download the infant mortality rate and the exports for Mexico. It is supposed that these variables have nothing in common, so we would not expect a significant relationship.

library(wbstats)

## Warning: package 'wbstats' was built under R version 4.0.5

# Mexico - Infant mortality
infantm<-wb_data(indicator = c("SP.DYN.IMRT.IN"), 
      country="MEX", start_date = 1980, end_date = 2020)

# Mexico - Export value
exports<-wb_data(indicator = c("TX.VAL.MRCH.XD.WD"), 
      country="MEX", start_date = 1980, end_date = 2020)

The wb function brings a data frame with the requested data. We can plot the data to have an idea of these 2 variables:

plot.ts(infantm$SP.DYN.IMRT.IN)

plot.ts(exports$TX.VAL.MRCH.XD.WD)

Now run a regression using these series. Report the result of the regression. Did you find significant relationship? is your result what you expected?

m1 <- lm(exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
summary(m1)

## 
## Call:
## lm(formula = exports$TX.VAL.MRCH.XD.WD ~ infantm$SP.DYN.IMRT.IN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.331 -39.723  -7.456  37.848  77.194 
## 
## Coefficients:
##                        Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)            275.1783    16.0408   17.16 < 0.0000000000000002 ***
## infantm$SP.DYN.IMRT.IN  -6.1847     0.5327  -11.61    0.000000000000046 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.65 on 38 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7801, Adjusted R-squared:  0.7743 
## F-statistic: 134.8 on 1 and 38 DF,  p-value: 0.00000000000004602

Research about spurious regression, and explain in which cases you can end up with a spurious regression

Are the series cointegrated? Is the regression spurious or valid? Run the corresponding test

library(tseries)
adf.test(m1$residuals, k = 0)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  m1$residuals
## Dickey-Fuller = -1.5881, Lag order = 0, p-value = 0.7351
## alternative hypothesis: stationary

5 Cointegration between Financial series

Using daily data of Mexico IPCyC market index and the S&P 500, examine whether two series are cointegrated. Generate an index for each instrument. To do these indexes, create a variable that represents how 1.00 peso or 1.00 dollar invested in each instrument would be changing over time.

From Jan 1, 2015 to Oct 2, 2017.
From Oct 3, 2017 to Feb 28, 2018

For both cases, run a cointegration test and INTERPRET your results.

6 Workshop submission

The grade of this Workshop will be the following:

Complete (100%): If you submit an ORIGINAL and COMPLETE HTML file with all the activities, with your notes, and with your OWN RESPONSES to questions
Incomplete (75%): If you submit an ORIGINAL HTML file with ALL the activities but you did NOT RESPOND to the questions and/or you did not do all activities and respond to some of the questions.
Very Incomplete (10%-70%): If you complete from 10% to 75% of the workshop or you completed more but parts of your work is a copy-paste from other workshops.
Not submitted (0%)

Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.

Workshop 2, Time Series

Alberto Dorantes, Ph.D. & Felipe A. Pérez S., Ph.D.

Apr 28, 2021