Abstract
The main purpose of this workshop is to illustrate the foundations of Autoregressive Moving Average models - ARMA. We will apply the ARMA model to a dataset of sales of real consumer products.You will work in RStudio. Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop.
At the beginning of the R Notebook write Workshop 2 - Financial Econometrics II and your name (as we did in previous workshop).
Create an R Notebook document (File -> New File -> R Notebook), where you have to write whatever is asked in this workshop. More specifically, you have to:
Replicate all the R Code along with its output.
You have to do whatever is asked in the workshop. It can be: a) Responses to specific questions and/or do an exercise/challenge.
Any QUESTION or any STEP you need to do will be written in CAPITAL LETTERS. For ANY QUESTION, you have to RESPOND IN CAPITAL LETTERS right after the question.
You have to keep saving your .Rmd file, and ONLY SUBMIT the .html version of your .Rmd file. Pay attention in class to know how to generate an html file from your .Rmd.
The main purpose of this workshop is to illustrate what are Autoregressive Moving Average models - ARMA - and how they are constructed. This methodology is part of the Box-Jenkins models for time series, which provide a series of tools to understand past behavior and predict future values of time series. Given a time series \(Y_{t}\), the ARMA model can be be denoted as ARMA(p,q) where p is the number of autoregressive terms and q refers to the number of moving-average terms.
In general the autoregressive models assume that there is some kind of autocorrelation phenomenon in the time series. The concept of autocorrelation implies the correlation of a random variables with its past and future values. This is the building block behind these models.
In time-series Econometrics we can talk about an autocorrelation in the dependent variable (also called, endogeneous variable). This means the future values of an economic or financial variable can be explained somehow by its previous values. This is the foundation of what econometricians have called an Autoregressive Model also denoted as AR model. The autoregressive notation of the stochastic model defined by \(Y_{t}\) can be described as follows:
\[ Y_{t}=\phi_{0}+\phi_{1}Y_{t-1}+\phi_{2}Y_{t-2}+...+\phi_{p}Y_{t-p}+\epsilon_{t} \]
Where \(\phi_{1},...\phi_{p}\) are the parameters of the model, \(\epsilon_{t}\) refers to the error term or random shock. We can express the same equation as follows:
\[ Y_{t}=\phi_{0}+\sum_{k=1}^{p}\phi_{k}Y_{t-k}+\epsilon_{t} \]
In order to apply this model we have to prove the series \(Y_{t}\) is stationary. We will learn later how to test statistically whether a series can be considered stationary or not. If the series is not stationary, we can calculate the first difference of the series, and most of the time the resulting series will become stationary.
In linear regression models, the method to estimate the regression coefficients is the Ordinary Least Squares (OLS). For ARMA models, the best method to estimate the autoregressive coefficients is the Maximum Likelihood (ML) method. The ML method works with iterations, while OLS has an analytic solution (in othe words, a formula). Fortunately, R can estimate this for you.
In a traditional linear regression model using time series variables, we assume that the errors are not correlated among them. If this assumption is violated (the errors are autocorrelated), then we can use autoregressive models that will include autoregressive terms to leave errors with no correlation (independent errors, also called, white noise).
Now I will explain what is a moving average model.
The AR(p) model is a process with a long-term memory. Why? this is because a shock today will impact in the future forever. If you read the mathematical equation, this can be interpreted as follows. The value of the series tomorrow will depend on the value of today, after applying the filter \(\phi_1\). Then, the value of tomorrow already has informationabout the value of today. And the value of tomorrow will impact the value of 2 days from now. Then, the value of today will have an impact in the whole future of the series.
However, the impact of the value of today will be diminishihg exponentially according to \((\phi_1)^N\), where N is the number of periods in the future. Since \(|\phi1|<1\), then the bigger the exponent N -which refers to a further future period- the smaller the effect.
If we want to model a series with a short-term memory, then we have to think in a different equation:
\[ Y_{t}=\theta_{0}+\theta_{1}\epsilon_{t-1}+\theta_{2}\epsilon_{t-2}+...+\theta_{q}\epsilon_{t-q}+\epsilon_{t} \]
This is called a “Moving-Average” model with q terms. In this case, the value of today is not directly affected by its previous own value, but the past random shocks. In the case of q=1, this process doesnot have long memory since the value of today is only affected by the shock of yesterday.
The more q terms included into the model, then the longer the memory of the model. Most of the cases, real-world series only include very few q terms.
In the MA(q) model, \(\theta_{0}\), … \(\theta_{q}\) are the parameters of the model, \(\epsilon_{t-k}\) refers to past external errors/shocks, and the current external shock is defined as \(\epsilon_{t}\), so:
\[ Y_{t}=\theta_{0}+\epsilon_{t}+\sum_{k=1}^{q}\theta_{k}\epsilon_{t-k} \]
We can interpret an MA model as a random process where the current value of the series is influenced by the current random shock, and by some percentages of previous random q shocks/errors. The percentages are actually the values of the coefficients \(\theta\).
In Finance, when we want to model the behavior of market returns, we know that there are thousands of events that affect how the market returns are moving today. These thousands of events are basically summarized in a “random shock” that nobody knows what will be its magnitud and direction (positive or negative effect).
Now I will put together both AR and MA model into one, the ARMA model.
Now we can combine both AR(p) terms and MA(q) terms into ONE model, that is called ARMA(p,q):
\[ Y_{t}=\phi_{0}+\sum_{k=1}^{p}\phi_{k}Y_{t-k}+ \sum_{k=1}^{q}\theta_{k}\epsilon_{t-k} + \epsilon_{t} \]
We combine AR and MA terms to model Y in case Y can be modeled including short-term and long-term memory. The long-term memory is modeled with the AR terms, while the short-term memory is modeled with the MA terms. We will practice with an example to ilustrate this theoretical model.
We will work with an online dataset that contains sales data of different consumer products. The data was created using real products, but the names of the products were changed due to privacy issues.
Download the excel file from a web site:
download.file("http://www.apradie.com/ec2004/salesbrands3.xlsx", "salesbrands3.xlsx", mode="wb")
This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.
Also, you have to install the readxl package. This package reads data from Excel files. We load this package:
library(readxl)
Now, we import the Excel file into an R dataframe:
<- read_excel("salesbrands3.xlsx", sheet = "Resultado")
sales_brands
# Always familiarize with the data
head(sales_brands, 5)
## # A tibble: 5 x 15
## date qbrand1tons salesbrand1 qbrand2tons salesbrand2
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2015-01-01 00:00:00 107. 60781. 195. 60701.
## 2 2015-02-01 00:00:00 106. 60411. 216. 68289.
## 3 2015-03-01 00:00:00 88.8 53175. 220. 62530.
## 4 2015-04-01 00:00:00 86.7 50862. 171. 50576.
## 5 2015-05-01 00:00:00 87.2 51014. 177. 54327.
## # ... with 10 more variables: qbrand3tons <dbl>, salesbrand3 <dbl>,
## # qbrand4tons <dbl>, salesbrand4 <dbl>, qbrand5tons <dbl>, salesbrand5 <dbl>,
## # qbrand6tons <dbl>, salesbrand6 <dbl>, tonstotal <dbl>, salestotal <dbl>
As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format.
In order to perform some time-series analysis, it is recommended to use xts objects. We will construct a xts object using the xts() method. The objects of xts class are composed by 2 objects: the core data and the index (class Date, POSIX, etc).
# Load library xts. Install the package if you haven't already.
library(xts)
# We will start by turning sales_brands into a data frame
# because data frames are easier to manipulate
<- as.data.frame(sales_brands)
sales_brands.df
# Divide the index and the core data:
# I get the date in a new variable fecha:
<- sales_brands.df$date
fecha # The first column of the sales_brand.df is the date.
# I drop this column since I copied to fecha:
<- sales_brands.df[,-1]
sales_brands.df
# Now I join both objects using xts() and assign the resulting object to sales.xts
<- xts(x = sales_brands.df, order.by = fecha)
sales.xts
head(sales.xts, 5)
## qbrand1tons salesbrand1 qbrand2tons salesbrand2 qbrand3tons
## 2015-01-01 106.62 60780.72 195.46 60700.51 293.10
## 2015-02-01 105.82 60411.47 216.09 68289.15 494.46
## 2015-03-01 88.82 53174.89 220.15 62529.54 356.88
## 2015-04-01 86.69 50861.69 170.74 50575.68 291.07
## 2015-05-01 87.23 51014.23 177.17 54327.48 318.09
## salesbrand3 qbrand4tons salesbrand4 qbrand5tons salesbrand5
## 2015-01-01 75196.35 185.91 48228.51 28.94 2295.50
## 2015-02-01 105010.90 149.30 39922.14 24.66 1973.43
## 2015-03-01 90902.84 144.65 39788.22 23.75 1903.24
## 2015-04-01 72853.34 161.43 42692.59 21.00 1684.28
## 2015-05-01 75277.10 141.09 37167.52 23.44 1883.41
## qbrand6tons salesbrand6 tonstotal salestotal
## 2015-01-01 338.38 55561.26 1148.41 302762.8
## 2015-02-01 284.52 51469.74 1274.85 327076.8
## 2015-03-01 251.34 50395.48 1085.59 298694.2
## 2015-04-01 230.29 49124.61 961.22 267792.2
## 2015-05-01 250.45 46906.63 997.47 266576.4
# Do you notice any difference between data frames and xts objects?
#delete variable fecha
rm(fecha)
We can have an overview of the behaviour of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:
plot(sales.xts$qbrand2tons)
It seems that this product has a growing trend over time, except for the last year.
We can graph consumer brand 6
plot(sales.xts$qbrand6tons)
This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.
A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, cost of good sold, etc is to transform the series into a stationary series.
When you work with sales data, I strongly recommend to create the first difference of the logarithm of the variable. This is actually its percentage change (in continuously compounded). If you do this process to most of real-world sales variable the result will be stationary most of the time (about 95% of the variables). The other reason why it is recommended to do this transformation is that the first difference of the log of a variable is actually the % change of the variable from one period to the next (in continuously compounded percentage). Then, we can do this transformation as follows:
We first create the natural log of each sales variable (in tons). We can apply the log function to the xts object to get the log of all series:
<- log(sales.xts) ln_sales
The new object has the same structure as sales.xts but the columns contain the natural logarithm of the data. This is due to R vectorization, a key feature of R.
Now, I can work with the first difference month-by-month using the diff() function for each variable. For example, to graph the first difference of the log of volume sales for both brands :
plot(diff(ln_sales$qbrand2tons), type = "l", col = "red")
lines(diff(ln_sales$qbrand6tons), type = "l", col = "blue")
The first difference of the log value in R is diff(ln_sales$qbrand2tons). The diff() function generates on the fly this first difference as:
diff(ln_sales\(qbrand2tons) = ln_sales\)qbrand2tons - lag(ln_sales$qbrand2tons, 1)
Remember that the dollar sign operator ($) is used to access/refer to the columns of a data frame or xts object.
We can see now that the 2 series look like stationary since their means tend to be in the same level.
R performs the first difference calculation temporarily to do a graph, and then, this variable is not saved in the dataset or environment. In other words, we do not need to generate the first difference as another variable (column), but we can if we need to.
Another transformation with monthly data is annual percentage change. In this case, this is the difference between the log value of the series and its value 12 periods ago. We can check this difference using the diff() function with the lag argument set to 12:
# I create a variable for the seasonal difference of the log of sales, which
# is the annual % change of sales (in continuously compounded):
= diff(ln_sales$qbrand2tons, lag = 12)
s12.lnqbrand2
# I plot this new variable, the annual % change of sales for product 2:
plot(s12.lnqbrand2, type = "l", col = "red")
# I do the same for product 6:
= diff(ln_sales$qbrand6tons, lag = 12)
s12.lnqbrand6 # I plot this new variable, the annual % change of sales for product 2:
plot(s12.lnqbrand6, type = "l", col = "blue")
In this case, the series for both brands do not look stationary. But we need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not. There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools.
The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (phi1=1). We can run this test in R using the adf.test function from the tseries package.
You need to install the package tseries to do the following analysis.
library(tseries)
# Before running the Dicky-Fuller test, we need to drop NA (missing) values of
# the variable:
= na.omit(s12.lnqbrand2)
s12.lnqbrand2 adf.test(s12.lnqbrand2, alternative = "stationary",k=0)
##
## Augmented Dickey-Fuller Test
##
## data: s12.lnqbrand2
## Dickey-Fuller = -3.5079, Lag order = 0, p-value = 0.05421
## alternative hypothesis: stationary
The adf.test() function from the tseries package will do a Dickey-Fuller test if we set lags (k) equal to 0.
This is another way to do the same Dickey Fuller test:
The Dicky Fuller test reports a p-value. In this case, it is barely greater than 0.05. In this case, we will consider the pvalue to be 0.05, so we will have 95% confidence (1-pvalue) to reject the null hypothesis that states a unit root for these series (\(\phi_1 = 1\)). In other words, for the series with a Dicky-Fuller p-value less or equal than 0.05 we can say that there is statistical evidence to conclude that the differenced series (s12.lnbrand2tons) is stationary.
Then for the variable s12.lnbrand2tons we can say that there is statistical evidence to say that the series is stationary.
If we try the same for brand 6 we will find that its seasonal difference (s12.lnbrand6tons) is also stationary. Try this by yourself.
For the case of monthly times series variables like sales of products or sales of a company, I strongly recommend to work with the seasonal difference of the log of sales (s12.lnbrand2tons), which is the annual percentage change month by month. The reason is because most of consumer products have a seasonal pattern. We can do this ONLY if the seasonal difference of the log is STATIONARY.
For these 2 products, then, I will use the seasonal difference of the log of sales.
In time series analysis, a correlogram is an chart of correlation statistics also known as an autocorrelation plot. This is a plot of the sample autocorrelations of the dependent variable \(Y_t\) and the time lags. In R, the command to compute the autocorrelations is acf2() from the astsa package, which gives you both, the autocorrelation (ACF) figures and the partial-autocorrelations (PACF), as well. Let’s examine the autocorrelations of brand 2.
You need to install the astsa package to do auto-correlation plots.
library(astsa)
acf2(s12.lnqbrand2)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## ACF 0.59 0.51 0.35 0.21 0.05 -0.04 -0.13 -0.09
## PACF 0.59 0.26 -0.04 -0.11 -0.15 -0.06 -0.05 0.11
The acf2() function gives you the autocorrelogram, which is the autocorrelations in a graph where it is easy to identify which lags of the variable have a significant autocorrelation.
The X axis represents the lag numbers, while the Y axis is the level of autocorrelation. The vertical lines are the levels of autocorrelations between the current value of \(Y\) (\(Y_t\)) and its own lags. The dotted, blue line delimits the 95% confidence interval for the autocorrelations. Then, if the autocorrelation goes out of the 95% C.I., then the autocorrelation of the variable \(Y_t\) with its corresponding lag \(Y_{t-1}\) (\(AutoCorr(Y_t, Y_{t-1})\)) is statistically different than zero. We can see that only the lag 1 (\(Y_{t-1}\)) and lag 2 (\(Y_{t-2}\)) is significantly autocorrelated (positively) with the current value of the variable since this autocorrelation is significantly greater than zero.
What is the meaning of the first significant lag? In this case, this means that the value of \(Y\) at any point in time \(t\) has an average autocorrelation equal to \(\phi_1\) (if the model ends up being an AR(1)) with its own value at \(t-1\). The autocorrelation of \(Y_t\) with its previous value \(Y_{t-1}\) can be calculated manually as the correlation, wich is equal to their covariance divided by the product of their standard deviation. Remembering the formula for correlation between 2 variables Y and X:
\[CORR(Y,X)=\frac{COV(Y,X)}{SD(Y)*SD(X)}\] In the case of “Autocorrelation”, we only have one variable \(Y\), so we are looking at the correlation of \(Y_t\) with its own previous value \(Y_{t-1}\). Then, applying the correlation formula between \(Y_t\) and \(Y_{t-1}\) (instead of \(X\)), we can calculate this autocorrelation as:
\[AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{SD(Y_t)*SD(Y_{t-1})}\] Since \(Y\) is supposed to be stationary series, then the standard deviation of \(Y_t\) has to be the same as the standard deviation of \(Y_{t-1}\). Then:
\[AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{VAR(Y_t)} = \frac{\gamma\left(1\right)}{\sigma^{2}_y}\] Fortunately, R does the calculation of all the autocorrelations of \(Y_t\) with its own lags, and these autocorrelations are plotted using the acf2 command (as seen above).
Unlike the ACF plot, the partial-autocorrelation (PACF) plot shows the autocorrelation of the lag with the variable after considering the effect of lower-order autocorrelations. The autocorrelation and partial autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) are the same, since there is no lower-order autocorrelations. The autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) and the partial autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) will be different since the partial autocorrelation is calculated AFTER the autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) is taken into consideration.
In other words, the PACF plot shows the amount of autocorrelation between \(Y_{t}\) and its lag K \((Y_{t-k})\) that is not explained by lower-order autocorrelations (\(Y_{t}\) with \((Y_{t-k})\) and all the corresponding autocorrelations until \(Y_{t-1}\)).
We see that in the ac plot lag 1 and 2 autocorrelations are positive and significant, and the following lag autocorrelations are also positive but not significant. We see that for each lag the autocorrelation diminishes. In the partial autocorrelation (PACF) we can see that ONLY the autocorrelation of lag 1 is significant and positive, and immediatly the rest of the correlation go close to zero suddenly. This is a classical AR(1) signature with only the first lag. We do not need to include LAG 2 since the pac indicates to include only LAG1.
Then we can see that the partial autocorrelation plot (PACF) helps us to identify the # of AR terms. Then, according to this PACF we see that only lag 1 is autocorrelated with the current value AFTER considering the effect of the lower-level autocorreations of lags 2,3,4, etc. In other words, the annual %change in sales of the current month seems to be positively correlated with the annual % change in sales 1 months ago.
A first order autoregressive model using only the lag 1 can be expresed as follows for an stochastic process \(Y_t\):
\[Y_t = \phi_0 + \phi_1 Y_{t-1} +\epsilon_t\]
The process is stationary only if \(\mid\phi_1\mid < 1\)
I can run an AR model in R for brand 2 to test whether the absolute value of \(\phi_1\) is lower than 1. In R the function to run AR, MA, ARMA and ARIMA models is the same: sarima() from the astsa package. Of course, the syntax of the function’s arguments has to be adjusted to each model. So, let’s run an AR model with lag 1 for brand 2.
You need to install the forecast package.
# Load package forecast
library(forecast)
# Use function Arima
<- Arima(s12.lnqbrand2, order = c(1,0,0))
m1 summary(m1)
## Series: s12.lnqbrand2
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.6266 0.0515
## s.e. 0.1236 0.0355
##
## sigma^2 estimated as 0.008241: log likelihood=41.95
## AIC=-77.9 AICc=-77.27 BIC=-72.69
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.003930423 0.08859139 0.06879971 -52.73237 144.4614 0.7041704
## ACF1
## Training set -0.2125151
According to the result, we can see that the coefficient \(\phi_1\) is positive, its absolute value is less than one, and it is significant since its p-value is less than 0.05. In these cases we can say that the autocorrelation is positive at the 95% confidence level.
We can write the equation of first order models for brand 2 as follows:
s12.lnbrand2tons(t) = 0.0515 + 0.6266 * s12.lnbrand2tons(t-1)+ error(t)
Now we can use this model to do predictions. We will learn how to do predictions in the following workshop.
QUESTION: INTERPRET THE COEFFICIENTS OF THIS AR(1) MODEL
Design an ARMA model for the brand 1 using the annual % growth. Run the model and interpret it.
If you want to sharpen your knowledge about Autoregressive Moving Average models and how to code them in R, do the following course from Datacamp: ARIMA models in R.
Go to the course site and carefully re-read the Notes a) Introduction to time-series financial models and re-read the theory explained in this workshop about the Introduction to ARMA Models.
Go to Canvas and respond Quiz 3. You will have 3 attempts. Questions in this Quiz are related to concepts of the readings related to this Workshop.
The grade of this Workshop will be the following:
Remember that you have to submit your .html file through Canvas BEFORE NEXT CLASS.