1 Hand-on activity

We will work with an online dataset that contains sales data of different consumer products. The data was created using real products, but the names of the products were changed due to privacy issues.

Download the excel file from a web site:

download.file("http://www.apradie.com/ec2004/salesbrands3.xlsx", "salesbrands3.xlsx", mode="wb")

This dataset has sales information of several brands of consumer products that belong to one category (for example, the category can be Cereals or Candies). The content of this dataset is very similar to a real-world category. Consumer products usually show some seasonality and trend over time. We will learn how to handle these features later.

Also, you have to install the readxl package. This package reads data from Excel files. We load this package:

library(readxl)

Now, we import the Excel file into an R dataframe:

sales_brands <- read_excel("salesbrands3.xlsx", sheet = "Resultado")

# Always familiarize with the data
head(sales_brands, 5)
## # A tibble: 5 x 15
##   date                qbrand1tons salesbrand1 qbrand2tons salesbrand2
##   <dttm>                    <dbl>       <dbl>       <dbl>       <dbl>
## 1 2015-01-01 00:00:00       107.       60781.        195.      60701.
## 2 2015-02-01 00:00:00       106.       60411.        216.      68289.
## 3 2015-03-01 00:00:00        88.8      53175.        220.      62530.
## 4 2015-04-01 00:00:00        86.7      50862.        171.      50576.
## 5 2015-05-01 00:00:00        87.2      51014.        177.      54327.
## # ... with 10 more variables: qbrand3tons <dbl>, salesbrand3 <dbl>,
## #   qbrand4tons <dbl>, salesbrand4 <dbl>, qbrand5tons <dbl>, salesbrand5 <dbl>,
## #   qbrand6tons <dbl>, salesbrand6 <dbl>, tonstotal <dbl>, salestotal <dbl>

As you can see, we have imported a monthly dataset where the variable date shows the first day of the month in a yyyy-mm-dd format.

In order to perform some time-series analysis, it is recommended to use xts objects. We will construct a xts object using the xts() method. The objects of xts class are composed by 2 objects: the core data and the index (class Date, POSIX, etc).

# Load library xts. Install the package if you haven't already.
library(xts)

# We will start by turning sales_brands into a data frame 
# because data frames are easier to manipulate
sales_brands.df <- as.data.frame(sales_brands)

# Divide the index and the core data:
# I get the date in a new variable fecha:
fecha <- sales_brands.df$date
# The first column of the sales_brand.df is the date. 
#  I drop this column since I copied to fecha: 
sales_brands.df <- sales_brands.df[,-1]

# Now I join both objects using xts() and assign the resulting object to sales.xts
sales.xts <- xts(x = sales_brands.df, order.by = fecha)

head(sales.xts, 5)
##            qbrand1tons salesbrand1 qbrand2tons salesbrand2 qbrand3tons
## 2015-01-01      106.62    60780.72      195.46    60700.51      293.10
## 2015-02-01      105.82    60411.47      216.09    68289.15      494.46
## 2015-03-01       88.82    53174.89      220.15    62529.54      356.88
## 2015-04-01       86.69    50861.69      170.74    50575.68      291.07
## 2015-05-01       87.23    51014.23      177.17    54327.48      318.09
##            salesbrand3 qbrand4tons salesbrand4 qbrand5tons salesbrand5
## 2015-01-01    75196.35      185.91    48228.51       28.94     2295.50
## 2015-02-01   105010.90      149.30    39922.14       24.66     1973.43
## 2015-03-01    90902.84      144.65    39788.22       23.75     1903.24
## 2015-04-01    72853.34      161.43    42692.59       21.00     1684.28
## 2015-05-01    75277.10      141.09    37167.52       23.44     1883.41
##            qbrand6tons salesbrand6 tonstotal salestotal
## 2015-01-01      338.38    55561.26   1148.41   302762.8
## 2015-02-01      284.52    51469.74   1274.85   327076.8
## 2015-03-01      251.34    50395.48   1085.59   298694.2
## 2015-04-01      230.29    49124.61    961.22   267792.2
## 2015-05-01      250.45    46906.63    997.47   266576.4
# Do you notice any difference between data frames and xts objects?

#delete variable fecha
rm(fecha)

2 Stationary vs Non-stationary

We can have an overview of the behavior of sales of these consumer products using the plot command. Let’s focus on brand 2 and 6. To see volume sales (the number of tons sold) of brand 2:

plot(sales.xts$qbrand2tons)

It seems that this product has a growing trend over time, except for the last year.

We can graph consumer brand 6

plot(sales.xts$qbrand6tons)

This brand has a clear growing trend over time. As we can see, many consumer products have a growing or a declining trend. And other products might seem to be stationary over time, with sales over an average over time.

A series with a growing or declining trend is non-stationary. Most of the variables that represent sales of any product can be treated as non-stationary. Then, a rule of thumb when we work with business variables such as volume sales, value sales, cost of good sold, etc is to transform the series into a stationary series.

When you work with sales data, I strongly recommend to create the first difference of the logarithm of the variable. This is actually its percentage change (in continuously compounded). If you do this process to most of real-world sales variable the result will be stationary most of the time (about 95% of the variables). The other reason why it is recommended to do this transformation is that the first difference of the log of a variable is actually the % change of the variable from one period to the next (in continuously compounded percentage). Then, we can do this transformation as follows:

We first create the natural log of each sales variable (in tons). We can apply the log function to the xts object to get the log of all series:

ln_sales <- log(sales.xts)

The new object has the same structure as sales.xts but the columns contain the natural logarithm of the data. This is due to R vectorization, a key feature of R.

Now, I can work with the first difference month-by-month using the diff() function for each variable. For example, to graph the first difference of the log of volume sales for both brands :

plot(diff(ln_sales$qbrand2tons), type = "l", col = "red")
lines(diff(ln_sales$qbrand6tons), type = "l", col = "blue")

The first difference of the log value in R is diff(ln_sales$qbrand2tons). The diff() function generates on the fly this first difference as:

diff(ln_sales\(qbrand2tons) = ln_sales\)qbrand2tons - lag(ln_sales$qbrand2tons, 1)

Remember that the dollar sign operator ($) is used to access/refer to the columns of a data frame or xts object.

We can see now that the 2 series look like stationary since their means tend to be in the same level.

R performs the first difference calculation temporarily to do a graph, and then, this variable is not saved in the dataset or environment. In other words, we do not need to generate the first difference as another variable (column), but we can if we need to.

Another transformation with monthly data is annual percentage change. In this case, this is the difference between the log value of the series and its value 12 periods ago. We can check this difference using the diff() function with the lag argument set to 12:

# I create a variable for the seasonal difference of the log of sales, which
#   is the annual % change of sales (in continuously compounded):
s12.lnqbrand2 = diff(ln_sales$qbrand2tons, lag = 12)

# I plot this new variable, the annual % change of sales for product 2:
plot(s12.lnqbrand2, type = "l", col = "red")
# I do the same for product 6:
s12.lnqbrand6 = diff(ln_sales$qbrand6tons, lag = 12)
# I plot this new variable, the annual % change of sales for product 2:
plot(s12.lnqbrand6, type = "l", col = "blue")
lines(diff(ln_sales$qbrand6tons, lag = 12), type = "l", col = "blue")

In this case, the series for both brands do not look stationary. But we need to do a statistical test to check whether a series can be treated as stationary or not. Some series might not look stationary for our eyes, but we need to run statistical tests to decide whether to treat the series stationary or not. There is a statistical way to evaluate whether these series are stationary. We need a statistical confirmation in addition to the visual tools.

The Dicky-Fuller test is one of the most used statistical tests for stationary. This test is called a unit root test. The Dicky Fuller test assumes that the series is non stationary. In other words, it assumes that the series follows a random walk with a unit root (phi1=1). We can run this test in R using the adf.test function from the tseries package.

You need to install the package tseries to do the following analysis.

library(tseries)
# Before running the Dicky-Fuller test, we need to drop NA (missing) values of 
#   the variable:
s12.lnqbrand2 = na.omit(s12.lnqbrand2)
adf.test(s12.lnqbrand2, alternative = "stationary",k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  s12.lnqbrand2
## Dickey-Fuller = -3.5079, Lag order = 0, p-value = 0.05421
## alternative hypothesis: stationary

The adf.test() function from the tseries package will do a Dickey-Fuller test if we set lags (k) equal to 0.

The Dicky Fuller test reports a p-value. In this case, it is barely greater than 0.05. In this case, we will consider the pvalue to be 0.05, so we will have 95% confidence (1-pvalue) to reject the null hypothesis that states a unit root for these series (\(\phi_1 = 1\)). In other words, for the series with a Dicky-Fuller p-value less or equal than 0.05 we can say that there is statistical evidence to conclude that the differenced series (s12.lnbrand2tons) is stationary.

Then for the variable s12.lnbrand2tons we can say that there is statistical evidence to say that the series is stationary.

If we try the same for brand 6 we will find that its seasonal difference (s12.lnbrand6tons) is also stationary.

Try this by yourself.

# We generate the seasonal difference of the log for brand 6
s12.lnqbrand6 = na.omit(s12.lnqbrand6)
adf.test(s12.lnqbrand6, alternative = "stationary",k=0)
## Warning in adf.test(s12.lnqbrand6, alternative = "stationary", k = 0): p-value
## smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  s12.lnqbrand6
## Dickey-Fuller = -5.3934, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

Since p-value of the test is <0.01, then we also reject the null hypothesis that states that the series is NOT stationary; in other words, we can also treat the seasonal difference of the log price of brand 6 as stationary; we can also say that the annual % growth of sales of brand 6 can be treated as stationary.

For the case of monthly times series variables like sales of products or sales of a company, I strongly recommend to work with the seasonal difference of the log of sales (s12.lnbrand2tons), which is the annual percentage change month by month. The reason is because most of consumer products have a seasonal pattern. We can do this ONLY if the seasonal difference of the log is STATIONARY.

For these 2 products, then, I will use the seasonal difference of the log of sales.

3 Autocorrelations (AC) and Partial-autocorrelations (PAC)

In time series analysis, a correlogram is an chart of correlation statistics also known as an autocorrelation plot. This is a plot of the sample autocorrelations of the dependent variable \(Y_t\) and the time lags. In R, the command to compute the autocorrelations is acf2() from the astsa package, which gives you both, the autocorrelation (ACF) figures and the partial-autocorrelations (PACF), as well. Let’s examine the autocorrelations of brand 2.

You need to install the astsa package to do auto-correlation plots.

library(astsa)
acf2(s12.lnqbrand2)

##      [,1] [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## ACF  0.59 0.51  0.35  0.21  0.05 -0.04 -0.13 -0.09
## PACF 0.59 0.26 -0.04 -0.11 -0.15 -0.06 -0.05  0.11

The acf2() function gives you the autocorrelogram, which is the autocorrelations in a graph where it is easy to identify which lags of the variable have a significant autocorrelation.

The X axis represents the lag numbers, while the Y axis is the level of autocorrelation. The vertical lines are the levels of autocorrelations between the current value of \(Y\) (\(Y_t\)) and its own lags. The dotted, blue line delimits the 95% confidence interval for the autocorrelations. Then, if the autocorrelation goes out of the 95% C.I., then the autocorrelation of the variable \(Y_t\) with its corresponding lag \(Y_{t-1}\) (\(AutoCorr(Y_t, Y_{t-1})\)) is statistically different than zero. We can see that only the lag 1 (\(Y_{t-1}\)) and lag 2 (\(Y_{t-2}\)) is significantly autocorrelated (positively) with the current value of the variable since this autocorrelation is significantly greater than zero.

What is the meaning of the first significant lag? In this case, this means that the value of \(Y\) at any point in time \(t\) has an average autocorrelation equal to \(\phi_1\) (if the model ends up being an AR(1)) with its own value at \(t-1\). The autocorrelation of \(Y_t\) with its previous value \(Y_{t-1}\) can be calculated manually as the correlation, wich is equal to their covariance divided by the product of their standard deviation. Remembering the formula for correlation between 2 variables Y and X:

\[CORR(Y,X)=\frac{COV(Y,X)}{SD(Y)*SD(X)}\] In the case of “Autocorrelation”, we only have one variable \(Y\), so we are looking at the correlation of \(Y_t\) with its own previous value \(Y_{t-1}\). Then, applying the correlation formula between \(Y_t\) and \(Y_{t-1}\) (instead of \(X\)), we can calculate this autocorrelation as:

\[AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{SD(Y_t)*SD(Y_{t-1})}\] Since \(Y\) is supposed to be stationary series, then the standard deviation of \(Y_t\) has to be the same as the standard deviation of \(Y_{t-1}\). Then:

\[AUTOCORR(Y_t,Y_{t-1})=\frac{AUTOCOV(Y_t,Y_{t-1})}{VAR(Y_t)} = \frac{\gamma\left(1\right)}{\sigma^{2}_y}\] Fortunately, R does the calculation of all the autocorrelations of \(Y_t\) with its own lags, and these autocorrelations are plotted using the acf2 command (as seen above).

Unlike the ACF plot, the partial-autocorrelation (PACF) plot shows the autocorrelation of the lag with the variable after considering the effect of lower-order autocorrelations. The autocorrelation and partial autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) are the same, since there is no lower-order autocorrelations. The autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) and the partial autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) will be different since the partial autocorrelation is calculated AFTER the autocorrelation between \(Y_{t}\) and \(Y_{t-1}\) is taken into consideration.

In other words, the PACF plot shows the amount of autocorrelation between \(Y_{t}\) and its lag K \((Y_{t-k})\) that is not explained by lower-order autocorrelations (\(Y_{t}\) with \((Y_{t-k})\) and all the corresponding autocorrelations until \(Y_{t-1}\)).

We see that in the ac plot lag 1 and 2 autocorrelations are positive and significant, and the following lag autocorrelations are also positive but not significant. We see that for each lag the autocorrelation diminishes. In the partial autocorrelation (PACF) we can see that ONLY the autocorrelation of lag 1 is significant and positive, and immediatly the rest of the correlation go close to zero suddenly. This is a classical AR(1) signature with only the first lag. We do not need to include LAG 2 since the pac indicates to include only LAG1.

Then we can see that the partial autocorrelation plot (PACF) helps us to identify the # of AR terms. Then, according to this PACF we see that only lag 1 is autocorrelated with the current value AFTER considering the effect of the lower-level autocorreations of lags 2,3,4, etc. In other words, the annual %change in sales of the current month seems to be positively correlated with the annual % change in sales 1 months ago.

4 Running an AR(1) model

A first order autoregressive model using only the lag 1 can be expresed as follows for an stochastic process \(Y_t\):

\[Y_t = \phi_0 + \phi_1 Y_{t-1} +\epsilon_t\]

The process is stationary only if \(\mid\phi_1\mid < 1\)

We can run an AR model in R for brand 2 to test whether the absolute value of \(\phi_1\) is lower than 1. In R there are several ways to run ARIMA models. We will use the Arima function from the forecast package.

Let’s run an AR(1) model for brand 2.

You need to install the forecast package.

# Load package forecast
library(forecast)

# Use function Arima
m1 <- Arima(s12.lnqbrand2, order = c(1,0,0))
summary(m1)
## Series: s12.lnqbrand2 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1    mean
##       0.6266  0.0515
## s.e.  0.1236  0.0355
## 
## sigma^2 = 0.008241:  log likelihood = 41.95
## AIC=-77.9   AICc=-77.27   BIC=-72.69
## 
## Training set error measures:
##                       ME       RMSE        MAE       MPE     MAPE      MASE
## Training set 0.003930423 0.08859139 0.06879971 -52.73237 144.4614 0.7041704
##                    ACF1
## Training set -0.2125151

According to the result, we can see that the coefficient \(\phi_1\) is positive, its absolute value is less than one, and it is significant since its p-value is less than 0.05. In these cases we can say that the autocorrelation is positive at the 95% confidence level.

We can write the equation of first order models for brand 2 as follows:

s12.lnbrand2tons(t) = 0.0515 + 0.6266 * s12.lnbrand2tons(t-1)+ error(t)

QUESTION: INTERPRET THE COEFFICIENTS OF THIS AR(1) MODEL

The phi1 coefficient is 0.6266. We first have to remember the variable we are modelling. In this case we are modelling the product 2 annual % growth of sales (month by month). Then, we can say that the annual % growth of product 2 is positively and significantly related with its own annual % growth of one month ago. The phi1 coefficient plays the role of a filter. I this case, we can say that about 62.66% of the annual growth of one month ago is passed to the annual growth of this month.

In the case of phi0, in the AR(1) model is the expected annual %growth of sales of this month if the annual % growth of sales of previous month is zero. In this case, phi0=0.0515, so if last month sales did not grow, then the expected annual % sales growth of this month will be equal to 5.15.

Looking at the AR(1) equation, we see that the annual % growth of this month is determined by: a) phi0, b)phi1 times annual growth of last month, and c) random shock/error of the month.

Then, we can say that if last month sales grew 20%, that growth will impact the annual growth of this month in about (0.20)(0.6266).

Following the AR(1) equation, in this case, the expected annual % growth of this month will be phi0 + (phi1)(s12.lnbrand2tons(t-1)), which is equal to 0.0514784+ 0.20(0.6266284) = 0.1768041. I did not include the error term here since the expected value of the error is zero since the mean of the error is expected to be zero.

Now we can use this model to do predictions. If we want to do a forecast for 12 months, we use the forecast() function.

forecast(m1, h = 12)
##         Point Forecast       Lo 80     Hi 80      Lo 95     Hi 95
## 3628801     0.02233683 -0.09400135 0.1386750 -0.1555871 0.2002607
## 3715201     0.03321748 -0.10407454 0.1705095 -0.1767526 0.2431875
## 3801601     0.04003561 -0.10465689 0.1847281 -0.1812525 0.2613237
## 3888001     0.04430804 -0.10318887 0.1918049 -0.1812690 0.2698851
## 3974401     0.04698526 -0.10159836 0.1955689 -0.1802538 0.2742243
## 4060801     0.04866289 -0.10034527 0.1976711 -0.1792254 0.2765512
## 4147201     0.04971414 -0.09946040 0.1988887 -0.1784286 0.2778569
## 4233601     0.05037288 -0.09886693 0.1996127 -0.1778697 0.2786155
## 4320001     0.05078567 -0.09847977 0.2000511 -0.1774961 0.2790675
## 4406401     0.05104433 -0.09823117 0.2003198 -0.1772528 0.2793415
## 4492801     0.05120642 -0.09807303 0.2004859 -0.1770968 0.2795096
## 4579201     0.05130799 -0.09797302 0.2005890 -0.1769976 0.2796136

The first column (Point Forecast) is the mean forecast for the following 12 months of the variable, which is the annual % growth of sales for product 2 month by month.

Since the prediction is in natural log and is a seasonal difference, I have to come back to the original scale, so I need to create the forecast variable in tons, not in log of tons:

5 Challenge

Design an ARMA model for the brand 1 using the annual % growth. Do a forecast for the next 12 months for the brand using this model.

# Construct variable of annual % growth
s12.lnbrand1tons <- na.omit(diff(ln_sales$qbrand1tons, lag = 12))

# First check for stationarity
adf.test(s12.lnbrand1tons, alternative = "s",k=0)
## Warning in adf.test(s12.lnbrand1tons, alternative = "s", k = 0): p-value smaller
## than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  s12.lnbrand1tons
## Dickey-Fuller = -4.6658, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

The p-value of this test is < 0.05, so we can reject the null hypothesis that states that the series is a Random Walk. Then, we can treat the seasonal difference as stationary

Since the series is stationary, we can continue running the ACF and PACF plots to check the significance of the autocorrelations of the lags

acf2(s12.lnbrand1tons)

##      [,1] [,2] [,3]  [,4] [,5] [,6] [,7]  [,8]
## ACF  0.63 0.48 0.39  0.20 0.22 0.31 0.38  0.34
## PACF 0.63 0.15 0.06 -0.16 0.16 0.24 0.20 -0.10

This is a clear AR signature because the autocorrelations (from ACF) dies out gradually and the partial autocorrelations (from PACF) cuts off sharply after lag 1

ar1_brand1tons <- Arima(s12.lnbrand1tons, order = c(1,0,0))
summary(ar1_brand1tons)
## Series: s12.lnbrand1tons 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1    mean
##       0.6330  0.0717
## s.e.  0.1178  0.0244
## 
## sigma^2 = 0.003824:  log likelihood = 58.07
## AIC=-110.14   AICc=-109.51   BIC=-104.92
## 
## Training set error measures:
##                        ME       RMSE        MAE       MPE     MAPE      MASE
## Training set -0.001543299 0.06034798 0.04710515 -113.7927 194.7673 0.5835665
##                    ACF1
## Training set -0.1170992

Both coefficients, phi0 and phi1 are positive. The absolute value of \(\phi_1\) is less than 1. The p-values of the coefficients are not displayed in the output, but we can have a rough estimate looking at the standard errors (s.e.) of both coefficients.

For phi1=0.633, and its standard error = 0.1178. Then, if we divide the coefficient by its standard error, we get the t-value of the coefficient. Then, in this case, t = 5.3735144. Since the t-value>2, then we can say that phi1 is significantly bigger than zero.

The equation of this AR(1) model is:

s12.lnbrand1tons(t) = 0.0717 + 0.633 * s12.lnbrand1tons(t-1)+ error(t)

Now we can do the forecast for 12 months:

forecast(ar1_brand1tons, h = 12)
##         Point Forecast       Lo 80     Hi 80       Lo 95     Hi 95
## 3628801     0.03454252 -0.04470642 0.1137915 -0.08665828 0.1557433
## 3715201     0.04818881 -0.04560119 0.1419788 -0.09525062 0.1916282
## 3801601     0.05682637 -0.04219201 0.1558448 -0.09460918 0.2082619
## 3888001     0.06229360 -0.03874359 0.1633308 -0.09222945 0.2168167
## 3974401     0.06575415 -0.03608063 0.1675889 -0.08998871 0.2214970
## 4060801     0.06794453 -0.03420804 0.1700971 -0.08828435 0.2241734
## 4147201     0.06933096 -0.03294866 0.1716106 -0.08709222 0.2257541
## 4233601     0.07020851 -0.03212196 0.1725390 -0.08629244 0.2267095
## 4320001     0.07076397 -0.03158687 0.1731148 -0.08576813 0.2272961
## 4406401     0.07111555 -0.03124345 0.1734745 -0.08542903 0.2276601
## 4492801     0.07133809 -0.03102418 0.1737004 -0.08521149 0.2278877
## 4579201     0.07147894 -0.03088463 0.1738425 -0.08507264 0.2280305

We can get the forecast in the original measure, volume sales of product 2. We need to do some data management since the forecast is in annual cc % growth.

# Create a Date object with the dates of the forecast periods
date_index <- seq.Date(from=as.Date("2019-07-01"), by="months", length.out = 12)

# Construct an xts object with the prediction and date_index
hat_s12.lnb1q <- xts(forecast(ar1_brand1tons, h = 12)$mean, order.by = date_index)

# Merge original ln of tons with seasonal difference prediction
aux <- merge(ln_sales$qbrand1tons, hat_s12.lnb1q)

# Obtain prediction of ln of tons
aux$hat_lnbrand1tons <- lag(aux$qbrand1tons, 12) + aux$hat_s12.lnb1q

# Merge original with predicted ln of tons
lnbrand1tons<- merge(ln_sales$qbrand1tons, aux$hat_lnbrand1tons)

# Return to original measure
hat_brand1tons <- exp(lnbrand1tons)

plot(hat_brand1tons)

Fortunately, we will avoid doing this transformation if we set up the Arima model in a different way. We will learn this in the next workshop.