1 Q CHALLENGE - Modelling consumer sales with an ARIMA/SARIMA model

Go to the course site (Material) and download Dr. Nau slides about ARIMA/SARIMA models. Read the slides and focus on his recommendations to select the number of p, d, q, P, D, Q for any ARIMA/SARIMA model. Download the following Dataset:

download.file("http://www.apradie.com/ec2004/salesfab2.xlsx", "salesfabs2.xlsx", mode="wb")

Now, we read the data from the Excel file and construct an xts dataset. I also load the libraries needed.

library(readxl)
library(xts)
library(zoo)
library(tseries)
library(forecast)

sales_brands <- read_xlsx("salesfabs2.xlsx", sheet = "Sheet1")
#I create an xts object to handle the object and dates in a better way:

# I construct a date variable as a sequence of dates using the month column:
date <- seq(as.Date(sales_brands$month[1]),
                      by="month",
                      length.out = nrow(sales_brands) )
# I delete the first column since it has the month, and I will construct
#   an xts dataset with the index equal to the month
sales_brands = sales_brands[,-1]

# I create the xts object:
sales_brands<- xts(x=sales_brands, frequency = "monthly", order.by=date)

sales_brands <- sales_brands["/2017-10-1",]

This dataset contains consumer national monthly sales of a Category of products. The sales is in tons (qfab#) and prices (pfab#) and include sales of few products. You have to do the following:

Generate the log variables for price (pfab1) and sales in volume (qfab1) for product 1.

Now you have to work in a model for qfab1, sales in tons of product 1:

1.1 Visualization

Graph the natural log of sales in tons of this product. WHAT DO YOU OBSERVE about the series? does it look stationary or non-stationary? does it look with seasonality?

I get the natural logarithm of all values:

lnsales<- log(sales_brands)

Now I graph log of sales for product 1:

plot(lnsales$qfab1)

It seems that the series is not stationary and it has seasonality since the mean of the series changes over time, and there are clear picks in the same periods of the year.

I start checking whether the seasonal difference of the log is stationary.

1.2 Check for stationary

If the series is not stationary, try whether the seasonal difference of the log looks as stationary (s12.lnqfab1). Do a time series plot, run and INTERPRET the Dicky-Fuller test.

I try the seasonal difference of the series:

s12.lnqfab1 <- diff(lnsales$qfab1, lag = 12)
s12.lnqfab1<- na.omit(s12.lnqfab1)
plot(s12.lnqfab1)

Now I will make the dfuller test of this variable:

adf.test(s12.lnqfab1, k = 0)

## Warning in adf.test(s12.lnqfab1, k = 0): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  s12.lnqfab1
## Dickey-Fuller = -4.8046, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

#I had to drop the na terms for the dfuller test to be performed

I see that the series is now stationary, since the p-value is less than 0-05.

1.3 Calibrate the model

According to Dr Nau slides and what you learned in this workshop, calibrate the best ARIMA/SARIMA model.

I will make the AC and PAC graphs to see the level of autocorrelation of the variable at certain lags:

# I load the astsa library to run the acf and pacf plots:
library(astsa)
acf2(s12.lnqfab1,max.lag = 24)

##      [,1] [,2] [,3]  [,4]  [,5] [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
## ACF  0.41 0.47 0.31  0.17  0.01 0.04 -0.13 -0.28 -0.29 -0.48 -0.24 -0.49 -0.15
## PACF 0.41 0.37 0.05 -0.13 -0.19 0.05 -0.10 -0.29 -0.14 -0.26  0.26 -0.30  0.21
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## ACF  -0.23 -0.15 -0.14 -0.03 -0.06  0.17  0.17  0.14  0.28  0.09  0.20
## PACF -0.04 -0.08 -0.09 -0.18  0.00  0.21 -0.11  0.04 -0.09  0.07 -0.08

In the ACF plot we see that the autocorrelation between Lag1 and the variable is about 0.40, and the autocorrelation between Lag2 and the variable is about 0.50.

After this lag, the following autocorrelations decay gradually, and after Lag3 all autocorrelations are not significant.

However, in the partial autocorrelation plot, the Lag2 has a partial correlation about 0.38, meaning that Lag2 is autocorrelated with the current value in 0.38 AFTER considering the correlation between Lag1 and the current value of the variable.

In the PACF, after the 2nd lag, the autocorrelations decay drastically. This pattern observed in the ACF and PACF is a typical AR signature; in this case we select 2 terms (p=2) since the PACF shows the first 2 autocorrelations as positive and significant.

When finding an AR signature, we start setting the p according to the positive and significant autocorrelations in the PACF, and set the remaining the parameters equal to zero (q=0, P=0, and Q=0). If there are other significant lags that are not part of the AR signature, ignore them for now. In this case, we can see in the ACF and PACF plots that lag 12 is significant and negative, but we can ignore this for now since the might show up when we check the errors of the first model.

Then I start adding only 2 AR terms (p=2) and later see whether the model errors look like a white noise or whether I need to add more terms.

# Load the lmtest library
library(lmtest)

model1<-Arima(sales_brands$qfab1, order = c(2,0,0), # p=2; d=0; q=0 
      seasonal = list(order=c(0,1,0),period=12), #P=0; D=1; Q=0
      include.constant = TRUE, # include the drift or phi0
      lambda = 0) # Use the log transformation before running the model

# I check the model coefficients and their corresponding p-values: 
coeftest(model1)

## 
## z test of coefficients:
## 
##        Estimate Std. Error z value  Pr(>|z|)    
## ar1   0.2575482  0.1204736  2.1378 0.0325331 *  
## ar2   0.3573285  0.1201494  2.9740 0.0029391 ** 
## drift 0.0083612  0.0022295  3.7502 0.0001767 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I see that both AR coefficients are positive and significant as expected according to the PACF.

Now I can check whether the residuals of this model behave like a white noise:

1.4 Check residuals

What is a white noise series? RESPOND WITH YOUR WORDS

A white is a time series variable that has no autocorrelations between any pair of lags, and also its mean is 0 and it is stationary. In other words, a white noise is a series that cannot be explained by any of its own past values or errors, since it is totally unpredictable.

The residuals series of a model (error series) is the difference between a real value of the variable and its predicted value using the model.

The residuals of the model are already calculated and stored in the Arima model R object in the residuals attribute:

# I get the residuals of the model and assign it to a temporal variable:
qfab1_res <- model1$residuals

I get the ACF and PACF plots of these residuals to check for white noise or if the residuals still have significant autocorrelations with certain lags:

acf2(qfab1_res,max.lag = 12)

##       [,1] [,2] [,3]  [,4]  [,5] [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF  -0.02 0.04 0.16 -0.04 -0.11 0.15 -0.02 -0.14 -0.13 -0.29  0.06 -0.37
## PACF -0.02 0.04 0.16 -0.04 -0.12 0.13  0.00 -0.13 -0.20 -0.31  0.14 -0.39

Lag10 and Lag12 autocorrelations are negative and significant. The rest of the lag autocorrelations are not.

For monthly economic or sales data, according to my personal experience, I only pay attention to autocorrelations of LAG 1,2,3,6 and 12. I call these lags the important LAGS. Most of the economic and sales monthly data might have important autocorrelations with its past values of 1, 2, 3, 6 or 12 months ago. I do not pay attention to autocorrelations further than LAG 12 (for monthly data).

Then in this case we will consider the significant autocorrelation of LAG 12 in the ARIMA-SARIMA model.

The question is, how do I include this term in the model? According to Dr. Nau recommendations, if the lag is negative and appears in both ACF and PACF and do not follow an AR signature, then we start including a SEASONAL MA term (Q=1). Why seasonal MA term? because the autocorrelation is with Lag12 and our data is monthly.

Then I modified the model and run it:

model2 <- Arima(sales_brands$qfab1, order = c(2,0,0),#p=2;d=0;q=0 
      seasonal = list(order=c(0,1,1),period=12),#P=0,D=1,Q=1
      include.constant = TRUE, #include phi0 in the model
      lambda = 0) # use log transformation for the original variable

# I check the pvalues of the coefficients:
coeftest(model2)

## 
## z test of coefficients:
## 
##         Estimate Std. Error z value  Pr(>|z|)    
## ar1    0.3231767  0.1278035  2.5287  0.011449 *  
## ar2    0.2247712  0.1326734  1.6942  0.090233 .  
## sma1  -0.4272312  0.1349052 -3.1669  0.001541 ** 
## drift  0.0084352  0.0011842  7.1233 1.053e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The seasonal MA term is negative and significant as expected. Then, I get the residuals of this model and check whether the residuals series looks like a white noise series.

qfab1_res2 <- model2$residuals

I get the ACF and PACF graphs to check whether these errors behave like a white noise:

acf2(qfab1_res2,max.lag = 12)

##      [,1] [,2] [,3] [,4]  [,5] [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
## ACF     0 0.03 0.12 -0.1 -0.13 0.15 -0.02 -0.18 -0.07  -0.3  0.07 -0.07
## PACF    0 0.03 0.12 -0.1 -0.14 0.15  0.01 -0.18 -0.14  -0.3  0.18 -0.09

Now the residuals series looks like a white noise series since non of the important Lags are significant. Then I keep this model and finish the calibration process.

INTERPRETATION of the final model:

First, I have to check which is the variable of study (the dependent variable). In this case I am modeling the annual difference of the log of volume sales of product 1 (s12_lnqfab1), which is the annual % growth of sales of product 1 month by month. Then, I can interpret the model as follows:

The AR(1) term has a coefficient (phi1) equal to 0.3231767, and it is significantly greater than zero since its p-value<0.05. Then I can say that the annual % growth of product 1 sales is positively and significantly related with its own annual % growth of the previous month. In other words, for each 1% increase in annual % growth of the previous month, the current annual % growth of sales is expected to grow in about 0.3231767%.

The AR(2) term has a coefficient (phi2) equal to 0.2247712, and it is significantly greater than zero since its p-value<0.05. Then I can say tha the annual % growth of product 1 sales is positively and significantly related with its own annual % growth of 2 months ago AFTER considering the autocorrelation of Lag1 with the current annual % growth. For 1% increase in annual % growth of 2 months ago, the current annual % growth of sales will grow in about 0.2247712% after considering the annual % growth of the previous month (Lag1).

The constant phi0 (of drift) is positive and significant. This means that the annual % growth of sales of product 1 is systematically growing over time with an average annual growth of 0.8435174% (it has a growing tendency). INTERPRETATION of the Seasonal MA term:

The annual % growth of sales of product 1 is negatively related to the shock (error) of 12 months ago. When the shock is negative 12 months ago, it is expected that the annual % growth of the current month will be positive.

1.5 Forecast sales

With this model I do a forecast for sales of product 1 for the next 24 months. For the last model I recommend to convert the original variable to a ts R object to have a better visualization of the forecast:

#Convert the xts to a ts dataset. I indicate the first month of the series
#  and also the period in the year (frequency)
qfab1.ts <- ts(coredata(sales_brands$qfab1),start=c(2012,1),frequency=12)

# I re-run the last calibrated model with this ts object:
model2 <- Arima(qfab1.ts, order = c(2,0,0),#p=2;d=0;q=0 
      seasonal = list(order=c(0,1,1),period=12),#P=0,D=1,Q=1
      include.constant = TRUE, #include phi0 in the model
      lambda = 0) # use log transformation for the original variable

forecast_qfab1 <- forecast(model2, h=24)
# I stored the information of the forecast in forecast_air

# I plot the forecast
autoplot(forecast_qfab1)

We can see the forecast of sales volume, which is stored as an attribute in the forecast_qfab1 object:

forecast_qfab1$mean

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2017                                                                        
## 2018 588.8096 605.7687 523.5945 354.1505 349.6611 421.9181 340.2731 336.1421
## 2019 653.3505 671.6545 580.1115 392.2160 387.1279 467.0388 376.6140 372.0106
##           Sep      Oct      Nov      Dec
## 2017                   287.0625 340.1268
## 2018 331.8119 310.0792 319.3057 378.2420
## 2019 367.1976 343.1346

1.6 The AIC Criteria for model selection

The Akaike Information Criteria (AIC) is a measure of how much the prediction of the model fits the actual data. The less the AIC value, the more the fit of the model.

Its formula:

\[ AIC = 2k - 2log(L) \]

Where: k= number of independent variables to build model L= maximum likelihood estimate of the model

It says about how well your model is fitted with your data.

Almost always AIC is negative. The smaller the AIC, the higher the fit of the model. We can use the AIC criteria in case we have 2 competing ARIMA-SARIMA model, and select the model with the lowest (more negative) AIC.

AIC is already calculated by Arima and can be shown as follows:

model2$aic

## [1] -125.0668

Workshop 4 SOLUTION, Financial Econometrics II

Alberto Dorantes, Ph.D.

March 16, 2022