Tools used

Regarding the tools used, we will be using the software “R”, and taking advantage of the libraries that it posses for Forecasting Modeling. The most prominent libraries are: forecast and lubridate.

The GUI interface that we will be using is Rstudio in its version: 1.1.383.

STEP 1: CASE A,B

We are given one Excel file with 4 workbooks corresponding to cases A, B, C, D respectively. Formats in CSV are well used for analyzing data in R, therefore, we will separate those workbooks in different files: A.csv, B.csv, C.csv, D.csv.

There are no empty values or missing values in the data sets, specifically in the column’s values. Except for the column of “Sales Date”; for this case, we only need the columns with data, and will do through R.

STEP 2 CASE A,B

Firstly, we will need to construct a correct data frame in which we can assess the time series in R. For this step, we need to use “fread” command, which constructs the proper data frame of the csv files. After this, we proceed to creating the main plot:

We have been constructed the time series data for case A. As we can see, there isn’t seem to be a trend, nevertheless, we need to consider that or data might not be stationary. We have seasonality as well, since the data provided, is constructed with day to day samples of Sales.

One way to verify that our data is stationary is constructing a correlogram to determine the linear relationship between laged values.

Data seems to be strongly related within each lag. The seems to be a downward trending as well. Possitive values indicate that there are persistence in fluctuations around the mean. Values are above the null hypothesis confidence interval that states autocorrelation is equal to zero. This also indicates that random noise isn’t affecting the series and we can confirm it through the Box Ljung Test:

## 
##  Box-Ljung test
## 
## data:  casea.ts.daily
## X-squared = 14033, df = 20, p-value < 2.2e-16

So we are confirming that we don’t have white noise in the data, and also since the p-value is nearly equal to zero, the series are stationary.

STEP 3 CASE A,B

Since we are dealing with multiple seasonal cycles, we have been selected three models that best fitted during the process of constructing the correct model:

a. TBATS model (Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components)

b. Exponential Smoothing with Seasonal Descomposition of Time Series

c. Dynamic harmonic regression with multiple seasonal periods (using Fourier terms): Forecast from Regression with ARIMA

STEP 4 CASE A,B

The best selected model is the one which is tested its accuracy and if there are some remaining residuals (random noise) after the modelling. We can confirm through a Ljung-Box Test that the residuals haven’t any effect on the constructued models:

## 
##  Ljung-Box test
## 
## data:  Residuals from TBATS
## Q* = 901.44, df = 722, p-value = 5.497e-06
## 
## Model df: 8.   Total lags used: 730

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(2,1,3) errors
## Q* = 882.48, df = 699, p-value = 2.67e-06
## 
## Model df: 31.   Total lags used: 730

Our criteria in selecting the best model consists in creating a “training model”, which is composed from the “2015-01-01” to “2016-12-31”, so that our accuracy can be tested through the last year:

TBATS Accuracy

##                      ME      RMSE       MAE        MPE      MAPE      MASE
## Training set -0.6499263  79.46867  63.03877 -0.8577902  7.201265 0.2546117
## Test set     12.8293535 307.92196 251.85367 -3.9707173 22.599857 1.0172293
##                    ACF1 Theil's U
## Training set 0.04764549        NA
## Test set     0.94437800   2.71139

STL Exponential Smoothing Accuracy

##                        ME      RMSE       MAE        MPE      MAPE
## Training set  -0.06857261  61.92235  48.48242 -0.4495234  5.619985
## Test set     -19.49817680 349.61689 293.41426 -7.9566827 27.266790
##                  MASE       ACF1 Theil's U
## Training set 0.195819 0.06252689        NA
## Test set     1.185091 0.94011716  3.295303

Both RMSE(Square root of average square errors) and MAE (mean absolute error) seem having approximate values. The best election will of how confident we are in capture the value of the Sales Date, so the best approximation is the “Exponential Smoothing with Seasonal Descomposition of Time Series”. The dark are represents the 95% confidence of capturing the sales and the lighter represents the 80% of capturing the value.

We then apply the same model to CASE B:

CASES C AND D

For cases C and D, we are dealing with count data. One method that delivered good results and a good approach was the TBATS model. These are the results:

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Forecasting Assessment

Jaime Paz

January 28, 2018

PROBLEM