“We would like you to come up with a plan on how you would build a forecast engine on the data provided. There are four cases: A, B, C, and D. Cases A and B are from a similar sample, whereas C and D are from another sample”.
We are going to come up with a plan that will simply be formed as follows:
Prepare the data. Firstly, is important to collect the data correctly. So, the first step will be to study the data sets and assess if they are in a correct format.
Testing the model. This consists in applying some criteria for assessing that we have constructed a
good model
Regarding the tools used, we will be using the software “R”, and taking advantage of the libraries that it posses for Forecasting Modeling. The most prominent libraries are: forecast and lubridate.
The GUI interface that we will be using is Rstudio in its version: 1.1.383.
We are given one Excel file with 4 workbooks corresponding to cases A, B, C, D respectively. Formats in CSV are well used for analyzing data in R, therefore, we will separate those workbooks in different files: A.csv, B.csv, C.csv, D.csv.
There are no empty values or missing values in the data sets, specifically in the column’s values. Except for the column of “Sales Date”; for this case, we only need the columns with data, and will do through R.
Firstly, we will need to construct a correct data frame in which we can assess the time series in R. For this step, we need to use “fread” command, which constructs the proper data frame of the csv files. After this, we proceed to creating the main plot:
We have been constructed the time series data for case A. As we can see, there isn’t seem to be a trend, nevertheless, we need to consider that or data might not be stationary. We have seasonality as well, since the data provided, is constructed with day to day samples of Sales.
One way to verify that our data is stationary is constructing a correlogram to determine the linear relationship between laged values.
Data seems to be strongly related within each lag. The seems to be a downward trending as well. Possitive values indicate that there are persistence in fluctuations around the mean. Values are above the null hypothesis confidence interval that states autocorrelation is equal to zero. This also indicates that random noise isn’t affecting the series and we can confirm it through the Box Ljung Test:
##
## Box-Ljung test
##
## data: casea.ts.daily
## X-squared = 14033, df = 20, p-value < 2.2e-16
So we are confirming that we don’t have white noise in the data, and also since the p-value is nearly equal to zero, the series are stationary.
Since we are dealing with multiple seasonal cycles, we have been selected three models that best fitted during the process of constructing the correct model:
The best selected model is the one which is tested its accuracy and if there are some remaining residuals (random noise) after the modelling. We can confirm through a Ljung-Box Test that the residuals haven’t any effect on the constructued models:
##
## Ljung-Box test
##
## data: Residuals from TBATS
## Q* = 901.44, df = 722, p-value = 5.497e-06
##
## Model df: 8. Total lags used: 730
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(2,1,3) errors
## Q* = 882.48, df = 699, p-value = 2.67e-06
##
## Model df: 31. Total lags used: 730
Our criteria in selecting the best model consists in creating a “training model”, which is composed from the “2015-01-01” to “2016-12-31”, so that our accuracy can be tested through the last year:
TBATS Accuracy
## ME RMSE MAE MPE MAPE MASE
## Training set -0.6499263 79.46867 63.03877 -0.8577902 7.201265 0.2546117
## Test set 12.8293535 307.92196 251.85367 -3.9707173 22.599857 1.0172293
## ACF1 Theil's U
## Training set 0.04764549 NA
## Test set 0.94437800 2.71139
STL Exponential Smoothing Accuracy
## ME RMSE MAE MPE MAPE
## Training set -0.06857261 61.92235 48.48242 -0.4495234 5.619985
## Test set -19.49817680 349.61689 293.41426 -7.9566827 27.266790
## MASE ACF1 Theil's U
## Training set 0.195819 0.06252689 NA
## Test set 1.185091 0.94011716 3.295303
Both RMSE(Square root of average square errors) and MAE (mean absolute error) seem having approximate values. The best election will of how confident we are in capture the value of the Sales Date, so the best approximation is the “Exponential Smoothing with Seasonal Descomposition of Time Series”. The dark are represents the 95% confidence of capturing the sales and the lighter represents the 80% of capturing the value.
We then apply the same model to CASE B:
For cases C and D, we are dealing with count data. One method that delivered good results and a good approach was the TBATS model. These are the results:
## List of 1
## $ plot.title:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE