Introduction

In the framework of the class of Forecasting II of the Master in Management of the HEC Lausanne, we had the opportunity to put into practice the theoretical notions we have learned in the first part of the course (Forecasting I, prof. Wilhelm) by analyzing the sales data of a Walmart Store and build a forecast. The data were extracted from the M forecasting competition (Kaggle Competition).

The M Forecasting competitions are a series of events which test the forecasting ability of many different methods and models. This year, the competition focuses on predicting sales data provided by Walmart for several stores in the US. For more information on the competition:
http://kaggle.com/c/m5-forecasting-accuracy

In our project, we will use the M5 - Accuracy competition data to formulate adequate time series models. Important aspects of the competition are Exploratory Data Analysis, Modelling, and Forecast Evaluation:
https://mofc.unic.ac.cy/m5-competition/

Description of the data

We use the data provided by the competition that consist in 4 csv dataset containing daily informations:

Datasets
calendar.csv
sales_train_validation.csv
sample_submission.csv
sell_prices.csv

The scope of the research is to forecast 28 days of sales for one Walmart Shop in Wisconsin, US. The analyzed time window covers a period from February 2011 to April 2016. We also have access to the sell prices.

Exploratory Data Analysis

To begin with, we will start by analyzing the raw data, in order to choose the best model to forecast.

Since the sales_train_validation.csv dataset is structured by sales per day for each different item, we have to aggregate them to have a better understanding of the overall sales per day.

Eliminating the level shift

As we can see in the graph above, there is an abrupt level shift on the 2012-10-26. An assumption could be that the Walmart Shop went through refurbishment (slow sales decrease) before completely reopen to the public (sales premium). As the future forecast are based on the past observations, this situation could impact our model accuracy. This is why we have decided to eliminate it by raising the pre-shift sales level. The steps that we applied to eliminate this difference were the following:

Step Procedure
1 Split the times series: before and after the level shift
2 Calculate the regression slope for each part
3 Compute the difference between both slopes
4 Update our data including the computed difference

After performing the correction, we can observe that the pre-shift and post-shift ablines cross at the level shift. Note that the time serie looks more stable and smooth. Therefore, we can start diving deeper in the analysis.

Overview

To better visualize the store structure, we began by plotting the quantity of products per department. This allowed us to see that the department FOODS_3 holds the highest numbers of products.

Time series

To be able to forecast the data, we will create time series and look for specific characteristics that can be identified, such as seasonality and trend.

The graph above clearly shows a seasonal pattern in the daily sales, which are repeated throughout the years. Additionally, we can, once again, notice the strong decrease of sales in 2012, that we attribute to a refurbishment. Also, one can observe the increase of the sales over the years, suggesting an upward trend. To be able to confirm the latter, we decided to view this from a weekly perspective.

The results clearly highlight an unequal sales volume for each weekday. Moreover, we observe that the weekends (Saturday and Sunday, here in green) are the days with the highest amounts of sales. This makes sense, since it is the days that working people ususally reserve for the grocery shopping.

Lastly, in order to confirm our previous assumptions, we ploted the yearly sales.

Indeed, we can see that there is an upgoing trend that is almost linear. Additionally, we focused only on the data up to 2015, since the ones of 2016 are incomplete (only the first 4 months are available) and may bias the results of this graph.

We then investigated if special events impacted the sales of the store.

As we can see, the peaks of sales do not correspond to the special events. However, we can clearly see that there is an increse in sales the days preceeding each red line. One can argue that usually people buy gifts a couple of days before the event (and not on the day itself). However, those peaks are not abnormally high compared to the others. Therefore, we can conclude that special events are not very important for Walmart’s sales. This could make sense as people usually go to premium brands for gift offering.

Sales per Department

To dive deeper in this project, we decided to conduct an analysis per department. For that, we created time series for each of them. We are going to focus on the weekly sales.

These graphs capture and confirm the weekly seasonality, that is present for each department, as well. Moreover, we can observe that FOODS_2 has no data before the end of 2012. One can assume that the shop went into renovation, explaining the emergence of a third food department.

Another important information from these graphs is that each departement seems to have its specific trend. For example, FOODS_3 has seen an important sales and variance increase through the years. Additionally, FOODS_2 appears to have an increasing trend at the end of the window considered. Whereas, the HOBBIES departments seem to have a more or less constant variance (despite a few peaks) and stagnating sales. These results highlight the importance to consider hierarchical modelling for the forecasting. Indeed, as departments seems to have differents patterns, it would make sense to fit several models on lower levels.

ACF & PACF of the Overall Sales

The autocorrelation function measures the strength of the relationship between observations, seperated by h lags. The formula of the autocorrelation function is:

\[ r_{k} = \frac{\sum\limits_{t=k+1}^T (y_{t}-\bar{y})(y_{t-k}-\bar{y})} {\sum\limits_{t=1}^T (y_{t}-\bar{y})^2}, \]

The partial autocorrelation function can be defined as the correlation of the residuals between two observations seperated by h lags, given the ones in between.

We focused on the analysis of the overall sales, since the goal of this project is to forecast at that level.

The ACF plot clearly highlights the weekly seasonal pattern present in the dataset, since we observe the same peaks at the multiple of each lag. Furthermore, the PACF plot shows some correlation between the different lags given the observations in between. Therefore, one can conclude that this time series is not stationary.

#> 
#>  Box-Pierce test
#> 
#> data:  sales
#> X-squared = 628, df = 1, p-value <2e-16

The Ljung-Box test confirms the results of the autocorrelation analysis. The p-value is very low, allowing us to reject H0, that is that the data is independently distributed (with a quite high probability). In other words, we reject the hypothesis that the residuals are uncorrelated. Thus, we can confirm the lack of stationarity of the time series.

Modelling

In the following section, we will focus on transforming the data in a hierarchical form, with the “hts” package. As mentionned earlier, data are organized in 4 levels:

Levels overview
Level 0 : Store
Level 1 : Category
Level 2 : Department
Level 3 : Items

Moreover, in order to evaluate the reliability of our forecast we will divide the data into training and testing sets.

Bottom-up approach

To forecast hierarchical time series, we decided to focus on the bottom up approach and then compare our results to an aggregate model based on total sales. This methodology generates the forecast at the lowest level and then adds them up to recreate the forecasts of the upper levels of the hierarchy. Thus, we will begin by forecasting at the item level and then build it up to the store level (level 0).

The main reason for our choice of using this methodology, is the small loss of information that occurs due to the hierarchical aggregation.

The “ETS” and “ARIMA” models seem appropriate to deal with such time series. Therefore, we will compare the “normal” ARIMA (Autoarima) to the hierarchical one (ARIMA), in addition to the exponential smoothing (ETS) model.

In the following image we can visualize the hierarchy.

A caption

A caption

For any time the bottom-level, which have \(m= 3049\), where the quantity \(m\) must be higher than quantity \(n\) in the level 2 (Departments, \(n=7\)).

  1. For the bottom-level, we have the following denotation, to express the 3049 sku.

\[Y_t= Y_{F_1-1}+...+Y_{F_2-1}+...+Y_{HH_1-1}+...+Y_{HH_2-1}+...+Y_{H_1-1}+...+Y_{H_2-1}+...+Y_{H_2-515}\] B) For the level 2, we have 7 equations (one per department). Here are the equations for the first 3 departments:

\[Y_{F_1}= Y_{F_1-1}+Y_{F_1-2}+Y_{F_1-3}+Y_{F_1-4}+...\]

\[Y_{F_2}= Y_{F_2-1}+Y_{F_2-2}+Y_{F_2-3}+.... \]

\[Y_{F_3}= Y_{F_3-1}+Y_{F_3-2}+Y_{F_3-3}+... \]

  1. For the Level 1:

for the Foods category with 3 departments \[Y_{Food}= Y_{F_1}+Y_{F_2}+Y_{F_3} \]

for the Household category with 2 departments \[Y_{HH}= Y_{HH_1}+Y_{HH_2}\]

and, last, the Hobbies category \[Y_{Hobbies}= Y_{H_1}+Y_{H_2}\]

  1. Finally, at the top level, we have \[Y= Y_{HH}+Y_{H}+Y_{F}\]

Then, we constructed an \(n\times m\) matrix where the summing matrix denoted by \(\{S}\) show how the bottom level is aggregated.

\[ Y_t= Sb_t\] We created a time series and we established the hierarchical structure, to be able to capture the information from the bottom to the top level.

Models

Hierarchical Arima - Bottom Up model

wi1.bu.arima <- forecast(
  wi1.h,
  h = 28,
  method = "bu",
  fmethod = "arima",
  keep.fitted = TRUE,
  stepwise = FALSE
)
agg_arima <- aggts(wi1.bu.arima)

Hierarchical Exponential smoothing - Bottom Up model

wi1.bu.ets <-
  forecast(
    wi1.h,
    h = 28,
    method = "bu",
    fmethod = "ets",
    keep.fitted = TRUE
  )
agg_ets <- aggts(wi1.bu.ets)

Arima model on total sales

wi1.bu.autoarima <- auto.arima(auto.arima.training)

Arima model by departments

train_1<-training %>% 
  filter(cat_id=="HOBBIES")
auto.arima.training_1<-ts(colSums(train_1[,7:1891]))
wi1.bu.autoarima_1 <- auto.arima(auto.arima.training_1)
wi1.bu.autoarima_1<-forecast(wi1.bu.autoarima_1,h=28)

train_2<-training %>% 
  filter(cat_id=="HOUSEHOLD")
auto.arima.training_2<-ts(colSums(train_2[,7:1891]))
wi1.bu.autoarima_2 <- auto.arima(auto.arima.training_2)
wi1.bu.autoarima_2<-forecast(wi1.bu.autoarima_2,h=28)

train_3<-training %>% 
  filter(cat_id=="FOODS")
auto.arima.training_3<-ts(colSums(train_3[,7:1891]))
wi1.bu.autoarima_3 <- auto.arima(auto.arima.training_3)
wi1.bu.autoarima_3<-forecast(wi1.bu.autoarima_3,h=28)

Diagnostics

Now that the models have been built, we are going to compare different loss functions, to find the model that best minimises them. This will allow to pick the best one to forecast. The metrics chosen are:

  • The mean square error (MSE): \[\text{MSE} = \text{mean}(e_{t}^2)\]

  • The root mean square error (RMSE): \[\text{RMSE} = \sqrt{\text{MSE}}\]

  • The mean absolute percentage error (MAPE): \[\text{MAPE} = \text{mean}(|P_{t}|)\] where \[\ P_{t} = \ 100e_{t}/y_t \]

  • The mean absolute error (MAE): \[\text{MAE} = \text{mean}(|e_{t}|)\]

methods arima ets autoarima autoarima_2
MSE 8.54e+05 8.74e+05 3.31e+05 3.66e+05
RMSE 9.24e+02 9.35e+02 5.75e+02 6.05e+02
MAPE 2.19e-01 2.28e-01 1.02e-01 1.10e-01
MAE 2.14e-01 2.28e-01 1.03e-01 1.13e-01

Overall, one can notice that the Autoarima model (without hierarchical time series) is the best, since it minimises all the metrics (it has the lowest value everywhere). Then, the second best is the hierarchical ARIMA model followed by the ETS one. This result is quite surprising as we were expecting the hierarchical ARIMA to be the most accurate.

On an addtional note, the Autoarima was built at the store level (autoarima) and at the department level (autoarima_2). As we can see, the overall model still performs better. The Autoarima was also computed at the SKU level. However, that model was even less accurate, thus disregarded.

This can be explained by the fact that the products sales (at the lowest levels) are in some way correlated. The disadvantage with the hierarchical models is that they do not seem to capture the whole picture correctly. As our dataset contains more than 3000 products, it means that there are 3000 models, each one having some error distribution. Therefore, when we compile them to create the total sales, we are summing all these deviations, which could explain this mitigated result. However, they could definitively be more attractive if we were working with several shops with differents sales patterns at the department level.

Residuals Analysis

We will review the residuals in order to assess whether the model captured the data patterns. For us to have a good analysis, the residuals must have the following characteristics:

Characteristics
1 The average of the residuals must be zero.
2 They must have no correlation between them.
3 Constant variance (recommended)
4 Normally distributed (recommended)

If the analyses of the models do not meet the first two features, they should be improved.

Hierarchical Arima - Bottom Up model

#> 
#>  Box-Ljung test
#> 
#> data:  residuals.arima
#> X-squared = 79, df = 12, p-value = 6e-12

Hierarchical Exponential smoothing - Bottom Up model

#> 
#>  Box-Ljung test
#> 
#> data:  residuals.ets
#> X-squared = 79, df = 12, p-value = 6e-12

Arima model on total sales

#> 
#>  Box-Ljung test
#> 
#> data:  residuals.autoarima
#> X-squared = 33, df = 12, p-value = 0.001

For the three models, one can observe that the normality assumption is not fully satisfied. To further explain, they follow similar distributions that are rightly skewed. Additionally, their distributions are larger that the one expected from a Normal distribution.

Moreover, the ACF plots reveal that residuals are significantly different from 0 for some lags. Once again, the autoarima performs better than the hierarchical models. The latters show some significant seasonality in there lags. We can conclude the same by looking at the residuals, which are far from being stationary.

Also the ARIMA models (hierarchical and normal) have their residuals fluctuating somewhere around 0 but with large bounds. This is better than the ETS model, which average residuals is far from 0. Additionally, when computing the Ljung-Box test of the residuals of each model, all p-values are extremely small. This allows us to reject the H0 hypothesis, that is to reject the hypothesis of uncorrelatedness of the residuals. In other words, the residuals of those models are all correlated. The p-value of the Autoarima model is the highest, implying that it is doing a slightly better job than the two others, even though it is not the best.

This analysis suggests that all three models have a poor quality. However, the best one among them remains the Autoarima model. We are, thus, going to forecast using this model.

Forecasting

Plotting the Results

Since we uncovered that the best model is the non-hierarchical ARIMA, we are going to do the forecast using that model.

As we can see, the non-hierarchical model is indeed doing a good job in predicting. It is not replicating the time series but actually forecasting. The use of this model seems appropriate to forecast the sales of this shop.

The fact that we are only predicting 28 days ahead, does not enable us to make many comments on the future of the forecast, but it seems like the range (which is between values 2797.555, 4998.776) of the predictions is smaller with respect to the time series of the observations, and there is a decreasing trend towards the end.

To confirm that the non-hierarchical model is actually doing a better job, we also computed the predictions of the two other models.

And, as we can see from these graphs, both the hierarchical ARIMA and the ETS are doing poor jobs in the forecast of this time series. Indeed, their predictions follow an almost flat line centered around 4’000 in both cases, which is not representative at all of the observations we have in the time series.

Prediction Intervals

Now that we have confirmed that the Autoarima model is doing a better job in predicting, we decided to compute its 95% prediction intervals. We focused this analysis exclusively on the chosen model.

As we can see, the prediction intervals are quite large. This infers high uncertainty regarding the predictions. This makes sense, since we have concluded previously, that even if the Autoarima is doing a better job than the others, it is still not doing a great one.

Discussion and Conclusion

The goal of this project was to provide a 28-day forecast (point forecast and probabilistic forecast for 95% prediction intervals) for the aggregated unit sales of a Walmart Store, located in Wisconsin, US.

To begin, we conducted an Exploratory Data Analysis. The latter revealed that, the store could be sectioned into 7 departments: 3 for the FOOD category, 2 for the HOBBIES one and 2 for the HOUSEHOLDS one. Additionally, by plotting the daily sales, we observed a level shift in the dataset. We assumed that a renovation took place, explaining the latter. To improve the quality of the forecast, we decided to correct this level shift.

Furthermore, this analysis uncovered a seasonal pattern in the dataset, as well as an increasing trend over the years. Also, the weekend appears to be the period at which the sales are the highest across all years. The seasonal pattern can be observed on the aggregated sales, as well as on the sales of each department. The analysis of the ACF and PACF plots of the overall sales confirmed the latter. Also, a Ljung-Box test corroborated the lack of stationarity of the time series.

Then, regarding the modelling, we transformed the times series into a hierarchical one. This allowed us to compare three models: the ARIMA, the Hierarchial ARIMA and the ETS models. The last two are based on hierarchical time series. Also, we created training and test sets, to be able to evaluate the reliability of our forecasts. For the models based on hierarchical time series, we chose to use the bottom up approach.

The diagnostics revealed that the three models had a very poor quality. To further explain, the normality assumption of the residuals were not satisfied and presented patterns. However, among all of them, the non-hierarchical ARIMA model appeared to be the best one, since it minimised all the loss metrics considered: MSE, RMSE, MAPE and MAE. Thus, we decided to conduct our forecast using this model.

By computing the predictions, we indeed realized that the Autoarima (5, 1, 1) conducted a better job than the other two models, since it was actually forecasting and not just replicating the values in the time series. Moreover, the range of the predictions appears to be smaller than the range of the observations. On the other hand, the hierarchical ARIMA and ETS models, just predicted values that were centered around 4’000, following an almost flat line. This allowed us to confirm that they were actually very poor models. Regarding the prediction intervals, we focused the analysis only on the best model: the Autoarima one. The latter pointed out that the range of uncertainty was quite large, which makes sense, since the model is not doing the best job.

The results of this report were quite surprising for us, as we would have imagined that the hierarchical models would perform better than the simpler Autoarima one. However, we believe that the bottom-up approach does a better job in forecasting at lower and more granular levels. It appears that this method had trouble capturing the bigger picture. Oppositely, the non-hierarchical ARIMA seems better suited to provide more general results, that are at an aggregated level, which is the level of interest of this project.

This report suggests that the models should definitely be refined in order to provide better results. Further research could include trying the top-down approach, that would probably provide better forecasts at an aggregated level. Another approach that could be tested is to remove manually the deterministic part of the time series (trend + seasonality), and build a stochastic model on the residuals.