Photo by Fabio Bracht (@bracht) on Unsplash.

1 Introduction

Welcome to an extensive Exploratory Data Analysis for the 5th Makridakis forecasting competitions (M5)! This notebook will grow over the coming days and weeks into a deep dive of all the relevant aspects of this challenge. Here’s all you need to know to get started:

Some Background: the Makridakis competitions (or M-competitions), organised by forecasting expert Spyros Makridakis, aim to provide a better understanding and advancement of forecasting methodology by comparing the performance of different methods in solving a well-defined, real-world problem. The first M-competition was held in 1982. The forth competition (M4) ran in 2018 and featured “100,000 time series and 61 forecasting methods” (source in link). According to forecasting researcher and practitioner Rob Hyndman the M-competitions “have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models”. This empirical approach is very similar to Kaggle’s trade-mark way of having the best machine learning algorithms engage in intense competition on diverse datasets. M5 is the first M-competition to be held on Kaggle.

The goal: We have been challenged to predict sales data provided by the retail giant Walmart 28 days into the future. This competition will run in 2 tracks: In addition to forecasting the values themselves in the Forecasting competition, we are simultaneously tasked to estimate the uncertainty of our predictions in the Uncertainty Distribution competition. Both competitions will have the same 28 day forecast horizon.

The data: We are working with 42,840 hierarchical time series. The data were obtained in the 3 US states of California (CA), Texas (TX), and Wisconsin (WI). “Hierarchical” here means that data can be aggregated on different levels: item level, department level, product category level, and state level. The sales information reaches back from Jan 2011 to June 2016. In addition to the sales numbers, we are also given corresponding data on prices, promotions, and holidays. Note, that we have been warned that most of the time series contain zero values.

The data comprises 3075 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The hierachical aggregation captures the combinations of these factors. For instance, there is 1 time series for all sales, 3 time series for all sales per state, and so on. The largest category is sales per product per store.

The training data comes in the shape of 3 separate files:

The metrics:

Need some inspiration? Beyond this notebook, I recommend you to check out other recent Kaggle time series competitions such as Wiki Traffic Forecasting, Favorita Grocery Sales Forecasting, Recruit Restaurant, or ASHRAE and read the Kernels and top ranking write-ups. Also, note that a few years ago Kaggle ran two Walmart recruiting competitions with the goal to forecast sales: Recruiting 1 and Recruiting 2. Last but not least, there was a relevant Playground competition a few years ago.

Let’s get started!

2 Preparations

2.1 Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for time series forecasting.

## Warning: package 'vroom' was built under R version 3.6.3
## Warning: package 'skimr' was built under R version 3.6.3
## Warning: package 'alluvial' was built under R version 3.6.3
## Warning: package 'ggforce' was built under R version 3.6.3
## Warning: package 'ggridges' was built under R version 3.6.3
## Warning: package 'GGally' was built under R version 3.6.3
## Warning: package 'wesanderson' was built under R version 3.6.3
## Warning: package 'rlang' was built under R version 3.6.3

2.2 Helper functions

A short helper function to compute binomial confidence intervals.

3 Quick Look: File structure and content

As a first step let’s have a quick look of the data sets using the head, summary, and glimpse tools where appropriate.

3.1 Training sales data

Here are the first 10 columns and rows of the our training sales data:

id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4
HOBBIES_1_001_CA_1_validation HOBBIES_1_001 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_002_CA_1_validation HOBBIES_1_002 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_003_CA_1_validation HOBBIES_1_003 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_004_CA_1_validation HOBBIES_1_004 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_005_CA_1_validation HOBBIES_1_005 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_006_CA_1_validation HOBBIES_1_006 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_007_CA_1_validation HOBBIES_1_007 HOBBIES_1 HOBBIES CA_1 CA 0 0 0 0
HOBBIES_1_008_CA_1_validation HOBBIES_1_008 HOBBIES_1 HOBBIES CA_1 CA 12 15 0 0
HOBBIES_1_009_CA_1_validation HOBBIES_1_009 HOBBIES_1 HOBBIES CA_1 CA 2 0 7 3
HOBBIES_1_010_CA_1_validation HOBBIES_1_010 HOBBIES_1 HOBBIES CA_1 CA 0 0 1 0

We find:

  • There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.

  • The sales per date are encoded as columns starting with the prefix d_. Those are the number of units sold per day (not the total amount of dollars).

  • We already see that there are quite a lot of zero values.

This data set has too many columns and rows to display them all:

## [1]  1919 30490

All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.

## # A tibble: 1 x 2
##   dset           n
##   <chr>      <int>
## 1 validation 30490

3.2 Sales prices

This data set gives us the weekly price changes per item:

store_id item_id wm_yr_wk sell_price
CA_1 HOBBIES_1_001 11325 9.58
CA_1 HOBBIES_1_001 11326 9.58
CA_1 HOBBIES_1_001 11327 8.26
CA_1 HOBBIES_1_001 11328 8.26
CA_1 HOBBIES_1_001 11329 8.26
CA_1 HOBBIES_1_001 11330 8.26
CA_1 HOBBIES_1_001 11331 8.26
CA_1 HOBBIES_1_001 11332 8.26
CA_1 HOBBIES_1_001 11333 8.26
CA_1 HOBBIES_1_001 11334 8.26
##    store_id           item_id             wm_yr_wk       sell_price     
##  Length:6841121     Length:6841121     Min.   :11101   Min.   :  0.010  
##  Class :character   Class :character   1st Qu.:11247   1st Qu.:  2.180  
##  Mode  :character   Mode  :character   Median :11411   Median :  3.470  
##                                        Mean   :11383   Mean   :  4.411  
##                                        3rd Qu.:11517   3rd Qu.:  5.840  
##                                        Max.   :11621   Max.   :107.320

We find:

  • We have the store_id and item_id to link this data to our training and validation data.

  • Prices range from $0.10 to a bit more than 100 dollars.

3.3 Calendar

The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:

date wm_yr_wk weekday wday month year d event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI
2011-01-29 11101 Saturday 1 1 2011 d_1 NA NA NA NA 0 0 0
2011-01-30 11101 Sunday 2 1 2011 d_2 NA NA NA NA 0 0 0
2011-01-31 11101 Monday 3 1 2011 d_3 NA NA NA NA 0 0 0
2011-02-01 11101 Tuesday 4 2 2011 d_4 NA NA NA NA 1 1 0
2011-02-02 11101 Wednesday 5 2 2011 d_5 NA NA NA NA 1 0 1
2011-02-03 11101 Thursday 6 2 2011 d_6 NA NA NA NA 1 1 1
2011-02-04 11101 Friday 7 2 2011 d_7 NA NA NA NA 1 0 0
2011-02-05 11102 Saturday 1 2 2011 d_8 NA NA NA NA 1 1 1
## Observations: 1,969
## Variables: 14
## $ date         <date> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-...
## $ wm_yr_wk     <dbl> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday      <chr> "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday"...
## $ wday         <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month        <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year         <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d            <chr> "d_1", "d_2", "d_3", "d_4", "d_5", "d_6", "d_7", "d_8"...
## $ event_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "SuperBowl", NA, NA, N...
## $ event_type_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Sporting", NA, NA, NA...
## $ event_name_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ event_type_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ snap_CA      <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX      <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...
##       date               wm_yr_wk       weekday               wday      
##  Min.   :2011-01-29   Min.   :11101   Length:1969        Min.   :1.000  
##  1st Qu.:2012-06-04   1st Qu.:11219   Class :character   1st Qu.:2.000  
##  Median :2013-10-09   Median :11337   Mode  :character   Median :4.000  
##  Mean   :2013-10-09   Mean   :11347                      Mean   :3.997  
##  3rd Qu.:2015-02-13   3rd Qu.:11502                      3rd Qu.:6.000  
##  Max.   :2016-06-19   Max.   :11621                      Max.   :7.000  
##      month             year           d             event_name_1      
##  Min.   : 1.000   Min.   :2011   Length:1969        Length:1969       
##  1st Qu.: 3.000   1st Qu.:2012   Class :character   Class :character  
##  Median : 6.000   Median :2013   Mode  :character   Mode  :character  
##  Mean   : 6.326   Mean   :2013                                        
##  3rd Qu.: 9.000   3rd Qu.:2015                                        
##  Max.   :12.000   Max.   :2016                                        
##  event_type_1       event_name_2   event_type_2      snap_CA      
##  Length:1969        Mode:logical   Mode:logical   Min.   :0.0000  
##  Class :character   NA's:1969      NA's:1969      1st Qu.:0.0000  
##  Mode  :character                                 Median :0.0000  
##                                                   Mean   :0.3301  
##                                                   3rd Qu.:1.0000  
##                                                   Max.   :1.0000  
##     snap_TX          snap_WI      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3301   Mean   :0.3301  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

We find:

  • The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.

4 Visual Overview: Interactive time series plots

We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates column in date format:

4.1 Individual item-level time series - random sample

Here we will sample 50 random time series from our training sample to get an idea about the data.

In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.

Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:

We find:

  • Most time series have pretty low daily count statistics, alongside the large percentage of zero numbers we already noticed. On the one hand, this suggests that spikes are not going to be overly pronounced. But it also indicates that accurate forecasts will have to deal with quite a lot of noise.

  • Some of our sample time series start in the middle of the time range, and some have long gaps in between. This is an additional challenge.

Notes:

  • I have assigned one of 7 colour brewer colour-blind friendly colours (“Set2”) to each time series (cycling through the selection sequentially). Those will typically provide a good contrast for a reasonable number of simultaneous plots, but sometimes two time series might have the same colour.

  • I got the inspiration to use the crosstalk package for building this plot from the excellent dashboard kernel by Saba Tavoosi. Make sure to check it out.

4.2 All aggregate sales

After peeking at some of the individual time series and finding a lot of zero values, we will now do some aggregation to get some decent statistics.

First off, here we plot the aggregate time series over all items, stores, categories, departments and sales. This is an interactive plot and you can use the usual plotly tools (upper right corner) to zoom, pan, and scale the view.

Fig. 3

We find:

  • The sales are generally going up, which I suppose is good news for Walmart. We can make out some yearly seasonality, and a dip at Christmas, which is the only day of the year when the stores are closed.

  • Zooming in, we can see strong weekly seasonality plus possibly some additional overlaying patterns with shorter periods than yearly.

  • The most recent 2016 sales numbers appear to grow a bit faster than in previous years.

4.3 Sales per State

To get a bit more out of these big picture views we will look at the sales per state on a monthly aggregate level. This is another interactive ggplotly graph:

Fig. 4

We find:

  • California (CA) sells more items in general, while Wisconsin (WI) was slowly catching up to Texas (TX) and eventually surpassed it in the last months of our training data.

  • CA has pronounced dips in 2013 and 2015 that appear to be present in the other states as well, just less severe. These dips and peaks don’t appear to always occur (see 2012) but they might primarily reflect the yearly seasonality we noticed already.

4.4 Sales per Store & Category

There are 10 stores, 4 in CA and 3 each in TX and WI, and 3 categories: Foods, Hobbies, and Household. Here we’re switching up our visualisation strategy a bit by including all of those in the same, non-interactive layout. We will again use monthly aggregates to keep the plots clean.

This visual uses the a facetted view for the stores (1 facet per state) and the great patchwork package to arrange the individual plots, including a count of all category occurences:

Fig. 5

Fig. 5

We find:

  • “Foods” are the most common category, followed by “Household” which is still quite a bit above “Hobbies”. The number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.

  • In terms of stores, we see that the TX stores are quite close together in sales; with “TX_3” rising from the levels of “TX_1” to the level of “TX_2” over the time of our training data. The WI stores “WI_1” and “WI_2” show a curious jump in sales in 2012, while “WI_3” shows a long dip over several year.

  • The CA stores are relatively well separated in store volume. Note “CA_2”, which declines to the “CA_4” level in 2015, only to recover and jump up to “CA_1” sales later in the year.

4.5 Sales per Department

Our data has 7 departments, 3 for “FOODS” and 2 each for “HOBBIES” and “HOUSEHOLD”. Together with the 3 states those are 21 levels. Let’s see whether we can fit them all in a single facet grid plot.

Fig. 5

Fig. 5

We find:

  • “FOODS_3” is clearly driving the majority of “FOODS” category sales in all states. “FOODS_2” is picking up a bit towards the end of the time range, especially in “WI”.

  • Similarly, “HOUSEHOLD_1” is clearly outselling “HOUSEHOLD_2”. “HOBBIES_1” is on a higher average sales level than “HOBBIES_2”, but both are not showing much development over time.

4.6 Seasonalities - global

Moving on from the time series views, at least for the moment, we are changing up our visuals to study seasonalities. Here is a heat map that combines the weekly and yearly seasonalities.

Because of the general increasing trend in sales, we’re not looking at absolute sales values. Instead, we aim to model this trend using a smoothed (LOESS) fit which we then subtract from the data. The heatmap shows the relative changes. Note, that I removed the Christmas dips because they would be distracting for the purpose of this plot:

Fig. 6

Fig. 6

We find:

  • The weekly pattern is strong, with Sat and Sun standing out prominently. Also Monday seems to benefit a bit from the weekend effect.

  • The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip. Certain holidays, like the 4th of July, might somewhat influence these patterns; but over 5 years they should average out reasonably well.

  • I already tweaked the smoothing fit parameters a bit, but they could probably be optimised further. Still, it looks quite good for the first try.

Let’s stay with the seasonalities for a little while longer and go one or two levels deeper.

Next, we’ll look at the weekly and monthly patterns on a state level. We’re using the same smoothing approach which is shown in the upper panels for reference. Here, I will scale each individual time series by its global mean to allow for a better comparison between the three states.

The colours are the same throughout the layout; see the upper panels for reference:

Fig. 7

Fig. 7

We find:

  • After scaling, the weekday vs weekend pattern is very similar for all 3 states, except for an interesting downturn in Sunday sales in WI.

  • The monthly seasonalities are indeed complex. There is a dip in the winter months and a second, generally shallower dip around May. WI is again the odd state out: it sells notably less in the summer compared to TX and especially CA; so much so that the Feb/Aug ratio is inverted for WI vs CA/TX.

The logical next step is to look at the Seasonalities per Category, since you wouldn’t necessarily expect food and household item shopping to follow the same patterns. I will also break up this view by state, following the insights we just saw above.

Again, we are subtracting the trend and scaling by the global mean:

# sum by cat + state, pivoting dates
foo <- train %>%
  group_by(cat_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

# fit loess and subtract
bar <- foo %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id, state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 2/3, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

# bar %>% 
#   ggplot(aes(dates, sales, col = cat_id)) +
#   geom_line() +
#   geom_line(aes(dates, loess), col = "black") +
#   facet_grid(cat_id ~ state_id) +
#   theme_tufte() +
#   theme(legend.position = "none") +
#   labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")

p1 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(cat_id, state_id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>%  
  ggplot(aes(wday, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free", nrow = 3) +
  theme(legend.position = "top") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality and", col = "State")
  
p2 <- bar %>% 
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(cat_id, state_id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>% 
  ggplot(aes(month, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free_y", nrow = 3) +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality by Category & State", col = "State")

layout <- "
AABBB
"

p1 + p2 + plot_layout(design = layout)
Fig. 8

Fig. 8

We find:

  • The weekly patterns for the “foods” category are very close for all 3 states, and WI shows some deviations for “hobbies” and “household”. We also see the Wisconsin’s characteristic Sunday dip for all 3 categories.

  • The monthly patterns show some interesting signatures: For “foods”, CA and TX are pretty close but WI shows that inverted summer/winter ratio. In contrast, the 3 states are much more similar to each other in the “hobbies” category. And when it comes to “household” items, CA doesn’t seem to sell as much of them during the first 3 months of the year but slightly more in the summer; compared to WI and TX.

Before we look at the additional explanatory variables, here is a visual representation of the split between training data vs validation data (public leaderboard) vs evaluation data (eventual private leaderboard):

Fig. 9

Fig. 9

  • This shows the 5+ years of training data together with the 28 days each of validation and evaluation time range.

  • The training range goes from 2011-01-29 to 2016-04-24. The validation range spans the following 28 days from 2016-04-25 to 2016-05-22. This is what we are predicting for the initial public leaderboard. Eventually, we will be given the ground truth for this period to train our model for the final evaluation.

  • The evaluation range goes from 2016-05-23 to 2016-06-19; another 28 days. This is what the ultimate scoring will be based on.


To be continued