Photo by Fabio Bracht (@bracht) on Unsplash.

1 Introduction

Welcome to an extensive Exploratory Data Analysis for the 5th Makridakis forecasting competitions (M5)! This notebook will grow over the coming days and weeks into a deep dive of all the relevant aspects of this challenge. Here’s all you need to know to get started:

Some Background: the Makridakis competitions (or M-competitions), organised by forecasting expert Spyros Makridakis, aim to provide a better understanding and advancement of forecasting methodology by comparing the performance of different methods in solving a well-defined, real-world problem. The first M-competition was held in 1982. The forth competition (M4) ran in 2018 and featured “100,000 time series and 61 forecasting methods” (source in link). According to forecasting researcher and practitioner Rob Hyndman the M-competitions “have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models”. This empirical approach is very similar to Kaggle’s trade-mark way of having the best machine learning algorithms engage in intense competition on diverse datasets. M5 is the first M-competition to be held on Kaggle.

The goal: We have been challenged to predict sales data provided by the retail giant Walmart 28 days into the future. This competition will run in 2 tracks: In addition to forecasting the values themselves in the Forecasting competition, we are simultaneously tasked to estimate the uncertainty of our predictions in the Uncertainty Distribution competition. Both competitions will have the same 28 day forecast horizon.

The data: We are working with 42,840 hierarchical time series. The data were obtained in the 3 US states of California (CA), Texas (TX), and Wisconsin (WI). “Hierarchical” here means that data can be aggregated on different levels: item level, department level, product category level, and state level. The sales information reaches back from Jan 2011 to June 2016. In addition to the sales numbers, we are also given corresponding data on prices, promotions, and holidays. Note, that we have been warned that most of the time series contain zero values.

The data comprises 3075 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The hierachical aggregation captures the combinations of these factors. For instance, there is 1 time series for all sales, 3 time series for all sales per state, and so on. The largest category is sales per product per store.

The training data comes in the shape of 3 separate files:

sales_train.csv: this is our main training data. It has 1 column for each of the 1941 days from 2011-01-29 and 2016-05-22; not including the validation period of 28 days until 2016-06-19. It also includes the IDs for item, department, category, store, and state.
sell_prices.csv: the store and item IDs together with the sales price of the item as a weekly average.
calendar.csv: dates together with related features like day-of-the week, month, year, and an 3 binary flags for whether the stores in each state allowed purchases with SNAP food stamp at this date (1) or not (0).

The metrics:

The point forecast submission are being evaluated using the Root Mean Squared Scaled Error (RMSSE), which is derived from the Mean Absolute Scaled Error (MASE) that was designed to be scale invariant and symmetric. In a similar way to the MASE, the RMSSE is scale invariant and symmetric, and measures the prediction error (i.e. forecast - truth) relative to a “naive forecast” that simply assumes that step i = step i-1. In contrast to the MASE, here both prediction error and naive error are scaled to account for the goal of estimating average values in the presence of many zeros.
The uncertainy distributions are being evaluated using the Weighted Scaled Pinball Loss (WSPL). We are asked to provide the 50%, 67%, 95%, and 99% uncertainty intervals together with the forecasted median.
Both metrics are computed for each time series and then averaged accross all time series including weights. The weights are proportional to the sales volume of the item, to give more importance to high selling products.

Need some inspiration? Beyond this notebook, I recommend you to check out other recent Kaggle time series competitions such as Wiki Traffic Forecasting, Favorita Grocery Sales Forecasting, Recruit Restaurant, or ASHRAE and read the Kernels and top ranking write-ups. Also, note that a few years ago Kaggle ran two Walmart recruiting competitions with the goal to forecast sales: Recruiting 1 and Recruiting 2. Last but not least, there was a relevant Playground competition a few years ago.

Let’s get started!

2 Preparations

2.1 Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for time series forecasting.

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('patchwork') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('vroom') # input/output

## Warning: package 'vroom' was built under R version 3.6.3

library('skimr') # overview

## Warning: package 'skimr' was built under R version 3.6.3

library('tibble') # data wrangling
library('tidyr') # data wrangling
library('purrr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation

# specific visualisation
library('alluvial') # visualisation

## Warning: package 'alluvial' was built under R version 3.6.3

library('ggrepel') # visualisation
library('ggforce') # visualisation

## Warning: package 'ggforce' was built under R version 3.6.3

library('ggridges') # visualisation

## Warning: package 'ggridges' was built under R version 3.6.3

library('gganimate') # animations
library('GGally') # visualisation

## Warning: package 'GGally' was built under R version 3.6.3

library('ggthemes') # visualisation
library('wesanderson') # visualisation

## Warning: package 'wesanderson' was built under R version 3.6.3

library('kableExtra') # display

# Date + forecast
library('lubridate') # date and time
library('forecast') # time series analysis
library('prophet') # time series analysis

## Warning: package 'rlang' was built under R version 3.6.3

# Interactivity
library('crosstalk')
library('plotly')

# parallel
library('foreach')
library('doParallel')

2.2 Helper functions

A short helper function to compute binomial confidence intervals.

# function to extract binomial confidence levels
get_binCI <- function(x,n) as.list(setNames(binom.test(x,n)$conf.int, c("lwr", "upr")))

2.3 Load data

We use R’s new vroom package which reads data as fast as its onomatopoeic name implies:

path <- "./raw_data/"
train <- vroom(str_c(path,'sales_train_validation.csv'), delim = ",", col_types = cols())
prices <- vroom(str_c(path,'sell_prices.csv'), delim = ",", col_types = cols())
calendar <- vroom(str_c(path,'calendar.csv'), delim = ",", col_types = cols())

sample_submit <- vroom(str_c(path,'sample_submission.csv'), delim = ",", col_types = cols())

3 Quick Look: File structure and content

As a first step let’s have a quick look of the data sets using the head, summary, and glimpse tools where appropriate.

3.1 Training sales data

Here are the first 10 columns and rows of the our training sales data:

train %>% 
  select(seq(1,10,1)) %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling()

id	item_id	dept_id	cat_id	store_id	state_id	d_1	d_2	d_3	d_4
HOBBIES_1_001_CA_1_validation	HOBBIES_1_001	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_002_CA_1_validation	HOBBIES_1_002	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_003_CA_1_validation	HOBBIES_1_003	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_004_CA_1_validation	HOBBIES_1_004	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_005_CA_1_validation	HOBBIES_1_005	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_006_CA_1_validation	HOBBIES_1_006	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_007_CA_1_validation	HOBBIES_1_007	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_008_CA_1_validation	HOBBIES_1_008	HOBBIES_1	HOBBIES	CA_1	CA	12	15	0	0
HOBBIES_1_009_CA_1_validation	HOBBIES_1_009	HOBBIES_1	HOBBIES	CA_1	CA	2	0	7	3
HOBBIES_1_010_CA_1_validation	HOBBIES_1_010	HOBBIES_1	HOBBIES	CA_1	CA	0	0	1	0

We find:

There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.
The sales per date are encoded as columns starting with the prefix d_. Those are the number of units sold per day (not the total amount of dollars).
We already see that there are quite a lot of zero values.

This data set has too many columns and rows to display them all:

c(ncol(train),nrow(train))

## [1]  1919 30490

All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.

train %>% 
  mutate(dset = if_else(str_detect(id, "validation"), "validation", "training")) %>% 
  count(dset)

## # A tibble: 1 x 2
##   dset           n
##   <chr>      <int>
## 1 validation 30490

3.2 Sales prices

This data set gives us the weekly price changes per item:

prices %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling()

store_id	item_id	wm_yr_wk	sell_price
CA_1	HOBBIES_1_001	11325	9.58
CA_1	HOBBIES_1_001	11326	9.58
CA_1	HOBBIES_1_001	11327	8.26
CA_1	HOBBIES_1_001	11328	8.26
CA_1	HOBBIES_1_001	11329	8.26
CA_1	HOBBIES_1_001	11330	8.26
CA_1	HOBBIES_1_001	11331	8.26
CA_1	HOBBIES_1_001	11332	8.26
CA_1	HOBBIES_1_001	11333	8.26
CA_1	HOBBIES_1_001	11334	8.26

summary(prices)

##    store_id           item_id             wm_yr_wk       sell_price     
##  Length:6841121     Length:6841121     Min.   :11101   Min.   :  0.010  
##  Class :character   Class :character   1st Qu.:11247   1st Qu.:  2.180  
##  Mode  :character   Mode  :character   Median :11411   Median :  3.470  
##                                        Mean   :11383   Mean   :  4.411  
##                                        3rd Qu.:11517   3rd Qu.:  5.840  
##                                        Max.   :11621   Max.   :107.320

We find:

We have the store_id and item_id to link this data to our training and validation data.
Prices range from $0.10 to a bit more than 100 dollars.

3.3 Calendar

The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:

calendar %>% 
  head(8) %>% 
  kable() %>% 
  kable_styling()

date	wm_yr_wk	weekday	wday	month	year	d	event_name_1	event_type_1	event_name_2	event_type_2	snap_CA	snap_TX	snap_WI
2011-01-29	11101	Saturday	1	1	2011	d_1	NA	NA	NA	NA	0	0	0
2011-01-30	11101	Sunday	2	1	2011	d_2	NA	NA	NA	NA	0	0	0
2011-01-31	11101	Monday	3	1	2011	d_3	NA	NA	NA	NA	0	0	0
2011-02-01	11101	Tuesday	4	2	2011	d_4	NA	NA	NA	NA	1	1	0
2011-02-02	11101	Wednesday	5	2	2011	d_5	NA	NA	NA	NA	1	0	1
2011-02-03	11101	Thursday	6	2	2011	d_6	NA	NA	NA	NA	1	1	1
2011-02-04	11101	Friday	7	2	2011	d_7	NA	NA	NA	NA	1	0	0
2011-02-05	11102	Saturday	1	2	2011	d_8	NA	NA	NA	NA	1	1	1

glimpse(calendar)

## Observations: 1,969
## Variables: 14
## $ date         <date> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-...
## $ wm_yr_wk     <dbl> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday      <chr> "Saturday", "Sunday", "Monday", "Tuesday", "Wednesday"...
## $ wday         <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month        <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year         <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d            <chr> "d_1", "d_2", "d_3", "d_4", "d_5", "d_6", "d_7", "d_8"...
## $ event_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "SuperBowl", NA, NA, N...
## $ event_type_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Sporting", NA, NA, NA...
## $ event_name_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ event_type_2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ snap_CA      <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX      <dbl> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...

summary(calendar)

##       date               wm_yr_wk       weekday               wday      
##  Min.   :2011-01-29   Min.   :11101   Length:1969        Min.   :1.000  
##  1st Qu.:2012-06-04   1st Qu.:11219   Class :character   1st Qu.:2.000  
##  Median :2013-10-09   Median :11337   Mode  :character   Median :4.000  
##  Mean   :2013-10-09   Mean   :11347                      Mean   :3.997  
##  3rd Qu.:2015-02-13   3rd Qu.:11502                      3rd Qu.:6.000  
##  Max.   :2016-06-19   Max.   :11621                      Max.   :7.000  
##      month             year           d             event_name_1      
##  Min.   : 1.000   Min.   :2011   Length:1969        Length:1969       
##  1st Qu.: 3.000   1st Qu.:2012   Class :character   Class :character  
##  Median : 6.000   Median :2013   Mode  :character   Mode  :character  
##  Mean   : 6.326   Mean   :2013                                        
##  3rd Qu.: 9.000   3rd Qu.:2015                                        
##  Max.   :12.000   Max.   :2016                                        
##  event_type_1       event_name_2   event_type_2      snap_CA      
##  Length:1969        Mode:logical   Mode:logical   Min.   :0.0000  
##  Class :character   NA's:1969      NA's:1969      1st Qu.:0.0000  
##  Mode  :character                                 Median :0.0000  
##                                                   Mean   :0.3301  
##                                                   3rd Qu.:1.0000  
##                                                   Max.   :1.0000  
##     snap_TX          snap_WI      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3301   Mean   :0.3301  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

We find:

The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.

3.4 Missing values & zero values

There are no missing values in our sales training data:

sum(is.na(train))

## [1] 0

However, there are a lot of zero values, here we plot the distribution of zero percentages among all time series:

bar <- train %>% 
  select(-contains("id")) %>% 
  na_if(0) %>% 
  is.na() %>% 
  as_tibble() %>% 
  mutate(sum = pmap_dbl(select(., everything()), sum)) %>% 
  mutate(mean = sum/(ncol(train) - 1)) %>% 
  select(sum, mean)
  
bar %>% 
  ggplot(aes(mean)) +
  geom_density(fill = "blue") +
  scale_x_continuous(labels = scales::percent) +
  coord_cartesian(xlim = c(0, 1)) +
  theme_hc() +
  theme(axis.text.y = element_blank()) +
  labs(x = "", y = "", title = "Density for percentage of zero values - all time series")

Fig. 1

This means that only a minority of time series have less than 50% of zero values. The peak is rather close to 100%

4 Visual Overview: Interactive time series plots

We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates column in date format:

extract_ts <- function(df){
  
  min_date <- date("2011-01-29")
  
  df %>%
    select(id, starts_with("d_")) %>%  
    pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
    mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
    mutate(dates = min_date + dates - 1) %>% 
    mutate(id = str_remove(id, "_validation"))
}

set.seed(4321)
foo <- train %>% 
  sample_n(50)

ts_out <- extract_ts(foo)

cols <- ts_out %>% 
  distinct(id) %>% 
  mutate(cols = rep_len(brewer.pal(7, "Set2"), length.out = n_distinct(ts_out$id)))

ts_out <- ts_out %>% 
  left_join(cols, by = "id")

pal <- cols$cols %>%
   setNames(cols$id)

4.1 Individual item-level time series - random sample

Here we will sample 50 random time series from our training sample to get an idea about the data.

In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.

Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:

shared_ts <- highlight_key(ts_out)

palette(brewer.pal(9, "Set3"))

gg <- shared_ts %>% 
  ggplot(aes(dates, sales, col = id, group = id)) +
  geom_line() +
  scale_color_manual(values = pal) +
  labs(x = "Date", y = "Sales") +
  theme_tufte() +
  NULL

filter <- bscols(
  filter_select("ids", "Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):", shared_ts, ~id, multiple = TRUE),
  ggplotly(gg, dynamicTicks = TRUE),
  widths = c(12, 12)
)

bscols(filter)

Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):

We find:

Most time series have pretty low daily count statistics, alongside the large percentage of zero numbers we already noticed. On the one hand, this suggests that spikes are not going to be overly pronounced. But it also indicates that accurate forecasts will have to deal with quite a lot of noise.
Some of our sample time series start in the middle of the time range, and some have long gaps in between. This is an additional challenge.

Notes:

I have assigned one of 7 colour brewer colour-blind friendly colours (“Set2”) to each time series (cycling through the selection sequentially). Those will typically provide a good contrast for a reasonable number of simultaneous plots, but sometimes two time series might have the same colour.
I got the inspiration to use the crosstalk package for building this plot from the excellent dashboard kernel by Saba Tavoosi. Make sure to check it out.

4.2 All aggregate sales

After peeking at some of the individual time series and finding a lot of zero values, we will now do some aggregation to get some decent statistics.

First off, here we plot the aggregate time series over all items, stores, categories, departments and sales. This is an interactive plot and you can use the usual plotly tools (upper right corner) to zoom, pan, and scale the view.

foo <- train %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  mutate(id = 1)

bar <- extract_ts(foo)

gg <- bar %>% 
  ggplot(aes(dates, sales)) +
  geom_line(col = "blue") +
  theme_tufte() +
  labs(x = "Date", y = "Sales", title = "All aggregate sales")

ggplotly(gg, dynamicTicks = TRUE)

Fig. 3

We find:

The sales are generally going up, which I suppose is good news for Walmart. We can make out some yearly seasonality, and a dip at Christmas, which is the only day of the year when the stores are closed.
Zooming in, we can see strong weekly seasonality plus possibly some additional overlaying patterns with shorter periods than yearly.
The most recent 2016 sales numbers appear to grow a bit faster than in previous years.

4.3 Sales per State

To get a bit more out of these big picture views we will look at the sales per state on a monthly aggregate level. This is another interactive ggplotly graph:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id)

bar <- extract_ts(foo) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates))

gg <- bar %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_tufte() +
  labs(x = "Date", y = "Sales", title = "Monthly Sales per State")

ggplotly(gg, dynamicTicks = TRUE)

Fig. 4

We find:

California (CA) sells more items in general, while Wisconsin (WI) was slowly catching up to Texas (TX) and eventually surpassed it in the last months of our training data.
CA has pronounced dips in 2013 and 2015 that appear to be present in the other states as well, just less severe. These dips and peaks don’t appear to always occur (see 2012) but they might primarily reflect the yearly seasonality we noticed already.

4.4 Sales per Store & Category

There are 10 stores, 4 in CA and 3 each in TX and WI, and 3 categories: Foods, Hobbies, and Household. Here we’re switching up our visualisation strategy a bit by including all of those in the same, non-interactive layout. We will again use monthly aggregates to keep the plots clean.

This visual uses the a facetted view for the stores (1 facet per state) and the great patchwork package to arrange the individual plots, including a count of all category occurences:

foo <- train %>%
  group_by(cat_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = cat_id)

bar <- train %>%
  group_by(store_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = store_id)

p1 <- extract_ts(foo) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_hc() +
  theme(legend.position = "none") +
  labs(title = "Sales per Category", x = "Date", y = "Sales")

p2 <- train %>% 
  count(cat_id) %>% 
  ggplot(aes(cat_id, n, fill = cat_id)) +
  geom_col() +
  theme_hc() +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(size = 7)) +
  labs(x = "", y = "", title = "Rows per Category")

p3 <- extract_ts(bar) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  mutate(state_id = str_sub(id, 1, 2)) %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(title = "Sales per Store", x = "Date", y = "Sales", col = "Store ID") +
  facet_wrap(~state_id)

layout <- "
AAB
CCC
"

p1 + p2 + p3 + plot_layout(design = layout)

Fig. 5

We find:

“Foods” are the most common category, followed by “Household” which is still quite a bit above “Hobbies”. The number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.
In terms of stores, we see that the TX stores are quite close together in sales; with “TX_3” rising from the levels of “TX_1” to the level of “TX_2” over the time of our training data. The WI stores “WI_1” and “WI_2” show a curious jump in sales in 2012, while “WI_3” shows a long dip over several year.
The CA stores are relatively well separated in store volume. Note “CA_2”, which declines to the “CA_4” level in 2015, only to recover and jump up to “CA_1” sales later in the year.

4.5 Sales per Department

Our data has 7 departments, 3 for “FOODS” and 2 each for “HOBBIES” and “HOUSEHOLD”. Together with the 3 states those are 21 levels. Let’s see whether we can fit them all in a single facet grid plot.

min_date <- date("2011-01-29")

foo <- train %>%
  group_by(dept_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

foo %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, dept_id, state_id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  ggplot(aes(dates, sales, col = dept_id)) +
  geom_line() +
  facet_grid(state_id ~ dept_id) +
  theme_tufte() +
  theme(legend.position = "none", strip.text.x = element_text(size = 8)) +
  labs(title = "Sales per Department and State", x = "Date", y = "Sales")

Fig. 5

We find:

“FOODS_3” is clearly driving the majority of “FOODS” category sales in all states. “FOODS_2” is picking up a bit towards the end of the time range, especially in “WI”.
Similarly, “HOUSEHOLD_1” is clearly outselling “HOUSEHOLD_2”. “HOBBIES_1” is on a higher average sales level than “HOBBIES_2”, but both are not showing much development over time.

4.6 Seasonalities - global

Moving on from the time series views, at least for the moment, we are changing up our visuals to study seasonalities. Here is a heat map that combines the weekly and yearly seasonalities.

Because of the general increasing trend in sales, we’re not looking at absolute sales values. Instead, we aim to model this trend using a smoothed (LOESS) fit which we then subtract from the data. The heatmap shows the relative changes. Note, that I removed the Christmas dips because they would be distracting for the purpose of this plot:

foo <- train %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  mutate(id = 1)

bar <- extract_ts(foo) %>% 
  filter(!str_detect(as.character(dates), "-12-25"))

loess_all <- predict(loess(bar$sales ~ as.integer(bar$dates - min(bar$dates)) + 1, span = 1/2, degree = 1))

bar <- bar %>% 
  mutate(loess = loess_all) %>% 
  mutate(sales_rel = sales - loess)

p1 <- bar %>% 
  ggplot(aes(dates, sales)) +
  geom_line(col = "blue", alpha = 0.5) +
  geom_line(aes(dates, loess), col = "black") +
  theme_hc() +
  labs(x = "", y = "Sales", title = "Total Sales with Smoothing Fit + Seasonality in Residuals")

p2 <- bar %>% 
  mutate(wday = wday(dates, label = TRUE, week_start = 1),
         month = month(dates, label = TRUE),
         year = year(dates)) %>% 
  group_by(wday, month, year) %>% 
  summarise(sales = sum(sales_rel)/1e3) %>%
  ggplot(aes(month, wday, fill = sales)) +
  geom_tile() +
  labs(x = "Month of the year", y = "Day of the week", fill = "Relative Sales [1k]") +
  scale_fill_distiller(palette = "Spectral") +
  theme_hc()

p1 / p2

Fig. 6

We find:

The weekly pattern is strong, with Sat and Sun standing out prominently. Also Monday seems to benefit a bit from the weekend effect.
The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip. Certain holidays, like the 4th of July, might somewhat influence these patterns; but over 5 years they should average out reasonably well.
I already tweaked the smoothing fit parameters a bit, but they could probably be optimised further. Still, it looks quite good for the first try.

Let’s stay with the seasonalities for a little while longer and go one or two levels deeper.

Next, we’ll look at the weekly and monthly patterns on a state level. We’re using the same smoothing approach which is shown in the upper panels for reference. Here, I will scale each individual time series by its global mean to allow for a better comparison between the three states.

The colours are the same throughout the layout; see the upper panels for reference:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id)

bar <- extract_ts(foo) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

p1 <- bar %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  geom_line(aes(dates, loess), col = "black") +
  facet_wrap(~ id) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")


p2 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  ungroup() %>% 
  ggplot(aes(wday, sales, group = id, col = id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality")
  
p3 <- bar %>% 
  ungroup() %>%
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  ungroup() %>% 
  ggplot(aes(month, sales, group = id, col = id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality")
  
layout <- "
AAAAA
BBCCC
"

p1 + p2 + p3 + plot_layout(design = layout)

Fig. 7

We find:

After scaling, the weekday vs weekend pattern is very similar for all 3 states, except for an interesting downturn in Sunday sales in WI.
The monthly seasonalities are indeed complex. There is a dip in the winter months and a second, generally shallower dip around May. WI is again the odd state out: it sells notably less in the summer compared to TX and especially CA; so much so that the Feb/Aug ratio is inverted for WI vs CA/TX.

The logical next step is to look at the Seasonalities per Category, since you wouldn’t necessarily expect food and household item shopping to follow the same patterns. I will also break up this view by state, following the insights we just saw above.

Again, we are subtracting the trend and scaling by the global mean:

# sum by cat + state, pivoting dates
foo <- train %>%
  group_by(cat_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

# fit loess and subtract
bar <- foo %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id, state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 2/3, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

# bar %>% 
#   ggplot(aes(dates, sales, col = cat_id)) +
#   geom_line() +
#   geom_line(aes(dates, loess), col = "black") +
#   facet_grid(cat_id ~ state_id) +
#   theme_tufte() +
#   theme(legend.position = "none") +
#   labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")

p1 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(cat_id, state_id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>%  
  ggplot(aes(wday, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free", nrow = 3) +
  theme(legend.position = "top") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality and", col = "State")
  
p2 <- bar %>% 
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(cat_id, state_id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>% 
  ggplot(aes(month, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free_y", nrow = 3) +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality by Category & State", col = "State")

layout <- "
AABBB
"

p1 + p2 + plot_layout(design = layout)

Fig. 8

We find:

The weekly patterns for the “foods” category are very close for all 3 states, and WI shows some deviations for “hobbies” and “household”. We also see the Wisconsin’s characteristic Sunday dip for all 3 categories.
The monthly patterns show some interesting signatures: For “foods”, CA and TX are pretty close but WI shows that inverted summer/winter ratio. In contrast, the 3 states are much more similar to each other in the “hobbies” category. And when it comes to “household” items, CA doesn’t seem to sell as much of them during the first 3 months of the year but slightly more in the summer; compared to WI and TX.

Before we look at the additional explanatory variables, here is a visual representation of the split between training data vs validation data (public leaderboard) vs evaluation data (eventual private leaderboard):

foo <- extract_ts(train %>% head(1)) %>% 
  select(dates) %>% 
  mutate(dset = "training")

bar <- tibble(
  dates = max(foo$dates) + seq(1,28,1)
) %>% 
  mutate(dset = "validation")

foobar <- tibble(
  dates = max(bar$dates) + seq(1,28,1)
) %>% 
  mutate(dset = "evaluation")

foo <- foo %>%
  bind_rows(bar) %>%
  bind_rows(foobar) %>%
  mutate(year = year(dates)) %>% 
  mutate(dset = fct_relevel(as.factor(dset), c("training", "validation", "evaluation")))
year(foo$dates) <- 2016

foo %>%
  filter(!is.na(dates)) %>%
  ggplot(aes(dates, year, color = dset)) +
  geom_point(shape = "|", size = 10) +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +
  scale_y_reverse() +
  theme_hc() +
  theme(legend.position = "bottom") + #, axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(color = "Data set", x = "Month", y = "Year", title = "Training vs Validation vs Evaluation split over time") +
  # scale_color_brewer(type = "qual", palette = "Set2") +
  scale_color_manual(values = wes_palette("FantasticFox1")) +
  guides(color = guide_legend(override.aes = list(size = 4, pch = 15)))

Fig. 9

This shows the 5+ years of training data together with the 28 days each of validation and evaluation time range.
The training range goes from 2011-01-29 to 2016-04-24. The validation range spans the following 28 days from 2016-04-25 to 2016-05-22. This is what we are predicting for the initial public leaderboard. Eventually, we will be given the ground truth for this period to train our model for the final evaluation.
The evaluation range goes from 2016-05-23 to 2016-06-19; another 28 days. This is what the ultimate scoring will be based on.

To be continued

Back to (predict) the future - Interactive M5 EDA

2020-03-05

1 Introduction

2 Preparations

2.1 Load libraries

2.2 Helper functions

2.3 Load data

3 Quick Look: File structure and content

3.1 Training sales data

3.2 Sales prices

3.3 Calendar

3.4 Missing values & zero values

4 Visual Overview: Interactive time series plots

4.1 Individual item-level time series - random sample

4.2 All aggregate sales

4.3 Sales per State

4.4 Sales per Store & Category

4.5 Sales per Department

4.6 Seasonalities - global