Photo by Fabio Bracht (@bracht) on Unsplash.

1 Introduction

Welcome to an extensive Exploratory Data Analysis for the 5th Makridakis forecasting competitions (M5)! This notebook will grow over the coming days and weeks into a deep dive of all the relevant aspects of this challenge. Here’s all you need to know to get started:

Some Background: the Makridakis competitions (or M-competitions), organised by forecasting expert Spyros Makridakis, aim to provide a better understanding and advancement of forecasting methodology by comparing the performance of different methods in solving a well-defined, real-world problem. The first M-competition was held in 1982. The forth competition (M4) ran in 2018 and featured “100,000 time series and 61 forecasting methods” (source in link). According to forecasting researcher and practitioner Rob Hyndman the M-competitions “have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models”. This empirical approach is very similar to Kaggle’s trade-mark way of having the best machine learning algorithms engage in intense competition on diverse datasets. M5 is the first M-competition to be held on Kaggle.

The goal: We have been challenged to predict sales data provided by the retail giant Walmart 28 days into the future. This competition will run in 2 tracks: In addition to forecasting the values themselves in the Forecasting competition, we are simultaneously tasked to estimate the uncertainty of our predictions in the Uncertainty Distribution competition. Both competitions will have the same 28 day forecast horizon.

The data: We are working with 42,840 hierarchical time series. The data were obtained in the 3 US states of California (CA), Texas (TX), and Wisconsin (WI). “Hierarchical” here means that data can be aggregated on different levels: item level, department level, product category level, and state level. The sales information reaches back from Jan 2011 to June 2016. In addition to the sales numbers, we are also given corresponding data on prices, promotions, and holidays. Note, that we have been warned that most of the time series contain zero values.

The data comprises 3049 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The hierachical aggregation captures the combinations of these factors. For instance, we can create 1 time series for all sales, 3 time series for all sales per state, and so on. The largest category is sales of all individual 3049 products per 10 stores for 30490 time series.

The training data comes in the shape of 3 separate files:

sales_train.csv: this is our main training data. It has 1 column for each of the 1941 days from 2011-01-29 and 2016-05-22; not including the validation period of 28 days until 2016-06-19. It also includes the IDs for item, department, category, store, and state. The number of rows is 30490 for all combinations of 30490 items and 10 stores.
sell_prices.csv: the store and item IDs together with the sales price of the item as a weekly average.
calendar.csv: dates together with related features like day-of-the week, month, year, and an 3 binary flags for whether the stores in each state allowed purchases with SNAP food stamps at this date (1) or not (0).

The metrics:

The point forecast submission are being evaluated using the Root Mean Squared Scaled Error (RMSSE), which is derived from the Mean Absolute Scaled Error (MASE) that was designed to be scale invariant and symmetric. In a similar way to the MASE, the RMSSE is scale invariant and symmetric, and measures the prediction error (i.e. forecast - truth) relative to a “naive forecast” that simply assumes that step i = step i-1. In contrast to the MASE, here both prediction error and naive error are scaled to account for the goal of estimating average values in the presence of many zeros.
The uncertainy distributions are being evaluated using the Weighted Scaled Pinball Loss (WSPL). We are asked to provide the 50%, 67%, 95%, and 99% uncertainty intervals together with the forecasted median.
Both metrics are computed for each time series and then averaged accross all time series including weights. The weights are proportional to the sales volume of the item, in dollars, to give more importance to high selling products. Note, that the weights are based on the last 28 days of the training data, and that those dates will be adjusted for the ultimate evaluation data, as confirmed by the organisers.

Need some inspiration? Beyond this notebook, I recommend you to check out other recent Kaggle time series competitions such as Wiki Traffic Forecasting, Favorita Grocery Sales Forecasting, Recruit Restaurant, or ASHRAE and read the Kernels and top ranking write-ups. Also, note that a few years ago Kaggle ran two Walmart recruiting competitions with the goal to forecast sales: Recruiting 1 and Recruiting 2. Last but not least, there was a relevant Playground competition a few years ago.

Let’s get started!

2 Preparations

2.1 Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for time series forecasting.

# general visualisation
library('ggplot2') # visualisation

## Warning: package 'ggplot2' was built under R version 3.6.3

library('scales') # visualisation

## Warning: package 'scales' was built under R version 3.6.3

library('patchwork') # visualisation

## Warning: package 'patchwork' was built under R version 3.6.3

library('RColorBrewer') # visualisation
library('corrplot') # visualisation

## Warning: package 'corrplot' was built under R version 3.6.3

# general data manipulation
library('dplyr') # data manipulation

## Warning: package 'dplyr' was built under R version 3.6.3

library('readr') # input/output
library('vroom') # input/output

## Warning: package 'vroom' was built under R version 3.6.3

library('skimr') # overview
library('tibble') # data wrangling

## Warning: package 'tibble' was built under R version 3.6.3

library('tidyr') # data wrangling

## Warning: package 'tidyr' was built under R version 3.6.3

library('purrr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation

## Warning: package 'forcats' was built under R version 3.6.3

library('fuzzyjoin') # data wrangling

## Warning: package 'fuzzyjoin' was built under R version 3.6.3

# specific visualisation
library('alluvial') # visualisation

## Warning: package 'alluvial' was built under R version 3.6.3

library('ggrepel') # visualisation
library('ggforce') # visualisation

## Warning: package 'ggforce' was built under R version 3.6.3

library('ggridges') # visualisation

## Warning: package 'ggridges' was built under R version 3.6.3

library('gganimate') # animations
library('GGally') # visualisation
library('ggthemes') # visualisation
library('wesanderson') # visualisation

## Warning: package 'wesanderson' was built under R version 3.6.3

library('kableExtra') # display

## Warning: package 'kableExtra' was built under R version 3.6.3

library(timetk)

## Warning: package 'timetk' was built under R version 3.6.3

# Date + forecast
library('lubridate') # date and time

## Warning: package 'lubridate' was built under R version 3.6.3

library('forecast') # time series analysis

## Warning: package 'forecast' was built under R version 3.6.3

#library('prophet') # time series analysis
library('timetk') # time series analysis

# Interactivity
library('crosstalk')
library('plotly')

## Warning: package 'plotly' was built under R version 3.6.3

# parallel
library('foreach')

## Warning: package 'foreach' was built under R version 3.6.2

library('doParallel')

## Warning: package 'doParallel' was built under R version 3.6.3

## Warning: package 'iterators' was built under R version 3.6.2

2.2 Helper functions

A short helper function to compute binomial confidence intervals.

# function to extract binomial confidence levels
get_binCI <- function(x,n) as.list(setNames(binom.test(x,n)$conf.int, c("lwr", "upr")))

2.3 Load data

We use R’s new vroom package which reads data as fast as its onomatopoeic name implies:

train <- read.csv('sales_train_validation.csv')
prices <- read.csv('sell_prices.csv')
calendar <- read.csv('calendar.csv')
sample_submit <- read.csv('sample_submission.csv')

3 Quick Look: File structure and content

As a first step let’s have a quick look of the data sets using the head, summary, and glimpse tools where appropriate.

3.1 Training sales data

Here are the first 10 columns and rows of the our training sales data:

train %>% 
  select(seq(1,10,1)) %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling()

id	item_id	dept_id	cat_id	store_id	state_id	d_1	d_2	d_3	d_4
HOBBIES_1_001_CA_1_validation	HOBBIES_1_001	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_002_CA_1_validation	HOBBIES_1_002	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_003_CA_1_validation	HOBBIES_1_003	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_004_CA_1_validation	HOBBIES_1_004	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_005_CA_1_validation	HOBBIES_1_005	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_006_CA_1_validation	HOBBIES_1_006	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_007_CA_1_validation	HOBBIES_1_007	HOBBIES_1	HOBBIES	CA_1	CA	0	0	0	0
HOBBIES_1_008_CA_1_validation	HOBBIES_1_008	HOBBIES_1	HOBBIES	CA_1	CA	12	15	0	0
HOBBIES_1_009_CA_1_validation	HOBBIES_1_009	HOBBIES_1	HOBBIES	CA_1	CA	2	0	7	3
HOBBIES_1_010_CA_1_validation	HOBBIES_1_010	HOBBIES_1	HOBBIES	CA_1	CA	0	0	1	0

We find:

There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.
The sales per date are encoded as columns starting with the prefix d_. Those are the number of units sold per day (not the total amount of dollars).
We already see that there are quite a lot of zero values.

This data set has too many columns and rows to display them all:

c(ncol(train),nrow(train))

## [1]  1919 30490

All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.

train %>% 
  mutate(dset = if_else(str_detect(id, "validation"), "validation", "training")) %>% 
  count(dset)

## # A tibble: 1 x 2
##   dset           n
##   <chr>      <int>
## 1 validation 30490

3.2 Sales prices

This data set gives us the weekly price changes per item:

prices %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling()

store_id	item_id	wm_yr_wk	sell_price
CA_1	HOBBIES_1_001	11325	9.58
CA_1	HOBBIES_1_001	11326	9.58
CA_1	HOBBIES_1_001	11327	8.26
CA_1	HOBBIES_1_001	11328	8.26
CA_1	HOBBIES_1_001	11329	8.26
CA_1	HOBBIES_1_001	11330	8.26
CA_1	HOBBIES_1_001	11331	8.26
CA_1	HOBBIES_1_001	11332	8.26
CA_1	HOBBIES_1_001	11333	8.26
CA_1	HOBBIES_1_001	11334	8.26

summary(prices)

##     store_id              item_id           wm_yr_wk       sell_price     
##  TX_2   : 701214   FOODS_1_003:   2820   Min.   :11101   Min.   :  0.010  
##  TX_1   : 699796   FOODS_1_006:   2820   1st Qu.:11247   1st Qu.:  2.180  
##  CA_1   : 698412   FOODS_1_018:   2820   Median :11411   Median :  3.470  
##  WI_3   : 696094   FOODS_1_019:   2820   Mean   :11383   Mean   :  4.411  
##  CA_3   : 693990   FOODS_1_020:   2820   3rd Qu.:11517   3rd Qu.:  5.840  
##  TX_3   : 691112   FOODS_1_024:   2820   Max.   :11621   Max.   :107.320  
##  (Other):2660503   (Other)    :6824201

We find:

We have the store_id and item_id to link this data to our training and validation data.
Prices range from $0.10 to a bit more than 100 dollars.

3.3 Calendar

The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:

calendar %>% 
  head(8) %>% 
  kable() %>% 
  kable_styling()

date	wm_yr_wk	weekday	wday	month	year	d	snap_CA	snap_TX	snap_WI
2011-01-29	11101	Saturday	1	1	2011	d_1	0	0	0
2011-01-30	11101	Sunday	2	1	2011	d_2	0	0	0
2011-01-31	11101	Monday	3	1	2011	d_3	0	0	0
2011-02-01	11101	Tuesday	4	2	2011	d_4	1	1	0
2011-02-02	11101	Wednesday	5	2	2011	d_5	1	0	1
2011-02-03	11101	Thursday	6	2	2011	d_6	1	1	1
2011-02-04	11101	Friday	7	2	2011	d_7	1	0	0
2011-02-05	11102	Saturday	1	2	2011	d_8	1	1	1

glimpse(calendar)

## Rows: 1,969
## Columns: 14
## $ date         <fct> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-0...
## $ wm_yr_wk     <int> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday      <fct> Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday...
## $ wday         <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month        <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year         <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d            <fct> d_1, d_2, d_3, d_4, d_5, d_6, d_7, d_8, d_9, d_10, d_1...
## $ event_name_1 <fct> , , , , , , , , SuperBowl, , , , , , , , ValentinesDay...
## $ event_type_1 <fct> , , , , , , , , Sporting, , , , , , , , Cultural, , , ...
## $ event_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , 
## $ event_type_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , , 
## $ snap_CA      <int> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX      <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...

summary(calendar)

##          date         wm_yr_wk          weekday         wday      
##  2011-01-29:   1   Min.   :11101   Friday   :281   Min.   :1.000  
##  2011-01-30:   1   1st Qu.:11219   Monday   :281   1st Qu.:2.000  
##  2011-01-31:   1   Median :11337   Saturday :282   Median :4.000  
##  2011-02-01:   1   Mean   :11347   Sunday   :282   Mean   :3.997  
##  2011-02-02:   1   3rd Qu.:11502   Thursday :281   3rd Qu.:6.000  
##  2011-02-03:   1   Max.   :11621   Tuesday  :281   Max.   :7.000  
##  (Other)   :1963                   Wednesday:281                  
##      month             year            d              event_name_1 
##  Min.   : 1.000   Min.   :2011   d_1    :   1               :1807  
##  1st Qu.: 3.000   1st Qu.:2012   d_10   :   1   LentStart   :   6  
##  Median : 6.000   Median :2013   d_100  :   1   LentWeek2   :   6  
##  Mean   : 6.326   Mean   :2013   d_1000 :   1   MemorialDay :   6  
##  3rd Qu.: 9.000   3rd Qu.:2015   d_1001 :   1   Mother's day:   6  
##  Max.   :12.000   Max.   :2016   d_1002 :   1   NBAFinalsEnd:   6  
##                                  (Other):1963   (Other)     : 132  
##     event_type_1          event_name_2     event_type_2     snap_CA      
##           :1807                 :1964            :1964   Min.   :0.0000  
##  Cultural :  37   Cinco De Mayo :   1   Cultural :   4   1st Qu.:0.0000  
##  National :  52   Easter        :   1   Religious:   1   Median :0.0000  
##  Religious:  55   Father's day  :   2                    Mean   :0.3301  
##  Sporting :  18   OrthodoxEaster:   1                    3rd Qu.:1.0000  
##                                                          Max.   :1.0000  
##                                                                          
##     snap_TX          snap_WI      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3301   Mean   :0.3301  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000  
##

We find:

The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.
There are only 1969 non-NA rows in the event_name_2 column; i.e. only 5 (out of 1969) instances where there is more than 1 event on a particular day.

3.4 Missing values & zero values

There are no missing values in our sales training data:

sum(is.na(train))

## [1] 0

However, there are a lot of zero values, here we plot the distribution of zero percentages among all time series:

bar <- train %>% 
  select(-contains("id")) %>% 
  na_if(0) %>% 
  is.na() %>% 
  as_tibble() %>% 
  mutate(sum = pmap_dbl(select(., everything()), sum)) %>% 
  mutate(mean = sum/(ncol(train) - 1)) %>% 
  select(sum, mean)
  
bar %>% 
  ggplot(aes(mean)) +
  geom_density(fill = "blue") +
  scale_x_continuous(labels = scales::percent) +
  coord_cartesian(xlim = c(0, 1)) +
  theme_hc() +
  theme(axis.text.y = element_blank()) +
  labs(x = "", y = "", title = "Density for percentage of zero values - all time series")

Fig. 1

This means that only a minority of time series have less than 50% of zero values. The peak is rather close to 100%

4 Visual Overview: Interactive time series plots

We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates column in date format:

extract_ts <- function(df){
  
  min_date <- date("2011-01-29")
  
  df %>%
    select(id, starts_with("d_")) %>%  
    pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
    mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
    mutate(dates = min_date + dates - 1) %>% 
    mutate(id = str_remove(id, "_validation"))
}

set.seed(4321)
foo <- train %>% 
  sample_n(50)

ts_out <- extract_ts(foo)

cols <- ts_out %>% 
  distinct(id) %>% 
  mutate(cols = rep_len(brewer.pal(7, "Set2"), length.out = n_distinct(ts_out$id)))

ts_out <- ts_out %>% 
  left_join(cols, by = "id")

pal <- cols$cols %>%
   setNames(cols$id)

4.1 Individual item-level time series - random sample

Here we will sample 50 random time series from our training sample to get an idea about the data.

In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.

Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:

shared_ts <- highlight_key(ts_out)

palette(brewer.pal(9, "Set3"))

gg <- shared_ts %>% 
  ggplot(aes(dates, sales, col = id, group = id)) +
  geom_line() +
  scale_color_manual(values = pal) +
  labs(x = "Date", y = "Sales") +
  theme_tufte() +
  NULL

filter <- bscols(
  filter_select("ids", "Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):", shared_ts, ~id, multiple = TRUE),
  ggplotly(gg, dynamicTicks = TRUE),
  widths = c(12, 12)
)

bscols(filter)

Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):

We find:

Most time series have pretty low daily count statistics, alongside the large percentage of zero numbers we already noticed. On the one hand, this suggests that spikes are not going to be overly pronounced. But it also indicates that accurate forecasts will have to deal with quite a lot of noise.
Some of our sample time series start in the middle of the time range, and some have long gaps in between. This is an additional challenge.

Notes:

I have assigned one of 7 colour brewer colour-blind friendly colours (“Set2”) to each time series (cycling through the selection sequentially). Those will typically provide a good contrast for a reasonable number of simultaneous plots, but sometimes two time series might have the same colour.
I got the inspiration to use the crosstalk package for building this plot from the excellent dashboard kernel by Saba Tavoosi. Make sure to check it out.

4.2 All aggregate sales

After peeking at some of the individual time series and finding a lot of zero values, we will now do some aggregation to get some decent statistics.

First off, here we plot the aggregate time series over all items, stores, categories, departments and sales. This is an interactive plot and you can use the usual plotly tools (upper right corner) to zoom, pan, and scale the view.

foo <- train %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  mutate(id = 1)

bar <- extract_ts(foo)

gg <- bar %>% 
  ggplot(aes(dates, sales)) +
  geom_line(col = "blue") +
  theme_tufte() +
  labs(x = "Date", y = "Sales", title = "All aggregate sales")

ggplotly(gg, dynamicTicks = TRUE)

Fig. 3

We find:

The sales are generally going up, which I suppose is good news for Walmart. We can make out some yearly seasonality, and a dip at Christmas, which is the only day of the year when the stores are closed.
Zooming in, we can see strong weekly seasonality plus possibly some additional overlaying patterns with shorter periods than yearly.
The most recent 2016 sales numbers appear to grow a bit faster than in previous years.

4.3 Sales per State

To get a bit more out of these big picture views we will look at the sales per state on a monthly aggregate level. This is another interactive ggplotly graph:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id)

bar <- extract_ts(foo) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates))

gg <- bar %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_tufte() +
  labs(x = "Date", y = "Sales", title = "Monthly Sales per State")

ggplotly(gg, dynamicTicks = TRUE)

Fig. 4

We find:

California (CA) sells more items in general, while Wisconsin (WI) was slowly catching up to Texas (TX) and eventually surpassed it in the last months of our training data.
CA has pronounced dips in 2013 and 2015 that appear to be present in the other states as well, just less severe. These dips and peaks don’t appear to always occur (see 2012) but they might primarily reflect the yearly seasonality we noticed already.

4.4 Sales per Store & Category

There are 10 stores, 4 in CA and 3 each in TX and WI, and 3 categories: Foods, Hobbies, and Household. Here we’re switching up our visualisation strategy a bit by including all of those in the same, non-interactive layout. We will again use monthly aggregates to keep the plots clean.

This visual uses the a facetted view for the stores (1 facet per state) and the great patchwork package to arrange the individual plots, including a count of all category occurences:

foo <- train %>%
  group_by(cat_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = cat_id)

bar <- train %>%
  group_by(store_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = store_id)

p1 <- extract_ts(foo) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_hc() +
  theme(legend.position = "none") +
  labs(title = "Sales per Category", x = "Date", y = "Sales")

p2 <- train %>% 
  count(cat_id) %>% 
  ggplot(aes(cat_id, n, fill = cat_id)) +
  geom_col() +
  theme_hc() +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(size = 7)) +
  labs(x = "", y = "", title = "Rows per Category")

p3 <- extract_ts(bar) %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  mutate(state_id = str_sub(id, 1, 2)) %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(title = "Sales per Store", x = "Date", y = "Sales", col = "Store ID") +
  facet_wrap(~state_id)

layout <- "
AAB
CCC
"

p1 + p2 + p3 + plot_layout(design = layout)

Fig. 5

We find:

“Foods” are the most common category, followed by “Household” which is still quite a bit above “Hobbies”. The number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.
In terms of stores, we see that the TX stores are quite close together in sales; with “TX_3” rising from the levels of “TX_1” to the level of “TX_2” over the time of our training data. The WI stores “WI_1” and “WI_2” show a curious jump in sales in 2012, while “WI_3” shows a long dip over several year.
The CA stores are relatively well separated in store volume. Note “CA_2”, which declines to the “CA_4” level in 2015, only to recover and jump up to “CA_1” sales later in the year.

4.5 Sales per Department

Our data has 7 departments, 3 for “FOODS” and 2 each for “HOBBIES” and “HOUSEHOLD”. Together with the 3 states those are 21 levels. Let’s see whether we can fit them all in a single facet grid plot.

min_date <- date("2011-01-29")

foo <- train %>%
  group_by(dept_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

foo %>% 
  mutate(month = month(dates),
         year = year(dates)) %>% 
  group_by(month, year, dept_id, state_id) %>% 
  summarise(sales = sum(sales),
            dates = min(dates)) %>% 
  ungroup() %>% 
  filter(str_detect(as.character(dates), "..-..-01")) %>% 
  filter(dates != max(dates)) %>% 
  ggplot(aes(dates, sales, col = dept_id)) +
  geom_line() +
  facet_grid(state_id ~ dept_id) +
  theme_tufte() +
  theme(legend.position = "none", strip.text.x = element_text(size = 8)) +
  labs(title = "Sales per Department and State", x = "Date", y = "Sales")

Fig. 5

We find:

“FOODS_3” is clearly driving the majority of “FOODS” category sales in all states. “FOODS_2” is picking up a bit towards the end of the time range, especially in “WI”.
Similarly, “HOUSEHOLD_1” is clearly outselling “HOUSEHOLD_2”. “HOBBIES_1” is on a higher average sales level than “HOBBIES_2”, but both are not showing much development over time.

4.6 Seasonalities - global

Moving on from the time series views, at least for the moment, we are changing up our visuals to study seasonalities. Here is a heat map that combines the weekly and yearly seasonalities.

Because of the general increasing trend in sales, we’re not looking at absolute sales values. Instead, we aim to model this trend using a smoothed (LOESS) fit which we then subtract from the data. The heatmap shows the relative changes. Note, that I removed the Christmas dips because they would be distracting for the purpose of this plot:

foo <- train %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  mutate(id = 1)

bar <- extract_ts(foo) %>% 
  filter(!str_detect(as.character(dates), "-12-25"))

loess_all <- predict(loess(bar$sales ~ as.integer(bar$dates - min(bar$dates)) + 1, span = 1/2, degree = 1))

bar <- bar %>% 
  mutate(loess = loess_all) %>% 
  mutate(sales_rel = sales - loess)

p1 <- bar %>% 
  ggplot(aes(dates, sales)) +
  geom_line(col = "blue", alpha = 0.5) +
  geom_line(aes(dates, loess), col = "black") +
  theme_hc() +
  labs(x = "", y = "Sales", title = "Total Sales with Smoothing Fit + Seasonality in Residuals")

p2 <- bar %>% 
  mutate(wday = wday(dates, label = TRUE, week_start = 1),
         month = month(dates, label = TRUE),
         year = year(dates)) %>% 
  group_by(wday, month, year) %>% 
  summarise(sales = sum(sales_rel)/1e3) %>%
  ggplot(aes(month, wday, fill = sales)) +
  geom_tile() +
  labs(x = "Month of the year", y = "Day of the week", fill = "Relative Sales [1k]") +
  scale_fill_distiller(palette = "Spectral") +
  theme_hc()

p1 / p2

Fig. 6

We find:

The weekly pattern is strong, with Sat and Sun standing out prominently. Also Monday seems to benefit a bit from the weekend effect.
The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip. Certain holidays, like the 4th of July, might somewhat influence these patterns; but over 5 years they should average out reasonably well.
I already tweaked the smoothing fit parameters a bit, but they could probably be optimised further. Still, it looks quite good for the first try.

Let’s stay with the seasonalities for a little while longer and go one or two levels deeper.

Next, we’ll look at the weekly and monthly patterns on a state level. We’re using the same smoothing approach which is shown in the upper panels for reference. Here, I will scale each individual time series by its global mean to allow for a better comparison between the three states.

The colours are the same throughout the layout; see the upper panels for reference:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id)

bar <- extract_ts(foo) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

p1 <- bar %>% 
  ggplot(aes(dates, sales, col = id)) +
  geom_line() +
  geom_line(aes(dates, loess), col = "black") +
  facet_wrap(~ id) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")


p2 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  ungroup() %>% 
  ggplot(aes(wday, sales, group = id, col = id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality")
  
p3 <- bar %>% 
  ungroup() %>%
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  ungroup() %>% 
  ggplot(aes(month, sales, group = id, col = id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality")
  
layout <- "
AAAAA
BBCCC
"

p1 + p2 + p3 + plot_layout(design = layout)

Fig. 7

We find:

After scaling, the weekday vs weekend pattern is very similar for all 3 states, except for an interesting downturn in Sunday sales in WI.
The monthly seasonalities are indeed complex. There is a dip in the winter months and a second, generally shallower dip around May. WI is again the odd state out: it sells notably less in the summer compared to TX and especially CA; so much so that the Feb/Aug ratio is inverted for WI vs CA/TX.

The logical next step is to look at the Seasonalities per Category, since you wouldn’t necessarily expect food and household item shopping to follow the same patterns. I will also break up this view by state, following the insights we just saw above.

Again, we are subtracting the trend and scaling by the global mean:

# sum by cat + state, pivoting dates
foo <- train %>%
  group_by(cat_id, state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  ungroup() %>% 
  select(ends_with("id"), starts_with("d_")) %>%  
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1)

# fit loess and subtract
bar <- foo %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id, state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 2/3, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales)

# bar %>% 
#   ggplot(aes(dates, sales, col = cat_id)) +
#   geom_line() +
#   geom_line(aes(dates, loess), col = "black") +
#   facet_grid(cat_id ~ state_id) +
#   theme_tufte() +
#   theme(legend.position = "none") +
#   labs(x = "", y = "Sales", title = "Sales per State with Seasonalities")

p1 <- bar %>% 
  ungroup() %>%
  mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>% 
  group_by(cat_id, state_id, wday) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>%  
  ggplot(aes(wday, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free", nrow = 3) +
  theme(legend.position = "top") +
  labs(x = "", y = "Relative Sales", title = "Weekly Seasonality and", col = "State")
  
p2 <- bar %>% 
  mutate(month = month(dates, label = TRUE)) %>% 
  group_by(cat_id, state_id, month) %>% 
  summarise(sales = sum(sales_rel)) %>%
  unite(col = id, ends_with("id"), remove = FALSE) %>% 
  ggplot(aes(month, sales, group = id, col = state_id)) +
  geom_line(size = 1.5) +
  theme_tufte() +
  facet_wrap(~cat_id, scales = "free_y", nrow = 3) +
  theme(legend.position = "none") +
  labs(x = "", y = "Relative Sales", title = "Monthly Seasonality by Category & State", col = "State")

layout <- "
AABBB
"

p1 + p2 + plot_layout(design = layout)

Fig. 8

We find:

The weekly patterns for the “foods” category are very close for all 3 states, and WI shows some deviations for “hobbies” and “household”. We also see the Wisconsin’s characteristic Sunday dip for all 3 categories.
The monthly patterns show some interesting signatures: For “foods”, CA and TX are pretty close but WI shows that inverted summer/winter ratio. In contrast, the 3 states are much more similar to each other in the “hobbies” category. And when it comes to “household” items, CA doesn’t seem to sell as much of them during the first 3 months of the year but slightly more in the summer; compared to WI and TX.

Before we look at the additional explanatory variables, here is a visual representation of the split between training data vs validation data (public leaderboard) vs evaluation data (eventual private leaderboard):

foo <- extract_ts(train %>% head(1)) %>% 
  select(dates) %>% 
  mutate(dset = "training")

bar <- tibble(
  dates = max(foo$dates) + seq(1,28,1)
) %>% 
  mutate(dset = "validation")

foobar <- tibble(
  dates = max(bar$dates) + seq(1,28,1)
) %>% 
  mutate(dset = "evaluation")

foo <- foo %>%
  bind_rows(bar) %>%
  bind_rows(foobar) %>%
  mutate(year = year(dates)) %>% 
  mutate(dset = fct_relevel(as.factor(dset), c("training", "validation", "evaluation")))
year(foo$dates) <- 2016

foo %>%
  filter(!is.na(dates)) %>%
  ggplot(aes(dates, year, color = dset)) +
  geom_point(shape = "|", size = 10) +
  scale_x_date(date_labels = "%b", date_breaks = "1 month") +
  scale_y_reverse() +
  theme_hc() +
  theme(legend.position = "bottom") + #, axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(color = "Data set", x = "Month", y = "Year", title = "Training vs Validation vs Evaluation split over time") +
  # scale_color_brewer(type = "qual", palette = "Set2") +
  scale_color_manual(values = wes_palette("FantasticFox1")) +
  guides(color = guide_legend(override.aes = list(size = 4, pch = 15)))

Fig. 9

This shows the 5+ years of training data together with the 28 days each of validation and evaluation time range.
The training range goes from 2011-01-29 to 2016-04-24. The validation range spans the following 28 days from 2016-04-25 to 2016-05-22. This is what we are predicting for the initial public leaderboard. Eventually, we will be given the ground truth for this period to train our model for the final evaluation.
The evaluation range goes from 2016-05-23 to 2016-06-19; another 28 days. This is what the ultimate scoring will be based on.

5 Explanatory Variables: Prices and Calendar

This section will focus on the additional explanatory variables we’ve been given: item prices and calendar events. In terms of calendar features, we have already used some parameters like day-of-week or month, derived from the date, in the previous section. After studying the basic properties of these datasets, we will also connect them to the time series data.

5.1 Calendar

We need some calendar features for the item price table, so let’s start with that one.

In the quick view in section 3.3 we see that the calendar data frame contains basic features like day-of-week (as character column weekday and as integer column wday), month, year, and of course date. Alongside the date there is also a d column which links the date to the column names in the training data. (Our time series extraction function has the starting date as a hard-coded value, but you can also convert from column name to date using the calendar file.)

The other features deal with events and food stamps:

If you look at section 3.3. you’ll see that the columns event_name_2 and event_type_2 only contain 5 values that are not NAs, so we’re ignoring them here and only focus on the event_*_1 features.
The acronym SNAP stands for Supplemental Nutrition Assistance Program. The following is copied from their website: “The Supplemental Nutrition Assistance Program (SNAP) is the largest federal nutrition assistance program. SNAP provides benefits to eligible low-income individuals and families via an Electronic Benefits Transfer card. This card can be used like a debit card to purchase eligible food in authorized retail food stores.”

p1 <- calendar %>% 
  select(date, event_type_1) %>% 
  count(event = !is.na(event_type_1)) %>% 
  add_tally(n, name = "total") %>% 
  mutate(perc = n/total) %>% 
  ggplot(aes(event, perc, fill = event)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Days with Events")

p2 <- calendar %>% 
  filter(!is.na(event_type_1)) %>% 
  select(date, event_type_1) %>% 
  count(event_type_1) %>% 
  add_tally(n, name = "total") %>% 
  mutate(perc = n/total) %>% 
  ggplot(aes(reorder(event_type_1, n, FUN = min), perc, fill = event_type_1)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Types of events")

p3 <- calendar %>% 
  select(date, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state", values_to = "snap") %>% 
  mutate(state = str_sub(state, 6,7)) %>% 
  group_by(state, snap) %>% 
  summarise(n = n()) %>% 
  add_tally(n, name = "total") %>% 
  mutate(perc = n/total) %>% 
  mutate(snap = as.logical(snap)) %>% 
  ggplot(aes(snap, perc, fill = snap)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  theme_hc() +
  facet_wrap(~ state, scales = "free") +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Days with SNAP purchases: same percentage for all states")

layout <- "
AABBB
CCCCC
"

p1 + p2 + p3 + plot_layout(design = layout)

Fig. 10

We find:

In our calendar coverage (i.e. training + validation + evaluation range) about 8% of days have a special event. Of these events, about 1/3 are Religious (e.g. Orthodox Christmas) and 1/3 are National Holidays (e.g. Independence Day). The remaining third is again split into 2/3 Cultural (e.g. Valentines Day) and 1/3 Sporting events (e.g. SuperBowl).
Looking at the percentage of days where purchases with SNAP food stamps are allowed in Walmart stores, we find that it is the exact same for each of the 3 states: 650 days or 33%. This is noteworthy.

Let’s stay with SNAP for a little bit and look at a different style of visual: a calendar view. Here, I’m showing the days for each month in the shape of days-of-week as rows and weeks-of-month as columns; essentially the view a calendar is displayed in a Year view. Then I colour the SNAP days orange and show the same graph for each state:

foo <- calendar %>% mutate(date = as.Date(date))%>%
  select(date, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state", values_to = "snap") %>% 
  mutate(state = str_sub(state, 6,7)) %>% 
  tk_augment_timeseries_signature()

p1 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "CA") %>% 
  #mutate(mweek2 = lead(mweek,1)) %>%
  #filter(!is.na(mweek2)) %>%
  #ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "California")

p2 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "TX") %>% 
  # mutate(mweek2 = lead(mweek,1)) %>%
  # filter(!is.na(mweek2)) %>%
  # ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "Texas")

p3 <- foo %>%
  filter(!is.na(day)) %>% 
  filter(year == 2012 & month <= 6 & state == "WI") %>% 
  # mutate(mweek2 = lead(mweek,1)) %>%
  # filter(!is.na(mweek2)) %>%
  # ggplot(aes(mweek2, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  ggplot(aes(mweek, fct_rev(wday.lbl), fill = as.logical(snap))) + 
  geom_tile(colour = "white") +
  geom_text(aes(label = day), size = 2.5) +
  scale_fill_manual(values = c("grey70", "orange")) +
  facet_grid(year~month.lbl) + 
  theme_tufte() +
  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 8)) +
  labs(x="", y="", title = "Wisconsin")

p1 / p2 / p3 +
   plot_annotation(title = 'The same SNAP days each month, but different from state to state')

Fig. 11

These are just the first six month of 2012, to keep the plot manageable, but the pattern is always the same:

SNAP days are always the same days of the month for each month. There are always 10 of them, and the specific days are different from state to state: days 1-10 for CA, days 1-15 without 2, 4, 8, 10, 14 for TX, and days 2-15 without 4, 7, 10, 13 for WI.
The SNAP days for these 3 states also all happen in the first half of each month, no later than the 15th. This certainly helps to measure their impact and make predictions more robust.

5.2 Item Prices

We have item price information for each item ID, which includes the category and department IDs, and its store ID, which includes the state ID. Let’s look at some average overviews first.

Here is a facet grid with overlapping density plots for price distributions within the groups of category, department, and state. Note the logarithmic scale on the x-axes:

foo <- prices %>% 
  mutate(cat_id = str_sub(item_id, 1, -7)) %>% 
  mutate(dept_id = str_sub(item_id, -5, -5)) %>% 
  mutate(state = str_sub(store_id, 1, 2))

foo %>% 
  ggplot(aes(sell_price, fill = dept_id)) +
  geom_density(bw = 0.1, alpha = 0.5) +
  scale_x_log10(breaks = c(0.5, 1, 5, 10, 50)) +
  coord_cartesian(xlim = c(0.3, 60)) +
  # facet_wrap(~ cat_id, nrow = 3) +
  facet_grid(cat_id ~ state) +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "Average Sales Price [$]", y = "", fill = "Department",
       title = "Item Prices vary by Category and Department",
       subtitle = "But distributions are almost identical from State to State")

Fig. 12

We find:

First of all, the distributions are almost identical between the 3 states. There are some minute differences in the “FOODS” category, but this might be due the smoothing bandwidth size. For all practical purposes, I think that we can treat the price distributions as equal.
There are notable differences between the categories: FOODs are on average cheaper than HOUSEHOLD items. And HOBBIES items span a wider range of prices than the other two; even suggesting a second peak at lower prices.
Within the categories, we find significant differences, too:
- Among the three food categories department 3 (i.e. “FOODS_3”) does not contain a high-price tail.
- The HOBBIES category is the most diverse one, with both departments having quite broad distributions but “HOBBIES_1” accounting for almost all of the items above $10. “HOBBIES_2” has a bimodal structure.
- The HOUSEHOLD price distributions are quite similar, but “HOUSEHOLD_2” peaks at clearly higher prices than “HOUSEHOLD_1”.

Prices are provided as weekly averages. The week ID wm_yr_wk can be linked to dates (and sales) using the calendar column of the same name.

Here we do just that to extract the item price distributions per category and department for each of the year from 2011 to 2016. We use ridgeline plots provided by the ggridges package to produce stacks of overlapping density graphs:

foo <- prices %>% 
  mutate(cat_id = str_sub(item_id, 1, -7)) %>% 
  mutate(dept_id = str_sub(item_id, -5, -5)) %>% 
  select(cat_id, dept_id, wm_yr_wk, sell_price) %>% 
  left_join(calendar %>% 
    select(date, wm_yr_wk) %>% 
    group_by(wm_yr_wk) %>% 
    slice(1), by = "wm_yr_wk") %>% 
  mutate(year = year(date),
         month = month(date, label = TRUE, abbr = TRUE)) %>% 
  mutate(year_mon = str_sub(as.character(date), 1, 7)) %>% 
  ungroup()

## Warning: tz(): Don't know how to compute timezone for object of class factor;
## returning "UTC". This warning will become an error in the next major version of
## lubridate.

## Warning: tz(): Don't know how to compute timezone for object of class factor;
## returning "UTC". This warning will become an error in the next major version of
## lubridate.

foo %>% 
  sample_frac(0.3) %>% 
  ggplot(aes(x = sell_price, y = as.factor(year), fill = dept_id)) +
  geom_density_ridges(bandwidth = 0.1, alpha = 0.5) +
  scale_x_log10(breaks = c(0.5, 1, 2, 5, 10, 25)) +
  coord_cartesian(xlim = c(0.4, 30)) +
  facet_wrap(~ cat_id, nrow = 1) +
  #theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "Average Sales Price [$]", y = "", fill = "Department",
       title = "Item Prices by Category & Department - stable over the years",
       subtitle = "But see HOBBIES_2 becoming bimodal + average price increases")

Fig. 13

We find:

Overall, the price distributions are pretty stable over the years, with only slight increases that are likely due to inflation. For this plot, I’ve left the standard grey backgrounds so that you can use the vertical grid lines to better see the shifts to the right (remember that the scale is logarithmic). This is probably best visible in HOBBIES_1.
An interesting evolution is visible in HOBBIES_2, which over time becomes much more bimodal: the second peak at $1 is increasing in importance until it almost reaches the level of the main peak just above 2 dollars. At the same time the small secondary peak at half a dollar in HOBBIES_1 becomes more flat after 2012.
The HOUSEHOLD departments are stable. FOODS shows small changes like the relative growth of the $1 peak of FOODS_1.

5.3 Connection to time series data

Now that we have an idea about the properties of our explanatory variables, let’s see how they could influence our time series data.

Here, we’re looking at time series properties on an aggregate level. I plan for the next section to contain views of selected individual time series.

First off, this is a comprehensive view of sales volume during days with special events vs non-events for our 3 categories food, hobbies, and household. I’m showing the daily time series in the background, but it’s more instructive to look at the smoothed representations. I’m doing the smoothing per category and also globally (dashed line). This global fit will be used to compute the relative sales for the two bottom panels. In the lower right panel I only look at days with events and show the median sales for the 4 different types of events. The two bottom panels share the same y-axis labels between them.

Take your time to digest this view:

foo <- train %>%
  group_by(cat_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = cat_id) %>% 
  extract_ts() %>% 
  rename(cat_id = id) %>% 
  left_join(calendar %>% mutate(date=as.Date(date)) %>% select(date, event_type_1), by = c("dates" = "date")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(cat_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  mutate(is_event = !is.na(event_type_1)) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = is_event, col = is_event)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ cat_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "Event", title = "Sales per Category during special events vs non-events")

p2 <- foo %>% 
  ggplot(aes(cat_id, sales_rel, fill = is_event)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Relative Sales", fill = "event")

p3 <- foo %>%
  filter(is_event == TRUE) %>% 
  group_by(cat_id, event_type_1) %>% 
  summarise(sales = median(sales_rel)) %>% 
  ggplot(aes(cat_id, sales, fill = event_type_1)) +
  geom_col(position = "dodge") +
  coord_flip() +
  theme_hc() +
  theme(legend.position = "right", axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Median Relative Sales", fill = "Type")

layout <- '
AAAAAA
AAAAAA
AAAAAA
BBBCCD
BBBCCD
'

p1 + p2 + p3 + guide_area() + plot_layout(design = layout, guides = 'collect')

Fig. 14

We find:

For FOODS the smoothed lines of event vs non-event sales are pretty similar, while for HOBBIES the red event line is consistently below the non-events and for HOUSEHOLD the same is true after 2013. (This is a curious detail, that before 2013 there was no real difference between those sales.)
This impression is confirmed by looking at the boxplots in the lower left panel, which show the relative sales after subtracting the global smoothing fit. The FOODS sales are pretty comparable between events and non-events, while the event sales for HOUSEHOLD and especially HOBBIES are notably below the non-event level.
In the lower right panel we break out the events by type and look at the medians of the relative sales. The first thing we see is that FOODS sales are notably higher during “Sporting” events. This makes sense, given the food culture associated with big events like the Superbowl. FOODs also have slightly positive net sales during “Cultural” events.
In general, “National” and “Religious” events both lead to relative decline in sales volume. “National” events are more depressing for the HOBBIES category, while the other two categories are slightly more affected by “Religious” events. HOBBIES also sees lower sales from “Cultural” events, while for FOODS and HOUSEHOLD the differences are smaller. “Sporting” has a minor impact on HOUSEHOLD and HOBBIES.

We’ve spend quite a bit of time on designing this dashboard-like overview, so let’s get some mileage out of it. Here’s the equivalent plot for the 3 states:

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id) %>% 
  extract_ts() %>% 
  rename(state_id = id) %>% 
  left_join(calendar %>% mutate(date = as.Date(date))%>% select(date, event_type_1), by = c("dates" = "date")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  group_by(state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  mutate(is_event = !is.na(event_type_1)) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = is_event, col = is_event)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ state_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "Event", title = "Sales per State during special events vs non-events")

p2 <- foo %>% 
  ggplot(aes(state_id, sales_rel, fill = is_event)) +
  geom_boxplot() +
  coord_flip() +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Relative Sales", fill = "event")

p3 <- foo %>%
  filter(is_event == TRUE) %>% 
  group_by(state_id, event_type_1) %>% 
  summarise(sales = median(sales_rel)) %>% 
  ggplot(aes(state_id, sales, fill = event_type_1)) +
  geom_col(position = "dodge") +
  coord_flip() +
  theme_hc() +
  theme(legend.position = "right", axis.title.x = element_text(size = 10)) +
  labs(x = "", y = "Median Relative Sales", fill = "Type")

layout <- '
AAAAAA
AAAAAA
AAAAAA
BBBCCD
BBBCCD
'

p1 + p2 + p3 + guide_area() + plot_layout(design = layout, guides = 'collect')

Fig. 15

We find:

Special events slightly outsell non-event days in TX before 2014; afterwards they are similar. CA and WI also show a drop around the same time, but here it’s from similar sales to lower sales. This seems to be a common pattern that starts in 2013.
As a result the events boxplot for WI is notably shifted toward lower values, while in TX events push the sales figures to somewhat higher values. CA is roughly the same for events vs non-events. Note, that this is the global picture which will be different pre- and post-2014..
The view of event types looks interesting, especially for WI where “National” events have a strong negative impact on sales numbers. WI is also the only state where “Cultural” events have lower sales numbers, especially compared to TX. In contrast, “Religious” events have the smallest, but still negative impact in WI. “Sporting” events have a positive influence in each state.

Now, for the states we also have the SNAP dates from Fig. 11. In this plot we show again a smoothed fit for the SNAP days vs other days in the top panel, but then add different plots in the lower panels. The lower left plot shows the daily sales percentages. Since SNAP days make up about 1/3 of the days in each month, we sum the sales for each group (SNAP vs non-SNAP) and then divide by the total number of days.

bar <- calendar %>% 
  select(date, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state_id", values_to = "snap") %>% 
  mutate(state_id = str_replace(state_id, "snap_", ""))

foo <- train %>%
  group_by(state_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  rename(id = state_id) %>% 
  extract_ts() %>% 
  rename(state_id = id) %>% 
  left_join(bar %>% mutate (date= as.Date(date)), by = c("dates" = "date", "state_id")) %>% 
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  mutate(snap = as.logical(snap)) %>% 
  group_by(state_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  ungroup()


p1 <- foo %>% 
  ggplot(aes(dates, sales/1e3, group = snap, col = snap)) +
  geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
  geom_line(alpha = 0.3) +
  geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
  scale_colour_manual(values = c("grey70", "red")) +
  facet_wrap(~ state_id, scales = "free") +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales [$1k]", col = "SNAP day", title = "Sales per State on SNAP days vs other")

p2 <- foo %>% 
  group_by(state_id, snap) %>% 
  summarise(sales = sum(sales),
            ct = n()) %>% 
  mutate(sales_daily = sales/ct) %>% 
  add_tally(sales_daily, name = "total") %>% 
  mutate(perc = sales_daily/total) %>% 
  ggplot(aes(state_id, perc, fill = snap)) +
  geom_col(position = "dodge") +
  geom_label(aes(label = sprintf("%.1f %%", perc*100), group = snap), position = position_dodge(width = 1)) +
  scale_y_continuous(labels = scales::percent, breaks = c(0, seq(0.1, 0.5, 0.2))) +
  coord_cartesian(ylim = c(0, 0.6)) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  labs(x = "", y = "", title = "Daily Sales Percentage")

layout <- '
AAAA
AAAA
BBBC
'

p1 + p2 + guide_area() + plot_layout(design = layout, guides = 'collect')

Fig. 16

We find:

The SNAP days have clearly higher sales in every state. The largest difference to non-SNAP days is present for WI; this is abundantly clear from both the time series plots and the daily sales percentages. CA sales are closest for the two groups, but SNAP days are still almost 2 percentage points above 50%.
There are some slight variations over time, primarily for WI where the two curves appear to reach there biggest difference around 2013. As with all smoothing fits, also these ones have to be taken with a grain of salt near the very edges of the data.

After looking at the last two visuals the question has to be: how does the sales category interact with the SNAP feature?

The “N” in SNAP stands for “Nutrition”, so we would naively expect that those benefits would primarily affect purchases in our FOODS category. However, keep in mind that once someone bases their shopping patterns on SNAP days, they are very likely to purchase other items while they are there.

Let’s see what the data shows. Since the SNAP days are state dependent, we have to join our data by state level (and date) first, before aggregating by category and SNAP. We first look at the daily sales percentage for SNAP vs other, and then plot a heatmap for weekday vs month. The values of the heatmap show the differences between the sum of relative sales on SNAP days minus the sum of relative sales on other days. If the numbers are positive then there are more sales on SNAP days for this weekday and month (normalised by overall volume):

calendar<- calendar %>% mutate (date= as.Date(date))
min_date <- min(calendar$date)

bar <- calendar %>% 
  select(date, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state_id", values_to = "snap") %>% 
  mutate(state_id = str_replace(state_id, "snap_", ""))

foo <- train %>%
  group_by(state_id, cat_id) %>% 
  summarise_at(vars(starts_with("d_")), sum) %>% 
  pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
  mutate(dates = as.integer(str_remove(dates, "d_"))) %>% 
  mutate(dates = min_date + dates - 1) %>% 
  left_join(bar, by = c("dates" = "date", "state_id")) %>%  
  filter(!str_detect(as.character(dates), "-12-25")) %>% 
  mutate(snap = as.logical(snap)) %>% 
  group_by(state_id, cat_id) %>% 
  mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
         mean_sales = mean(sales)) %>% 
  mutate(sales_rel = (sales - loess)/mean_sales) %>% 
  ungroup()

## Warning: Column `state_id` joining factor and character vector, coercing into
## character vector

# foo %>% 
#   filter(cat_id == "HOBBIES") %>% 
#   ggplot(aes(dates, sales/1e3, group = snap, col = snap)) +
#   geom_line(aes(dates, loess/1e3), col = "black", linetype = 2) +
#   geom_line(alpha = 0.3) +
#   geom_smooth(method = "loess", formula = 'y ~ x', span = 2/3, se = FALSE) +
#   scale_colour_manual(values = c("grey70", "red")) +
#   facet_wrap(~ state_id, scales = "free") +
#   theme_hc() +
#   theme(legend.position = "right") +
#   labs(x = "", y = "Sales [$1k]", col = "SNAP day", title = "Sales per State on SNAP days vs other")


p1 <- foo %>% 
  group_by(state_id, cat_id, snap) %>% 
  summarise(sales = sum(sales),
            ct = n()) %>% 
  mutate(sales_daily = sales/ct) %>% 
  add_tally(sales_daily, name = "total") %>% 
  mutate(perc = sales_daily/total) %>% 
  ggplot(aes(cat_id, perc, fill = snap)) +
  geom_col(position = "dodge") +
  facet_wrap(~ state_id, nrow = 1) +
  #geom_label(aes(label = sprintf("%.1f %%", perc*100), group = snap), position = position_dodge(width = 1)) +
  scale_y_continuous(labels = scales::percent, breaks = c(0, seq(0.1, 0.5, 0.2))) +
  coord_cartesian(ylim = c(0, 0.6)) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none", axis.text.x = element_text(size = 8)) + 
  labs(x = "", y = "", title = "Daily Sales Percentage for SNAP per category")
  
p2 <- foo %>% 
  filter(state_id == "CA" & cat_id == "FOODS") %>% 
    mutate(wday = wday(dates, label = TRUE, week_start = 1),
         month = month(dates, label = TRUE),
         year = year(dates)) %>% 
  group_by(wday, month, snap) %>% 
  summarise(sales = sum(sales_rel)) %>% 
  pivot_wider(names_from = "snap", values_from = "sales", names_prefix = "snap") %>% 
  mutate(snap_effect = snapTRUE - snapFALSE) %>% 
  ggplot(aes(month, wday, fill = snap_effect)) +
  geom_tile() +
  labs(x = "Month of the year", y = "Day of the week", fill = "SNAP effect") +
  scale_fill_distiller(palette = "Spectral") +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "", y = "", title = "SNAP impact by weekday & month",
       subtitle = "Relative sales of SNAP days - other days. Only FOODS category and state CA.")

p1 / p2

Fig. 17

We find:

As expected, the impact is largest for the FOODs category; especially in WI. However, there are indications of slight synergy effects on other categories as well.
The heatmap focusses on FOODS and CA (because CA has the overall largest sales numbers). We see that overall the work days Mon-Fri show stronger benefits from SNAP purchases than the weekend Sat/Sun.
Thursdays in November stand out. I suspect that this effect is because of SNAP, but because of Thanksgiving. Thanksgiving is celebrated on the 4th Thursday in November every year. Here, this holiday most likely cuts into the “other” purchases and leads to the SNAP effect appearing artificially high.
Which reminds us that one caveat we have to keep in mind here is that SNAP days are always during the first half of each month. Ideally, we would want to control for the possibility that the effect we see is 1st half of month vs 2nd half of month rather than SNAP vs other. Or the possibility that we see the impact of holidays like Thanksgiving instead.

6 Individual time series with explanatory variables

Now that we know the global impact of the explanatory variables, let’s look at some example time series to get an idea about individual effects.

We will pick 3 random items from our interactive Fig. 2. Then we extract their sales numbers and join calendar events together with SNAP flags. I’ve spent some thoughts on how to display this in an efficient way, and here is my current preferred strategy: we plot the sales numbers as line charts on top of background rectangles that show the (regular) periods of SNAP days. Then we add event indicators as black points. Let me know if you have any other suggestions on how to plot this more elegantly.

example_ids <- str_c(c("FOODS_2_092_CA_1", "HOUSEHOLD_2_071_TX_2", "HOBBIES_1_348_WI_3"), "_validation")

example_sales <- train %>% 
  filter(id %in% example_ids) %>%  
  extract_ts() %>% 
  mutate(state_id = str_sub(id, -4, -3))

example_prices <- prices %>% 
  unite("id", item_id, store_id, sep = "_") %>% 
  filter(id %in% str_remove(example_ids, "_validation"))

example_calendar <- calendar %>% 
  select(date, wm_yr_wk, event_name_1, starts_with("snap")) %>% 
  pivot_longer(starts_with("snap"), names_to = "state_id", values_to = "snap") %>% 
  mutate(state_id = str_replace(state_id, "snap_", "")) %>% 
  rename(event = event_name_1)

example <- example_sales %>% 
  left_join(example_calendar, by = c("dates" = "date", "state_id")) %>% 
  left_join(example_prices, by = c("id", "wm_yr_wk")) %>% 
  mutate(snap = as.factor(if_else(snap == 1, "SNAP", "Other")))

# start and end times for SNAP intervals
snap_intervals <- example %>%
  group_by(state_id) %>% 
  mutate(foo = lead(snap, 1),
         bar = lag(snap, 1)) %>% 
  mutate(snap_start = if_else(snap == "SNAP" & bar == "Other", dates, date(NA_character_)),
         snap_end = if_else(snap == "SNAP" & foo == "Other", dates, date(NA_character_))) %>% 
  ungroup()

snap_intervals <- snap_intervals %>% 
  select(state_id, snap_start) %>% 
  filter(!is.na(snap_start)) %>% 
  bind_cols(snap_intervals %>% 
    select(snap_end) %>% 
    filter(!is.na(snap_end)))

To keep the focus on the SNAP periods, this plot will zoom into the period of May - Sep 2015. The three different items cover the 3 states and 3 sales categories. Following Fig. 11 we know that different states have different SNAP days; with only CA having a continuous 10-day range in our data:

We can look a price changes in a similar way. Here, I’m first extracting the intervals of constant prices by identifying the change points.

price_intervals <- example %>% 
  group_by(id) %>% 
  mutate(foo = lead(sell_price, 1)) %>% 
  filter((sell_price != foo) | (is.na(foo) & !is.na(sell_price)) ) %>% 
  select(id, dates, sell_price) %>% 
  mutate(price_start = lag(dates, 1)) %>% 
  replace_na(list(price_start = min(example$dates))) %>% 
  rename(price_end = dates)

Then I’ll join the those intervals, and their price numbers, to the sales information for our 3 example items. Since price changes happen over a longer time range, here we will look at the entire training data time range, not just focus on a few months as above. The visualisation is another experiment in which I encode the price as a background colour behind the time series. The idea is to be able to identify changes in price, whether it’s an increase or a decrease, with changes in sales numbers.

In the following plots, lighter colours mean lower prices. Note the different price ranges for each item:

p1 <- example %>% 
  filter(id == "FOODS_2_092_CA_1") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  #geom_line(col = scales::hue_pal()(3)[1], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  #scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "FOODS_2_092_CA_1")

p2 <- example %>% 
  filter(id == "HOUSEHOLD_2_071_TX_2") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  # geom_line(col = scales::hue_pal()(3)[2], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  # scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "HOUSEHOLD_2_071_TX_2")

p3 <- example %>% 
  filter(id == "HOBBIES_1_348_WI_3") %>% 
  select(id, dates, sales) %>% 
  left_join(price_intervals, by = c("id", "dates" = "price_start")) %>% 
  mutate(price_start = if_else(is.na(price_end), date(NA_character_), dates)) %>% 
  ggplot(aes(dates, sales, group = id)) +
  geom_rect(aes(xmin = price_start, xmax = price_end, ymin = 0, ymax = Inf, fill = sell_price), na.rm = TRUE) +
  # geom_line(col = scales::hue_pal()(3)[3], na.rm = TRUE) +
  geom_line(col = "grey30", na.rm = TRUE) +
  #scale_colour_hue(guide = FALSE) +
  # scale_fill_gradient(low = "grey90", high = "grey70") +
  scale_fill_viridis_c(begin = 1, end = 0.4, alpha = 0.7) +
  theme_hc() +
  theme(legend.position = "right") +
  labs(x = "", y = "Sales", fill = "Price [$]", title = "HOBBIES_1_348_WI_3")

p1 / p2 / p3 + plot_annotation(title = 'Price changes for 3 random items over full training period',
                               subtitle = "Line charts = sales curves. Background colour = sell price. Lighter colours = lower prices")

Fig. 19

We find:

Some items have more frequent price changes than others. The anonymous FOODS product changes price about 10 times, whereas the HOUSEHOLD item only has one short period of lower price and then goes back to its original price.
The main takeaway here is that in some cases price changes are correlated with demand picking up after a noticeable gap of zero sales. Case in point: the FOODS item sales around 2013 and the HOBBIES product sales at a similar time. In both cases, the price changing during a gap coincides with the an increase in sales that lasts for a certain amount of time (rather than just 1 or 2 days). Note, though, that our data tells us nothing about whether those price changes were accompanied by increased advertising efforts, which might have more impact than the prices themselves.
In terms of the FOODS and HOBBIES items we also see some evidence during the early parts of the time series that lower prices coincide with somewhat higher sales numbers. Keep in mind that these are simply some example time series at this point.

Alright, let’s try to put everything together in another interactive plot: This one is essentially the setup of Fig. 18 for the entire training time range, with the prices from Fig. 19 overlayed as thick orange lines. The plotly tools allow you to zoom and pan each panel individually. In this way, you can explore each time series in different time ranges and resolutions.

Here, the vertical grey background stripes indicate SNAP dates. The black points are events, like in Fig. 18. The thick orange line shows sell prices scaled to relative values that display well in the y-axis range of the sales numbers.

Those are only 3 example items, so feel free to take this code and use it to explore other time series that you are interested in or that might be causing issues in your models. I’d be curious to know of any specific examples for which you find odd or unexpected patterns.

7 Summary statistics

Let’s make a big jump from the detailed view of individual time series to a collection of overview statistics. Here we attempt to parametrise a sample of time series via a comprehensive set of fundamental parameters.

For the sake of keeping this analysis nimble we will only look at a random 5,000 time series. We can see that the statistics are reasonably representative by comparing the distribution of zero sales to the corresponding full distribution in Fig. 1 (Section 3.4).

set.seed(4321)
stat_train <- train %>% 
  sample_n(5e3) %>% 
  select(id, starts_with("d_")) %>% 
  mutate(id = str_replace(id, "_validation", ""))

stat_mean <- stat_train %>% 
  na_if(0) %>% 
  mutate(mean = select(., starts_with("d_")) %>% pmap_dbl(~mean(c(...), na.rm = TRUE))) %>% 
  select(id, mean)

stat_zero <- stat_train %>% 
  select(-contains("id")) %>% 
  na_if(0) %>% 
  is.na() %>% 
  as_tibble() %>% 
  mutate(sum = pmap_dbl(select(., everything()), sum)) %>% 
  mutate(mean = sum/(ncol(stat_train) - 1)) %>% 
  select(sum, mean)
  
stat_prices <- prices %>% 
  unite(col = "id", item_id, store_id, sep  = "_") %>%
  semi_join(stat_train, by = "id")

p1 <- stat_zero %>% 
  ggplot(aes(mean)) +
  geom_density(fill = "blue", bw = 0.02) +
  scale_x_continuous(labels = scales::percent) +
  coord_cartesian(xlim = c(0, 1)) +
  theme_hc() +
  theme(axis.text.y = element_blank()) +
  labs(x = "", y = "", title = "Density: Percentage of zero values")

p2 <- stat_mean %>% 
  ggplot(aes(mean)) +
  geom_density(fill = "red", bw = 0.03) +
  scale_x_log10(breaks = c(1, 2, 5, 10, 20)) +
  theme_hc() +
  theme(axis.text.y = element_blank()) +
  labs(x = "Sales", y = "", title = "Density: Mean sales (w/o zeros; log scale)")

foo <- stat_prices %>% 
  distinct(id, sell_price) %>% 
  group_by(id) %>% 
  summarise(mean_price = mean(sell_price),
            min_price = min(sell_price),
            max_price = max(sell_price)) %>%
  mutate(var_price = (max_price - min_price)/ mean_price)

p3 <- foo %>% 
  ggplot(aes(mean_price)) +
  geom_density(fill = "darkgreen")  +
  theme_hc() +
  theme(axis.text.y = element_blank(), legend.position = "none") +
  labs(x = "Price [$]", y = "", title = "Density: Mean item price")

p4 <- foo %>% 
  ggplot(aes(var_price)) +
  geom_density(fill = "green3")  +
  scale_x_continuous(labels = scales::percent) +
  coord_cartesian(xlim = c(0, 1)) +
  theme_hc() +
  theme(axis.text.y = element_blank(), legend.position = "none") +
  labs(x = "", y = "", title = "Density: Normalised price variations")


(p1 + p2) / (p3 + p4)

Fig. 21

We find:

We briefly mentioned the high number of zero values at the beginning of this Notebook. It’s certainly worth reminding ourselves that only a small fraction of time series have less than 25% zeros; and that that even having less than 50% zeros is not overly common.
Once we remove the zeros, the mean number of daily sales is rather low: between 1 and 2 units. However, there is a small fraction that has mean sales numbers above 5 units, or even above 10. This illustrates the overall low-count nature of our individual time series.
The price statistics are shown in shades of green: The mean item price is chiefly between 0-10 dollars; peaking around 2 or 3 dollars. As we have seen above, prices depend very much on category.
In light green we see a visualisation of the price variability in our items. Displayed here is the difference between maximum and minimum price, divided by the mean price. Most prices don’t vary much: less than 25% of their mean value. I cut off the x-axis at 100%, to focus on the interesting part, but we also see a few rare instances exceeding 200%.

Let’s spend a bit more time on the zeros sales. Far from being an unimportant feature, these instances can tell us quite a lot about the properties of our time series. Here we break it down by year and look at the percentage of zeros (same as the plot above) and the year of the first non-zero instance:

foo <- stat_train %>% 
  extract_ts() %>% 
  na_if(0) %>% 
  mutate(sales = is.na(sales)) %>% 
  mutate(year = year(dates)) %>% 
  group_by(year, id) %>% 
  summarise(mean  = mean(sales)) %>% 
  ungroup()

bar <- stat_train %>% 
  extract_ts() %>% 
  filter(sales > 0) %>% 
  group_by(id) %>% 
  summarise(min_year = year(min(dates)))

p1 <- foo %>% 
  ggplot(aes(mean, as.factor(year))) +
  geom_density_ridges(alpha = 0.5, bandwidth = 0.08, fill = "blue") +
  scale_x_continuous(labels = scales::percent) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Percentage of zeros per year")

p2 <- bar %>% 
  group_by(min_year) %>% 
  summarise(n = n()) %>%
  add_tally(n, name = "total") %>% 
  mutate(perc = n/total) %>% 
  mutate(is2011 = min_year == 2011) %>% 
  ggplot(aes(as.factor(min_year), n, fill = is2011)) +
  geom_col() +
  geom_text(aes(label = sprintf("%.1f%%", perc*100)), nudge_y = 110) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Year of first non-zero sale")

p1 + p2

Fig. 22

We find:

The percentage of zero values per year is significantly different for the first two years, 2011 and 2012, than for the remaining ones. This indicates that a notable number of time series only really start in 2013.
This impression is confirmed by looking at the year of the first non-zero entry: while 2011 has by the single largest chunk of non-zero entries, it accounts for just above 50% of all of them. 2012 and 2013 together make up about 30% of all time series starting points. There’s only 2 time series in our 5k sample that start in 2016.

A similar analysis can be done for the months of the year. Here our focus is slightly different: instead of looking at the starting year to determine how (in)complete a time series is, we now study which months show the lowest vs highest sales activities.

foo <- stat_train %>% 
  extract_ts() %>% 
  na_if(0) %>% 
  mutate(sales = is.na(sales)) %>% 
  mutate(month = month(dates, label = TRUE, abbr = TRUE)) %>% 
  group_by(month, id) %>% 
  summarise(mean  = mean(sales)) %>% 
  ungroup()

bar <- stat_train %>% 
  extract_ts() %>% 
  filter(sales > 0) %>% 
  group_by(id) %>% 
  summarise(min_month = month(min(dates), label = TRUE, abbr = TRUE))

 p1 <- foo %>% 
  ggplot(aes(mean, fct_rev(as.factor((month))))) +
  geom_density_ridges(alpha = 0.5, bandwidth = 0.04, fill = "blue") +
  scale_x_continuous(labels = scales::percent) +
  theme_tufte() +
  theme(legend.position = "none") +
  labs(x = "", y = "", title = "Percentage of zeros per month")

p2 <- bar %>% 
  group_by(min_month) %>% 
  summarise(n = n()) %>%
  add_tally(n, name = "total") %>% 
  mutate(perc = n/total) %>% 
  ggplot(aes(fct_rev(as.factor(min_month)), n)) +
  geom_col(fill = "blue") +
  coord_flip(ylim = c(0, 2200)) +
  geom_text(aes(label = sprintf("%.1f%%", perc*100)), nudge_y = 50, hjust = "left") +
  #scale_fill_manual(values = c("grey70", "red")) +
  theme_tufte() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  labs(x = "", y = "", title = "Month of first non-zero sale")

layout <- '
ABB
'

p1 + p2 + plot_layout(design = layout)

Fig. 23

We find:

There’s no month that really stands out in terms of zero percentage. Certainly worth a look, though.
The right-side panel showing the month of the first non-zero entry is far more interesting. Remember that our data starts on 2011-01-29: there are only 3 days in our first year when a time series could have its first non-zero entry. And we also know that only 55% of time series have their first entry in 2011 in the first place.

This calls for a heatmap to clear up the picture:

foobar <- stat_train %>% 
  extract_ts() %>% 
  filter(sales > 0) %>% 
  group_by(id) %>% 
  summarise(min_month = month(min(dates), label = TRUE, abbr = TRUE),
            min_year = year(min(dates)))

foobar %>% 
  count(min_month, min_year) %>% 
  ggplot(aes(min_month, min_year, fill = n)) +
  geom_tile() +
  scale_y_reverse() +
  scale_fill_distiller(palette = "Spectral", trans = "log10") +
  theme_tufte() +
  theme(legend.position = "bottom") +
  labs(x = "", y = "", fill = "# First Non-Zero Entries", title = "Effective start of sample time series",
       subtitle = "Counting the Year & Month of the first non-zero sales entry. Colour scale is logarithmic.")

Fig. 24

We find:

The single most common effective starting month for our sample data is indeed Jan 2011; despite only having 3 days in our data.
We can also see that May and Jul have a certain edge over Jun in 2012 - 2015. This is consistent with Fig. 23 above. Since we will ultimately predict sales in Jun 2016 (see Fig. 9) this is an interesting insight to ponder.
Similarly, Dec is elevated over Nov, but this is likely due to Holiday sales.

We can turn this question on its head, of course, and look at it from the perspective of last effective dates. A “Heads or Tails” view, if you like ;-)

Practically, this means simply looking for the maximum non-zero year/month instead of the minimum one. Everything else stays the same. Now I also add some numbers on top of the tile colours to

foobar <- stat_train %>% 
  extract_ts() %>% 
  filter(sales > 0) %>% 
  group_by(id) %>% 
  summarise(max_month = month(max(dates), label = TRUE, abbr = TRUE),
            max_year = year(max(dates)))

We find:

The vast majority of time series extend to Apr 2016, which is reassuring. However, there are also about 250 instances (i.e. 5% of 5k) that have end dates before April. This doesn’t have to mean that these items have no sales in May or June, but it does affect the forecasts.
The total number of time series without sales in 2016 is small, but they exist.

Finally, let’s have a look at price changes. In Fig. 21 we visualised the magnitude of price variations, here we’ll compare their frequency and direction.

First, we will look at the number of price changes per category. This counts the number of times an item changed its price, and then plots the distribution over our three categories FOODS, HOBBIES, and HOUSEHOLD. Then, we study the direction of these price changes: we count how often the price increased and plot this number as a percentage of the total price changed. A price increase percentage below 50% indicates that there were more price drops and rises.

To display the price increase percentages we choose violin plots, which give us the global quartile measures while also preserving the shape of the distribution. Note, that since many items only changed price a few times, the underlying data is somewhat discrete; but the violin densities do a pretty good job at visualising differences in distribution shapes. We compare price increase percentages by category and by state:

foo <- stat_prices %>% 
  distinct(id, sell_price) %>% 
  group_by(id) %>% 
  summarise(mean_price = mean(sell_price),
            min_price = min(sell_price),
            max_price = max(sell_price),
            ct = n()) %>%
  mutate(var_price = (max_price - min_price)/ mean_price) %>% 
  separate(id, into = c("cat", "dept", "item", "state", "store"), sep = "_")

bar <- stat_prices %>% 
  distinct(id, sell_price) %>% 
  group_by(id) %>% 
  mutate(rise = (lead(sell_price, 1) - sell_price) > 0) %>% 
  filter(!is.na(rise)) %>% 
  summarise(rise_frac = mean(rise)) %>% 
  separate(id, into = c("cat", "a", "b", "state", "c"), sep = "_") %>% 
  select(-a, -b, -c)


p1 <- foo %>% 
  ggplot(aes(ct, group = cat, fill = cat)) +
#  geom_boxplot() +
#  geom_jitter(height = 0.1) +
  scale_x_log10(breaks = c(1, 2, 5, 10, 20)) +
  geom_density(bw = 0.2, alpha = 0.5) +
  theme_hc() +
  theme(legend.position = "bottom") +
  labs(x = "", y = "", fill = "", title = "# Price changes by Category")

p2 <- bar %>% 
  ggplot(aes(cat, rise_frac, fill = cat)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), bw = 0.1) +
  scale_y_continuous(labels = scales::percent) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", fill = "", title = "% price increases by Category",
       subtitle = "Violin density plot with quartiles")

p3 <- bar %>% 
  mutate(is_tx = state == "TX") %>% 
  ggplot(aes(state, rise_frac, fill = is_tx)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), bw = 0.1) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c("grey70", "red")) +
  theme_hc() +
  theme(legend.position = "none") +
  labs(x = "", y = "", fill = "", title = "% price increases by State",
       subtitle = "TX is more balanced than CA/WI")

layout <- "
AB
AB
AB
AC
DC
"

p1 + p3 + p2 + guide_area() + plot_layout(design = layout, guides = "collect")

Fig. 26

We find:

FOODS items are seeing more frequent price changes. Overall, more than 10 price changes per item are rare.
The FOODS category items are also more likely to increase in price over time, while HOUSEHOLD items show price drops slightly more often. The HOBBIES category is pretty balanced between items become more or less expensive.
Between the three states, only TX shows a balanced distribution between price rises and drops. Both CA and WI are slightly more skewed towards prices increasing over time.
Seeing that our data covers multiple years, and inflation being a thing, I would have expected prices to increase globally over time. The fact that the data shows something different is certainly notable.

8 One more thing: External Dataset of Natural Disasters

In this bonus section I want to demonstrate how to include external datasets in our analysis. This competition permits external data (not all competitions do) under the condition that all such sources are listed in this summary thread.

I’m contributing a dataset of natural disaster declarations in the US, which I pre-processed after downloading from the website of the Federal Emergency Management Agency (FEMA). The dataset is available on Kaggle. If you decide to use it, please consider upvoting the dataset.

The full FEMA dataset goes back to 1953 and also contains information about recent biological emergency declarations - aka COVID-19. This information may be interesting for COVID-19 models. In addition, however, I also provide the file fema_disasters_m5.csv, which I prepared specifically for this competition by restricting the content to our three states CA, TX, & WI as well as excluding any events outside the time window visualised in Fig. 9.

We find:

There’s a notable fraction of natural disaster declarations during the training period in CA (about 18%) and especially TX (about 23%). On the other hand, WI has almost no natural disasters at all.
More interesting, however, are the validation and evaluation periods. There are already 8 disaster dates in TX during the validation period (i.e. about 25%), but the entire evaluation range had natural disaster occuring in TX. Plus another 8 days in CA. WI is unaffected.
The types of disasters during the validation and evaluation periods are floods in TX and Fires in CA. Notably, while CA fires also dominate the training period, there are many more fire declarations than flood events in TX during that time.

While those numbers are potentially impactful, especially for the crucial evaluation period, a few caveats are in order:

Natural disasters like fires or floods are relatively local events, and the states in question are very large (especially TX). While the full disaster table provides a geo resolution down to the county-level, in our M5 data we only have information about the states. Thus, without knowing where exactly those stores are located it’s hard to tell whether we’d expect them to be affected by a specific disaster. Still, any extra information can be useful and creativity is encouraged.
We usually don’t know ahead of time when disaster strikes. Remember that our task is to make predictions with only the information we would have had just prior to the prediction period in question. However, both the validation and evaluation periods are quite short, and events like fires or floods can (unfortunately) last for more than a few days. In addition, any natural disaster during the training period might well have had an impact that we want to take into account.
In the particular case of the TX floods during the validation & evaluation period, this natural disaster was began on 2016-05-22; i.e. on the last day of the validation data range. Event without knowing how long this event would last, it’s fair to assume on this day that it would have an impact on the sales data (if geographically close).

Next, let’s use our overview layout from Figs. 14 & 15 to examine the impact of these natural disasters on the sales during the training period. Here, I won’t look at WI because it only has a few disaster days compared to the other two states. This removes all “Snow” days. I also won’t consider the “Tsunami” type which is super rare to begin with. This leaves us with the following layout:

Back to (predict) the future - Interactive M5 EDA

2020-05-20

1 Introduction

2 Preparations

2.1 Load libraries

2.2 Helper functions

2.3 Load data

3 Quick Look: File structure and content

3.1 Training sales data

3.2 Sales prices

3.3 Calendar

3.4 Missing values & zero values

4 Visual Overview: Interactive time series plots

4.1 Individual item-level time series - random sample

4.2 All aggregate sales

4.3 Sales per State

4.4 Sales per Store & Category

4.5 Sales per Department

4.6 Seasonalities - global

5 Explanatory Variables: Prices and Calendar

5.1 Calendar

5.2 Item Prices

5.3 Connection to time series data

6 Individual time series with explanatory variables

7 Summary statistics

8 One more thing: External Dataset of Natural Disasters