Photo by Fabio Bracht (@bracht) on Unsplash.
Welcome to an extensive Exploratory Data Analysis for the 5th Makridakis forecasting competitions (M5)! This notebook will grow over the coming days and weeks into a deep dive of all the relevant aspects of this challenge. Here’s all you need to know to get started:
Some Background: the Makridakis competitions (or M-competitions), organised by forecasting expert Spyros Makridakis, aim to provide a better understanding and advancement of forecasting methodology by comparing the performance of different methods in solving a well-defined, real-world problem. The first M-competition was held in 1982. The forth competition (M4) ran in 2018 and featured “100,000 time series and 61 forecasting methods” (source in link). According to forecasting researcher and practitioner Rob Hyndman the M-competitions “have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models”. This empirical approach is very similar to Kaggle’s trade-mark way of having the best machine learning algorithms engage in intense competition on diverse datasets. M5 is the first M-competition to be held on Kaggle.
The goal: We have been challenged to predict sales data provided by the retail giant Walmart 28 days into the future. This competition will run in 2 tracks: In addition to forecasting the values themselves in the Forecasting competition, we are simultaneously tasked to estimate the uncertainty of our predictions in the Uncertainty Distribution competition. Both competitions will have the same 28 day forecast horizon.
The data: We are working with 42,840 hierarchical time series. The data were obtained in the 3 US states of California (CA), Texas (TX), and Wisconsin (WI). “Hierarchical” here means that data can be aggregated on different levels: item level, department level, product category level, and state level. The sales information reaches back from Jan 2011 to June 2016. In addition to the sales numbers, we are also given corresponding data on prices, promotions, and holidays. Note, that we have been warned that most of the time series contain zero values.
The data comprises 3049 individual products from 3 categories and 7 departments, sold in 10 stores in 3 states. The hierachical aggregation captures the combinations of these factors. For instance, we can create 1 time series for all sales, 3 time series for all sales per state, and so on. The largest category is sales of all individual 3049 products per 10 stores for 30490 time series.
The training data comes in the shape of 3 separate files:
sales_train.csv: this is our main training data. It has 1 column for each of the 1941 days from 2011-01-29 and 2016-05-22; not including the validation period of 28 days until 2016-06-19. It also includes the IDs for item, department, category, store, and state. The number of rows is 30490 for all combinations of 30490 items and 10 stores.
sell_prices.csv: the store and item IDs together with the sales price of the item as a weekly average.
calendar.csv: dates together with related features like day-of-the week, month, year, and an 3 binary flags for whether the stores in each state allowed purchases with SNAP food stamps at this date (1) or not (0).
The metrics:
The point forecast submission are being evaluated using the Root Mean Squared Scaled Error (RMSSE), which is derived from the Mean Absolute Scaled Error (MASE) that was designed to be scale invariant and symmetric. In a similar way to the MASE, the RMSSE is scale invariant and symmetric, and measures the prediction error (i.e. forecast - truth) relative to a “naive forecast” that simply assumes that step i = step i-1. In contrast to the MASE, here both prediction error and naive error are scaled to account for the goal of estimating average values in the presence of many zeros.
The uncertainy distributions are being evaluated using the Weighted Scaled Pinball Loss (WSPL). We are asked to provide the 50%, 67%, 95%, and 99% uncertainty intervals together with the forecasted median.
Both metrics are computed for each time series and then averaged accross all time series including weights. The weights are proportional to the sales volume of the item, in dollars, to give more importance to high selling products. Note, that the weights are based on the last 28 days of the training data, and that those dates will be adjusted for the ultimate evaluation data, as confirmed by the organisers.
Need some inspiration? Beyond this notebook, I recommend you to check out other recent Kaggle time series competitions such as Wiki Traffic Forecasting, Favorita Grocery Sales Forecasting, Recruit Restaurant, or ASHRAE and read the Kernels and top ranking write-ups. Also, note that a few years ago Kaggle ran two Walmart recruiting competitions with the goal to forecast sales: Recruiting 1 and Recruiting 2. Last but not least, there was a relevant Playground competition a few years ago.
Let’s get started!
We load a range of libraries for general data wrangling and general visualisation together with more specialised tools for time series forecasting.
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'scales' was built under R version 3.6.3
## Warning: package 'patchwork' was built under R version 3.6.3
## Warning: package 'corrplot' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'vroom' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
library('purrr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation## Warning: package 'forcats' was built under R version 3.6.3
## Warning: package 'fuzzyjoin' was built under R version 3.6.3
## Warning: package 'alluvial' was built under R version 3.6.3
## Warning: package 'ggforce' was built under R version 3.6.3
## Warning: package 'ggridges' was built under R version 3.6.3
library('gganimate') # animations
library('GGally') # visualisation
library('ggthemes') # visualisation
library('wesanderson') # visualisation## Warning: package 'wesanderson' was built under R version 3.6.3
## Warning: package 'kableExtra' was built under R version 3.6.3
## Warning: package 'timetk' was built under R version 3.6.3
## Warning: package 'lubridate' was built under R version 3.6.3
## Warning: package 'forecast' was built under R version 3.6.3
#library('prophet') # time series analysis
library('timetk') # time series analysis
# Interactivity
library('crosstalk')
library('plotly')## Warning: package 'plotly' was built under R version 3.6.3
## Warning: package 'foreach' was built under R version 3.6.2
## Warning: package 'doParallel' was built under R version 3.6.3
## Warning: package 'iterators' was built under R version 3.6.2
A short helper function to compute binomial confidence intervals.
We use R’s new vroom package which reads data as fast as its onomatopoeic name implies:
As a first step let’s have a quick look of the data sets using the head, summary, and glimpse tools where appropriate.
Here are the first 10 columns and rows of the our training sales data:
| id | item_id | dept_id | cat_id | store_id | state_id | d_1 | d_2 | d_3 | d_4 |
|---|---|---|---|---|---|---|---|---|---|
| HOBBIES_1_001_CA_1_validation | HOBBIES_1_001 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_002_CA_1_validation | HOBBIES_1_002 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_003_CA_1_validation | HOBBIES_1_003 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_004_CA_1_validation | HOBBIES_1_004 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_005_CA_1_validation | HOBBIES_1_005 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_006_CA_1_validation | HOBBIES_1_006 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_007_CA_1_validation | HOBBIES_1_007 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 0 | 0 |
| HOBBIES_1_008_CA_1_validation | HOBBIES_1_008 | HOBBIES_1 | HOBBIES | CA_1 | CA | 12 | 15 | 0 | 0 |
| HOBBIES_1_009_CA_1_validation | HOBBIES_1_009 | HOBBIES_1 | HOBBIES | CA_1 | CA | 2 | 0 | 7 | 3 |
| HOBBIES_1_010_CA_1_validation | HOBBIES_1_010 | HOBBIES_1 | HOBBIES | CA_1 | CA | 0 | 0 | 1 | 0 |
We find:
There is one column each for the IDs of item, department, category, store, and state; plus a general ID that is a combination of the other IDs plus a flag for validation.
The sales per date are encoded as columns starting with the prefix d_. Those are the number of units sold per day (not the total amount of dollars).
We already see that there are quite a lot of zero values.
This data set has too many columns and rows to display them all:
## [1] 1919 30490
All IDs are marked as “validation”. This refers to the intial validation testing period of the competition, before we ultimately predict a different 28-days period.
train %>%
mutate(dset = if_else(str_detect(id, "validation"), "validation", "training")) %>%
count(dset)## # A tibble: 1 x 2
## dset n
## <chr> <int>
## 1 validation 30490
This data set gives us the weekly price changes per item:
| store_id | item_id | wm_yr_wk | sell_price |
|---|---|---|---|
| CA_1 | HOBBIES_1_001 | 11325 | 9.58 |
| CA_1 | HOBBIES_1_001 | 11326 | 9.58 |
| CA_1 | HOBBIES_1_001 | 11327 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11328 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11329 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11330 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11331 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11332 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11333 | 8.26 |
| CA_1 | HOBBIES_1_001 | 11334 | 8.26 |
## store_id item_id wm_yr_wk sell_price
## TX_2 : 701214 FOODS_1_003: 2820 Min. :11101 Min. : 0.010
## TX_1 : 699796 FOODS_1_006: 2820 1st Qu.:11247 1st Qu.: 2.180
## CA_1 : 698412 FOODS_1_018: 2820 Median :11411 Median : 3.470
## WI_3 : 696094 FOODS_1_019: 2820 Mean :11383 Mean : 4.411
## CA_3 : 693990 FOODS_1_020: 2820 3rd Qu.:11517 3rd Qu.: 5.840
## TX_3 : 691112 FOODS_1_024: 2820 Max. :11621 Max. :107.320
## (Other):2660503 (Other) :6824201
We find:
We have the store_id and item_id to link this data to our training and validation data.
Prices range from $0.10 to a bit more than 100 dollars.
The calendar data gives us date features such as weekday, month, or year; alongside 2 different event features and a SNAP food stamps flag:
| date | wm_yr_wk | weekday | wday | month | year | d | event_name_1 | event_type_1 | event_name_2 | event_type_2 | snap_CA | snap_TX | snap_WI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2011-01-29 | 11101 | Saturday | 1 | 1 | 2011 | d_1 | 0 | 0 | 0 | ||||
| 2011-01-30 | 11101 | Sunday | 2 | 1 | 2011 | d_2 | 0 | 0 | 0 | ||||
| 2011-01-31 | 11101 | Monday | 3 | 1 | 2011 | d_3 | 0 | 0 | 0 | ||||
| 2011-02-01 | 11101 | Tuesday | 4 | 2 | 2011 | d_4 | 1 | 1 | 0 | ||||
| 2011-02-02 | 11101 | Wednesday | 5 | 2 | 2011 | d_5 | 1 | 0 | 1 | ||||
| 2011-02-03 | 11101 | Thursday | 6 | 2 | 2011 | d_6 | 1 | 1 | 1 | ||||
| 2011-02-04 | 11101 | Friday | 7 | 2 | 2011 | d_7 | 1 | 0 | 0 | ||||
| 2011-02-05 | 11102 | Saturday | 1 | 2 | 2011 | d_8 | 1 | 1 | 1 |
## Rows: 1,969
## Columns: 14
## $ date <fct> 2011-01-29, 2011-01-30, 2011-01-31, 2011-02-01, 2011-0...
## $ wm_yr_wk <int> 11101, 11101, 11101, 11101, 11101, 11101, 11101, 11102...
## $ weekday <fct> Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday...
## $ wday <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, ...
## $ month <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, ...
## $ d <fct> d_1, d_2, d_3, d_4, d_5, d_6, d_7, d_8, d_9, d_10, d_1...
## $ event_name_1 <fct> , , , , , , , , SuperBowl, , , , , , , , ValentinesDay...
## $ event_type_1 <fct> , , , , , , , , Sporting, , , , , , , , Cultural, , , ...
## $ event_name_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , ,
## $ event_type_2 <fct> , , , , , , , , , , , , , , , , , , , , , , , , ,
## $ snap_CA <int> 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ...
## $ snap_TX <int> 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, ...
## $ snap_WI <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, ...
## date wm_yr_wk weekday wday
## 2011-01-29: 1 Min. :11101 Friday :281 Min. :1.000
## 2011-01-30: 1 1st Qu.:11219 Monday :281 1st Qu.:2.000
## 2011-01-31: 1 Median :11337 Saturday :282 Median :4.000
## 2011-02-01: 1 Mean :11347 Sunday :282 Mean :3.997
## 2011-02-02: 1 3rd Qu.:11502 Thursday :281 3rd Qu.:6.000
## 2011-02-03: 1 Max. :11621 Tuesday :281 Max. :7.000
## (Other) :1963 Wednesday:281
## month year d event_name_1
## Min. : 1.000 Min. :2011 d_1 : 1 :1807
## 1st Qu.: 3.000 1st Qu.:2012 d_10 : 1 LentStart : 6
## Median : 6.000 Median :2013 d_100 : 1 LentWeek2 : 6
## Mean : 6.326 Mean :2013 d_1000 : 1 MemorialDay : 6
## 3rd Qu.: 9.000 3rd Qu.:2015 d_1001 : 1 Mother's day: 6
## Max. :12.000 Max. :2016 d_1002 : 1 NBAFinalsEnd: 6
## (Other):1963 (Other) : 132
## event_type_1 event_name_2 event_type_2 snap_CA
## :1807 :1964 :1964 Min. :0.0000
## Cultural : 37 Cinco De Mayo : 1 Cultural : 4 1st Qu.:0.0000
## National : 52 Easter : 1 Religious: 1 Median :0.0000
## Religious: 55 Father's day : 2 Mean :0.3301
## Sporting : 18 OrthodoxEaster: 1 3rd Qu.:1.0000
## Max. :1.0000
##
## snap_TX snap_WI
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3301 Mean :0.3301
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
We find:
The calendar has all the relevant dates, weekdays, months plus snap binary flags and logical event columns.
There are only 1969 non-NA rows in the event_name_2 column; i.e. only 5 (out of 1969) instances where there is more than 1 event on a particular day.
There are no missing values in our sales training data:
## [1] 0
However, there are a lot of zero values, here we plot the distribution of zero percentages among all time series:
bar <- train %>%
select(-contains("id")) %>%
na_if(0) %>%
is.na() %>%
as_tibble() %>%
mutate(sum = pmap_dbl(select(., everything()), sum)) %>%
mutate(mean = sum/(ncol(train) - 1)) %>%
select(sum, mean)
bar %>%
ggplot(aes(mean)) +
geom_density(fill = "blue") +
scale_x_continuous(labels = scales::percent) +
coord_cartesian(xlim = c(0, 1)) +
theme_hc() +
theme(axis.text.y = element_blank()) +
labs(x = "", y = "", title = "Density for percentage of zero values - all time series")Fig. 1
This means that only a minority of time series have less than 50% of zero values. The peak is rather close to 100%
We will start our visual exploration by investigating a number of time series plots on different aggregation levels. Here is a short helper function to transform our wide data into a long format with a dates column in date format:
extract_ts <- function(df){
min_date <- date("2011-01-29")
df %>%
select(id, starts_with("d_")) %>%
pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
mutate(dates = as.integer(str_remove(dates, "d_"))) %>%
mutate(dates = min_date + dates - 1) %>%
mutate(id = str_remove(id, "_validation"))
}
set.seed(4321)
foo <- train %>%
sample_n(50)
ts_out <- extract_ts(foo)
cols <- ts_out %>%
distinct(id) %>%
mutate(cols = rep_len(brewer.pal(7, "Set2"), length.out = n_distinct(ts_out$id)))
ts_out <- ts_out %>%
left_join(cols, by = "id")
pal <- cols$cols %>%
setNames(cols$id)Here we will sample 50 random time series from our training sample to get an idea about the data.
In the following plot, you can select the id of a time series (which is a concatenation of store and department) to display only the graphs that you’re interested in.
Currently, I don’t see a way to avoid having all time series plotted at the beginning. Select 1 from the input field to begin:
shared_ts <- highlight_key(ts_out)
palette(brewer.pal(9, "Set3"))
gg <- shared_ts %>%
ggplot(aes(dates, sales, col = id, group = id)) +
geom_line() +
scale_color_manual(values = pal) +
labs(x = "Date", y = "Sales") +
theme_tufte() +
NULL
filter <- bscols(
filter_select("ids", "Sales over time: Select a time series ID (remove with backspace key, navigate with arrow keys):", shared_ts, ~id, multiple = TRUE),
ggplotly(gg, dynamicTicks = TRUE),
widths = c(12, 12)
)
bscols(filter)