1 Introduction

This is an extensive Exploratory Data Analysis for the Corporación Favorita Grocery Sales Forecasting competition with tidy R and ggplot2.

The aim of this challenge is to forecast more accurate product sales for the Ecuadorian supermarket chain Corporación Favorita.

The data comes in the shape of multiple files. First, the training data (../input/train.csv) essentially contains the sales by date, store, and item. The test data (../input/test.csv) contains the same features without the sales information, which we are tasked to predict. The train vs test split is based on the date. In addition, some test items are not included in the train data.

In addition, the data description notes the importance of public sector pay days (on the 15th and the end of the month) as well as the impact of a major earthquake on April 16 2016.

2 Preparations

2.1 Load libraries

We load a range of libraries for general data wrangling and general visualisation together with more specialised tools.

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation

# specific visualisation
library('ggrepel') # visualisation
library('treemapify') # visualisation
library('ggforce') # visualisation
library('ggridges') # visualization

# specific data manipulation
library('broom') # data wrangling
library('purrr') # string manipulation

# Date plus forecast
library('lubridate') # date and time
library(prophet)
library(xgboost)
library(caret)
library('timeDate') # date and time

2.2 Helper functions

We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots. We also make use of a brief helper function to compute binomial confidence intervals.

source("multiplot.R")

# function to extract binomial confidence levels
get_binCI <- function(x,n) as.list(setNames(binom.test(x,n)$conf.int, c("lwr", "upr")))

2.3 Load data

We use data.table’s fread function to speed up reading in the data. Note, that the uncompressed training data is 4.7 GB in size with 126 million rows. For the purpose of this exploration, we will only use 10% of this data.

set.seed(1234)
train <- sample_frac(as.tibble(fread('../input/train.csv')),0.1)
test <- as.tibble(fread('../input/test.csv'))
stores <- as.tibble(fread('../input/stores.csv'))
items <- as.tibble(fread('../input/items.csv'))
trans <- as.tibble(fread('../input/transactions.csv'))
oil <- as.tibble(fread('../input/oil.csv'))
holidays <- as.tibble(fread('../input/holidays_events.csv'))

3 Overview: File structure and content

As a first step we will have an overview of the individual data sets using the summary and glimpse tools.

3.1 Training data

summary(train)

##        id                date             store_nbr        item_nbr      
##  Min.   :        2   Length:12549704    Min.   : 1.00   Min.   :  96995  
##  1st Qu.: 31339112   Class :character   1st Qu.:12.00   1st Qu.: 522721  
##  Median : 62741540   Mode  :character   Median :28.00   Median : 959437  
##  Mean   : 62743729                      Mean   :27.46   Mean   : 972741  
##  3rd Qu.: 94139379                      3rd Qu.:43.00   3rd Qu.:1354380  
##  Max.   :125497019                      Max.   :54.00   Max.   :2127114  
##    unit_sales         onpromotion    
##  Min.   :-15372.000   Mode :logical  
##  1st Qu.:     2.000   FALSE:9600389  
##  Median :     4.000   TRUE :781199   
##  Mean   :     8.561   NA's :2168116  
##  3rd Qu.:     9.000                  
##  Max.   : 20748.000

glimpse(train)

## Observations: 12,549,704
## Variables: 6
## $ id          <int> 14269441, 78096733, 76462175, 78232274, 108042332, 8…
## $ date        <chr> "2013-11-20", "2016-05-03", "2016-04-16", "2016-05-0…
## $ store_nbr   <int> 40, 24, 28, 46, 37, 6, 27, 2, 14, 1, 42, 32, 24, 1, …
## $ item_nbr    <int> 502224, 472526, 1339879, 1501547, 1038946, 1971461, …
## $ unit_sales  <dbl> 1.000, 2.000, 4.000, 10.000, 5.000, 4.000, 1.000, 25…
## $ onpromotion <lgl> NA, FALSE, FALSE, TRUE, FALSE, FALSE, NA, FALSE, FAL…

We find:

There is a unique id to label our observations. The date feature will need to be transformed to a better format.
The store numbers are integers (store_nbr) ranging from 1 to 54. Item numbers (item_nbr) are also encoded as integers. Both of these features will work better when encoded as factors.
onpromotion is a logical feature, describing whether the item in question had been assigned a special promotion pricing at the time in the specific store. This feature contains many NA values.
unit_sales is our target feature: How many units of the specific item were sold in that store on that day. Negative values mean that this particular item was returned (source).

3.2 Test data

summary(test)

##        id                date             store_nbr       item_nbr      
##  Min.   :125497040   Length:3370464     Min.   : 1.0   Min.   :  96995  
##  1st Qu.:126339656   Class :character   1st Qu.:14.0   1st Qu.: 805321  
##  Median :127182272   Mode  :character   Median :27.5   Median :1294665  
##  Mean   :127182272                      Mean   :27.5   Mean   :1244798  
##  3rd Qu.:128024887                      3rd Qu.:41.0   3rd Qu.:1730015  
##  Max.   :128867503                      Max.   :54.0   Max.   :2134244  
##  onpromotion    
##  Mode :logical  
##  FALSE:3171867  
##  TRUE :198597   
##                 
##                 
##

glimpse(test)

## Observations: 3,370,464
## Variables: 5
## $ id          <int> 125497040, 125497041, 125497042, 125497043, 12549704…
## $ date        <chr> "2017-08-16", "2017-08-16", "2017-08-16", "2017-08-1…
## $ store_nbr   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ item_nbr    <int> 96995, 99197, 103501, 103520, 103665, 105574, 105575…
## $ onpromotion <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

We find:

The test data contains the same range of store_nbr. The described difference in dates is not visible in this representation.
We already see that there are some item_nbr that cannot be found in the train data.
Interestingly, there are no NAs in the onpromotion feature here.

3.3 Stores

summary(stores)

##    store_nbr         city              state               type          
##  Min.   : 1.00   Length:54          Length:54          Length:54         
##  1st Qu.:14.25   Class :character   Class :character   Class :character  
##  Median :27.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :27.50                                                           
##  3rd Qu.:40.75                                                           
##  Max.   :54.00                                                           
##     cluster      
##  Min.   : 1.000  
##  1st Qu.: 4.000  
##  Median : 8.500  
##  Mean   : 8.481  
##  3rd Qu.:13.000  
##  Max.   :17.000

glimpse(stores)

## Observations: 54
## Variables: 5
## $ store_nbr <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ city      <chr> "Quito", "Quito", "Quito", "Quito", "Santo Domingo", "…
## $ state     <chr> "Pichincha", "Pichincha", "Pichincha", "Pichincha", "S…
## $ type      <chr> "D", "D", "D", "D", "D", "D", "D", "D", "B", "C", "B",…
## $ cluster   <int> 13, 13, 8, 9, 4, 13, 8, 8, 6, 15, 6, 15, 15, 7, 15, 3,…

We find:

Stores are identified by their city (e.g. “Quito”) and state (e.g. “Pichincha”), according to their store_nbr which connects this information to the train/test data. Along with the type of the store, these should be encoded as factors.
cluster describes a “grouping of similar stores” (source). We will find out what exactly that means.

3.4 Items

summary(items)

##     item_nbr          family              class        perishable    
##  Min.   :  96995   Length:4100        Min.   :1002   Min.   :0.0000  
##  1st Qu.: 818111   Class :character   1st Qu.:1068   1st Qu.:0.0000  
##  Median :1306198   Mode  :character   Median :2004   Median :0.0000  
##  Mean   :1251436                      Mean   :2170   Mean   :0.2405  
##  3rd Qu.:1904918                      3rd Qu.:2990   3rd Qu.:0.0000  
##  Max.   :2134244                      Max.   :7780   Max.   :1.0000

glimpse(items)

## Observations: 4,100
## Variables: 4
## $ item_nbr   <int> 96995, 99197, 103501, 103520, 103665, 105574, 105575,…
## $ family     <chr> "GROCERY I", "GROCERY I", "CLEANING", "GROCERY I", "B…
## $ class      <int> 1093, 1067, 3008, 1028, 2712, 1045, 1045, 1045, 1045,…
## $ perishable <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,…

We find:

The items are grouped into a broad family (e.g. “BREAD/BAKERY”) and an integer class column. Once more, these will be factors.
perishable, an identifier whether the item will go bad over time. It is encoded as an integer but would work better as a logical feature, since the only values appear to be “0 vs 1”: perishable (e.g. milk) vs not perishable (e.g. DVDs).
item_nbr is of course the key column relating this data set to train/test

3.5 Transactions

summary(trans)

##      date             store_nbr      transactions 
##  Length:83488       Min.   : 1.00   Min.   :   5  
##  Class :character   1st Qu.:13.00   1st Qu.:1046  
##  Mode  :character   Median :27.00   Median :1393  
##                     Mean   :26.94   Mean   :1695  
##                     3rd Qu.:40.00   3rd Qu.:2079  
##                     Max.   :54.00   Max.   :8359

glimpse(trans)

## Observations: 83,488
## Variables: 3
## $ date         <chr> "2013-01-01", "2013-01-02", "2013-01-02", "2013-01-…
## $ store_nbr    <int> 25, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ transactions <int> 770, 2111, 2358, 3487, 1922, 1903, 2143, 1874, 3250…

We find:

This data set gives us an additional total number of transactions per store_nbr for a given date. This information is only available for the training data.

3.6 Oil

summary(oil)

##      date             dcoilwtico    
##  Length:1218        Min.   : 26.19  
##  Class :character   1st Qu.: 46.41  
##  Mode  :character   Median : 53.19  
##                     Mean   : 67.71  
##                     3rd Qu.: 95.66  
##                     Max.   :110.62  
##                     NA's   :43

glimpse(oil)

## Observations: 1,218
## Variables: 2
## $ date       <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04…
## $ dcoilwtico <dbl> NA, 93.14, 92.97, 93.12, 93.20, 93.21, 93.08, 93.81, …

We find that this is a simple time series of the oil price at a given date. There are a few NAs in the price (dcoilwtico) column. The date feature should be re-formated.

3.7 Holidays

summary(holidays)

##      date               type              locale         
##  Length:350         Length:350         Length:350        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  locale_name        description        transferred    
##  Length:350         Length:350         Mode :logical  
##  Class :character   Class :character   FALSE:338      
##  Mode  :character   Mode  :character   TRUE :12

glimpse(holidays)

## Observations: 350
## Variables: 6
## $ date        <chr> "2012-03-02", "2012-04-01", "2012-04-12", "2012-04-1…
## $ type        <chr> "Holiday", "Holiday", "Holiday", "Holiday", "Holiday…
## $ locale      <chr> "Local", "Regional", "Local", "Local", "Local", "Loc…
## $ locale_name <chr> "Manta", "Cotopaxi", "Cuenca", "Libertad", "Riobamba…
## $ description <chr> "Fundacion de Manta", "Provincializacion de Cotopaxi…
## $ transferred <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

We find:

Holidays and special events also come in the shape of a time series with a date column.
There is a type of the holiday, a qualifier whether it’s regional (locale) and in which region it applies (locale_name), as well as the name of the holiday in the feature description.
transferred is a logical column indicating whether this specific holiday was moved to a different day that year.

3.8 Reformating features

train <- train %>%
  mutate(date = ymd(date),
         store_nbr = as.factor(store_nbr),
         item_nbr = as.factor(item_nbr))

test <- test %>%
  mutate(date = ymd(date),
         store_nbr = as.factor(store_nbr),
         item_nbr = as.factor(item_nbr))

stores <- stores %>%
  mutate(city = as.factor(city),
         state = as.factor(state),
         type = as.factor(type),
         store_nbr = as.factor(store_nbr))

items <- items %>%
  mutate(family = as.factor(family),
         class = as.factor(class),
         perish = as.logical(perishable),
         item_nbr = as.factor(item_nbr))

trans <- trans %>%
  mutate(date = ymd(date),
         store_nbr = as.factor(store_nbr))

oil <- oil %>%
  mutate(date = ymd(date)) %>%
  rename(oilprice = dcoilwtico)

holidays <- holidays %>%
  mutate(date = ymd(date),
    type = as.factor(type),
         locale = as.factor(locale),
         locale_name = as.factor(locale_name))

4 Individual feature visualisations

In the next step we will get an visual overview of the data by plotting the individual feature distribution for each data set separately. This will be the foundation for our analysis. From here, we will explore the impact of different features on the sales numbers as well as plan multi-feature comparisons.

But first things first.

4.1 Training data

Here we visualise the training data distributions as histograms (for numerical and date features) or as bar plots (for categorical features). Note the double-logarithmic axes on the unit_sales plot:

p1 <- train %>%

  ggplot(aes(date)) +

  geom_freqpoly(color = "blue", binwidth = 10, size = 1.2) +

  coord_cartesian(ylim = c(38e3, 110e3))



p2 <- train %>%

  ggplot(aes(onpromotion, fill = onpromotion)) +

  geom_bar() +

  theme(legend.position = "none")



p3 <- train %>%

  filter(unit_sales > 0) %>%

  ggplot(aes(unit_sales)) +

  geom_histogram(fill = "red", binwidth = 0.1) +

  scale_x_log10() +

  scale_y_log10() +

  labs(x = "Positive unit sales")



p4 <- train %>%

  ggplot(aes(store_nbr, fill = store_nbr)) +

  geom_bar() +

  theme(legend.position = "none")



p5 <- train %>%

  ggplot(aes(item_nbr, fill = item_nbr)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank())

 

layout <- matrix(c(1,2,3,1,2,3,1,2,3,4,4,4,4,4,4,5,5,5,5,5,5),7,3,byrow=TRUE)

multiplot(p1, p2, p3, p4, p5, layout=layout)

Fig. 1

# p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1

We find:

The number of dates in our data set shows an increasing trend over time, indicating rising sales and/or number of different units that are being sold.
Only a small fraction of items are onpromotion on for a larger fraction this information is not known. The majority of items are not on promotion.
The contribution per store_nbr in our data set varies typically by about a factor of 3 (for the down-sampled train data). A single store (store_nbr == 52) has only a small number of entries. This strong variation could correspond to the size of the store.
The distribution of item_nbr shows similarly strong variation between only a few and several thousand entries. This in only a first overview showcasing the variation in the feature, which is why we don’t plot any numbers on the x-axis (which would not be readable for so many categorical features).
The positive unit_sales distribution peaks below 10 units, and then swiftly declines afterwards. Note the double-logarithmic scales. Here we only plot the sales, not the returns. Note, that unit_sales can correspond to an integer value (e.g. a bottle of wine) or a float value (e.g. 0.5 kg of cheese, to go with the wine).

4.2 Stores

p1 <- stores %>%

  ggplot(aes(city, fill = city)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))



p2 <- stores %>%

  ggplot(aes(state, fill = state)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))



p3 <- stores %>%

  group_by(type) %>%

  count() %>%

  ggplot(aes(type , n, fill = type)) +

  geom_col() +

  geom_text(aes(type, n+0.7, label = sprintf("%i", n))) +

  theme(legend.position = "none")



p4 <- stores %>%

  mutate(cluster = as.factor(cluster)) %>%

  ggplot(aes(cluster, fill = cluster)) +

  geom_bar() +

  theme(legend.position = "none")



layout <- matrix(c(1,2,1,2,1,2,3,4,3,4),5,2,byrow=TRUE)

multiplot(p1, p2, p3, p4, layout=layout)

Fig. 3

# p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1

We find:

The cities fall into four groups, with most of them having only a single store. Six cities have 2 or 3 stores. “Guayaquil” and the capital “Quito” are each in a group of their own with 8 and 18 stores, respectively.
The city distribution is reflected in the state distribution as well, with “Pichincha” having 19 stores, “Guayas” 11, and the rest between 1 and 3.
Among the types, we see that D and “C” are the most frequent, with “A” and “B” having similar medium frequency and “E” only accounting for 4 stores.
The cluster feature shows a range from 1 to 7.

4.3 Items

p1 <- items %>%

  ggplot(aes(family, fill = family)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))



p2 <- items %>%

  ggplot(aes(class, fill = class)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x = element_blank(), axis.ticks.x = element_blank())



p3 <- items %>%

  group_by(class) %>%

  count() %>%

  filter(n > 70) %>%

  ggplot(aes(class , n, fill = class)) +

  geom_col() +

  theme(legend.position = "none") +

  labs(x = "class - most frequent")



p4 <- items %>%

  ggplot(aes(perish, fill = perish)) +

  geom_bar() +

  theme(legend.position = "none")

  

layout <- matrix(c(2,3,4,2,3,4,1,1,1,1,1,1,1,1,1),5,3,byrow=TRUE)

multiplot(p1, p2, p3, p4, layout=layout)

Fig. 5

# p1 <- 1; p2 <- 1; p3 <- 1; p4 <- 1; p5 <- 1

We find:

The class feature has a large number of levels, suggesting a detailed sub-division, with larger variation between a few and more than 100 cases (upper left). In the upper middle plot we show the most frequent item classes, with “1016” being the most popular code.
About 1/4 of all items are perishable. This status will have a notable impact on our prediction models, because good predictions of perishable items are (understandably) rewarded with a bonus.
The family of the items is a broad grouping into what would correspond to an aisle or a section of a supermarket (e.g. “DELI” or “PET SUPPLIES”). The dominant role of “GROCERY I” helps us to understand why there is a large fraction of perishable items.

4.4 Transactions

Let’s look at the time series of cumulative transactions per store. We start with the median number of transactions over all stores. The red line is a smoothing fit:

trans %>%

  group_by(date) %>%

  summarise(med_trans = median(transactions)) %>%

  ggplot(aes(date, med_trans)) +

  geom_line(color = "black") +

  geom_smooth(method = "loess", color = "red", span = 1/5)

Fig. 7

We find that there is a strong spike before Christmas, with corresponding drops when the stores are presumably closed during the holidays. The smoothing fit emphasises slower varations on a time scale of months. Overall, the sales appear to be stable throughout this time range.

The next plot plots the smoothed transactions time series for each individual store:

trans %>%

  ggplot(aes(date, transactions, color = store_nbr)) +

  geom_smooth(method = "loess", span = 1/2, se = FALSE)

Fig. 8

We can see the large contributions of a few stores, together with the average level of around 1000 transactions for the smaller ones. We also see that a few stores opened during the time frame and showing declining sales as (presumably) their novelty wears off.

4.5 Oil

Here we plot the oil price over time together with its weekly changes (price - price 7 days later):

p1 <- oil %>%

  ggplot(aes(date, oilprice)) +

  geom_line(color = "black") +

  geom_smooth(method = "loess", color = "red", span = 1/5)



p2 <- oil %>%

  mutate(lag7 = lag(oilprice,7)) %>%

  mutate(diff = oilprice - lag7) %>%

  filter(!is.na(diff)) %>%

  ggplot(aes(date, diff)) +

  geom_line(color = "black") +

  geom_smooth(method = "loess", color = "red", span = 1/5) +

  labs(y = "Weekly variations in Oil price")



layout <- matrix(c(1,2),2,1,byrow=TRUE)

multiplot(p1, p2, layout=layout)

Fig. 9

We find:

There are strong, long-term changes in oil price with an obvious drop in the second half of 2014. Overlayed on this long-term trend appear to be fluctuations on time scales of weeks and months.
The frequent downward trends are visible in the week-to-week variations. The strong dips and rises in the lower plot might have a stronger influence on buying behaviour than the long-term evolution.

4.6 Holidays

Here we use two overview figures to make the individual plots more readable:

p1 <- holidays %>%

  ggplot(aes(type, fill = type)) +

  geom_bar() +

  theme(legend.position = "none")



p2 <- holidays %>%

  ggplot(aes(locale, fill = locale)) +

  geom_bar() +

  theme(legend.position = "none")



p3 <- holidays %>%

  group_by(description) %>%

  count() %>%

  arrange(desc(n)) %>%

  head(12) %>%

  ggplot(aes(description, n)) +

  geom_col(fill = "blue") +

  theme(legend.position = "none") +

  coord_flip() +

  labs(x = "description - most frequent", y = "Frequency")



p4 <- holidays %>%

  ggplot(aes(transferred, fill = transferred)) +

  geom_bar() +

  theme(legend.position = "none")



layout <- matrix(c(1,2,3,4),2,2,byrow=TRUE)

multiplot(p1, p2, p3, p4, layout=layout)

Fig. 10

holidays %>%

  ggplot(aes(locale_name, fill = locale_name)) +

  geom_bar() +

  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9))

Fig. 11

We find:

Mosts special days are of the type “Holiday” and are either of the locale “Local” or “National”. Relatively few “Regional” holidays are present in our data.
The large number of national holidays is emphasised in the second plot, which shows the locale_name of the event.
The lower left plot lists a few of the most frequent holiday descriptions (i.e. their names). Carnival is clearly important.
The majority of days off is not transferred.

This transferred features works a little different from what you might assume at first. It is not directly related to type == Transfer, but is assigned to the original holiday day prior to the transfer. This can be seen when we group by transferred and type:

holidays %>%

  count(transferred, type)

## # A tibble: 7 x 3
##   transferred type           n
##   <lgl>       <fct>      <int>
## 1 FALSE       Additional    51
## 2 FALSE       Bridge         5
## 3 FALSE       Event         56
## 4 FALSE       Holiday      209
## 5 FALSE       Transfer      12
## 6 FALSE       Work Day       5
## 7 TRUE        Holiday       12

What this means, is that a transferred holiday shows up twice in this data. First on its original date, but with a transferred == TRUE flag. This means that on this day there was no holiday. It was a (quasi) normal working day. Instead, the moved holiday has a type == Transfer and a transferred == FALSE on the new date. This is a non-working day now.

plot_item_store(1503844,1)

Fig. 13

Here, the plot title contains the information about the item classification (class and family) as well as the store type and city. In red, we plot those times when the item was on promotion (for simplicity, NAs are plotted as not-on-promotion values). The dashed blue line is a smoothing (loess) fit with corresponding gray confidence range. This allows us to judge to long-term variation. Finally, the light-gray curve in the background is the oil price during this time range and arbitrarily scaled to the y-axis range of the plot.

In this example, we see that the unit_sales had strong short-term variations superimposed on a relatively stable long-term trend. Promotions (in red) seem to have a positive impact on the sales numbers in most cases. Keep in mind that we only use 10% of training data values.

This helper function is easily generalised to plot a time series of cumulative unit_sales for a specific item in all stores:

plot_item(1503844)

Fig. 14

Here we see that the long-term trend is much more stable. We also use a connected line instead of data points. This emphasises the short-term variations and also the gaps in information for this particular product. These gaps might be due to our 10% sampling, but I think this is statistically unlikely. There is a higher probability that those time ranges are missing in our data set.

In this plot, the onpromotion flag (red) is set if one of the stores has a promotion for this item on this date.

Because such a busy time series can become difficult to read, we enhance our helper function with a zoom, courtesy of the ggforce package. Here we plot the promotion dates as red data points:

plot_item_zoom(1503844, 20161101, 20170201)

Fig. 15

In the zoom-in region we can clearly see the weekly variations. In this time series the promotion status seems not to have a significant impact. However, it is important to realise that this graph combines sales from all stores and that not many of them will have the same promotions at the same time.

The idea of plotting a single item_nbr can be extended towards visualising all (cumulative) sales of a certain family of items. Here we define again a zoom to study short-term variations:

plot_family_store_zoom("GROCERY I", 1, 20160401, 20160601)

Fig. 16

In this plot we clearly see the impact of a strong earthquake on April 16 2016, as “people rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.” (data description)

Consequently, this effect is less pronounced, but still visible, in the sales of e.g. “CLEANING” items:

plot_family_store_zoom("CLEANING", 1, 20160401, 20160601)

Fig. 17

Other groupings by different parameters are easily possible by modifying the helper functions. Feel free to fork this kernel and create tailored visualisations.

In summary: while we could forecast every single item_nbr independently there is a lot to gain from looking at patterns within groups of items that might show similar behaviour. These relations will be the subject of the following sections.

Next we will have a look at the total sales per date (upper panel) and the mean sales per day of the week vs month of the year as a heatmap in the lower panel:

p1 <- train %>%

  group_by(date) %>%

  summarise(sales = sum(unit_sales)) %>%

  ggplot(aes(date, sales)) +

  geom_line(color = "blue")



p2 <- train %>%

  mutate(wday = wday(date, label = TRUE),

         month = month(date, label = TRUE)) %>%

  mutate(wday = fct_relevel(wday, c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))) %>%

  group_by(wday, month) %>%

  summarise(mean_sales = mean(unit_sales)) %>%

  ggplot(aes(month, wday, fill = mean_sales)) +

  geom_tile() +

  labs(x = "Month of the year", y = "Day of the week") +

  scale_fill_distiller(palette = "Spectral")



layout <- matrix(c(1,2),2,1,byrow=TRUE)

multiplot(p1, p2, layout=layout)

Fig. 20

We find:

The sales volume rises over time, with notable spikes and hikes especially near Christmas.
The weekday and month information, contained in the date, shows us that weekends have significantly higher average sales numbers. Thursdays, on the other hand, have consistently lower sales. Tuesdays seem to be slightly lower.
The higher level for December is most likely because of Christmas.

4.7 Oil: price changes and sales impact

This section will study the changes in sales vs the fluctuations in the oil price. First, this is an overlay plot of the two time series:

foo <- train %>%

  group_by(date) %>%

  summarise(sales = sum(unit_sales))



oil_back <- oil %>%

    filter(date > min(foo$date))

oil_back <- oil_back %>%

    mutate(oilprice = ( min(foo$sales, na.rm = TRUE) +

      (oilprice-min(oilprice, na.rm = TRUE))/(max(oilprice, na.rm = TRUE) - min(oilprice, na.rm = TRUE)) *

        (max(foo$sales, na.rm = TRUE) - min(foo$sales, na.rm = TRUE)) ))

  

foo %>%

  ggplot(aes(date, sales)) +

  geom_line() +

  geom_line(data = oil_back, aes(date, oilprice), color = "blue") +

  ggtitle("Total sales (black) with oilprice (blue)")

Fig. 28

We find that over time the sales increase and the oil price drops. The large oil price decrease during the 2nd half of 2014 might have affected the lower sales numbers in the 1st half of 2015 (compared to the on average higher level in late 2014). Over the last year, the oil price was relatively stable.

Now we will parametrise these changes by computing the 1-day, 7-day, and 30-day lag for the time series of all sales as well as the oil price, respectively. This way we can compare drops and rises on a daily, weekly, and (approximately) monthly scale between the two parameters. This is what the corresponding correlation coefficients look like:

bar <- foo %>%

  left_join(oil, by = "date") %>%

  mutate(oil1 = oilprice - lag(oilprice,1),

         oil7 = oilprice - lag(oilprice,7),

         oil30 = oilprice - lag(oilprice,30),

         sales1 = sales - lag(sales,1),

         sales7 = sales - lag(sales,7),

         sales30 = sales - lag(sales,30)

         ) %>%

  filter(!is.na(oil30) & !is.na(oil1) & !is.na(sales30) & !is.na(oil7))



bar %>%

  select(-date) %>%

  cor(use="complete.obs", method = "spearman") %>%

  corrplot(type="lower", tl.col = "black",  diag=FALSE, method="number")

Fig. 29

In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation.

We find:

There is the expected anti-correlation between the two time series; this is just the capturing the opposite trends over time.
More interesting is the lack of signal in the bottom middle block, between the different sales indices vs oil price indices. This tells us that changes in one time series had no significant impact on the other time series over the duration of our data.
The weak correlations within the sales and oil price indices reflect the overall trend.

This analysis is already getting into feature-engineering territory, which is why we will keep other ideas like this for a later section.

4.8 Promotions for items and stores

The impact of periods of promotions for specific items and stores is best understood when joining the train, items, and stores data sets. Here we first look at the items and their families, with a special focus on perishable items.

This plot shows the median sales statistics for the onpromotion status of groups of items. We only compare the sales items for which we have both promotion and non-promotion entries, so that we can compare like with like. We are using boxplots to get an idea about the distribution properties. The colour scheme is shown in the top right corner and is consistent throughout the plot. Except for the top right plot we remove the NAs in the onpromotion feature:

foo <- train %>%

  group_by(item_nbr, onpromotion) %>%

  summarise(med_sales = median(unit_sales)) %>%

  filter(!is.na(onpromotion)) %>%

  spread(onpromotion, med_sales, fill = NA) %>%

  filter(!is.na(`TRUE`) & !is.na(`FALSE`)) %>%

  gather(`TRUE`, `FALSE`, key = "promo", value = "med_sales") %>%

  ungroup()



bar <- foo %>%

  mutate(item_nbr = as.character(item_nbr)) %>%

  left_join(items %>% mutate(item_nbr = as.character(item_nbr)), on = "item_nbr")



p1 <- foo %>%

  ggplot(aes(promo, med_sales, color = promo)) +

  geom_boxplot() +

  scale_y_log10() +

  theme(legend.position = "none")



p2 <- bar %>%

  ggplot(aes(perish, med_sales, color = promo)) +

  geom_boxplot() +

  scale_y_log10() +

  theme(legend.position = "none")



p3 <- train %>%

  mutate(item_nbr = as.character(item_nbr)) %>%

  left_join(items %>% mutate(item_nbr = as.character(item_nbr)), on = "item_nbr") %>%

  ggplot(aes(perish, fill = onpromotion)) +

  geom_bar(position = "fill")



p4 <- bar %>%

  mutate(family = str_sub(family, start = 1, end = 19)) %>%

  ggplot(aes(reorder(family, -med_sales, FUN = median), med_sales, color = promo)) +

  geom_boxplot() +

  scale_y_log10() +

  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +

  labs(x = "Family (reordered)")



layout <- matrix(c(1,2,3,1,2,3,4,4,4,4,4,4,4,4,4),5,3,byrow=TRUE)

multiplot(p1, p2, p3, p4, layout=layout)

Fig. 30

We find:

Promotions lead to somewhat higher sales on average. However, the difference is not large and the boxes overlap considerably.
This slight difference is preserved in the perishable vs non-perishable statistics, where we also see the trend for perishable items to have slight higher median sales.
In general, the a higher percentage of perishable items have been on promotion than non-perishable ones. There are also fewer NAs for the perishable items. This particular plot looks at all items, not just the pairwise comparisons.
We have again re-ordered the family vs median sales plot according to decreasing median sales volume (for all promotion values). This shows us an overall trend of median sales, but no really strong signals for promotion items for any family. However, there are a few notable cases such as “LADIESWEAR” or “SCHOOL AND OFFICE SUPPLIES” where the promotion status seems to have at least a certain impact.

4.9 Forecasts:

5 Dirty Prophet for Store Number 44 and Item Number 329362

HighestSoldItem = train %>%
  filter(store_nbr == 44) %>%
  filter(item_nbr == 329362) %>%
  arrange(desc(date)) %>%
  select(date,unit_sales) %>% head(60)

colnames(HighestSoldItem) = c("ds","y")

m <- prophet(HighestSoldItem,changepoint.prior.scale = 0.1)

future <- make_future_dataframe(m, periods = 16,freq = "day")

forecast <- predict(m, future)

predictions = tail(round(forecast$yhat),16)

plot(m, forecast)

cat("The predictions are ","\n")

## The predictions are

predictions

##  [1] 11 13  9 11 17 24 13 11 13  9 11 17 24 13 11 13

5.1 Break into the Prophet Components

prophet_plot_components(m, forecast)

5.2 XGBoost predictions

TransformColumn = function(x)
{
  if (is.na(x))
  {
    return (0)
  }
  else
  {
    return (x)
  }
}

GetHighestSoldItem = function(ds,ItemNumber,StoreNumber)
{
  
  HighestSoldItem = ds %>%
    filter(store_nbr == StoreNumber) %>%
    filter(item_nbr == ItemNumber) %>%
    mutate(year = year(ymd(date)))  %>%
    mutate(month = month(ymd(date)))  %>%
    mutate(dayOfWeek = wday(date))  %>%
    mutate(day = day(ymd(date))) 
 
  HighestSoldItem = left_join(HighestSoldItem,oil)
  
  HolidaysNational = holidays %>%
    filter(type != "Work Day") %>%
    filter(locale == "National")
  
  HighestSoldItem = left_join(HighestSoldItem,HolidaysNational, by = "date") 
  
  HighestSoldItem = HighestSoldItem %>%
    select( -locale,-locale_name,-description,-transferred)
  
  HighestSoldItem = HighestSoldItem %>%
    select(-id,-store_nbr,-item_nbr,-date,-onpromotion) 
  
  HighestSoldItem$type = sapply(HighestSoldItem$type,TransformColumn)
  
  features <- colnames(HighestSoldItem)
  
  for (f in features) {
    if ((class(HighestSoldItem[[f]])=="factor") || (class(HighestSoldItem[[f]])=="character")) {
      levels <- unique(HighestSoldItem[[f]])
      HighestSoldItem[[f]] <- as.numeric(factor(HighestSoldItem[[f]], levels=levels))
    }
  }
  
 
  
  return(HighestSoldItem)
  
  
}

HighestSoldItem = GetHighestSoldItem(train,329362,44)

## Joining, by = "date"

HighestSoldItemTest = GetHighestSoldItem(test,329362,44)

## Joining, by = "date"

formula = unit_sales ~ .

fitControl <- trainControl(method="cv",number = 5)


PlotImportance = function(importance)
{
  varImportance <- data.frame(Variables = row.names(importance[[1]]), 
                              Importance = round(importance[[1]]$Overall,2))
  
  # Create a rank variable based on importance
  rankImportance <- varImportance %>%
    mutate(Rank = paste0('#',dense_rank(desc(Importance))))
  
  rankImportancefull = rankImportance
  
  ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                             y = Importance)) +
    geom_bar(stat='identity',colour="white",fill = "#FFA07A") +
    geom_text(aes(x = Variables, y = 1, label = Rank),
              hjust=0, vjust=.5, size = 4, colour = 'black',
              fontface = 'bold') +
    labs(x = 'Variables', title = 'Relative Variable Importance') +
    coord_flip() + 
    theme_bw()
  
  
}


xgbGrid <- expand.grid(nrounds = 500,
                       max_depth = 4,
                       eta = .05,
                       gamma = 0,
                       colsample_bytree = .5,
                       min_child_weight = 1,
                       subsample = 1)

set.seed(13)

XGBModel = train(formula, data = HighestSoldItem,
                            method = "xgbTree",trControl = fitControl,
                            tuneGrid = xgbGrid,na.action = na.pass,metric="RMSE")


importance = varImp(XGBModel)
PlotImportance(importance)

predictions = round(predict(XGBModel,HighestSoldItemTest,na.action= na.pass))
test_itemNumber = test %>% filter(item_nbr==329362 & store_nbr==44)
test_itemNumber$unit_sales <- predictions
ggplot(data = test_itemNumber, aes(x = date,y = unit_sales)) + geom_line()

Shopping for Insights - Favorita EDA

2019-02-09

1 Introduction

2 Preparations

2.1 Load libraries

2.2 Helper functions

2.3 Load data

3 Overview: File structure and content

3.1 Training data

3.2 Test data

3.3 Stores

3.4 Items

3.5 Transactions

3.6 Oil

3.7 Holidays

3.8 Reformating features

4 Individual feature visualisations

4.1 Training data

4.2 Stores

4.3 Items

4.4 Transactions

4.5 Oil

4.6 Holidays

4.7 Oil: price changes and sales impact

4.8 Promotions for items and stores

4.9 Forecasts:

5 Dirty Prophet for Store Number 44 and Item Number 329362

5.1 Break into the Prophet Components

5.2 XGBoost predictions