Here are the libraries used in this analysis.
library(ggplot2) # graphing
library(dplyr) # data summarization and manipulation
library(tidyr) # reshaping data frame
library(data.table) # fast file reading using data.table::fread
library(kableExtra) # beautify table output
library(lubridate) # date conversion and manipulation
This dataset is obtained from one of Kaggle Competition, Corporación Favorita Grocery Sales Forecasting.
They are available as CSV files.
list.files(path='./data/')
[1] "holidays_events.csv" "items.csv" "oil.csv" "sample_submission.csv" "stores.csv"
[6] "test.csv" "train.csv" "transactions.csv"
All data files are stored in data/ folder. date.table::fread() is used for fast reading as the data exceeds 4.8GB.
Some variables need to be converted to appropriate type:
- Categorical variable that was repsented as integer in original dataset must be converted to factor
- Date variable was imported as text or factor need to be converted to date type
items.df$perishable = as.factor(items.df$perishable)
items.df$class = as.factor(items.df$class)
holidays.df$date = as.Date(holidays.df$date, format = "%Y-%m-%d")
transactions.df$date = as.Date(transactions.df$date, format = "%Y-%m-%d")
train.df[, date := as.Date(fast_strptime(as.character(date), "%Y-%m-%d"))]
This dataset is store metadata. Each row represent a record for a unique store.
Number of columns: 5
Number of rows: 54 (represent total number of stores)
str(stores.df)
Classes data.table and 'data.frame': 54 obs. of 5 variables:
$ store_nbr: int 1 2 3 4 5 6 7 8 9 10 ...
$ city : Factor w/ 22 levels "Ambato","Babahoyo",..: 19 19 19 19 22 19 19 19 19 19 ...
$ state : Factor w/ 16 levels "Azuay","Bolivar",..: 13 13 13 13 15 13 13 13 13 13 ...
$ type : Factor w/ 5 levels "A","B","C","D",..: 4 4 4 4 4 4 4 4 2 3 ...
$ cluster : int 13 13 8 9 4 13 8 8 6 15 ...
- attr(*, ".internal.selfref")=<externalptr>
summary(stores.df)
store_nbr city state type cluster
Min. : 1.00 Quito :18 Pichincha :19 A: 9 Min. : 1.000
1st Qu.:14.25 Guayaquil : 8 Guayas :11 B: 8 1st Qu.: 4.000
Median :27.50 Cuenca : 3 Azuay : 3 C:15 Median : 8.500
Mean :27.50 Santo Domingo: 3 Manabi : 3 D:18 Mean : 8.481
3rd Qu.:40.75 Ambato : 2 Santo Domingo de los Tsachilas: 3 E: 4 3rd Qu.:13.000
Max. :54.00 Latacunga : 2 Cotopaxi : 2 Max. :17.000
(Other) :18 (Other) :13
Stores are distribtued across 22 cities.
stores.df %>% count(city) %>%
ggplot(aes(x=reorder(city,-n), y=n)) + geom_bar(stat='identity') +
labs(title='Stores Distribution by Cities', x='Cities', y='Number of Stores') +
theme(axis.text.x = element_text (angle = (90), hjust = 1, vjust = 0.5))
pt1 = prop.table(table(stores.df$city)) %>% sort(decreasing = T) %>% {.*100} %>% head(2)
Majority of stores are located in Quito (33.3%) and Guayaquil (14.8%).
Stores are distribtued across 16 states.
stores.df %>% count(state) %>%
ggplot(aes(x=reorder(state,-n), y=n)) + geom_bar(stat='identity') +
labs(title='Stores Distribution by States', x='States', y='Number of Stores') +
theme(axis.text.x = element_text (angle = (90), hjust = 1, vjust = 0.5))
stores.df %>% group_by(state) %>% mutate(cities=paste(unique(city), collapse = " - ")) %>%
summarize(n_stores=n(), unique_cities=n_distinct(city),cities=max(cities)) %>%
arrange(-n_stores) %>% kable %>%
kable_styling(bootstrap_options = c("striped", "condensed"))
| state | n_stores | unique_cities | cities |
|---|---|---|---|
| Pichincha | 19 | 2 | Quito - Cayambe |
| Guayas | 11 | 4 | Guayaquil - Daule - Playas - Libertad |
| Azuay | 3 | 1 | Cuenca |
| Manabi | 3 | 2 | Manta - El Carmen |
| Santo Domingo de los Tsachilas | 3 | 1 | Santo Domingo |
| Cotopaxi | 2 | 1 | Latacunga |
| El Oro | 2 | 1 | Machala |
| Los Rios | 2 | 2 | Babahoyo - Quevedo |
| Tungurahua | 2 | 1 | Ambato |
| Bolivar | 1 | 1 | Guaranda |
| Chimborazo | 1 | 1 | Riobamba |
| Esmeraldas | 1 | 1 | Esmeraldas |
| Imbabura | 1 | 1 | Ibarra |
| Loja | 1 | 1 | Loja |
| Pastaza | 1 | 1 | Puyo |
| Santa Elena | 1 | 1 | Salinas |
pt2 = prop.table(table(stores.df$state)) %>% sort(decreasing = T) %>% {.*100} %>% head(2)
Majority of stores are located in state Pichincha (35.2%) and Guayas (20.4%).
Number of columns: 4
Number of rows: 4100 (each row represent single product)
str(items.df)
Classes data.table and 'data.frame': 4100 obs. of 4 variables:
$ item_nbr : int 96995 99197 103501 103520 103665 105574 105575 105576 105577 105693 ...
$ family : Factor w/ 33 levels "AUTOMOTIVE","BABY CARE",..: 13 13 8 13 6 13 13 13 13 13 ...
$ class : Factor w/ 337 levels "1002","1003",..: 65 45 218 18 188 32 32 32 32 23 ...
$ perishable: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
summary(items.df)
item_nbr family class perishable
Min. : 96995 GROCERY I :1334 1016 : 133 0:3114
1st Qu.: 818111 BEVERAGES : 613 1040 : 110 1: 986
Median :1306198 CLEANING : 446 1124 : 100
Mean :1251436 PRODUCE : 306 1034 : 98
3rd Qu.:1904918 DAIRY : 242 1122 : 81
Max. :2134244 PERSONAL CARE: 153 1072 : 70
(Other) :1006 (Other):3508
There are total 33 product familes, with number of items distributed shown below.
items.df %>% count(family) %>%
ggplot(aes(x=reorder(family,-n), y=n)) + geom_bar(stat='identity') +
labs(title='Items Distribution', x='Family', y='Number of Items') +
theme(axis.text.x = element_text (angle = (90), hjust = 1, vjust = 0.5))
Top 5 Families of Product and Its Percentage of Items
pt3 = prop.table(table(items.df$family) %>% sort(decreasing = T)) *100
kable(pt3[1:5],digits = 2, col.names = c('Family','Percentage')) %>% kable_styling(bootstrap_options = c("striped", "condensed"))
| Family | Percentage |
|---|---|
| GROCERY I | 32.54 |
| BEVERAGES | 14.95 |
| CLEANING | 10.88 |
| PRODUCE | 7.46 |
| DAIRY | 5.90 |
Number of columns: 6
Number of rows: 350 (each row represent a holiday)
Number of unique holidays date: 312
str(holidays.df)
Classes data.table and 'data.frame': 350 obs. of 6 variables:
$ date : Date, format: "2012-03-02" "2012-04-01" "2012-04-12" "2012-04-14" ...
$ type : Factor w/ 6 levels "Additional","Bridge",..: 4 4 4 4 4 4 4 4 4 4 ...
$ locale : Factor w/ 3 levels "Local","National",..: 1 3 1 1 1 1 1 3 1 1 ...
$ locale_name: Factor w/ 24 levels "Ambato","Cayambe",..: 16 3 4 13 20 17 8 11 12 15 ...
$ description: Factor w/ 103 levels "Batalla de Pichincha",..: 26 55 19 7 9 11 5 56 6 25 ...
$ transferred: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
- attr(*, ".internal.selfref")=<externalptr>
summary(holidays.df)
date type locale locale_name description transferred
Min. :2012-03-02 Additional: 51 Local :152 Ecuador :174 Carnaval : 10 Mode :logical
1st Qu.:2013-12-23 Bridge : 5 National:174 Quito : 13 Fundacion de Cuenca : 7 FALSE:338
Median :2015-06-08 Event : 56 Regional: 24 Ambato : 12 Fundacion de Ibarra : 7 TRUE :12
Mean :2015-04-24 Holiday :221 Guaranda : 12 Cantonizacion de Cayambe : 6
3rd Qu.:2016-07-03 Transfer : 12 Latacunga: 12 Cantonizacion de El Carmen: 6
Max. :2017-12-26 Work Day : 5 Riobamba : 12 Cantonizacion de Guaranda : 6
(Other) :115 (Other) :308
The holidays dataset records all holidays from 02 March 2012 till 26 December 2017.
Notes from data source provider:
- Bridge are extended holidays to make up long weekend, they are replaced by Work Day
- Additional are extra holidays given, typically around Christmas (without replacement Work Day)
- Transfer are replacement for original celeberation that were made a normal day
- Event type is National Holiday but not explained
holidays.df %>% count(type, locale) %>%
ggplot(aes(x=reorder(locale,-n), y=n, fill=type)) + geom_bar(stat='identity') +
labs(title='Holidays Distribution by Locale', x='Type', y='Days of Holidays') +
theme(axis.text.x = element_text (angle = (90), hjust = 1, vjust = 0.5))
holidays.df %>% count(locale,type) %>%
ggplot(aes(x=reorder(type,-n), y=n, fill=locale)) + geom_bar(stat='identity') +
labs(title='Holidays Distribution by Type', x='Locale', y='Days of Holidays') +
theme(axis.text.x = element_text (angle = (90), hjust = 1, vjust = 0.5))
There are total 12 holidays which were transferred from original date. These holidays are marked as Transferred as TRUE.
holidays.df %>% filter(transferred == T) %>% kable %>% kable_styling(bootstrap_options = c("striped", "condensed"))
| date | type | locale | locale_name | description | transferred |
|---|---|---|---|---|---|
| 2012-10-09 | Holiday | National | Ecuador | Independencia de Guayaquil | TRUE |
| 2013-10-09 | Holiday | National | Ecuador | Independencia de Guayaquil | TRUE |
| 2014-10-09 | Holiday | National | Ecuador | Independencia de Guayaquil | TRUE |
| 2016-05-24 | Holiday | National | Ecuador | Batalla de Pichincha | TRUE |
| 2016-07-25 | Holiday | Local | Guayaquil | Fundacion de Guayaquil | TRUE |
| 2016-08-10 | Holiday | National | Ecuador | Primer Grito de Independencia | TRUE |
| 2017-01-01 | Holiday | National | Ecuador | Primer dia del ano | TRUE |
| 2017-04-12 | Holiday | Local | Cuenca | Fundacion de Cuenca | TRUE |
| 2017-05-24 | Holiday | National | Ecuador | Batalla de Pichincha | TRUE |
| 2017-08-10 | Holiday | National | Ecuador | Primer Grito de Independencia | TRUE |
| 2017-09-28 | Holiday | Local | Ibarra | Fundacion de Ibarra | TRUE |
| 2017-12-06 | Holiday | Local | Quito | Fundacion de Quito | TRUE |
The above transferred holidays correpsonse to the original holidays as below.
holidays.df %>% filter(type == 'Transfer') %>% kable %>% kable_styling(bootstrap_options = c("striped", "condensed"))
| date | type | locale | locale_name | description | transferred |
|---|---|---|---|---|---|
| 2012-10-12 | Transfer | National | Ecuador | Traslado Independencia de Guayaquil | FALSE |
| 2013-10-11 | Transfer | National | Ecuador | Traslado Independencia de Guayaquil | FALSE |
| 2014-10-10 | Transfer | National | Ecuador | Traslado Independencia de Guayaquil | FALSE |
| 2016-05-27 | Transfer | National | Ecuador | Traslado Batalla de Pichincha | FALSE |
| 2016-07-24 | Transfer | Local | Guayaquil | Traslado Fundacion de Guayaquil | FALSE |
| 2016-08-12 | Transfer | National | Ecuador | Traslado Primer Grito de Independencia | FALSE |
| 2017-01-02 | Transfer | National | Ecuador | Traslado Primer dia del ano | FALSE |
| 2017-04-13 | Transfer | Local | Cuenca | Fundacion de Cuenca | FALSE |
| 2017-05-26 | Transfer | National | Ecuador | Traslado Batalla de Pichincha | FALSE |
| 2017-08-11 | Transfer | National | Ecuador | Traslado Primer Grito de Independencia | FALSE |
| 2017-09-29 | Transfer | Local | Ibarra | Fundacion de Ibarra | FALSE |
| 2017-12-08 | Transfer | Local | Quito | Traslado Fundacion de Quito | FALSE |
This also means holidays with type Transfer had been normal days and its celebration had been moved to holidays with Transferred equals TRUE
Transaction is the records the count of sales transactions for each date, store_nbr combination. Only included for the training data timeframe.
Number of columns: 3
Number of rows: 83488
Structure
str(transactions.df)
Classes data.table and 'data.frame': 83488 obs. of 3 variables:
$ date : Date, format: "2013-01-01" "2013-01-02" "2013-01-02" "2013-01-02" ...
$ store_nbr : int 25 1 2 3 4 5 6 7 8 9 ...
$ transactions: int 770 2111 2358 3487 1922 1903 2143 1874 3250 2940 ...
- attr(*, ".internal.selfref")=<externalptr>
Sample Rows
head(transactions.df) %>% kable %>% kable_styling(bootstrap_options = c("striped", "condensed"))
| date | store_nbr | transactions |
|---|---|---|
| 2013-01-01 | 25 | 770 |
| 2013-01-02 | 1 | 2111 |
| 2013-01-02 | 2 | 2358 |
| 2013-01-02 | 3 | 3487 |
| 2013-01-02 | 4 | 1922 |
| 2013-01-02 | 5 | 1903 |
Summary
summary(transactions.df)
date store_nbr transactions
Min. :2013-01-01 Min. : 1.00 Min. : 5
1st Qu.:2014-03-27 1st Qu.:13.00 1st Qu.:1046
Median :2015-06-08 Median :27.00 Median :1393
Mean :2015-05-20 Mean :26.94 Mean :1695
3rd Qu.:2016-07-14 3rd Qu.:40.00 3rd Qu.:2079
Max. :2017-08-15 Max. :54.00 Max. :8359
Number of columns: 6
Number of rows: 125,497,040
Notes from Datasource Provider:
Structure
str(train.df)
Classes data.table and 'data.frame': 125497040 obs. of 6 variables:
$ id : int 0 1 2 3 4 5 6 7 8 9 ...
$ date : Date, format: "2013-01-01" "2013-01-01" "2013-01-01" "2013-01-01" ...
$ store_nbr : int 25 25 25 25 25 25 25 25 25 25 ...
$ item_nbr : int 103665 105574 105575 108079 108701 108786 108797 108952 111397 114790 ...
$ unit_sales : num 7 1 2 1 1 3 1 1 13 3 ...
$ onpromotion: logi NA NA NA NA NA NA ...
- attr(*, ".internal.selfref")=<externalptr>
Sample Rows
head(train.df) %>% kable %>% kable_styling(bootstrap_options = c("striped", "condensed"))
| id | date | store_nbr | item_nbr | unit_sales | onpromotion |
|---|---|---|---|---|---|
| 0 | 2013-01-01 | 25 | 103665 | 7 | NA |
| 1 | 2013-01-01 | 25 | 105574 | 1 | NA |
| 2 | 2013-01-01 | 25 | 105575 | 2 | NA |
| 3 | 2013-01-01 | 25 | 108079 | 1 | NA |
| 4 | 2013-01-01 | 25 | 108701 | 1 | NA |
| 5 | 2013-01-01 | 25 | 108786 | 3 | NA |
Summary
summary(train.df)