1 Kaggle Mission

In this competition, you will forecast the demand of a product for a given week, at a particular store. The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week.

The train and test dataset are split based on time, as well as the public and private leaderboard dataset split.

Things to note:

  • There may be products in the test set that don’t exist in the train set. This is the expected behavior of inventory data, since there are new products being sold all the time. Your model should be able to accommodate this.

  • There are duplicate Cliente_ID’s in cliente_tabla, which means one Cliente_ID may have multiple NombreCliente that are very similar. This is due to the NombreCliente being noisy and not standardized in the raw data, so it is up to you to decide how to clean up and use this information.

  • The adjusted demand (Demanda_uni_equil) is always >= 0 since demand should be either 0 or a positive value. The reason that Venta_uni_hoy - Dev_uni_proxima sometimes has negative values is that the returns records sometimes carry over a few weeks. File descriptions

  • train.csv — the training set
  • test.csv — the test set
  • sample_submission.csv — a sample submission file in the correct format
  • cliente_tabla.csv — client names (can be joined with train/test on Cliente_ID)
  • producto_tabla.csv — product names (can be joined with train/test on Producto_ID)
  • town_state.csv — town and state (can be joined with train/test on Agencia_ID)

Data fields

  • Semana — Week number (From Thursday to Wednesday)
  • Agencia_ID — Sales Depot ID
  • Canal_ID — Sales Channel ID
  • Ruta_SAK — Route ID (Several routes = Sales Depot)
  • Cliente_ID — Client ID
  • NombreCliente — Client name
  • Producto_ID — Product ID
  • NombreProducto — Product Name
  • Venta_uni_hoy — Sales unit this week (integer)
  • Venta_hoy — Sales this week (unit: pesos)
  • Dev_uni_proxima — Returns unit next week (integer)
  • Dev_proxima — Returns next week (unit: pesos)
  • Demanda_uni_equil — Adjusted Demand (integer) (This is the target you will predict)

2 Path of Action

  1. Import the small data, determine what will be used, import all of the needed.
  2. Build a section on exploring the data: variable types, distributions, etc.
  3. Explore machine learning opportunities and/or regression.

3 Libraries

library(data.table); library(plyr); library(ggplot2)
## Warning: package 'plyr' was built under R version 3.2.5

4 Files

Print the top 6 rows of every table provided:

fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/cliente_tabla.csv", nrows = 6);
##    Cliente_ID                           NombreCliente
## 1:          0                              SIN NOMBRE
## 2:          1                        OXXO XINANTECATL
## 3:          2                              SIN NOMBRE
## 4:          3                               EL MORENO
## 5:          4 SDN SER  DE ALIM  CUERPO SA CIA  DE INT
## 6:          4    SDN SER DE ALIM CUERPO SA CIA DE INT
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/producto_tabla.csv", nrows = 6);
##    Producto_ID                         NombreProducto
## 1:           0                      NO IDENTIFICADO 0
## 2:           9              Capuccino Moka 750g NES 9
## 3:          41 Bimbollos Ext sAjonjoli 6p 480g BIM 41
## 4:          53         Burritos Sincro 170g CU LON 53
## 5:          72    Div Tira Mini Doradita 4p 45g TR 72
## 6:          73      Pan Multigrano Linaza 540g BIM 73
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/town_state.csv", nrows = 6);
##    Agencia_ID                  Town            State
## 1:       1110    2008 AG. LAGO FILT     MÉXICO, D.F.
## 2:       1111 2002 AG. AZCAPOTZALCO     MÉXICO, D.F.
## 3:       1112   2004 AG. CUAUTITLAN ESTADO DE MÉXICO
## 4:       1113    2008 AG. LAGO FILT     MÉXICO, D.F.
## 5:       1114  2029 AG.IZTAPALAPA 2     MÉXICO, D.F.
## 6:       1116  2011 AG. SAN ANTONIO     MÉXICO, D.F.
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/train.csv", nrows = 6);
##    Semana Agencia_ID Canal_ID Ruta_SAK Cliente_ID Producto_ID
## 1:      3       1110        7     3301      15766        1212
## 2:      3       1110        7     3301      15766        1216
## 3:      3       1110        7     3301      15766        1238
## 4:      3       1110        7     3301      15766        1240
## 5:      3       1110        7     3301      15766        1242
## 6:      3       1110        7     3301      15766        1250
##    Venta_uni_hoy Venta_hoy Dev_uni_proxima Dev_proxima Demanda_uni_equil
## 1:             3     25.14               0           0                 3
## 2:             4     33.52               0           0                 4
## 3:             4     39.32               0           0                 4
## 4:             4     33.52               0           0                 4
## 5:             3     22.92               0           0                 3
## 6:             5     38.20               0           0                 5
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/test.csv", nrows = 6);
##    id Semana Agencia_ID Canal_ID Ruta_SAK Cliente_ID Producto_ID
## 1:  0     11       4037        1     2209    4639078       35305
## 2:  1     11       2237        1     1226    4705135        1238
## 3:  2     10       2045        1     2831    4549769       32940
## 4:  3     11       1227        1     4448    4717855       43066
## 5:  4     11       1219        1     1130     966351        1277
## 6:  5     11       1146        4     6601    1741414         972
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/sample_submission.csv", nrows = 6);
##    id Demanda_uni_equil
## 1:  0                 7
## 2:  1                 7
## 3:  2                 7
## 4:  3                 7
## 5:  4                 7
## 6:  5                 7

5 Pick a Product and Client

We’ll definitely need to import the client and product tables, so we can look at our selection.

## 
Read 0.0% of 74180464 rows
Read 2.1% of 74180464 rows
Read 4.1% of 74180464 rows
Read 6.1% of 74180464 rows
Read 8.2% of 74180464 rows
Read 10.2% of 74180464 rows
Read 12.2% of 74180464 rows
Read 14.3% of 74180464 rows
Read 16.4% of 74180464 rows
Read 18.6% of 74180464 rows
Read 20.7% of 74180464 rows
Read 22.9% of 74180464 rows
Read 25.0% of 74180464 rows
Read 27.2% of 74180464 rows
Read 29.3% of 74180464 rows
Read 31.4% of 74180464 rows
Read 33.5% of 74180464 rows
Read 35.6% of 74180464 rows
Read 37.7% of 74180464 rows
Read 39.8% of 74180464 rows
Read 41.9% of 74180464 rows
Read 43.9% of 74180464 rows
Read 45.9% of 74180464 rows
Read 47.9% of 74180464 rows
Read 49.9% of 74180464 rows
Read 51.9% of 74180464 rows
Read 53.8% of 74180464 rows
Read 55.8% of 74180464 rows
Read 57.7% of 74180464 rows
Read 59.8% of 74180464 rows
Read 61.7% of 74180464 rows
Read 63.7% of 74180464 rows
Read 65.8% of 74180464 rows
Read 67.7% of 74180464 rows
Read 69.7% of 74180464 rows
Read 71.8% of 74180464 rows
Read 73.8% of 74180464 rows
Read 75.9% of 74180464 rows
Read 77.9% of 74180464 rows
Read 79.8% of 74180464 rows
Read 81.9% of 74180464 rows
Read 84.0% of 74180464 rows
Read 86.0% of 74180464 rows
Read 88.0% of 74180464 rows
Read 90.0% of 74180464 rows
Read 92.0% of 74180464 rows
Read 94.1% of 74180464 rows
Read 96.0% of 74180464 rows
Read 98.0% of 74180464 rows
Read 100.0% of 74180464 rows
Read 74180464 rows and 11 (of 11) columns from 2.980 GB file in 00:01:10

View the data in these guys…

summary(grupoClients)
##    Cliente_ID             NombreCliente   
##  100039 :     2   NO IDENTIFICADO:281670  
##  1001020:     2   LUPITA         :  4863  
##  100153 :     2   MARY           :  3016  
##  1001916:     2   LA PASADITA    :  2426  
##  1002226:     2   LA VENTANITA   :  2267  
##  100241 :     2   LA GUADALUPANA :  1299  
##  (Other):935350   (Other)        :639821
summary(grupoProducts)
##   Producto_ID                                          NombreProducto
##  0      :   1   100pct Whole Wheat 680g MTA ORO 43111         :   1  
##  100    :   1   100pct Whole Wheat 680g ORO 9753              :   1  
##  1031   :   1   12 Tarima MiniMilk Kitty Nav 110g GBI 37157   :   1  
##  1039   :   1   12Granos Multigra TwinPack 1360g MTA ORO 43364:   1  
##  106    :   1   12Granos Multigra TwinPack 1360g TAB ORO 48227:   1  
##  1064   :   1   12Tarima Mini Milk Kitty SnVal 110g GBI 37373 :   1  
##  (Other):2586   (Other)                                       :2586

So, the client data look a little odd (the description mentions this):

Will need to tackle the client duplication issue at some point.

count(grupoClients$NombreCliente) -> clientCount
clientCount[order(clientCount$freq, decreasing = T),] -> clientCount
clientCount[1:20,] #top 20 clients
##                      x   freq
## 230899 NO IDENTIFICADO 281670
## 181804          LUPITA   4863
## 206532            MARY   3016
## 166729     LA PASADITA   2426
## 168604    LA VENTANITA   2267
## 165117  LA GUADALUPANA   1299
## 269449            ROSY   1245
## 21664             ALEX   1242
## 123289            GABY   1238
## 164220    LA ESCONDIDA   1216
## 247249            PATY   1145
## 164278    LA ESPERANZA   1139
## 135178       HERNANDEZ   1129
## 163533     LA CHIQUITA   1117
## 76697             DANY   1082
## 123592          GARCIA   1062
## 160463       JUQUILITA   1022
## 206417        MARTINEZ    987
## 168163     LA TIENDITA    914
## 463         3 HERMANOS    913
grupoClients[grupoClients$NombreCliente == "LUPITA",][1:10,] 
##     Cliente_ID NombreCliente
## 8            6        LUPITA
## 130       2116        LUPITA
## 260       2472        LUPITA
## 283       2523        LUPITA
## 309       2588        LUPITA
## 320       2605        LUPITA
## 348       2683        LUPITA
## 457       3046        LUPITA
## 652       3556        LUPITA
## 659       3573        LUPITA
nrow(grupoClients[grupoClients$NombreCliente == "LUPITA",]) # number of lupita entries... is lupita a person or business?
## [1] 4863
count(grupoClients$Cliente_ID) -> clientIDCount
clientIDCount[order(clientIDCount$freq, decreasing = T),][1:10,] #top 10 duplicated IDs
##           x freq
## 96   100039    2
## 277 1001020    2
## 432  100153    2
## 530 1001916    2
## 622 1002226    2
## 670  100241    2
## 798  100295    2
## 856 1003165    2
## 966  100361    2
## 990  100369    2
grupoClients[duplicated(grupoClients$Cliente_ID),] -> grupoClientsDupeIDs
nrow(grupoClientsDupeIDs)/nrow(grupoClients) # % of the client IDs being dupes
## [1] 0.005197988
nrow(clientCount)/nrow(grupoClients) # % of the client names that are dupes
## [1] 0.3326573
grupoClients[grupoClients$NombreCliente == "NO IDENTIFICADO",][1:10,] # making sure the no identification clients have unique ID numbers
##        Cliente_ID   NombreCliente
## 612407    2465096 NO IDENTIFICADO
## 612410    2465098 NO IDENTIFICADO
## 612412    2465102 NO IDENTIFICADO
## 612413    2465104 NO IDENTIFICADO
## 612414    2465105 NO IDENTIFICADO
## 612415    2465106 NO IDENTIFICADO
## 612417    2465109 NO IDENTIFICADO
## 612420    2465112 NO IDENTIFICADO
## 612424    2465116 NO IDENTIFICADO
## 612426    2465120 NO IDENTIFICADO
nrow(grupoClients[grupoClients$NombreCliente == "NO IDENTIFICADO",])/nrow(grupoClients) # % of clients being no identification clients
## [1] 0.3011347

So, we’ve learned:

  1. A third of the client data have no identification attached to their IDs.
  2. A third of the client data’s names field are duplicates.
  3. The ID numbers are mostly unique, with only 0.5% of them being duplicates.

5.1 Explore Clients

I want to make a semi-informed decision on which client is chosen… Let’s see what we can do about that.

Trying to find out: which is the biggest client? What’s the distribution of clients and how many orders they make? What would make a good candidate to examine?

topClients <- count(grupoTrain$Cliente_ID[sample(1:nrow(grupoTrain), 1000000, replace = F)]) # doing about a 1% sample and counting the biggest clients for that set. 

topClients <- merge(x = topClients, y = grupoClients, by.x = "x", by.y = "Cliente_ID", all.x = T)
topClients <- aggregate(freq ~ NombreCliente, topClients, FUN = sum)
topClients <- topClients[order(topClients$freq, decreasing = T),]
topClients$percestimate <- round(topClients$freq/1000000,3)

quantile(topClients$freq)
##     0%    25%    50%    75%   100% 
##      1      1      2      4 179254
summary(topClients$freq) #very different from last time
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00      1.00      2.00      5.65      4.00 179300.00
topClients[1:20,] #top 20 clients, after smashing all of the duplicate IDs together
##          NombreCliente   freq percestimate
## 131022 NO IDENTIFICADO 179254        0.179
## 102109          LUPITA   6241        0.006
## 115901            MARY   3774        0.004
## 93376      LA PASADITA   2812        0.003
## 94640     LA VENTANITA   2390        0.002
## 92241   LA GUADALUPANA   1915        0.002
## 14698             ALEX   1761        0.002
## 91652     LA ESPERANZA   1729        0.002
## 145687 PUEBLA REMISION   1627        0.002
## 69396             GABY   1607        0.002
## 142425            PATY   1601        0.002
## 91161      LA CHIQUITA   1595        0.002
## 153841            ROSY   1513        0.002
## 43120             DANY   1473        0.001
## 93754   LA PROVIDENCIA   1419        0.001
## 75572        HERNANDEZ   1390        0.001
## 69598           GARCIA   1336        0.001
## 91613     LA ESCONDIDA   1316        0.001
## 115829        MARTINEZ   1274        0.001
## 91682       LA ESQUINA   1233        0.001

… Just curious about the “PUEBLA REMISION” entry. Apparently that means “public/personal referral”, or something similar.

grupoTrain <- merge(x = grupoTrain, y = grupoClients, by = "Cliente_ID", sort = F, all.x = T)

grupoTrainSpecial <- grupoTrain[grupoTrain$NombreCliente == "LA PASADITA",]
list("total training rows"=nrow(grupoTrain), "la pasadita rows"=nrow(grupoTrainSpecial))
## $`total training rows`
## [1] 74773833
## 
## $`la pasadita rows`
## [1] 209102

Move onto picking a product…

topProductsLaPas <- count(grupoTrainSpecial$Producto_ID)
topProductsLaPas <- merge(x = topProductsLaPas, y = grupoProducts, by.x = "x", by.y = "Producto_ID", all.x = T, sort = F)
sum(duplicated(topProductsLaPas$NombreProducto)) #0 if there's no duplicated products
## [1] 0
topProductsLaPas <- topProductsLaPas[order(topProductsLaPas$freq, decreasing = T),]
topProductsLaPas[1:10,]
##         x freq                         NombreProducto
## 26   1240 6651   Mantecadas Vainilla 4p 125g BIM 1240
## 28   1250 6194          Donas Azucar 4p 105g BIM 1250
## 30   1278 6191                   Nito 1p 62g BIM 1278
## 27   1242 6186 Donitas Espolvoreadas 6p 105g BIM 1242
## 43   2233 6028               Pan Blanco 640g BIM 2233
## 31   1284 5543               Rebanada 2p 55g BIM 1284
## 23   1232 4898              Panque Nuez 255g BIM 1232
## 266 35651 4435             Madalenas 3p 93g BIM 35651
## 2    1109 4194         Pan Blanco Chico 360g BIM 1109
## 406 41938 4124         Mantecadas Nuez 123g BIM 41938
summary(topProductsLaPas$freq) # product distribution with this store. 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     5.0    29.0   345.1   218.0  6651.0

Let’s go with the “Pan Blanco 640g BIM 2233”.

grupoTrainSpecial <- grupoTrainSpecial[grupoTrainSpecial$Producto_ID == 2233]
summary(grupoTrainSpecial)
##    Cliente_ID    WeekNumber        SalesDepotID       SalesChannelID    
##  1000747:   7   Length:6028        Length:6028        Length:6028       
##  100289 :   7   Class :character   Class :character   Class :character  
##  1022730:   7   Mode  :character   Mode  :character   Mode  :character  
##  1040510:   7                                                           
##  1047454:   7                                                           
##  1084758:   7                                                           
##  (Other):5986                                                           
##    RouteID          Producto_ID        SalesUnitWeek    SalesThisWeek    
##  Length:6028        Length:6028        Min.   : 0.000   Min.   :   0.00  
##  Class :character   Class :character   1st Qu.: 1.000   1st Qu.:  19.94  
##  Mode  :character   Mode  :character   Median : 2.000   Median :  39.88  
##                                        Mean   : 4.192   Mean   :  83.60  
##                                        3rd Qu.: 5.000   3rd Qu.:  99.70  
##                                        Max.   :93.000   Max.   :1854.42  
##                                                                          
##  ReturnUnitNextWeek ReturnsNextWeek   AdjustedDemand  
##  Min.   : 0.0000    Min.   :  0.000   Min.   : 0.000  
##  1st Qu.: 0.0000    1st Qu.:  0.000   1st Qu.: 1.000  
##  Median : 0.0000    Median :  0.000   Median : 2.000  
##  Mean   : 0.0574    Mean   :  1.145   Mean   : 4.146  
##  3rd Qu.: 0.0000    3rd Qu.:  0.000   3rd Qu.: 5.000  
##  Max.   :28.0000    Max.   :558.320   Max.   :93.000  
##                                                       
##                 NombreCliente 
##  LA PASADITA           :6028  
##  007                   :   0  
##  056 THE AIRPORT MARKET:   0  
##  06                    :   0  
##  0RLANDO               :   0  
##  1 2 3                 :   0  
##  (Other)               :   0

Plot some things, spot some hypotheses:

plotTable <- grupoTrainSpecial[sample(1:nrow(grupoTrainSpecial), 600),]
qplot(plotTable$SalesUnitWeek, plotTable$SalesThisWeek, alpha = 0.5)

qplot(grupoTrainSpecial$SalesUnitWeek, grupoTrainSpecial$SalesThisWeek)

cor(grupoTrainSpecial$SalesUnitWeek, grupoTrainSpecial$SalesThisWeek) # with the population there's still an almost linear relationship. These could be smashed together into a new feature...
## [1] 0.9999949
qplot(grupoTrainSpecial$ReturnsNextWeek, grupoTrainSpecial$ReturnUnitNextWeek)

cor(grupoTrainSpecial$ReturnsNextWeek, grupoTrainSpecial$ReturnUnitNextWeek) # same
## [1] 0.9999985
qplot(grupoTrainSpecial$SalesThisWeek, grupoTrainSpecial$AdjustedDemand) # ... so, the adjusted demand can easily be predicted by the sales this week?