In this competition, you will forecast the demand of a product for a given week, at a particular store. The dataset you are given consists of 9 weeks of sales transactions in Mexico. Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. The demand for a product in a certain week is defined as the sales this week subtracted by the return next week.
The train and test dataset are split based on time, as well as the public and private leaderboard dataset split.
Things to note:
There may be products in the test set that don’t exist in the train set. This is the expected behavior of inventory data, since there are new products being sold all the time. Your model should be able to accommodate this.
There are duplicate Cliente_ID’s in cliente_tabla, which means one Cliente_ID may have multiple NombreCliente that are very similar. This is due to the NombreCliente being noisy and not standardized in the raw data, so it is up to you to decide how to clean up and use this information.
The adjusted demand (Demanda_uni_equil) is always >= 0 since demand should be either 0 or a positive value. The reason that Venta_uni_hoy - Dev_uni_proxima sometimes has negative values is that the returns records sometimes carry over a few weeks. File descriptions
town_state.csv — town and state (can be joined with train/test on Agencia_ID)
Data fields
library(data.table); library(plyr); library(ggplot2)
## Warning: package 'plyr' was built under R version 3.2.5
Print the top 6 rows of every table provided:
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/cliente_tabla.csv", nrows = 6);
## Cliente_ID NombreCliente
## 1: 0 SIN NOMBRE
## 2: 1 OXXO XINANTECATL
## 3: 2 SIN NOMBRE
## 4: 3 EL MORENO
## 5: 4 SDN SER DE ALIM CUERPO SA CIA DE INT
## 6: 4 SDN SER DE ALIM CUERPO SA CIA DE INT
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/producto_tabla.csv", nrows = 6);
## Producto_ID NombreProducto
## 1: 0 NO IDENTIFICADO 0
## 2: 9 Capuccino Moka 750g NES 9
## 3: 41 Bimbollos Ext sAjonjoli 6p 480g BIM 41
## 4: 53 Burritos Sincro 170g CU LON 53
## 5: 72 Div Tira Mini Doradita 4p 45g TR 72
## 6: 73 Pan Multigrano Linaza 540g BIM 73
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/town_state.csv", nrows = 6);
## Agencia_ID Town State
## 1: 1110 2008 AG. LAGO FILT MÉXICO, D.F.
## 2: 1111 2002 AG. AZCAPOTZALCO MÉXICO, D.F.
## 3: 1112 2004 AG. CUAUTITLAN ESTADO DE MÉXICO
## 4: 1113 2008 AG. LAGO FILT MÉXICO, D.F.
## 5: 1114 2029 AG.IZTAPALAPA 2 MÉXICO, D.F.
## 6: 1116 2011 AG. SAN ANTONIO MÉXICO, D.F.
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/train.csv", nrows = 6);
## Semana Agencia_ID Canal_ID Ruta_SAK Cliente_ID Producto_ID
## 1: 3 1110 7 3301 15766 1212
## 2: 3 1110 7 3301 15766 1216
## 3: 3 1110 7 3301 15766 1238
## 4: 3 1110 7 3301 15766 1240
## 5: 3 1110 7 3301 15766 1242
## 6: 3 1110 7 3301 15766 1250
## Venta_uni_hoy Venta_hoy Dev_uni_proxima Dev_proxima Demanda_uni_equil
## 1: 3 25.14 0 0 3
## 2: 4 33.52 0 0 4
## 3: 4 39.32 0 0 4
## 4: 4 33.52 0 0 4
## 5: 3 22.92 0 0 3
## 6: 5 38.20 0 0 5
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/test.csv", nrows = 6);
## id Semana Agencia_ID Canal_ID Ruta_SAK Cliente_ID Producto_ID
## 1: 0 11 4037 1 2209 4639078 35305
## 2: 1 11 2237 1 1226 4705135 1238
## 3: 2 10 2045 1 2831 4549769 32940
## 4: 3 11 1227 1 4448 4717855 43066
## 5: 4 11 1219 1 1130 966351 1277
## 6: 5 11 1146 4 6601 1741414 972
fread("/Users/Zach/R/Kaggle/GrupoBimbo2016/files/sample_submission.csv", nrows = 6);
## id Demanda_uni_equil
## 1: 0 7
## 2: 1 7
## 3: 2 7
## 4: 3 7
## 5: 4 7
## 6: 5 7
We’ll definitely need to import the client and product tables, so we can look at our selection.
##
Read 0.0% of 74180464 rows
Read 2.1% of 74180464 rows
Read 4.1% of 74180464 rows
Read 6.1% of 74180464 rows
Read 8.2% of 74180464 rows
Read 10.2% of 74180464 rows
Read 12.2% of 74180464 rows
Read 14.3% of 74180464 rows
Read 16.4% of 74180464 rows
Read 18.6% of 74180464 rows
Read 20.7% of 74180464 rows
Read 22.9% of 74180464 rows
Read 25.0% of 74180464 rows
Read 27.2% of 74180464 rows
Read 29.3% of 74180464 rows
Read 31.4% of 74180464 rows
Read 33.5% of 74180464 rows
Read 35.6% of 74180464 rows
Read 37.7% of 74180464 rows
Read 39.8% of 74180464 rows
Read 41.9% of 74180464 rows
Read 43.9% of 74180464 rows
Read 45.9% of 74180464 rows
Read 47.9% of 74180464 rows
Read 49.9% of 74180464 rows
Read 51.9% of 74180464 rows
Read 53.8% of 74180464 rows
Read 55.8% of 74180464 rows
Read 57.7% of 74180464 rows
Read 59.8% of 74180464 rows
Read 61.7% of 74180464 rows
Read 63.7% of 74180464 rows
Read 65.8% of 74180464 rows
Read 67.7% of 74180464 rows
Read 69.7% of 74180464 rows
Read 71.8% of 74180464 rows
Read 73.8% of 74180464 rows
Read 75.9% of 74180464 rows
Read 77.9% of 74180464 rows
Read 79.8% of 74180464 rows
Read 81.9% of 74180464 rows
Read 84.0% of 74180464 rows
Read 86.0% of 74180464 rows
Read 88.0% of 74180464 rows
Read 90.0% of 74180464 rows
Read 92.0% of 74180464 rows
Read 94.1% of 74180464 rows
Read 96.0% of 74180464 rows
Read 98.0% of 74180464 rows
Read 100.0% of 74180464 rows
Read 74180464 rows and 11 (of 11) columns from 2.980 GB file in 00:01:10
View the data in these guys…
summary(grupoClients)
## Cliente_ID NombreCliente
## 100039 : 2 NO IDENTIFICADO:281670
## 1001020: 2 LUPITA : 4863
## 100153 : 2 MARY : 3016
## 1001916: 2 LA PASADITA : 2426
## 1002226: 2 LA VENTANITA : 2267
## 100241 : 2 LA GUADALUPANA : 1299
## (Other):935350 (Other) :639821
summary(grupoProducts)
## Producto_ID NombreProducto
## 0 : 1 100pct Whole Wheat 680g MTA ORO 43111 : 1
## 100 : 1 100pct Whole Wheat 680g ORO 9753 : 1
## 1031 : 1 12 Tarima MiniMilk Kitty Nav 110g GBI 37157 : 1
## 1039 : 1 12Granos Multigra TwinPack 1360g MTA ORO 43364: 1
## 106 : 1 12Granos Multigra TwinPack 1360g TAB ORO 48227: 1
## 1064 : 1 12Tarima Mini Milk Kitty SnVal 110g GBI 37373 : 1
## (Other):2586 (Other) :2586
So, the client data look a little odd (the description mentions this):
Will need to tackle the client duplication issue at some point.
count(grupoClients$NombreCliente) -> clientCount
clientCount[order(clientCount$freq, decreasing = T),] -> clientCount
clientCount[1:20,] #top 20 clients
## x freq
## 230899 NO IDENTIFICADO 281670
## 181804 LUPITA 4863
## 206532 MARY 3016
## 166729 LA PASADITA 2426
## 168604 LA VENTANITA 2267
## 165117 LA GUADALUPANA 1299
## 269449 ROSY 1245
## 21664 ALEX 1242
## 123289 GABY 1238
## 164220 LA ESCONDIDA 1216
## 247249 PATY 1145
## 164278 LA ESPERANZA 1139
## 135178 HERNANDEZ 1129
## 163533 LA CHIQUITA 1117
## 76697 DANY 1082
## 123592 GARCIA 1062
## 160463 JUQUILITA 1022
## 206417 MARTINEZ 987
## 168163 LA TIENDITA 914
## 463 3 HERMANOS 913
grupoClients[grupoClients$NombreCliente == "LUPITA",][1:10,]
## Cliente_ID NombreCliente
## 8 6 LUPITA
## 130 2116 LUPITA
## 260 2472 LUPITA
## 283 2523 LUPITA
## 309 2588 LUPITA
## 320 2605 LUPITA
## 348 2683 LUPITA
## 457 3046 LUPITA
## 652 3556 LUPITA
## 659 3573 LUPITA
nrow(grupoClients[grupoClients$NombreCliente == "LUPITA",]) # number of lupita entries... is lupita a person or business?
## [1] 4863
count(grupoClients$Cliente_ID) -> clientIDCount
clientIDCount[order(clientIDCount$freq, decreasing = T),][1:10,] #top 10 duplicated IDs
## x freq
## 96 100039 2
## 277 1001020 2
## 432 100153 2
## 530 1001916 2
## 622 1002226 2
## 670 100241 2
## 798 100295 2
## 856 1003165 2
## 966 100361 2
## 990 100369 2
grupoClients[duplicated(grupoClients$Cliente_ID),] -> grupoClientsDupeIDs
nrow(grupoClientsDupeIDs)/nrow(grupoClients) # % of the client IDs being dupes
## [1] 0.005197988
nrow(clientCount)/nrow(grupoClients) # % of the client names that are dupes
## [1] 0.3326573
grupoClients[grupoClients$NombreCliente == "NO IDENTIFICADO",][1:10,] # making sure the no identification clients have unique ID numbers
## Cliente_ID NombreCliente
## 612407 2465096 NO IDENTIFICADO
## 612410 2465098 NO IDENTIFICADO
## 612412 2465102 NO IDENTIFICADO
## 612413 2465104 NO IDENTIFICADO
## 612414 2465105 NO IDENTIFICADO
## 612415 2465106 NO IDENTIFICADO
## 612417 2465109 NO IDENTIFICADO
## 612420 2465112 NO IDENTIFICADO
## 612424 2465116 NO IDENTIFICADO
## 612426 2465120 NO IDENTIFICADO
nrow(grupoClients[grupoClients$NombreCliente == "NO IDENTIFICADO",])/nrow(grupoClients) # % of clients being no identification clients
## [1] 0.3011347
So, we’ve learned:
I want to make a semi-informed decision on which client is chosen… Let’s see what we can do about that.
Trying to find out: which is the biggest client? What’s the distribution of clients and how many orders they make? What would make a good candidate to examine?
topClients <- count(grupoTrain$Cliente_ID[sample(1:nrow(grupoTrain), 1000000, replace = F)]) # doing about a 1% sample and counting the biggest clients for that set.
topClients <- merge(x = topClients, y = grupoClients, by.x = "x", by.y = "Cliente_ID", all.x = T)
topClients <- aggregate(freq ~ NombreCliente, topClients, FUN = sum)
topClients <- topClients[order(topClients$freq, decreasing = T),]
topClients$percestimate <- round(topClients$freq/1000000,3)
quantile(topClients$freq)
## 0% 25% 50% 75% 100%
## 1 1 2 4 179254
summary(topClients$freq) #very different from last time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 5.65 4.00 179300.00
topClients[1:20,] #top 20 clients, after smashing all of the duplicate IDs together
## NombreCliente freq percestimate
## 131022 NO IDENTIFICADO 179254 0.179
## 102109 LUPITA 6241 0.006
## 115901 MARY 3774 0.004
## 93376 LA PASADITA 2812 0.003
## 94640 LA VENTANITA 2390 0.002
## 92241 LA GUADALUPANA 1915 0.002
## 14698 ALEX 1761 0.002
## 91652 LA ESPERANZA 1729 0.002
## 145687 PUEBLA REMISION 1627 0.002
## 69396 GABY 1607 0.002
## 142425 PATY 1601 0.002
## 91161 LA CHIQUITA 1595 0.002
## 153841 ROSY 1513 0.002
## 43120 DANY 1473 0.001
## 93754 LA PROVIDENCIA 1419 0.001
## 75572 HERNANDEZ 1390 0.001
## 69598 GARCIA 1336 0.001
## 91613 LA ESCONDIDA 1316 0.001
## 115829 MARTINEZ 1274 0.001
## 91682 LA ESQUINA 1233 0.001
… Just curious about the “PUEBLA REMISION” entry. Apparently that means “public/personal referral”, or something similar.
grupoTrain <- merge(x = grupoTrain, y = grupoClients, by = "Cliente_ID", sort = F, all.x = T)
grupoTrainSpecial <- grupoTrain[grupoTrain$NombreCliente == "LA PASADITA",]
list("total training rows"=nrow(grupoTrain), "la pasadita rows"=nrow(grupoTrainSpecial))
## $`total training rows`
## [1] 74773833
##
## $`la pasadita rows`
## [1] 209102
Move onto picking a product…
topProductsLaPas <- count(grupoTrainSpecial$Producto_ID)
topProductsLaPas <- merge(x = topProductsLaPas, y = grupoProducts, by.x = "x", by.y = "Producto_ID", all.x = T, sort = F)
sum(duplicated(topProductsLaPas$NombreProducto)) #0 if there's no duplicated products
## [1] 0
topProductsLaPas <- topProductsLaPas[order(topProductsLaPas$freq, decreasing = T),]
topProductsLaPas[1:10,]
## x freq NombreProducto
## 26 1240 6651 Mantecadas Vainilla 4p 125g BIM 1240
## 28 1250 6194 Donas Azucar 4p 105g BIM 1250
## 30 1278 6191 Nito 1p 62g BIM 1278
## 27 1242 6186 Donitas Espolvoreadas 6p 105g BIM 1242
## 43 2233 6028 Pan Blanco 640g BIM 2233
## 31 1284 5543 Rebanada 2p 55g BIM 1284
## 23 1232 4898 Panque Nuez 255g BIM 1232
## 266 35651 4435 Madalenas 3p 93g BIM 35651
## 2 1109 4194 Pan Blanco Chico 360g BIM 1109
## 406 41938 4124 Mantecadas Nuez 123g BIM 41938
summary(topProductsLaPas$freq) # product distribution with this store.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 5.0 29.0 345.1 218.0 6651.0
Let’s go with the “Pan Blanco 640g BIM 2233”.
grupoTrainSpecial <- grupoTrainSpecial[grupoTrainSpecial$Producto_ID == 2233]
summary(grupoTrainSpecial)
## Cliente_ID WeekNumber SalesDepotID SalesChannelID
## 1000747: 7 Length:6028 Length:6028 Length:6028
## 100289 : 7 Class :character Class :character Class :character
## 1022730: 7 Mode :character Mode :character Mode :character
## 1040510: 7
## 1047454: 7
## 1084758: 7
## (Other):5986
## RouteID Producto_ID SalesUnitWeek SalesThisWeek
## Length:6028 Length:6028 Min. : 0.000 Min. : 0.00
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 19.94
## Mode :character Mode :character Median : 2.000 Median : 39.88
## Mean : 4.192 Mean : 83.60
## 3rd Qu.: 5.000 3rd Qu.: 99.70
## Max. :93.000 Max. :1854.42
##
## ReturnUnitNextWeek ReturnsNextWeek AdjustedDemand
## Min. : 0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 1.000
## Median : 0.0000 Median : 0.000 Median : 2.000
## Mean : 0.0574 Mean : 1.145 Mean : 4.146
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.: 5.000
## Max. :28.0000 Max. :558.320 Max. :93.000
##
## NombreCliente
## LA PASADITA :6028
## 007 : 0
## 056 THE AIRPORT MARKET: 0
## 06 : 0
## 0RLANDO : 0
## 1 2 3 : 0
## (Other) : 0
Plot some things, spot some hypotheses:
plotTable <- grupoTrainSpecial[sample(1:nrow(grupoTrainSpecial), 600),]
qplot(plotTable$SalesUnitWeek, plotTable$SalesThisWeek, alpha = 0.5)
qplot(grupoTrainSpecial$SalesUnitWeek, grupoTrainSpecial$SalesThisWeek)
cor(grupoTrainSpecial$SalesUnitWeek, grupoTrainSpecial$SalesThisWeek) # with the population there's still an almost linear relationship. These could be smashed together into a new feature...
## [1] 0.9999949
qplot(grupoTrainSpecial$ReturnsNextWeek, grupoTrainSpecial$ReturnUnitNextWeek)
cor(grupoTrainSpecial$ReturnsNextWeek, grupoTrainSpecial$ReturnUnitNextWeek) # same
## [1] 0.9999985
qplot(grupoTrainSpecial$SalesThisWeek, grupoTrainSpecial$AdjustedDemand) # ... so, the adjusted demand can easily be predicted by the sales this week?