Project goals recap

In the previous task (https://docs.google.com/document/d/1es4-jWZ6eGhVgdas2JjcZQzaPpQSNCt64C17nDUiGAo/edit?usp=sharing) we did some research to understand Grupo Bimbo’s business and we set the following goals for this project:

  • Reduce waste in Bimbo’s operations: we thought that Bimbo had an opportunity for improvement in its operations due to the fact that clients order products every week based on what they think they are going to sell, and that the week after they can exchange expired products for new ones at no cost in case they ordered too much. We thought we could improve this situation by using weekly sales and refunds to predict optimal inventory, maybe using time series analysis. To do so, we should find the answers to the following questions:

    • How much are these refunds are costing Grupo Bimbo? How much could be saved by correcting it?

    • How better can we predict demand compared to the current system?

    • As seen in the previous task, we could tackle differently products with different expiry date: bread (expires in 7-10 days) to pastries (expire in 6-9 months). Product names are structured: Name + pieces + g + brand + number, can we use this information to categorize them? Will we need further classification to group behaviours?

    • Do different client categories have different behaviours?

    • Does any of them have bigger refund rates? Why?

  • Find growth opportunities for the trade marketing department. Bimbo could be interested in having a dashboard for its trade marketing department that shows them optimal assortment for their clients answering the following questions:

    • How many clients do we have? Who are they? How many products & brands do our clients order on average? Is there room for adding more? If so, which ones?

    • How many units for each product could each client be ordering within its category avoiding returns? Do we have distinct enough categories to calculate this properly?

Data available

train <- readRDS(file = "train.rds")
test <- read_csv("test.csv")
town_state <- read_csv("town_state.csv")
cliente_tabla <- read_csv("cliente_tabla.csv")
producto_tabla <- read_csv("producto_tabla.csv")

To reply all these questions we are given historical data of sales in Mexico in different csv files:

  • train: sales for 7 weeks. The dataset contains 11 variables and over 74 million rows (see description below; some variables have related tables)

  • test & submission: the same columns are given to us, this time without sales and returns for us to predict demand for the following 2 weeks (almost 7 million rows). In test & submission files there are new products and clients

In the train set we have 11 variables:

summary(train)
##      Semana       Agencia_ID       Canal_ID         Ruta_SAK   
##  Min.   :3.00   Min.   : 1110   Min.   : 1.000   Min.   :   1  
##  1st Qu.:4.00   1st Qu.: 1312   1st Qu.: 1.000   1st Qu.:1161  
##  Median :6.00   Median : 1613   Median : 1.000   Median :1286  
##  Mean   :5.95   Mean   : 2537   Mean   : 1.383   Mean   :2115  
##  3rd Qu.:8.00   3rd Qu.: 2036   3rd Qu.: 1.000   3rd Qu.:2802  
##  Max.   :9.00   Max.   :25759   Max.   :11.000   Max.   :9991  
##    Cliente_ID         Producto_ID    Venta_uni_hoy       Venta_hoy       
##  Min.   :2.600e+01   Min.   :   41   Min.   :   0.00   Min.   :     0.0  
##  1st Qu.:3.568e+05   1st Qu.: 1242   1st Qu.:   2.00   1st Qu.:    16.8  
##  Median :1.193e+06   Median :30549   Median :   3.00   Median :    30.0  
##  Mean   :1.802e+06   Mean   :20841   Mean   :   7.31   Mean   :    68.5  
##  3rd Qu.:2.371e+06   3rd Qu.:37426   3rd Qu.:   7.00   3rd Qu.:    56.1  
##  Max.   :2.015e+09   Max.   :49997   Max.   :7200.00   Max.   :647360.0  
##  Dev_uni_proxima    Dev_proxima        Demanda_uni_equil 
##  Min.   :0.0e+00   Min.   :     0.00   Min.   :   0.000  
##  1st Qu.:0.0e+00   1st Qu.:     0.00   1st Qu.:   2.000  
##  Median :0.0e+00   Median :     0.00   Median :   3.000  
##  Mean   :1.3e-01   Mean   :     1.24   Mean   :   7.225  
##  3rd Qu.:0.0e+00   3rd Qu.:     0.00   3rd Qu.:   6.000  
##  Max.   :2.5e+05   Max.   :130760.00   Max.   :5000.000

Semana

The data available is in categorical ordinal format in week number. We are given data of weeks 3 to 9 (mid-January to the end of February) in week number from Thursday to Wednesday, and we want to predict sales for weeks 10 and 11. Demand has evolved per week as follows:

train %>%
  select(Semana, Demanda_uni_equil) %>%
  group_by(Semana) %>%
  summarise(
    demanda = sum(Demanda_uni_equil)
  ) %>%
  ggplot(aes(x = Semana, y = demanda)) + geom_col()

Agencia ID

Warehouse where the products are delivered from (there are 552 different warehouses). Although it might not look useful for our first goal, this variable is the key in a secondary table listing the location of each agencia (and therefore each client), which could be useful for the dashboarding tool. For example, it lets us see that most agencias are located in Estado de México, México DF and Jalisco states, so most business must be done there:

town_state %>%
  select(State, Agencia_ID) %>%
  group_by(State) %>%
  summarise(
    n_agencia = n_distinct(Agencia_ID)
  ) %>%
  
  ggplot(aes(x = reorder(State, - n_agencia), y = n_agencia, fill = State)) + geom_col() + theme(legend.position = "none", axis.text.x = element_text(size = 7, angle = 45, hjust = 1))

Ruta SAK

We are also given Ruta information, identifying route through which a Client is replenished. As this information doesn’t seem to help meeting our goals, we will disregard it for now.

Canal & Client ID

In this dataset we are given client information and we are also already given a Client classification by Canal ID, so we might start replying some of our questions.

  • How many clients do we have?

In the Grupo Bimbo client database we have this amount of clients:

cliente_tabla %>%
  select(Cliente_ID) %>%
  summarise(
    n_clients = n_distinct(Cliente_ID)
  )
##   n_clients
## 1    930500

We have sales data available in the train file for this amount of clients:

train %>%
  select(Cliente_ID) %>%
  summarise(
    n_clients = n_distinct(Cliente_ID)
  )
##   n_clients
## 1    880604

However, if we put train & test files together we see that we’ll have to predict sales for 9.963 new clients:

library(plyr)
full_dataset <- rbind.fill(train, test)

full_dataset %>%
  select(Cliente_ID) %>%
  summarise(
    n_clients = n_distinct(Cliente_ID)
  )
##   n_clients
## 1    890267
detach("package:plyr", unload = TRUE)
  • Who are our clients? To find out we might want to see who are our top clients by sales:
# sales_per_cliente <- train %>%
#   select(Cliente_ID, Canal_ID, Venta_hoy) %>%
#   arrange(Cliente_ID, Canal_ID) %>%
#   group_by(Cliente_ID, Canal_ID) %>%
#   summarise(
#     ventas = sum(Venta_hoy)
#   )
# 
# sales_per_cliente <- merge(cliente_tabla, sales_per_cliente)
# 
# sales_per_cliente <- sales_per_cliente %>%
#   arrange(desc(ventas))
# 
# sales_per_cliente$Canal_ID <- as.factor(sales_per_cliente$Canal_ID)

# saveRDS(sales_per_cliente, file = "sales_per_cliente.rds")
sales_per_cliente <- readRDS(file = "sales_per_cliente.rds")

ggplot(sales_per_cliente[1:20,], aes(x = reorder(NombreCliente, -ventas), y = ventas, fill = Canal_ID)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))

If we plot it we see that we have a huge outlier: Puebla Remisión has sales of 154M pesos and is very far away from client nº2 (is buying 22x). This is a very interesting piece of information we want to remember, but we also want to see the sales of the other top clients without this outlier:

ggplot(sales_per_cliente[2:21,], aes(x = reorder(NombreCliente, -ventas), y = ventas, fill = Canal_ID)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))

Here we spot other remisiones (canal 5), and big chains in very different industries such as restaurants in canal 8 (McDonalds), supermarkets in canals 2 and 4 (Oxxo, Walmart, Chedraui), and B2C companies in canal 8 (Unilever).

We already have a Client grouping by Canal (we have 9 different Canal ID identified with numbers from 1 to 11, with no further information). This brings us to reply to the questions:

  • Which Canal has more clients? Is it the one that has more sales?
# sales_per_canal <- sales_per_cliente %>%
#   select(Canal_ID, Cliente_ID, ventas) %>%
#   arrange(Canal_ID) %>%
#   group_by(Canal_ID) %>%
#   summarise(
#     n_clients = n_distinct(Cliente_ID),
#     ventas = sum(ventas)
#   )
# 
# saveRDS(sales_per_canal, file = "sales_per_canal.rds")
sales_per_canal <- readRDS(file = "sales_per_canal.rds")

ggplot(sales_per_canal, aes(x = Canal_ID, y = n_clients, fill = Canal_ID)) + geom_col()

ggplot(sales_per_canal, aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_col()

ggplot(sales_per_cliente, aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_boxplot()

ggplot(sales_per_cliente[2:892132,], aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_boxplot()

Product ID

Grupo Bimbo operates through a great number or different brands and products, always within the bakery context, but some quite different from each other in terms of needs they meet and expiry date. In the train file we are only given Product ID, but we are given a secondary table with more details. These details contain:

  • Name of the product

  • product size in grams

  • a 3-character label for brand

  • Product ID again

These labels will be very interesting if we manage to split it into new columns. But first let’s see which are Bimbo’s best sellers:

sales_per_product <- train %>%
  select(Producto_ID, Venta_hoy) %>%
  arrange(Producto_ID) %>%
  group_by(Producto_ID) %>%
  summarise(
    ventas = sum(Venta_hoy)
  )

sales_per_product <- merge(producto_tabla, sales_per_product)

sales_per_product <- sales_per_product %>%
  arrange(desc(ventas))

sales_per_product$Producto_ID <- as.factor(sales_per_product$Producto_ID)

ggplot(sales_per_product[1:20,], aes(x = reorder(NombreProducto, -ventas), y = ventas)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))

What does each canal order most?

# sales_per_product_and_canal <- train %>%
#   select(Producto_ID, Canal_ID, Venta_hoy) %>%
#   arrange(Producto_ID, Canal_ID) %>%
#   group_by(Producto_ID, Canal_ID) %>%
#   summarise(
#     ventas = sum(Venta_hoy)
#   )
# 
# sales_per_product_and_canal <- 
#   merge(producto_tabla, sales_per_product_and_canal)
# 
# sales_per_product_and_canal <- sales_per_product_and_canal %>%
#   arrange(desc(ventas))
# 
# sales_per_product_and_canal$Producto_ID <- as.factor(sales_per_product_and_canal$Producto_ID)
# 
# sales_per_product_and_canal$Canal_ID <- as.factor(sales_per_product_and_canal$Canal_ID)
# 
# product_canal_plot <-sales_per_product_and_canal %>%
#   group_by(Canal_ID) %>%
#   top_n(n = 5, wt = ventas)

# saveRDS(product_canal_plot, file = "product_canal_plot.rds")
product_canal_plot <- readRDS(file = "product_canal_plot.rds")
product_canal_plot
## # A tibble: 45 x 4
## # Groups:   Canal_ID [9]
##    Producto_ID NombreProducto                                Canal_ID     ventas
##    <fct>       <chr>                                         <fct>         <dbl>
##  1 2233        Pan Blanco 640g BIM 2233                      1        201558986.
##  2 1146        Pan Integral 675g BIM 1146                    1        124395648.
##  3 2425        Nito 1p 62g Central BIM 2425                  1        107038715.
##  4 1278        Nito 1p 62g BIM 1278                          1         88119472.
##  5 1240        Mantecadas Vainilla 4p 125g BIM 1240          1         85075760.
##  6 34786       Pan Integral 680g TNB BIM 34786               2         65853706.
##  7 34865       Pan Blanco 680g TNB BIM 34865                 2         49879290.
##  8 34210       Super Pan Blanco 740g TNB BIM 34210           2         42777066.
##  9 35145       Wonder 100pct con Ajonjoli 567g MTA WON 35145 2         41451485.
## 10 34206       Medias Noches 8p 340g TNB BIM 34206           2         34264219.
## # ... with 35 more rows
# ggplot(product_canal_plot, aes(x = reorder(NombreProducto, -ventas), y = ventas)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1)) + facet_wrap(~ Canal_ID, scales = "free")
  • How many products & brands do our clients order on average? Is there room for adding more? If so, which ones?

  • How many units for each product could each client be ordering within its category avoiding returns? Do we have distinct enough categories to calculate this properly?

# summarise(train,
#   Product_count = n_distinct(Producto_ID),
# )
# summarise(producto_tabla,
#   Product_count = n_distinct(Producto_ID),
# )

Ventas, devoluciones & demanda

Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. For this we are given 5 variables:

  • Venta_uni_hoy: nº of units sold per client per product per week
venta_media_canal <- train %>%
  select(Canal_ID, Venta_uni_hoy) %>%
  group_by(Canal_ID) %>%
  summarise(
    minimo = round(min(Venta_uni_hoy), digits = 0),
    media = round(mean(Venta_uni_hoy), digits = 0),
    mediana = round(median(Venta_uni_hoy), digits = 0),
    maximo = round(max(Venta_uni_hoy), digits = 0)
  )

venta_media_canal <- as.data.frame(t(venta_media_canal))
  • Venta_hoy: same in pesos. Over the 7-week period Grupo Bimbo made this amount of money in sales: see Tableau

  • Dev_uni_proxima: units that are left unsold at the end of the week. We have seen that it can be bigger than Venta_uni, therefore units for refund can be cumulated week after week

  • Dev_proxima: same in pesos

  • Demanda_uni_equil: the label variable, defined as the sales this week subtracted by the return next week