In the previous task (https://docs.google.com/document/d/1es4-jWZ6eGhVgdas2JjcZQzaPpQSNCt64C17nDUiGAo/edit?usp=sharing) we did some research to understand Grupo Bimbo’s business and we set the following goals for this project:
Reduce waste in Bimbo’s operations: we thought that Bimbo had an opportunity for improvement in its operations due to the fact that clients order products every week based on what they think they are going to sell, and that the week after they can exchange expired products for new ones at no cost in case they ordered too much. We thought we could improve this situation by using weekly sales and refunds to predict optimal inventory, maybe using time series analysis. To do so, we should find the answers to the following questions:
How much are these refunds are costing Grupo Bimbo? How much could be saved by correcting it?
How better can we predict demand compared to the current system?
As seen in the previous task, we could tackle differently products with different expiry date: bread (expires in 7-10 days) to pastries (expire in 6-9 months). Product names are structured: Name + pieces + g + brand + number, can we use this information to categorize them? Will we need further classification to group behaviours?
Do different client categories have different behaviours?
Does any of them have bigger refund rates? Why?
Find growth opportunities for the trade marketing department. Bimbo could be interested in having a dashboard for its trade marketing department that shows them optimal assortment for their clients answering the following questions:
How many clients do we have? Who are they? How many products & brands do our clients order on average? Is there room for adding more? If so, which ones?
How many units for each product could each client be ordering within its category avoiding returns? Do we have distinct enough categories to calculate this properly?
train <- readRDS(file = "train.rds")
test <- read_csv("test.csv")
town_state <- read_csv("town_state.csv")
cliente_tabla <- read_csv("cliente_tabla.csv")
producto_tabla <- read_csv("producto_tabla.csv")
To reply all these questions we are given historical data of sales in Mexico in different csv files:
train: sales for 7 weeks. The dataset contains 11 variables and over 74 million rows (see description below; some variables have related tables)
test & submission: the same columns are given to us, this time without sales and returns for us to predict demand for the following 2 weeks (almost 7 million rows). In test & submission files there are new products and clients
In the train set we have 11 variables:
summary(train)
## Semana Agencia_ID Canal_ID Ruta_SAK
## Min. :3.00 Min. : 1110 Min. : 1.000 Min. : 1
## 1st Qu.:4.00 1st Qu.: 1312 1st Qu.: 1.000 1st Qu.:1161
## Median :6.00 Median : 1613 Median : 1.000 Median :1286
## Mean :5.95 Mean : 2537 Mean : 1.383 Mean :2115
## 3rd Qu.:8.00 3rd Qu.: 2036 3rd Qu.: 1.000 3rd Qu.:2802
## Max. :9.00 Max. :25759 Max. :11.000 Max. :9991
## Cliente_ID Producto_ID Venta_uni_hoy Venta_hoy
## Min. :2.600e+01 Min. : 41 Min. : 0.00 Min. : 0.0
## 1st Qu.:3.568e+05 1st Qu.: 1242 1st Qu.: 2.00 1st Qu.: 16.8
## Median :1.193e+06 Median :30549 Median : 3.00 Median : 30.0
## Mean :1.802e+06 Mean :20841 Mean : 7.31 Mean : 68.5
## 3rd Qu.:2.371e+06 3rd Qu.:37426 3rd Qu.: 7.00 3rd Qu.: 56.1
## Max. :2.015e+09 Max. :49997 Max. :7200.00 Max. :647360.0
## Dev_uni_proxima Dev_proxima Demanda_uni_equil
## Min. :0.0e+00 Min. : 0.00 Min. : 0.000
## 1st Qu.:0.0e+00 1st Qu.: 0.00 1st Qu.: 2.000
## Median :0.0e+00 Median : 0.00 Median : 3.000
## Mean :1.3e-01 Mean : 1.24 Mean : 7.225
## 3rd Qu.:0.0e+00 3rd Qu.: 0.00 3rd Qu.: 6.000
## Max. :2.5e+05 Max. :130760.00 Max. :5000.000
The data available is in categorical ordinal format in week number. We are given data of weeks 3 to 9 (mid-January to the end of February) in week number from Thursday to Wednesday, and we want to predict sales for weeks 10 and 11. Demand has evolved per week as follows:
train %>%
select(Semana, Demanda_uni_equil) %>%
group_by(Semana) %>%
summarise(
demanda = sum(Demanda_uni_equil)
) %>%
ggplot(aes(x = Semana, y = demanda)) + geom_col()
Warehouse where the products are delivered from (there are 552 different warehouses). Although it might not look useful for our first goal, this variable is the key in a secondary table listing the location of each agencia (and therefore each client), which could be useful for the dashboarding tool. For example, it lets us see that most agencias are located in Estado de México, México DF and Jalisco states, so most business must be done there:
town_state %>%
select(State, Agencia_ID) %>%
group_by(State) %>%
summarise(
n_agencia = n_distinct(Agencia_ID)
) %>%
ggplot(aes(x = reorder(State, - n_agencia), y = n_agencia, fill = State)) + geom_col() + theme(legend.position = "none", axis.text.x = element_text(size = 7, angle = 45, hjust = 1))
We are also given Ruta information, identifying route through which a Client is replenished. As this information doesn’t seem to help meeting our goals, we will disregard it for now.
In this dataset we are given client information and we are also already given a Client classification by Canal ID, so we might start replying some of our questions.
In the Grupo Bimbo client database we have this amount of clients:
cliente_tabla %>%
select(Cliente_ID) %>%
summarise(
n_clients = n_distinct(Cliente_ID)
)
## n_clients
## 1 930500
We have sales data available in the train file for this amount of clients:
train %>%
select(Cliente_ID) %>%
summarise(
n_clients = n_distinct(Cliente_ID)
)
## n_clients
## 1 880604
However, if we put train & test files together we see that we’ll have to predict sales for 9.963 new clients:
library(plyr)
full_dataset <- rbind.fill(train, test)
full_dataset %>%
select(Cliente_ID) %>%
summarise(
n_clients = n_distinct(Cliente_ID)
)
## n_clients
## 1 890267
detach("package:plyr", unload = TRUE)
# sales_per_cliente <- train %>%
# select(Cliente_ID, Canal_ID, Venta_hoy) %>%
# arrange(Cliente_ID, Canal_ID) %>%
# group_by(Cliente_ID, Canal_ID) %>%
# summarise(
# ventas = sum(Venta_hoy)
# )
#
# sales_per_cliente <- merge(cliente_tabla, sales_per_cliente)
#
# sales_per_cliente <- sales_per_cliente %>%
# arrange(desc(ventas))
#
# sales_per_cliente$Canal_ID <- as.factor(sales_per_cliente$Canal_ID)
# saveRDS(sales_per_cliente, file = "sales_per_cliente.rds")
sales_per_cliente <- readRDS(file = "sales_per_cliente.rds")
ggplot(sales_per_cliente[1:20,], aes(x = reorder(NombreCliente, -ventas), y = ventas, fill = Canal_ID)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))
If we plot it we see that we have a huge outlier: Puebla Remisión has sales of 154M pesos and is very far away from client nº2 (is buying 22x). This is a very interesting piece of information we want to remember, but we also want to see the sales of the other top clients without this outlier:
ggplot(sales_per_cliente[2:21,], aes(x = reorder(NombreCliente, -ventas), y = ventas, fill = Canal_ID)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))
Here we spot other remisiones (canal 5), and big chains in very different industries such as restaurants in canal 8 (McDonalds), supermarkets in canals 2 and 4 (Oxxo, Walmart, Chedraui), and B2C companies in canal 8 (Unilever).
We already have a Client grouping by Canal (we have 9 different Canal ID identified with numbers from 1 to 11, with no further information). This brings us to reply to the questions:
# sales_per_canal <- sales_per_cliente %>%
# select(Canal_ID, Cliente_ID, ventas) %>%
# arrange(Canal_ID) %>%
# group_by(Canal_ID) %>%
# summarise(
# n_clients = n_distinct(Cliente_ID),
# ventas = sum(ventas)
# )
#
# saveRDS(sales_per_canal, file = "sales_per_canal.rds")
sales_per_canal <- readRDS(file = "sales_per_canal.rds")
ggplot(sales_per_canal, aes(x = Canal_ID, y = n_clients, fill = Canal_ID)) + geom_col()
ggplot(sales_per_canal, aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_col()
ggplot(sales_per_cliente, aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_boxplot()
ggplot(sales_per_cliente[2:892132,], aes(x = Canal_ID, y = ventas, fill = Canal_ID)) + geom_boxplot()
Grupo Bimbo operates through a great number or different brands and products, always within the bakery context, but some quite different from each other in terms of needs they meet and expiry date. In the train file we are only given Product ID, but we are given a secondary table with more details. These details contain:
Name of the product
product size in grams
a 3-character label for brand
Product ID again
These labels will be very interesting if we manage to split it into new columns. But first let’s see which are Bimbo’s best sellers:
sales_per_product <- train %>%
select(Producto_ID, Venta_hoy) %>%
arrange(Producto_ID) %>%
group_by(Producto_ID) %>%
summarise(
ventas = sum(Venta_hoy)
)
sales_per_product <- merge(producto_tabla, sales_per_product)
sales_per_product <- sales_per_product %>%
arrange(desc(ventas))
sales_per_product$Producto_ID <- as.factor(sales_per_product$Producto_ID)
ggplot(sales_per_product[1:20,], aes(x = reorder(NombreProducto, -ventas), y = ventas)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1))
What does each canal order most?
# sales_per_product_and_canal <- train %>%
# select(Producto_ID, Canal_ID, Venta_hoy) %>%
# arrange(Producto_ID, Canal_ID) %>%
# group_by(Producto_ID, Canal_ID) %>%
# summarise(
# ventas = sum(Venta_hoy)
# )
#
# sales_per_product_and_canal <-
# merge(producto_tabla, sales_per_product_and_canal)
#
# sales_per_product_and_canal <- sales_per_product_and_canal %>%
# arrange(desc(ventas))
#
# sales_per_product_and_canal$Producto_ID <- as.factor(sales_per_product_and_canal$Producto_ID)
#
# sales_per_product_and_canal$Canal_ID <- as.factor(sales_per_product_and_canal$Canal_ID)
#
# product_canal_plot <-sales_per_product_and_canal %>%
# group_by(Canal_ID) %>%
# top_n(n = 5, wt = ventas)
# saveRDS(product_canal_plot, file = "product_canal_plot.rds")
product_canal_plot <- readRDS(file = "product_canal_plot.rds")
product_canal_plot
## # A tibble: 45 x 4
## # Groups: Canal_ID [9]
## Producto_ID NombreProducto Canal_ID ventas
## <fct> <chr> <fct> <dbl>
## 1 2233 Pan Blanco 640g BIM 2233 1 201558986.
## 2 1146 Pan Integral 675g BIM 1146 1 124395648.
## 3 2425 Nito 1p 62g Central BIM 2425 1 107038715.
## 4 1278 Nito 1p 62g BIM 1278 1 88119472.
## 5 1240 Mantecadas Vainilla 4p 125g BIM 1240 1 85075760.
## 6 34786 Pan Integral 680g TNB BIM 34786 2 65853706.
## 7 34865 Pan Blanco 680g TNB BIM 34865 2 49879290.
## 8 34210 Super Pan Blanco 740g TNB BIM 34210 2 42777066.
## 9 35145 Wonder 100pct con Ajonjoli 567g MTA WON 35145 2 41451485.
## 10 34206 Medias Noches 8p 340g TNB BIM 34206 2 34264219.
## # ... with 35 more rows
# ggplot(product_canal_plot, aes(x = reorder(NombreProducto, -ventas), y = ventas)) + geom_col() + theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1)) + facet_wrap(~ Canal_ID, scales = "free")
How many products & brands do our clients order on average? Is there room for adding more? If so, which ones?
How many units for each product could each client be ordering within its category avoiding returns? Do we have distinct enough categories to calculate this properly?
# summarise(train,
# Product_count = n_distinct(Producto_ID),
# )
# summarise(producto_tabla,
# Product_count = n_distinct(Producto_ID),
# )
Every week, there are delivery trucks that deliver products to the vendors. Each transaction consists of sales and returns. Returns are the products that are unsold and expired. For this we are given 5 variables:
venta_media_canal <- train %>%
select(Canal_ID, Venta_uni_hoy) %>%
group_by(Canal_ID) %>%
summarise(
minimo = round(min(Venta_uni_hoy), digits = 0),
media = round(mean(Venta_uni_hoy), digits = 0),
mediana = round(median(Venta_uni_hoy), digits = 0),
maximo = round(max(Venta_uni_hoy), digits = 0)
)
venta_media_canal <- as.data.frame(t(venta_media_canal))
Venta_hoy: same in pesos. Over the 7-week period Grupo Bimbo made this amount of money in sales: see Tableau
Dev_uni_proxima: units that are left unsold at the end of the week. We have seen that it can be bigger than Venta_uni, therefore units for refund can be cumulated week after week
Dev_proxima: same in pesos
Demanda_uni_equil: the label variable, defined as the sales this week subtracted by the return next week