Food demand forecast EDA + Modelling

1 Introduction

Welcome to an Exploratory Data Analysis for food demand forecasting!

This is a Kaggle machine learning project https://www.kaggle.com/datasets/kannanaikkal/food-demand-forecasting. It is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set.

Let’s get started!

2 Quicklook: Structure and content

2.1 Store

center_id	city_code	region_code	center_type	op_area
11	679	56	TYPE_A	3.7
13	590	56	TYPE_B	6.7
124	590	56	TYPE_C	4.0
66	648	34	TYPE_A	4.1

2.2 Meal

meal_id	category	cuisine
1885	Beverages	Thai
1993	Beverages	Thai
2539	Beverages	Thai
1248	Beverages	Indian

2.3 Test

id	week	center_id	meal_id	checkout_price	base_price
1028232	146	55	1885	158.11	159.11
1127204	146	55	1993	160.11	159.11
1212707	146	55	2539	157.14	159.14
1082698	146	55	2631	162.02	162.02

2.4 Train

id	week	center_id	meal_id	checkout_price	base_price	num_orders
1379560	1	55	1885	136.83	152.29	177
1466964	1	55	1993	136.83	135.83	270
1346989	1	55	2539	134.86	135.86	189
1338232	1	55	2139	339.50	437.53	54

Findings:
1. store and meal data can be connected to train by center_id and meal_id respectively.
2. The id column in train seems to be unique.

2.5 Check NA/missing values

## `summarise()` has grouped output by 'center_id'. You can override using the
## `.groups` argument.

## # A tibble: 3,597 × 3
## # Groups:   center_id [77]
##    center_id meal_id count
##        <int>   <int> <int>
##  1       101    1571     1
##  2       145    1571     1
##  3       145    2104     1
##  4        24    1248     3
##  5        41    1248     3
##  6        92    2577     3
##  7        97    2956     3
##  8       139    2577     3
##  9        24    2492     4
## 10        39    2956     4
## # … with 3,587 more rows

##                    id                  week             center_id 
##                     0                     0                     0 
##               meal_id        checkout_price            base_price 
##                     0                     0                     0 
## emailer_for_promotion     homepage_featured            num_orders 
##                     0                     0                     0

Findings: 1. There are a lot of missing values.
2. There is no NAs in the data set.

3 Group analysis

3.1 Sales over time

Findings:
1. The majority of the time series has the demand lower than 2500/week.
2. Some series e.g. cm_id 432290 starts with a high peak, which is worth investigating.
3. The overall trend is not clear.

3.2 All aggregated sales

Findings:
1. Overall, the demand is stable with a decreasing from week 109 which suggests the business is in its mature period.
2. The demand is generally stable with two high spikes(week5 and week48) and a dip (week62).
3. I have added the grey dash lines every 52weeks. There might be some seasonality although not clear.

3.3 Sales per region/city

Findings:
1. Regions can be separated in 4 groups:
G1-R56 has a clear higher demand than the rest regions with a weekly avg orders at around 350k; G2-R34/77 ~180k, G3-85 ~60k and rest in G4.
2. Despite of the demand differences, the patterns are similar across the groups, e.g. a surge in week 5, a dip in week 62. This suggests that the demand surge/dip apply to all regions and is not caused by a specific region.
3. The area difference does not seem to correlate to the demand directly.
4. Regions in G4 group has only one distribution center each.
5. Sales per city shows the similar pattern as the sales per region.

3.4 Sales per category/cuisine

Findings:
1. Region 93 has two spikes of Thai food in week 46/98(+52), showing a clear seasonality. it is the only region that the Thai food demand surpluses Indian/Italian food, suggesting there might be more Thai people dwelling in.
2. Overall we can see the demand of Italian>Indian>Thai=continental.

Findings:
1. Rice Beverages/bowl/sandwich/Pizza/Salad are dominating the sales in all regions.
2. Seafood/starters/snacks are of 2nd importance.
3. Some regions dont offer extras/soup. Some region has just started the business in the extra category.
4. It is worth noting that the demands of beverages are quite different across the regions.

4 Explanatory variables

4.1 Variable coorelations

Findings:
1. We see homepage_featured and emalier_for_promotion have positive correlations with the target num_orders
while is, however, negatively correlated with base_price and check_out_price.
2. base_price and check_out_price are highly correlated. homepage_featured and emalier_for_promotion are also
correlated, suggesting both promotions could happen together.

4.2 homepage promotion and email promotion

To show how promotion affects sales, I will take 4 examples. They are from different cuisines and are sold in different regions.

Findings:
1. For Italian and Indian cuisine, the homepage promotions bring a peak in the very week, with demand dropping in the following week.
2. For continental cuisine, we see the promotion last the first 120 weeks, bringing some ups and downs in demand, suggesting the company is promoting the sales of this category. we also see the demand dropped after the promotion.
3. For Thai food, promotions generally increased the demand, which, however, dropped immediately afterwards.

e_pro <- ts %>% 
  filter(!emailer_for_promotion==0) %>% 
  as.tibble() %>% 
  mutate(id=str_c(cm_id, region_code, sep="_")) %>% 
  group_by(id) %>% 
  summarise(sum_orders=sum(num_orders)) %>% 
  select(-2)
foo <- ts %>% 
  mutate(id=str_c(cm_id, region_code, sep="_"))
a <- foo %>% 
    filter(id %in% e_pro$id, region_code==56, cuisine=="Italian") %>% 
    head(1) %>% 
    pull(id)
b <- foo %>% 
    filter(id %in% e_pro$id, region_code==34, cuisine=="Continental") %>% 
    head(1) %>% 
    pull(id)
c <- foo %>% 
    filter(id %in% e_pro$id, region_code==85, cuisine=="Indian") %>% 
    head(1) %>% 
    pull(id)
d <- foo %>% 
    filter(id %in% e_pro$id, region_code==93, cuisine=="Thai") %>% 
    head(1) %>% 
    pull(id)
sample <- tibble(sample=c(a,b,c,d))
p1 <- foo %>% 
  filter(id %in% sample$sample, !is.na(cuisine)) %>% 
  mutate(has_epro=if_else(emailer_for_promotion==1, num_orders, NA_real_)) %>% 
  ggplot(aes(week, num_orders, group=id))+
  geom_line(aes(col=cuisine), na.rm=TRUE)+
  geom_point(aes(week, has_epro), na.rm = TRUE)+
  facet_wrap(~id, ncol=1, scales = "free_y")+
  labs(title="Sales for 3 random meals with email promotion", 
       col="Cuisine")+
  theme(axis.title.x =element_text(size =14), axis.title.y =element_text(size =14),
        plot.title = element_text(size=18))+
  theme_tufte()
ggplotly(p1, dynamicTicks = TRUE)

Findings:
1. For Indian/Italian cuisines, we see similar patterns as the homepage promotion, demand reaching the top in the week and reducing in following weeks.
2. For Continental cuisine, we see a lagged 3 to lagged 4 effect.
3. For Thai food, the demand is generally increased due to promotions but it drops immediately afterwards.

4.3 Price effect

## Picking joint bandwidth of 0.014

## Picking joint bandwidth of 0.0193

## Picking joint bandwidth of 0.0147

## Picking joint bandwidth of 0.0168

Findings:
1. The price is clearly increasing with time for all cuisines(the wave pattern moving towards the right side).
2. The continental and Indian food are priced higher than Italian and Thai food.
3. Continental and Thai food share the similar change pattern that the last peak becomes bimodal overtime. For continental food, the peak at 700 is gaining importance until it reaches the level of the peak at 600. Similarly, for Thai food, 400 group is becoming as important as the 300 group.

4.4 Conclusion

There are 8 regions, 51 citys and 77 fulfillment centers.
Some region has only one center(23, 35, 71, 93), the rest have multiple.
There are a lot of missing values.
The demand pattern is similar in regards to Italian/Indian cuisine but quite different from Thai.
The aggregated demand is stable with a bit decreasing trend.
The homepage/email promotions affect different materials in a different way. The promotion frequencies are also different. But it generally
increased the demand.
Price correlate to the demand in a negitave way.
Region 35/85/93 doesn’t offer soup/extras. Region 71 doesn’t offer soup but just started the extra business.
Most regions prefers Italian>Indian>Continental>Thai while region 93 shows some unique preference towards Thai whose demand is highest in 4
regions.
There are some unusual surges and drops in demands.
Beverage/rice bowls/sandwiches are the top 3 categories.

5 Modeling

5.1 Model the aggregated data

We first have a look at the aggregated data to find the model with the lower error.

.model	MASE	RMSSE
stl	0.8428738	0.8270821
arima	0.7661296	0.7607650
ets	0.8428738	0.8270821

Findings:
1. ARIMA model catches the pattern better with the lower MASE/RMSSE. 2. The models can’t catch some of the peaks and dips, suggesting more predictable need to be considered.

5.2 Model the individual ts

Based on what we have found, the below models are selected:
1. Dynamic regression model with promotions and price as predictables for any full time series.
2. Croston model for intermittent demand.
3. We do modelling and forecast to individual ts and reconcile from bottom up.

5.3 Forecast at different aggregations

Limitations:
1. The outliers need more information to forecast. 2. When the forecast comes to the category level, it doesn’t catch all the patterns. One of the reasons is the intermittent demand materials.
Further studies are required to improve the model performance.