1 Introduction

Welcome to an Exploratory Data Analysis for food demand forecasting!

This is a Kaggle machine learning project https://www.kaggle.com/datasets/kannanaikkal/food-demand-forecasting. It is a meal delivery company which operates in multiple cities. They have various fulfillment centers in these cities for dispatching meal orders to their customers. The client wants you to help these centers with demand forecasting for upcoming weeks so that these centers will plan the stock of raw materials accordingly.

The replenishment of majority of raw materials is done on weekly basis and since the raw material is perishable, the procurement planning is of utmost importance. Given the following information, the task is to predict the demand for the next 10 weeks (Weeks: 146-155) for the center-meal combinations in the test set.

Let’s get started!

2 Quicklook: Structure and content

2.1 Store

center_id city_code region_code center_type op_area
11 679 56 TYPE_A 3.7
13 590 56 TYPE_B 6.7
124 590 56 TYPE_C 4.0
66 648 34 TYPE_A 4.1

2.2 Meal

meal_id category cuisine
1885 Beverages Thai
1993 Beverages Thai
2539 Beverages Thai
1248 Beverages Indian

2.3 Test

id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured
1028232 146 55 1885 158.11 159.11 0 0
1127204 146 55 1993 160.11 159.11 0 0
1212707 146 55 2539 157.14 159.14 0 0
1082698 146 55 2631 162.02 162.02 0 0

2.4 Train

id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured num_orders
1379560 1 55 1885 136.83 152.29 0 0 177
1466964 1 55 1993 136.83 135.83 0 0 270
1346989 1 55 2539 134.86 135.86 0 0 189
1338232 1 55 2139 339.50 437.53 0 0 54

Findings:
1. store and meal data can be connected to train by center_id and meal_id respectively.
2. The id column in train seems to be unique.

2.5 Check NA/missing values

## `summarise()` has grouped output by 'center_id'. You can override using the
## `.groups` argument.
## # A tibble: 3,597 Ă— 3
## # Groups:   center_id [77]
##    center_id meal_id count
##        <int>   <int> <int>
##  1       101    1571     1
##  2       145    1571     1
##  3       145    2104     1
##  4        24    1248     3
##  5        41    1248     3
##  6        92    2577     3
##  7        97    2956     3
##  8       139    2577     3
##  9        24    2492     4
## 10        39    2956     4
## # … with 3,587 more rows
##                    id                  week             center_id 
##                     0                     0                     0 
##               meal_id        checkout_price            base_price 
##                     0                     0                     0 
## emailer_for_promotion     homepage_featured            num_orders 
##                     0                     0                     0

Findings: 1. There are a lot of missing values.
2. There is no NAs in the data set.

3 Group analysis

3.1 Sales over time

Findings:
1. The majority of the time series has the demand lower than 2500/week.
2. Some series e.g. cm_id 432290 starts with a high peak, which is worth investigating.
3. The overall trend is not clear.

3.2 All aggregated sales

Findings:
1. Overall, the demand is stable with a decreasing from week 109 which suggests the business is in its mature period.
2. The demand is generally stable with two high spikes(week5 and week48) and a dip (week62).
3. I have added the grey dash lines every 52weeks. There might be some seasonality although not clear.

3.3 Sales per region/city

Findings:
1. Regions can be separated in 4 groups:
G1-R56 has a clear higher demand than the rest regions with a weekly avg orders at around 350k; G2-R34/77 ~180k, G3-85 ~60k and rest in G4.
2. Despite of the demand differences, the patterns are similar across the groups, e.g. a surge in week 5, a dip in week 62. This suggests that the demand surge/dip apply to all regions and is not caused by a specific region.
3. The area difference does not seem to correlate to the demand directly.
4. Regions in G4 group has only one distribution center each.
5. Sales per city shows the similar pattern as the sales per region.

3.4 Sales per category/cuisine

Findings:
1. Region 93 has two spikes of Thai food in week 46/98(+52), showing a clear seasonality. it is the only region that the Thai food demand surpluses Indian/Italian food, suggesting there might be more Thai people dwelling in.
2. Overall we can see the demand of Italian>Indian>Thai=continental.

Findings:
1. Rice Beverages/bowl/sandwich/Pizza/Salad are dominating the sales in all regions.
2. Seafood/starters/snacks are of 2nd importance.
3. Some regions dont offer extras/soup. Some region has just started the business in the extra category.
4. It is worth noting that the demands of beverages are quite different across the regions.

4 Explanatory variables

4.1 Variable coorelations

Findings:
1. We see homepage_featured and emalier_for_promotion have positive correlations with the target num_orders
while is, however, negatively correlated with base_price and check_out_price.
2. base_price and check_out_price are highly correlated. homepage_featured and emalier_for_promotion are also
correlated, suggesting both promotions could happen together.

4.2 homepage promotion and email promotion

To show how promotion affects sales, I will take 4 examples. They are from different cuisines and are sold in different regions.

Findings:
1. For Italian and Indian cuisine, the homepage promotions bring a peak in the very week, with demand dropping in the following week.
2. For continental cuisine, we see the promotion last the first 120 weeks, bringing some ups and downs in demand, suggesting the company is promoting the sales of this category. we also see the demand dropped after the promotion.
3. For Thai food, promotions generally increased the demand, which, however, dropped immediately afterwards.

e_pro <- ts %>% 
  filter(!emailer_for_promotion==0) %>% 
  as.tibble() %>% 
  mutate(id=str_c(cm_id, region_code, sep="_")) %>% 
  group_by(id) %>% 
  summarise(sum_orders=sum(num_orders)) %>% 
  select(-2)
foo <- ts %>% 
  mutate(id=str_c(cm_id, region_code, sep="_"))
a <- foo %>% 
    filter(id %in% e_pro$id, region_code==56, cuisine=="Italian") %>% 
    head(1) %>% 
    pull(id)
b <- foo %>% 
    filter(id %in% e_pro$id, region_code==34, cuisine=="Continental") %>% 
    head(1) %>% 
    pull(id)
c <- foo %>% 
    filter(id %in% e_pro$id, region_code==85, cuisine=="Indian") %>% 
    head(1) %>% 
    pull(id)
d <- foo %>% 
    filter(id %in% e_pro$id, region_code==93, cuisine=="Thai") %>% 
    head(1) %>% 
    pull(id)
sample <- tibble(sample=c(a,b,c,d))
p1 <- foo %>% 
  filter(id %in% sample$sample, !is.na(cuisine)) %>% 
  mutate(has_epro=if_else(emailer_for_promotion==1, num_orders, NA_real_)) %>% 
  ggplot(aes(week, num_orders, group=id))+
  geom_line(aes(col=cuisine), na.rm=TRUE)+
  geom_point(aes(week, has_epro), na.rm = TRUE)+
  facet_wrap(~id, ncol=1, scales = "free_y")+
  labs(title="Sales for 3 random meals with email promotion", 
       col="Cuisine")+
  theme(axis.title.x =element_text(size =14), axis.title.y =element_text(size =14),
        plot.title = element_text(size=18))+
  theme_tufte()
ggplotly(p1, dynamicTicks = TRUE)

Findings:
1. For Indian/Italian cuisines, we see similar patterns as the homepage promotion, demand reaching the top in the week and reducing in following weeks.
2. For Continental cuisine, we see a lagged 3 to lagged 4 effect.
3. For Thai food, the demand is generally increased due to promotions but it drops immediately afterwards.

4.3 Price effect

## Picking joint bandwidth of 0.014
## Picking joint bandwidth of 0.0193
## Picking joint bandwidth of 0.0147
## Picking joint bandwidth of 0.0168

Findings:
1. The price is clearly increasing with time for all cuisines(the wave pattern moving towards the right side).
2. The continental and Indian food are priced higher than Italian and Thai food.
3. Continental and Thai food share the similar change pattern that the last peak becomes bimodal overtime. For continental food, the peak at 700 is gaining importance until it reaches the level of the peak at 600. Similarly, for Thai food, 400 group is becoming as important as the 300 group.

4.4 Conclusion

  1. There are 8 regions, 51 citys and 77 fulfillment centers.
  2. Some region has only one center(23, 35, 71, 93), the rest have multiple.
  3. There are a lot of missing values.
  4. The demand pattern is similar in regards to Italian/Indian cuisine but quite different from Thai.
  5. The aggregated demand is stable with a bit decreasing trend.
  6. The homepage/email promotions affect different materials in a different way. The promotion frequencies are also different. But it generally
    increased the demand.
  7. Price correlate to the demand in a negitave way.
  8. Region 35/85/93 doesn’t offer soup/extras. Region 71 doesn’t offer soup but just started the extra business.
  9. Most regions prefers Italian>Indian>Continental>Thai while region 93 shows some unique preference towards Thai whose demand is highest in 4
    regions.
  10. There are some unusual surges and drops in demands.
  11. Beverage/rice bowls/sandwiches are the top 3 categories.

5 Modeling

5.1 Model the aggregated data

We first have a look at the aggregated data to find the model with the lower error.
.model MASE RMSSE
stl 0.8428738 0.8270821
arima 0.7661296 0.7607650
ets 0.8428738 0.8270821

Findings:
1. ARIMA model catches the pattern better with the lower MASE/RMSSE. 2. The models can’t catch some of the peaks and dips, suggesting more predictable need to be considered.

5.2 Model the individual ts

Based on what we have found, the below models are selected:
1. Dynamic regression model with promotions and price as predictables for any full time series.
2. Croston model for intermittent demand.
3. We do modelling and forecast to individual ts and reconcile from bottom up.

5.3 Forecast at different aggregations

Limitations:
1. The outliers need more information to forecast. 2. When the forecast comes to the category level, it doesn’t catch all the patterns. One of the reasons is the intermittent demand materials.
Further studies are required to improve the model performance.