Demand Data Exploration and Forecasting

We have set of data from 2015-2017 regarding location, offering, and customer level entries as well as quantity ordered. We will start with a data exploration to surface insights from this data and then move on to order forecasting.

Key Takeaways

  1. Category 10 should be monitored closely as the order volume is significantly higher than other categories. This is true for both “R” and “C” subtypes.

  2. Current data models suggest Category 10 is stabilizing around a mean weekly order of ~4000. The median order across all other categories is ~1200.

  3. “R” subtype orders appear to be stabilizing to more normal (not skewed) distributions rather than decreasing in volume.

  4. ARIMA forecasts provide “R” subtype predictions with approximatly one-fifth the error (RMSE) as a naive median/IQR model.

Futher Work

  1. Create custom forecasts for each category and subtype. Outliers seem likely to have leverage over the mean but are best addressed relative to specific category.

  2. Explore the variance in subtype “C” across the categories. The orders across categories is are not similar.

  3. Adjust the category 10 subtype “C” forecast to flatten with time.

Exploratory Data Analysis

First, we will get a simple sense of the orders per week broken down by type and subtype.

data %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY, color=TYPE)) +
  geom_point(alpha=0.5) +
  facet_wrap(~c(SUBTYPE)) +
  labs(title = "Weekly Orders by Type and Subtype", 
       x= "Time", y= "Quantity")

We see above that for subtype “C” the demand for type “E” increases sharply in late 2016. This is also true for subtype “R”, though the effect for “R” is much higher in a absolute sense. There is a unique demand for type “E” products, but we have yet to isolate this from the rest of the data.

Outlying Demand

We will look into the outlying data points and try to separate them from the dataset. Type “E” orders seem to be contained to category 10 and 11.

data %>%
  filter(between(CAT, 8, 13)) %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY, color=TYPE)) +
  geom_point(alpha=0.5) +
  facet_wrap(~c(CAT)) +
  labs(title = "Weekly Orders by Category and Type", 
       x= "Time", y= "Quantity")

This plot is truncated to categories 7-12 for brevity. We see here that categories 10 and 11 are responsible for the dramatic increases in the demand. We see that this data is further separable into subtypes “C” and “R” (below).

data %>%
  filter(CAT == 10) %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY, color=TYPE)) +
  geom_point() +
  facet_wrap(~c(SUBTYPE)) +
  labs(title = "Weekly Orders in Category 10 by Type and Subtype", 
       x= "Time", y= "Quantity") +
  theme(legend.position = "none")

Category 10 seems to be experiencing rapid growth for from both “C” and “R” subtypes, with “R” subtypes really driving demand. Further work will need to be done to determine if category, type, subtype (10,E,R) customers are still growing or if demand has stabilized relative to the seasonal effect apparent in the larger data. We will handle this in the 2018 projection section.

If category has to do with the location, then is might be reasonable to infer that category 10 and 11 are new markets with varying degrees of customer uptake across product and customer types. Category 11 seems in need of intervention, whereas category 10, if it holds on current demand, would be the category with peak demand.

Standard Demand

Now that we have a handle on the outlying data from categories 10 and 11, we will focus on the “standard” demand for the remaining categories.

data %>%
  filter(!between(CAT, 10, 11), CAT <= 14, SUBTYPE == "R") %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY)) +
  geom_point(alpha=0.5, color="#F8766D") +
  facet_wrap(~CAT) +
  labs(title = "Weekly Orders by Category for Subtype \"R\" ", 
       x= "Time",
       y= "Quantity")

data %>%
  filter(!between(CAT, 10, 11), between(CAT, 6, 19), SUBTYPE == "C") %>%
  mutate(WEEK_END_DATE = as.Date(WEEK_END_DATE)) %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY)) +
  geom_point(alpha=0.5,color="#00BFC4") +
  facet_wrap(~CAT) +
  labs(title = "Weekly Orders by Category for  Subtype \"C\"", 
       x= "Time",
       y= "Quantity") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y")

Again, the total number of the categories is truncation for brevity. We see that there is a serious difference in orders based on subtype. “R” subtype orders are much more frequent and appear to have a seasonality. “C” subtype orders run the gambit from infrequent to seasonal, to seemingly constant. However, the relatively small size of subtype “C” orders may not greatly influence the total demand.

“R” subtype orders generally have a clear trending component as well as a seasonality. As stated above, this subtype of orders commands the largest portion ordering so it will be important to check for trend and seasonality so as to get the best forecast for 2018.

Subtype Size

Not to belabor this point, but below we see the relative size of subtype orders.

means<- 
  data %>%
  filter(!between(CAT, 10, 11)) %>%
  group_by(SUBTYPE) %>%
  summarise(subtype_mean_qty = mean(QTY))

data %>%
  filter(!between(CAT, 10, 11)) %>%
  group_by(CAT, SUBTYPE) %>%
  summarise(subtype_mean_qty = mean(QTY)) %>%
  ggplot(aes(x=subtype_mean_qty, fill=SUBTYPE)) +
  geom_histogram(bins = 40) +
  facet_wrap(~SUBTYPE, scales = "free") +
  labs(title = "Histogram of Mean Subtype Orders by Subtype",
       x = "Mean Order", 
       y= "Count") +
  theme(legend.position = "none") 

Here we see the relatively low demand based on the subtype. Mean “C” subtype orders are 1.25% of the total mean order share.

means %>%rename("Subtype Mean QTY" = subtype_mean_qty)
SUBTYPE Subtype Mean QTY
C 17.36521
R 1327.65847

Order Aggregation

We have reason to believe that trend and seasonality are at play for the orders of subtype “R”. Subtype “C” may have these components as well, but given it’s small order size we are de-prioritizing that subtype for now. Additionally, we will handle the seasonality and trend of category 10 separately and exclude categories 11, 30, and 31 from this analysis due to their very small size.

If we had more time we could individually assess the trend and seasonality of each category, but for the sake of time we will aggregate the categories. We do see that for most categories, the trend and seasonality (we think!) appear to be consistent. We should compute some metrics about the weekly distribution for insight into how best aggregate the orders. Below is a sample of the weekly demand for a several categories as well as a sample of mean, standard deviation, median, IQR, and skewness.

data %>%
  filter(between(CAT, 12,15), 
         SUBTYPE == "R") %>%
  mutate(WEEK_END_DATE = as.Date(WEEK_END_DATE)) %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY)) +
  geom_point(alpha=0.5, color="#F8766D") +
  facet_wrap(~CAT) +
  labs(title = "Weekly Orders by Category for Subtype \"R\" ",
       x= "Time", y= "Quantity") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

data %>%
  filter(!between(CAT, 10,11), 
         !between(CAT, 30, 31),
         SUBTYPE == "R") %>%
  group_by(WEEK_END_DATE) %>%
  summarise(mean_qty = mean(QTY),
            std_qty = sd(QTY),
            median_qty = median(QTY),
            iqr_qty = IQR(QTY),
            skew_qty = skewness(QTY)) %>%
  slice_sample(n=8) 
WEEK_END_DATE mean_qty std_qty median_qty iqr_qty skew_qty
2015-06-14 1323.923 448.9126 1170 450.5 1.2531740
2016-01-24 1433.556 401.3036 1316 474.5 0.8653915
2017-08-20 986.440 319.9758 883 309.0 1.0933669
2016-03-20 1469.148 439.0923 1350 360.5 1.2252406
2015-10-11 1310.444 420.3735 1191 407.5 1.2647395
2017-04-23 1375.704 443.6113 1262 526.0 0.9546428
2017-03-12 1434.519 444.2556 1305 537.5 1.0046758
2016-03-13 1480.963 459.8912 1352 434.5 1.1515874

The data table of aggregated orders suggests that across the categories we have right-tailed skewed distributions, but nothing that is alarming to practical users (Skew <= 1.3). It seems reasonable to aggregate using the median and IQR rather than mean and standard deviation as median/IQR are more robust for mostly normal distributions that have some right skewness. It also seems to be the case that the skewness of the weekly distribution decrease with time. This suggest that the weekly distributions are becoming more normal and thereby better centered to the mean/median.

ts_data<- 
  data %>%
  filter(!between(CAT, 10,11), 
         !between(CAT, 30, 31),
         SUBTYPE == "R", 
         WEEK_END_DATE > "2015-02-01", 
         WEEK_END_DATE < "2017-11-30") %>%
  group_by(WEEK_END_DATE) %>%
  summarise(median_qty = median(QTY),
            iqr_qty = IQR(QTY))

ts_data %>%
  ggplot(aes(x=WEEK_END_DATE, ymin=(median_qty-(iqr_qty/2)), ymax=(median_qty+(iqr_qty/2)))) +
  geom_ribbon(alpha=0.3, fill="#F8766D") +
  geom_line(aes(x=WEEK_END_DATE, y=median_qty), color="#F8766D") +
  labs(title = "Median Order Quantity All Categories",
       x= "Time",
       y="Median Order",
       subtitle = "\"R\" Subtype")

At a basic, naive level we could predict future demand given the plot above, in that we have an interval that spans a reasonable cross section of the historical data. To that end, we could generate values from sampling a normal distribution with median 1183, IQR 376, and skewness = 1.06 so as to predict order values through 2018. That said, we hope to do much better.

It’s worth noting that this aggregated data set seems less seasonal than the individual category orders. We could be headed into a Simpson’s Paradox like scenario.

Time Series Prediction

First, we will create a time series for weekly data.

library(lubridate)
ts_all<- ts(ts_data$median_qty, 
   freq=365.25/7, 
   start=decimal_date(ymd("2015-02-08")))
ggseasonplot(ts_all) +
  ggtitle("Year Over Year Median Orders") +
  theme(axis.text.x = element_text(angle = 45, hjust=1, vjust=1))

This year to year plot suggests a gradual slowing of “R” subtype orders. 2015 appears to be the highest year and 2017 the lowest.

Decomposition

We will use a basic multiplicative decomposition to separate the trend and seasonal components.

library(seasonal)
ts_all %>% 
  decompose("multiplicative") %>%
  autoplot()

Indeed, the aggregated by median order is not super seasonal. The trend also appears to be leak, which is good in that the trend is downward in orders.

Transformation, Difference, ARIMA Components

Next, we will see if some basic mathematical transformations with work to smooth out this data before modeling. Neither BoxCox nor Log10 transformation provided meaningful smoothing of scaling. Additionally, Unit root test suggest a marginal case for differencing at a value of 1.

library(urca)
ts_all_bc<- ts_all %>% BoxCox(lambda = BoxCox.lambda(ts_all))

ur.kpss(ts_all_bc)
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.5222
paste("n difference test: ", ts_all_bc %>% ndiffs())
## [1] "n difference test:  1"

Next we will look at the first difference of the data for insight into auto-regressive and moving average terms.

ggtsdisplay(ts_all %>%
              diff(), main="Single Difference with ACF/PACF")

ACF suggests single moving average term and PACF suggests two auto-regressive terms. We will also include a single seasonal difference to capture the leak seasonal component.

Modeling

Starting with a manual fit of an ARIMA model order (2,0,1) and seasonal (0,1,0) as indicated above.

We will also run a wide search using auto ARIMA because it’s very powerful.

auto_fit <- auto.arima(ts_all,
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal = TRUE)
summary(auto_fit)
## Series: ts_all 
## ARIMA(0,1,1)(0,1,0)[52] 
## 
## Coefficients:
##           ma1
##       -0.8090
## s.e.   0.0577
## 
## sigma^2 estimated as 6182:  log likelihood=-543.59
## AIC=1091.18   AICc=1091.31   BIC=1096.27
## 
## Training set error measures:
##                    ME     RMSE      MAE          MPE     MAPE      MASE
## Training set 1.428953 62.47602 35.98865 -0.009685148 3.042215 0.4701811
##                     ACF1
## Training set -0.09922675
checkresiduals(auto_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,0)[52]
## Q* = 15.899, df = 28, p-value = 0.9673
## 
## Model df: 1.   Total lags used: 29

The auto fit ARIMA model looks okay. The residuals are well bounded, symmetric, and normal and the ACF is fine. There are some ACF terms beyond the cut-off, but a manual ARIMA model showed no better performance. We will keep the auto fit model, it’s also a simple model, which is always good. It should be noted that the RMSE on this model is 62 orders, which is less than fifth the naive prediction of the \(\pm 376\) orders.

Prediction

Let’s see what happens when we turn this model loose on the future.

auto_fit %>% forecast(h=52) %>% autoplot() + labs(y="Orders")

We also have a sample of the actual predictions from the mean value per day.

preds<- forecast(auto_fit, h=52)

order_df<- 
  as.data.frame(preds$mean) %>% 
  rename("Predicted_Orders"=x) %>%
  mutate(day=1:52) %>%
  mutate(Predicted_Orders = round(Predicted_Orders, 0))
order_df %>%
  top_n(-10)
Predicted_Orders day
1363 1
1289 2
1223 3
945 4
954 5
1471 6
1283 7
1402 8
1393 9
1398 10

Subtype “C” Forecast

We have the time to attempt a forecast of the subtype “C”. We will run through a similar process as above.

data %>%
  filter(between(CAT, 12,15), 
         SUBTYPE == "C") %>%
  mutate(WEEK_END_DATE = as.Date(WEEK_END_DATE)) %>%
  ggplot(aes(x=WEEK_END_DATE, y=QTY)) +
  geom_point(alpha=0.5, color="#F8766D") +
  facet_wrap(~CAT) +
  labs(title = "Weekly Orders by Category for Subtype \"C\" ",
       x= "Time", y= "Quantity") +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

data %>%
  filter(!between(CAT, 10,11), 
         !between(CAT, 30, 31),
         SUBTYPE == "C") %>%
  group_by(WEEK_END_DATE) %>%
  summarise(mean_qty = mean(QTY),
            std_qty = sd(QTY),
            median_qty = median(QTY),
            iqr_qty = IQR(QTY),
            skew_qty = skewness(QTY)) %>%
  slice_sample(n=5)
WEEK_END_DATE mean_qty std_qty median_qty iqr_qty skew_qty
2017-12-03 20.13333 19.84895 17 19.50 1.7833703
2016-02-21 16.66667 14.50287 17 14.50 -0.0422019
2017-04-09 19.78571 20.11956 13 21.75 1.4143213
2016-12-25 15.08333 14.38091 9 14.75 1.3570483
2017-06-18 21.30769 18.03878 16 22.00 0.9836760

Median and IQR still work best as we have right-tail skewed data.

ts_data_c<- 
  data %>%
  filter(!between(CAT, 10,11), 
         !between(CAT, 30, 31),
         SUBTYPE == "C", 
         WEEK_END_DATE > "2015-02-01", 
         WEEK_END_DATE < "2017-11-30") %>%
  group_by(WEEK_END_DATE) %>%
  summarise(median_qty = median(QTY),
            iqr_qty = IQR(QTY))

ts_data_c %>%
  ggplot(aes(x=WEEK_END_DATE, ymin=(median_qty-(iqr_qty/2)), ymax=(median_qty+(iqr_qty/2)))) +
  geom_ribbon(alpha=0.3, fill="#F8766D") +
  geom_line(aes(x=WEEK_END_DATE, y=median_qty), color="#F8766D") +
  labs(title = "Median Order Quantity All Categories",
       x= "Time",
       y="Median Order",
       subtitle = "\"C\" Subtype")

A naive prediction will likely work out well here.

ts_all_c<- ts(ts_data_c$median_qty, 
   freq=365.25/7, 
   start=decimal_date(ymd("2015-01-04")))
ts_all_bc_c<- ts_all %>% BoxCox(lambda = BoxCox.lambda(ts_all))

ur.kpss(ts_all_c)
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 0.2974
paste("n difference test: ", ts_all_c %>% ndiffs())
## [1] "n difference test:  0"

No differencing needed and Box Cox transformation is unhelpful again here. Next is auto ARIMA modeling.

auto_fit_c <- auto.arima(ts_all_c,
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal = TRUE)
summary(auto_fit_c)
## Series: ts_all_c 
## ARIMA(0,0,5) with non-zero mean 
## 
## Coefficients:
##          ma1     ma2      ma3     ma4     ma5     mean
##       0.6116  0.5752  -0.0580  0.2173  0.2664  12.5688
## s.e.  0.1012  0.1226   0.1239  0.1065  0.0972   1.5038
## 
## sigma^2 estimated as 35.92:  log likelihood=-316.18
## AIC=646.36   AICc=647.59   BIC=664.52
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.06086388 5.809038 3.865694 -36.94888 56.95814 0.4884076
##                      ACF1
## Training set -0.009890881
checkresiduals(auto_fit_c)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5) with non-zero mean
## Q* = 5.0098, df = 14, p-value = 0.9857
## 
## Model df: 6.   Total lags used: 20

Errors look good as do metrics but Ljung test is pretty high.

auto_fit_c %>% forecast(h=52) %>% autoplot() + labs(y="Orders")

Indeed, a naive model has the lowest error overall.

preds_c<- forecast(auto_fit_c, h=52)

order_df_c<- 
  as.data.frame(preds_c$mean) %>% 
  rename("Predicted_Orders"=x) %>%
  mutate(day=1:52) %>%
  mutate(Predicted_Orders = round(Predicted_Orders, 0))
order_df_c %>%
  top_n(-10)
Predicted_Orders day
9 1
8 2
15 3
11 4
11 5
13 6
13 7
13 8
13 9
13 10

Category 10

Now we turn our attention to category 10, the outlier category for both subtypes “R” and “C”.

The amount of data for this category is a bit worrying. Without a sufficiently long enough history it is difficult to determine emerging trends or simple noise. Both “R” and “C” subtype have evidence of this towards the end of the dataset. These predictions should be taken with caution!

cat_10<-
  data %>%
  filter(SUBTYPE == "R", 
         CAT == 10) %>%
  mutate(WEEK_END_DATE = as.Date(WEEK_END_DATE))

ts_10<- ts(cat_10$QTY, 
   freq=365.25/7, 
   start=decimal_date(ymd("2015-01-04")))
ts_10 %>% autoplot() + ggtitle("Category 10 Subtype \"R\" Orders") + labs(y="Orders")

ur.kpss(ts_10)
## 
## ####################################### 
## # KPSS Unit Root / Cointegration Test # 
## ####################################### 
## 
## The value of the test statistic is: 1.3027
paste("n difference test: ", ts_10 %>% ndiffs())
## [1] "n difference test:  1"

Single differencing seems needed.

ts_10_diff <- ts_10 %>% diff()
tsdisplay(ts_10_diff,
          main="Single Difference Plot with ACF/PACF")

Indeed, this appears to be stationary with single difference. Now fit an auto ARIMA model.

auto_10 <- auto.arima(ts_10,
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal = TRUE)
summary(auto_10)
## Series: ts_10 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 217591:  log likelihood=-491.67
## AIC=985.34   AICc=985.4   BIC=987.51
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE       MASE       ACF1
## Training set 64.54609 462.9192 333.1825 3.944648 13.15499 0.07332245 -0.1083199
checkresiduals(auto_10)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0)
## Q* = 14.256, df = 13, p-value = 0.356
## 
## Model df: 0.   Total lags used: 13

All of the diagonstic metrics look solid including Ljung. The RMSE is very high, but that might be the result of insufficient data. Let’s look at the predictions.

auto_10 %>% forecast(h=52) %>% autoplot() + labs(y="Orders")

Auto ARIMA seems to thing this is not seasonal, nor trending and should be forecast at the mean value.

cat_10_c<-
  data %>%
  filter(SUBTYPE == "C", 
         CAT == 10) %>%
  mutate(WEEK_END_DATE = as.Date(WEEK_END_DATE))

ts_10_c<- ts(cat_10_c$QTY, 
   freq=365.25/7, 
   start=decimal_date(ymd("2015-01-04")))
ts_10_c %>% autoplot() +labs(title = "Category 10 Subtype \"C\"", y="Orders")

auto_10_c <- auto.arima(ts_10_c,
                         stepwise = FALSE,
                         approximation = FALSE,
                         seasonal = TRUE)
summary(auto_10_c)
## Series: ts_10_c 
## ARIMA(5,1,0) with drift 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     ar5   drift
##       -0.4190  -0.3966  -0.3901  -0.3216  0.3321  6.4417
## s.e.   0.1381   0.1613   0.1613   0.1681  0.1515  2.6875
## 
## sigma^2 estimated as 2490:  log likelihood=-365.82
## AIC=745.64   AICc=747.47   BIC=761.27
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE       MASE
## Training set -0.4439387 47.33644 33.29081 -13.30435 26.45068 0.07730064
##                    ACF1
## Training set 0.00527075
checkresiduals(auto_10_c)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,0) with drift
## Q* = 10.285, df = 8, p-value = 0.2456
## 
## Model df: 6.   Total lags used: 14

Again, reasonable metrics and relatively simple models.

auto_10_c %>% forecast(h=52) %>% autoplot() + labs(y="Orders")

Here we have damped seasonal forecast with a trend component. Future work would include adjusting the dampening and drift components to flatten the projection.