Forecasting

BACKGROUND The purpose of this file is to analyze a dataset, determine trends, seasonality, and derive a forecast. Skills learned are from Diogo Alves de Resende. The below analysis is only a small portion of what I learned, and if you are interested in learning more, his class is highly recommended. To find his course, go to Udemy.com and search for “Forecasting Models and Time Series for Business in R”. Additional skills and understanding were gathered from Factory Physics For Managers by Ed Pound, Jeffrey Bell, and Mark Spearman. Statistics were refreshed by utilizing Khan Academy. The dataset was derived from the Kaggle data set Forecasts for Product Demand by Charles Gaydon.

In business we already know that the forecast is always going to be wrong. This does not mean that we can not improve the forecast. The three rules of forecasting are:

The forecast is always going to be wrong.
Aggregate forecasts are more accurate than specific forecasts.
The further into the future you forecast, the larger your error will become.

EXERCISE The store ships various products from a variety of warehouses. Recently they have missed their Perfect Order Fulfillment promise to their customers, by not being able to deliver at the right time. Causes included a bad understanding of the trends and seasonality that occurs weekly and monthly at their locations.

They have provided data from 2011 to 2017. With this data they have asked to determine trends, seasonality, and build a better forecast that their business can leverage in the future.

ASK Develop a list of questions to ask, in order to meet the company’s request.

What does the trend look like?
Is there seasonality, and if yes, is this daily, weekly, monthly?
What does the forecast look like at an aggregate over the next 6 months?
In general, are there any other quick interesting facts the data can tell us?

PREPARE The data can be accessed from Kaggle at:

https://www.kaggle.com/code/cgaydon/forecasting-product-demand-with-simple-models/notebook

Load the packages required. If these packages are not already installed, then use the command install.package(“function”).

library(tidyverse)
library(prophet)
library(lubridate)
library(scales)
library(forecast)

Next set the working directory to where the data is saved. After the working directory is set, then bring the data into the environment pane.

raw_df <- read.csv("Data/Historical Product Demand.csv")

Once the data is loaded, due a bit of exploration, looking to see how the data is built:

str(raw_df)

## 'data.frame':    1048575 obs. of  5 variables:
##  $ Product_Code    : chr  "Product_0993" "Product_0979" "Product_0979" "Product_0979" ...
##  $ Warehouse       : chr  "Whse_J" "Whse_J" "Whse_J" "Whse_J" ...
##  $ Product_Category: chr  "Category_028" "Category_028" "Category_028" "Category_028" ...
##  $ Date            : chr  "2012/7/27" "2012/1/19" "2012/2/3" "2012/2/9" ...
##  $ Order_Demand    : chr  "100 " "500 " "500 " "500 " ...

From this table we can see a glimpse of the data, and begin to understand which processes need to occur to clean if for our FB Prophet forecasting model.

The data is made up of 1,048,575 observations or rows.
5 different variables exist. Some are continuous and some are categorical.
All the data types are characters, which will need to change.

PROCESS / EXPLORE Begin by transforming the data type of the Date and Order_Demand:

#Order_Demand to numeric
raw_df$Order_Demand <- as.numeric(raw_df$Order_Demand)

#Turn NA values into a 0 value. This will help with graphing and summarizing 
#the data. 
raw_df[is.na(raw_df)] <- 0

#Transform the Date Variables
raw_df$Date<- ymd(raw_df$Date)

Summarize the data by day.

df<- raw_df %>% 
  group_by(Date) %>% 
  summarise(Order_Demand=sum(Order_Demand))

The observations went from over 1 million to 1.7K. Graph the data via a bar plot to begin to understand what the data looks like.

ggplot(df, aes(x=Date, y=Order_Demand))+
  geom_bar(stat="identity")

Looking at this data we have a few things that need to change in order to have any sort of accurate forecast.

Leading and trailing dates need to be reduced, because they do not correlate to the rest of the data.
Missing days need to be addressed.
Daily variability is fairly high. To make this better, sum by month.

#First filter out anything before 2012, and after 31-Dec-2016. 
df <- filter(df, Date>="2012-01-01" & Date<"2017-01-01")

#Fill in the missing dates and add a value of 0 for the N/A's the populate
df <- df %>%
  mutate(Date = as.Date(Date)) %>%
  complete(Date = seq.Date(min(Date), max(Date), by="day"))
df[is.na(df)] <- 0

#Summarize the data by month
df$Date <- floor_date(df$Date, "month")


df <- df %>% 
    group_by(month = floor_date(Date, "month")) %>%
    summarize(Order_Demand = sum(Order_Demand))

#Re plot the data
ggplot(df, aes(x=month, y=Order_Demand))+
  geom_bar(stat="identity", fill = "light blue")+
  labs(title ="Monthly Order Demand",
       caption = "Based on Kaggle Dataset Forecast Demand")+
  xlab("Date")+
  ylab("Order Demand")+
  theme_bw()+
  scale_x_date (date_labels = "%m-%y",
                date_breaks = "2 months")+
  theme(axis.text.x = element_text(angle = 45, hjust=1, vjust = 1))

Much less volatility in the data, and now it can be transformed into a time series data set, allowing the data to be further analyzed.

y <-  ts(data = df$Order_Demand,
       start = c(2012,1),
       frequency = 12)

ANALYZE This next portion will be general graphing, test and training sets, and forecasts.

Plot a polar seasonality plot to see if the data is skewed in any particular month. This will assist with setting up and holiday/regressors.

ggseasonplot(x = y,
             main = "Seasonality Graph",
             polar = TRUE)

The graph is showing some patterns. Dips in demand during the months of February, May, and August. The months with the most demand are March, July, and October.

Next break down the forecast into the different components of time series data.

decomposition_multiplicative <- decompose(x=y,
                                          type="multiplicative")

plot (decomposition_multiplicative)

This graph shows the following:

Observed: what the data looks like in a line graph format.
Trend: The trend was increasing, but from 2015-2016 it has been decreasing.
Seasonal: Confirms what was seen on the ggseasonplot.

Next set up the training and test set of the data.

#Before setting up the training and test set, rename the columns so they work 
#with the FB Prophet Model. 
df <- rename(df, ds=month)
df <- rename(df, y=Order_Demand)


training <- df %>% filter(ds <"2016-07-01")
test <- df %>% filter(ds>="2016-07-01")

The training and test sets are being created for the below reasons:

Use the data to build the model prior to July 2016.
The model will then be used to determine the forecast for the number of observations of the test set.
The test set actuals is a control to measure the accuracy of the forecast which is created.

Next we will add holidays to the data frame.

#February
df$FebReg <- 0
df$FebReg <- replace (df$FebReg, which (df$ds == "2013-02-01"|
                                          df$ds == "2016-02-01"),1)

#April
df$AprReg <- 0
df$AprReg <- replace (df$AprReg, which (df$ds == "2012-04-01"|
                                          df$ds == "2016-04-01"),1)
#Aug
df$AugReg <- 0
df$AugReg <- replace (df$AugReg, which (df$ds == "2013-08-01"|
                                          df$ds == "2014-08-01"|
                                          df$ds == "2015-08-01"),1)

#Sep
df$SepReg <- 0
df$SepReg <- replace (df$SepReg, which (df$ds == "2012-09-01"|
                                          df$ds == "2016-09-01"),1)


head(df)

## # A tibble: 6 x 6
##   ds                y FebReg AprReg AugReg SepReg
##   <date>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 2012-01-01 74214936      0      0      0      0
## 2 2012-02-01 86711802      0      0      0      0
## 3 2012-03-01 85993881      0      0      0      0
## 4 2012-04-01 74503271      0      1      0      0
## 5 2012-05-01 86094669      0      0      0      0
## 6 2012-06-01 83349146      0      0      0      0

The holiday columns have been added. This will be important to influence the forecasting models.

#May
feb_dates <- subset(df, df$FebReg == 1)
feb_dates <- feb_dates$ds
february <- tibble(holiday = "february",
                         ds = feb_dates)

#May
apr_dates <- subset(df, df$AprReg == 1)
apr_dates <- apr_dates$ds
april <- tibble(holiday = "april",
                         ds = apr_dates)

#August
aug_dates <- subset(df, df$AugReg == 1)
aug_dates <- aug_dates$ds
august <- tibble(holiday = "august",
                         ds = aug_dates)

#September
sep_dates <- subset(df, df$SepReg == 1)
sep_dates <- sep_dates$ds
september <- tibble(holiday = "september",
                         ds = sep_dates)

#Merge all of the holidays together
holidays <- rbind(february, april, august, september)

Next build the forecasting model.

m <- prophet(yearly.seasonality = TRUE,
             holidays=holidays,
             seasonality.mode = "multiplicative",
             seasonality.prior.scale = 5,
             holidays.prior.scale = .5,
             changepoint.prior.scale = .1,
             changepoint_range = .85)
m <- fit.prophet(m, training)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

The power of the Prophet model, is the customization available. The seasonality numbers are telling prophet how to measure seasonality. Changepoint scale is looking at how often the trend is changing.

The next thing to look at is building the future dataset based off of the number of rows from the test dataframe. Then do a bit of processing.

future = make_future_dataframe(m,
                               periods = nrow(test),
                               freq="month")

#Change date type of future
future$ds <- ymd(future$ds)

#Remove NA Values from future
future <- future %>% drop_na

#Forecasting
forecast = predict(m, future)

#visualization
plot(m, forecast)+add_changepoints_to_plot(m)

This has broken our forecast model down. The black dots are the actual data points. The blue line is our point forecast, and the light blue shaded area are the upper and lower intervals. This is not perfect, but when the training forecast is compared to the actuals, the difference ranges from 5-20%. With only one month being higher than 11%. When taking into account the limits, the forecast falls within these limits each time.

Break down the components of this model.

prophet_plot_components(m, forecast)

After creating the forecast it is good to see how the parameters were influenced upon the changes that were made.

Determine the Accuracy of the Forecast.

predictions <- tail(forecast$yhat, nrow(test))
accuracy(predictions, test$y)

##               ME    RMSE     MAE      MPE     MAPE
## Test set 6318488 8446495 7258097 7.743423 8.862971

Conclusion The model brings us within 10% of actuals in most cases. The MAPE is not the best, but below are some thoughts on how to improve this:

Work through the other data forecasting models. To learn those, take Diegos class.
Execute cross validation to determine the best parameters. FB Prophet has an easy way to do this if working with daily data. When working with monthly data it becomes more complicated.
Look at breaking out the data by warehouses.
Potentially only forecast the top 20% of products.

This brings us to the end of forecasting via the FB Prophet model. If the daily data would have been more reliable, then cross validation could have been used to do thousands of calculations, and output our best parameters. Forecasting is difficult, but utilizing R helps work through the calculations. Thank you for taking the time to read, and drop any questions in the comments section.

Forecasting_Demand

Matthew Luckey

2022-04-09