library(tidyverse)
library(forecast)
library(prophet)
library(lubridate)

Use Case

In this LBB, we will forecasting sales data using facebook’s open source algorithm, that is prophet. The data that we use is the demand forecasting dataset that has been downloaded from kaggle. Let’s load and investigate the dataset first.

sales <- read.csv("data_input/dive-deeper/train.csv")
glimpse(sales)

## Observations: 730,500
## Variables: 4
## $ date  <fct> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-05, 2...
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ item  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ sales <int> 13, 11, 14, 13, 10, 12, 10, 9, 12, 9, 9, 7, 10, 12, 5, 7, 16,...

The dataset has 730.500 observations and 4 variables. The variable description is explained below:
- date: date of transaction occur
- store: store that the transaction occur
- item: item purchased by customer
- sales: sales of transaction

Exploration Data Analysis

Before we modelling the tie series, we can explore our data first by doing several transformation and visualization. Here we want to see if there is any difference in sales generated towards day of week.

sales_clean <- sales %>% 
  mutate(store = as.factor(store)) %>% 
  group_by(date, store) %>% 
  summarise(total = sum(sales)) %>% 
  ungroup() %>% 
  mutate(DoW = wday(date, label = T, abbr = F))
head(sales_clean)

## # A tibble: 6 x 4
##   date       store total DoW    
##   <fct>      <fct> <int> <ord>  
## 1 2013-01-01 1      1316 Tuesday
## 2 2013-01-01 2      1742 Tuesday
## 3 2013-01-01 3      1588 Tuesday
## 4 2013-01-01 4      1423 Tuesday
## 5 2013-01-01 5      1032 Tuesday
## 6 2013-01-01 6      1099 Tuesday

Then, we can visualize it and separate the visualization based on its store.

ggplot(sales_clean, aes(date, total)) +
  geom_point(aes(col = DoW)) +
  facet_wrap(~store, ncol = 2) +
  theme_minimal()

From the visualization above, we can see that for every store there is no significant difference in the pattern of sales generated. Besides, the highest sales generated in weekend (Saturday and Sunday), and the lowest sales generated happen in Monday.

Based on its visualization, now we know that there is a weekly seasonality in our data. Next, we want to observe the specific date on peak season for store 1.

library(plotly)
plot_sales <- ggplot(sales_clean[sales_clean$store == 1, ], aes(date, total)) +
  geom_point(aes(col = DoW)) +
  facet_wrap(~store, ncol = 2) +
  theme_minimal()
ggplotly(plot_sales)

Preparing the data

In this LBB, let’s say we want to forecast the sales on store 1, so we have to filter the data. After that, we also rename the variable date by ds and the variable that we will forecast by y. This should be done since it is the requirement on prophet.

data_prop <- sales_clean %>% 
  filter(store == 1) %>% 
  rename(ds = date,
         y = total)
head(data_prop)

## # A tibble: 6 x 4
##   ds         store     y DoW      
##   <fct>      <fct> <int> <ord>    
## 1 2013-01-01 1      1316 Tuesday  
## 2 2013-01-02 1      1264 Wednesday
## 3 2013-01-03 1      1305 Thursday 
## 4 2013-01-04 1      1452 Friday   
## 5 2013-01-05 1      1499 Saturday 
## 6 2013-01-06 1      1613 Sunday

Baseline Prophet Model

Now, we make the baseline prophet model for our data. The baseline prophet model, by default will include the daily, weekly and yearly seasonality to our data.

model_prophet <- prophet() %>% 
  fit.prophet(data_prop)

Then, we make the future dataframe for 365 days ahead to predict using our previous model.

# make a future date to predict
future <-  make_future_dataframe(model_prophet, periods = 365)

# predict the 365 days ahead
forecast <- predict(model_prophet, future)

Inspecting Model Components

One thing that we can check is model components. The model components in the simple prophet model consists of three, they are trend, weekly, and yearly components.

prophet_plot_components(model_prophet, forecast)

As seen above, the trend tends to increase as the year goes by. Besides, the sales drops in Monday, and increase in the other day. Meanwhile, there is a spike in July, means that maybe there is a holiday effect in July. It might be the Independence Day, that is happen in July 4th. The holiday effect will us check later.

Visualizing the actual and forecast data

In this LBB, we have a test dataset that saved under data_input/dive-deeper folder. The test dataset contains actual data for our 1 year ahead data (sales that occur on 2017). Here we should read the dataset first.

test <- read.csv("data_input/dive-deeper/test.csv")

After loading the test data, we should do several data aggregation to make the data frame has the similar shape with the data_prop above.

test_data <- test %>% 
  filter(store == 1) %>% 
  mutate(date = ymd(date)) %>%
  group_by(date) %>%
  summarise( 
    demand = sum(sales) 
  ) 
test_data

## # A tibble: 365 x 2
##    date       demand
##    <date>      <int>
##  1 2017-01-01   2155
##  2 2017-01-02   1441
##  3 2017-01-03   1635
##  4 2017-01-04   1713
##  5 2017-01-05   1858
##  6 2017-01-06   1890
##  7 2017-01-07   2043
##  8 2017-01-08   2187
##  9 2017-01-09   1444
## 10 2017-01-10   1775
## # ... with 355 more rows

Since we use the prophet technique, we have to rename the date and demand column to be ds and y.

test_data <- test_data %>%
  rename( 
    "ds" = date, 
    "y" = demand 
  )

One of improtant thing to check whether our model can capture the pattern of the data or not is by visualizing our actual and forecast data from the data_prop and test_data. Below is the visualization of our sales data using our previous model_prophet.

plot(model_prophet, forecast) +
  geom_point(data = test_data %>% mutate(ds = as.POSIXct(ds)), aes(x=ds, y=y), color="tomato3")

The predicted sales based on model_prophet has been captured the actual data quite well. Now, let us check the model performance using mean absolute percentage error (mape).

mape <- function(y, yhat) {
  return(mean(abs(y - yhat)/ y))
}

eval <- test_data %>% 
  mutate( 
    ds = as.POSIXct(ds) 
  ) %>% 
  left_join(forecast) %>% 
  select(ds, y, yhat, yhat_upper, yhat_lower) 
eval

## # A tibble: 365 x 5
##    ds                      y  yhat yhat_upper yhat_lower
##    <dttm>              <int> <dbl>      <dbl>      <dbl>
##  1 2017-01-01 00:00:00  2155 2437.      2563.      2282.
##  2 2017-01-02 00:00:00  1441 1522.      1661.      1377.
##  3 2017-01-03 00:00:00  1635 1820.      1966.      1678.
##  4 2017-01-04 00:00:00  1713 1832.      1963.      1692.
##  5 2017-01-05 00:00:00  1858 1967.      2100.      1833.
##  6 2017-01-06 00:00:00  1890 2118.      2257.      1974.
##  7 2017-01-07 00:00:00  2043 2272.      2421.      2129.
##  8 2017-01-08 00:00:00  2187 2426.      2574.      2283.
##  9 2017-01-09 00:00:00  1444 1509.      1652.      1375.
## 10 2017-01-10 00:00:00  1775 1806.      1943.      1661.
## # ... with 355 more rows

mape(eval$y, eval$yhat)

## [1] 0.06171747

Result: Based on model_prophet, MAPE is 6.17% for the test data. Hence, we can say that our model_propet is quite good for for forecasting the future data.

Now, let us create second model with adding the holiday effect.

Adding holiday effect in Prophet

In our second model, we want to add the nye, christmas, and independence day. This consideration is taken by observing the specific date on plot_sales above.

# create new years and christmas eve data frame
nye_christ <- data_frame(
  holiday = 'nye_christ',
  ds = as.Date(c('2013-01-01', '2013-12-25',
                 '2014-01-01', '2014-12-25',
                 '2015-01-01', '2015-12-25',
                 '2016-01-01', '2016-12-25',
                 '2017-01-01', '2017-12-25')),
  lower_window = -1, 
  upper_window = 1)

# create independece day data frame
independence <- data_frame(
  holiday = 'independence',
  ds = as.Date(c('2013-07-04', 
                 '2014-07-04', 
                 '2015-07-04', 
                 '2016-07-04',
                 '2017-07-04')),
  lower_window = 0,
  upper_window = 5
)

# bind two data frames
holidays <- bind_rows(nye_christ, independence)

After determine the holiday that might affect our supermarket sales, let us build the model with include the holiday effect named model_prophet2. First, we should prepare our model first by determining the holidays, weekly.seasonality, changepoint.prior.scale, etc.

# preparing the model
model_prophet2 <- prophet(holidays = holidays,
                          holidays.prior.scale = 0.5, 
                          weekly.seasonality = 10,
                          interval.width = 0.95,
                          changepoint.prior.scale = 0.005)

# fitting the model
model_prophet2 <- fit.prophet(model_prophet2, data_prop)

# predict the 90 days ahead
forecast2 <- predict(model_prophet2, future)

Next, let us investigate the model components using visualization below.

prophet_plot_components(model_prophet2, forecast2)

The plot components above tells us that there is any holiday effect in our data. Now, let see the visualization of comparation of our actual and predicted data below.

plot(model_prophet2, forecast2) +
  add_changepoints_to_plot(model_prophet2) +
  geom_point(test_data, mapping = aes(as.POSIXct(ds), y), col = "red")

As seen above, the red dash line indicating the change points. We have 4 change points in 2014 data, means that there is changing trend in those points. For overall trend and seasonality, it captures well. To make sure the model performs well, let us check the error for the train and test data.

mape <- function(y, yhat) {
  return(mean(abs(y - yhat)/ y))
}

eval <- test_data %>% 
  mutate( 
    ds = as.POSIXct(ds) 
  ) %>% 
  left_join(forecast2) %>% 
  select(ds, y, yhat, yhat_upper, yhat_lower) 
eval

## # A tibble: 365 x 5
##    ds                      y  yhat yhat_upper yhat_lower
##    <dttm>              <int> <dbl>      <dbl>      <dbl>
##  1 2017-01-01 00:00:00  2155 2446.      2665.      2192.
##  2 2017-01-02 00:00:00  1441 1537.      1753.      1300.
##  3 2017-01-03 00:00:00  1635 1826.      2081.      1601.
##  4 2017-01-04 00:00:00  1713 1840.      2063.      1586.
##  5 2017-01-05 00:00:00  1858 1977.      2197.      1746.
##  6 2017-01-06 00:00:00  1890 2128.      2361.      1890.
##  7 2017-01-07 00:00:00  2043 2284.      2499.      2055.
##  8 2017-01-08 00:00:00  2187 2440.      2687.      2206.
##  9 2017-01-09 00:00:00  1444 1524.      1762.      1285.
## 10 2017-01-10 00:00:00  1775 1823.      2049.      1588.
## # ... with 355 more rows

mape(eval$y, eval$yhat)

## [1] 0.05423256

Result: From the result above, the MAPE is 5.42% for our test data. The model_prophet2 has better result than model_prophet.

Conclusion and Business Recommendation

Prophet is a good tool for forecasting when dealing with multiple seasonality and the holiday effect. In the supermarket sales dataset, we know that Sunday has highest sales, while the lowest sales occur in Monday. Besides, in July the sales tends to increase, while in end of year, the sales drop compare to other months.

When we gain insight from the seasonality effect above, as an analyst, we can take an action plan based on the highest sales, we should provide more item in our store, more employee to handle customers, and perhaps longer hours to open the store.

Also, based on our forecasting, it can be seen that model_prophet2 (mape = 5.42%) is better than model_prophet (mape = 6.17%). Based on our sales forecast, we know that we can planning the budget for next quarter. Hence, the budget will be more efficient.

Forecasting Sales using Prophet

Yaumil Sitta

4/29/2020