I am working in a company that selling Agriculture Equipment, we have a product that marked as the highest sales since the beginning (about 9 years ago), i want to predict the sales for the next year in order to prepare how many stock that we need in order to fulfil the demand.

First, lets prepare the libraries.

library(tidyverse)
library(dplyr)
library(lubridate)
library(forecast)
library(MLmetrics)
library(tseries)
library(ggplot2)
library(plotly)
library(padr)
library(zoo)

Data Preparation

Next, we read the data. (I have remove product name in the data so it’s only shows Date and Quantity :) )

sales <- read.csv("salesh.csv")
tail(sales)

Data Preprocess & Exploratory

To make a timeseries data, first we need to have a sequential date (tanggal) in order and fill the empty QTY with average value, We are going to use pad and na.fill command.

sales <- sales %>% 
  mutate(TANGGAL = as.POSIXct(TANGGAL)) %>% 
  pad() %>% #extend the date range
  mutate(QTY=na.fill(QTY, fill = "extend")) %>%  #fill the NA column with average between notNA 
  select(TANGGAL, QTY)
head(sales, 5)

Timeseries

Now we have a proper timeseries data, lets explore the data with HoltWinters method.

Creating timeseries the data shown are daily sales so i am using frequency 7*4 to get a better trend grasp.

sales_ts <- ts(data = sales$QTY, start = c(2013, 09,17), end = c(2019, 31,12), frequency = 7*4)

visualize

sales_ts %>% autoplot()

It shows slightly sales increasement on late 2019, lets see a better detail with decompose() with multiplicative type

sales_ts %>%  decompose(type = "multiplicative") %>% autoplot()

After we get the information, lets separate the data into train and test

test <- tail(sales_ts, 3) #using 3 days as test data

train <- head(sales_ts, length(sales_ts)-length(test))

Model Fitting with HoltWinters method

model_sales <- HoltWinters(train, seasonal = "multiplicative")

Forecast Model

model_forecast <- forecast(object = model_sales, h = 24) #forecast for the next 24 days
sum(model_forecast$mean)

## [1] 325.7463

I did compared this output with actual sales number, its about the same amount which is good, bet let see the visualisation, can the model predict it nicely?

Visualize the model

train %>% autoplot()+
  autolayer(test, series = "test data")+
  autolayer(model_forecast$mean, series = "forecast")

accuracy test

accuracy(model_forecast$mean, test)

##                ME     RMSE      MAE      MPE     MAPE ACF1 Theil's U
## Test set 3.484933 4.112942 3.484933 18.80892 18.80892    0 0.8861136

After the calculation steps above, it shows that the result for the next 24 days would be:

sum(model_forecast$mean)

## [1] 325.7463

with MAPE 18,80%, I am quite confident with the result despite i have done cross-validation with the actual sales

Multi Seasonal Time Series

One of my Mentor in my school tells that we can grasp a better pattern with Multi Seasonal Time Series (MSTS), its a method that can learn the-not-quite-tidy-time-series-data, we can add multiple frequency on reading the data, so let’s see the comparison.

For this i am going to use frequency of a week, a month and a year.

I am going to use only the data from 2011 to 2016 because it shows some unusual sales activity during 2017 to 2018 that can disrupt my model pattern.

sales.msts <- msts(sales$QTY, seasonal.periods = c(30, 30*3, 30*6), start = c(2011, 01,05), end = c(2016, 31,12))
sales.msts %>% mstl() %>% autoplot()

From the visual above, it shows a better trend pattern so it means that we can grasp the data-pattern quite good compared to standard timeseries.

Cross-Validation

cv_train <- function(x){
  # 30*12*3 is simply means daily time times 30 days times 12 months and 3 years
  train <- head(x, length(x) - 30*6)
}

cv_test <- function(x){
  test <- tail(x,30*6)
}

train_1 <- cv_train(sales.msts)
test_1 <- cv_test(sales.msts)

Modeling 1

mod_1 <- train_1 %>% stlm(lambda = 0,method = "arima")

Forecast& Evaluation

#forecast
forecast_1 <- forecast(object = mod_1,h = length(test_1))
#evaluation
eval_1 <- accuracy(as.vector(forecast_1$mean),test_1)
data.frame(eval_1)

Visual 1

forecast_plot <- ggplot() +
  geom_line(aes(x = (sales$TANGGAL %>% tail(30*6)),y = (sales$QTY %>% tail(30*6)),color = "actual")) +
  geom_line(aes(x = (sales$TANGGAL %>% tail(30*6)),y = forecast_1$mean, color = "forecasted")) +
  theme_minimal() +
  labs(title = "Forecast using MSTS ARIMA",
       x = "Date time", y = "Count")

ggplotly(forecast_plot)

Modeling 2

Lets try with log model

# Arima Modeling
mod_1_adt <- train_1 %>% log() %>%
  stlm(method = "arima", s.window = "periodic")

Forecast & Evaluation

# forecast
forecast_1_adt <- forecast(object = mod_1_adt,h = length(test_1))

# evaluation
eval_1_adt <- data.frame(accuracy(as.vector(exp(forecast_1_adt$mean)),test_1)) %>%
  mutate(Model = "ARIMA_2_adt")

eval_1_adt

Visual 2

forecast_plot2 <- ggplot() +
  geom_line(aes(x = (sales$TANGGAL %>% tail(30*12*5)),y = (sales$QTY %>% tail(30*12*5)),color = "actual")) +
  geom_line(aes(x = (sales$TANGGAL %>% tail(30*6)),y = exp(forecast_1_adt$mean), color = "forecasted")) +
  theme_minimal() +
  labs(title = "Forecast using MSTS ARIMA",
       x = "Date time", y = "Count")

ggplotly(forecast_plot2)

sum(forecast_1_adt$mean)

## [1] 330.3879

Conclusion

By Using Multi-Seasonal Time Series, we’re able to grasp the data-pattern pretty well (by looking at the decomposed trend visual). Despite of the MAPE value are still quite high (45ish %), it’s still acceptable based on the business wise. I also checked with the forecast value that shows 330.38, it can precisely forecast with current sales condition.

Sales Analysis with TimeSeries

Ronny Goei

09/03/2020