Introduction

The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia. In TSDL, there are many subjects that can be analyzed using time series. In this project, we will use sales data and then build the best model to predict the sales.

Install the development version from Github:

# install.packages("devtools")
devtools::install_github("FinYang/tsdl")

Import the library:

library(tsdl)
library(tidyverse)
library(lubridate)
library(forecast)
library(TTR)
library(fpp)
library(tseries)
library(TSstudio)
library(MLmetrics)
library(ggplot2)

Checking the tdsl:

tsdl

## Time Series Data Library: 648 time series  
## 
##                        Frequency
## Subject                 0.1 0.25   1   4   5   6  12  13  52 365 Total
##   Agriculture             0    0  37   0   0   0   3   0   0   0    40
##   Chemistry               0    0   8   0   0   0   0   0   0   0     8
##   Computing               0    0   6   0   0   0   0   0   0   0     6
##   Crime                   0    0   1   0   0   0   2   1   0   0     4
##   Demography              1    0   9   2   0   0   3   0   0   2    17
##   Ecology                 0    0  23   0   0   0   0   0   0   0    23
##   Finance                 0    0  23   5   0   0  20   0   2   1    51
##   Health                  0    0   8   0   0   0   6   0   1   0    15
##   Hydrology               0    0  42   0   0   0  78   1   0   6   127
##   Industry                0    0   9   0   0   0   2   0   1   0    12
##   Labour market           0    0   3   4   0   0  17   0   0   0    24
##   Macroeconomic           0    0  18  33   0   0   5   0   0   0    56
##   Meteorology             0    0  18   0   0   0  17   0   0  12    47
##   Microeconomic           0    0  27   1   0   0   7   0   1   0    36
##   Miscellaneous           0    0   4   0   1   1   3   0   1   0    10
##   Physics                 0    0  12   0   0   0   4   0   0   0    16
##   Production              0    0   4  14   0   0  28   1   1   0    48
##   Sales                   0    0  10   3   0   0  24   0   9   0    46
##   Sport                   0    1   1   0   0   0   0   0   0   0     2
##   Transport and tourism   0    0   1   1   0   0  12   0   0   0    14
##   Tree-rings              0    0  34   0   0   0   1   0   0   0    35
##   Utilities               0    0   2   1   0   0   8   0   0   0    11
##   Total                   1    1 300  64   1   1 240   3  16  21   648

Data Preparation

Time Series is a certain sequence of data observations that a system collects within specific periods of time (daily, monthly, or yearly). Time series is a supervised learning and it can make forecasting based on past value patterns. Terms of time series data:

Must be sorted by time (past -> most recent),
There is no missing time,
There is no missing value.

Input our data and put it into sales object.

sales <- subset(tsdl, 12, "Sales")
sales

## Time Series Data Library: 24 Sales time series with frequency 12 
## 
##        Frequency
## Subject 12
##   Sales 24

sales_ts <- sales[[1]]
sales_ts

##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1960  87695  86890  96442  98133 113615 123924 128924 134775 117357 114626
## 1961  92188  88591  98683  99207 125485 124677 132543 140735 124008 121194
## 1962 101007  94228 104255 106922 130621 125251 140318 146174 122318 128770
## 1963 108497 100482 106140 118581 132371 132042 151938 150997 130931 137018
## 1964 109894 106061 112539 125745 136251 140892 158390 148314 144148 140138
## 1965 109895 109044 122499 124264 142296 150693 163331 165837 151731 142491
## 1966 116963 118049 137869 127392 154166 160227 165869 173522 155828 153771
## 1967 124046 121260 138870 129782 162312 167211 172897 189689 166496 160754
## 1968 139625 137361 138963 155301 172026 165004 185861 190270 163903 174270
## 1969 146182 137728 148932 156751 177998 174559 198079 189073 175702 180097
## 1970 154277 144998 159644 168646 166273 190176 205541 193657 182617 189614
## 1971 158167 156261 176353 175720 193939 201269 218960 209861 198688 190474
## 1972 166286 170699 181468 174241 210802 212262 218099 229001 203200 212557
## 1973 188992 175347 196265 203526 227443 233038 234119 255133 216478 232868
## 1974 194784 189756 193522 212870 248565 221532 252642 255007 206826 233231
## 1975 199024 191813 195997 208684 244113 243108 255918 244642 237579 237579
##         Nov    Dec
## 1960 107677 108087
## 1961 111634 111565
## 1962 117518 115492
## 1963 121271 123548
## 1964 124075 136485
## 1965 140229 140463
## 1966 143963 143898
## 1967 155582 145936
## 1968 160272 165614
## 1969 155202 174508
## 1970 174176 184416
## 1971 194502 190755
## 1972 197095 193693
## 1973 221616 209893
## 1974 212678 217173
## 1975 217775 227621

Sales data is already in time series format. We can see that the data is from January 1960 to December 1975.

Pre-processing Data

Checking any missing value.

anyNA(sales_ts)

## [1] FALSE

No missing value found.

Exploratory Data Analysis

Overview

sales_ts %>% autoplot()

There are three component in time series:

Trend: patterns in general, tend to increase or tend to decrease.
Seasonal: pattern that repeats over a period of time (e.g., a weekly pattern).
Residual/Error/Random: Irregular components that are can not be explained by trends or seasonality.

It can be seen from time series data plot before, this data have an additive trend and seasonal.

Decompose

Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components. Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.

# decompose
sales_dc <- decompose(sales_ts, type = "additive")

# visualize decompose result
autoplot(sales_dc)

Based on the decompose plot above, the data have an increasing trend and has seasonality.

Build Model

Before build model, we need to do cross validation. Split the data into train and test.

sales_test <- tail(sales_ts, 24)
sales_train <- head(sales_ts, length(sales_ts) - length(sales_test))

autoplot(sales_train) + geom_line(data = sales_test, col = "red")

Because our time series data have trend and seaseonal component, the model will be built by using Holt Winters and SARIMA.

Holt Winters

The Holt-Winters method uses exponential smoothing to encode lots of values from the past and use them to predict values for the present and future. The three aspects of the time series behavior value, trend, and seasonality are expressed as three types of exponential smoothing, so Holt-Winters is called triple exponential smoothing. The model predicts a current or future value by computing the combined effects of these three influences.

Model

model_hw <- HoltWinters(sales_train, seasonal = "additive")
model_hw

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = sales_train, seasonal = "additive")
## 
## Smoothing parameters:
##  alpha: 0.03998687
##  beta : 0.2948723
##  gamma: 0.09662213
## 
## Coefficients:
##           [,1]
## a   220821.762
## b     1495.224
## s1  -20174.366
## s2  -25018.888
## s3  -13113.638
## s4  -11439.715
## s5   10183.605
## s6   12476.648
## s7   23048.920
## s8   26336.362
## s9    6273.779
## s10   7350.147
## s11  -3549.882
## s12  -3633.538

Forecasting

Create the forecast:

forecast_hw <- forecast(object = model_hw, h = 24)

Evaluation - Data Test

Evaluate the model based on MAPE error.

MAPE(y_pred = forecast_hw$mean, y_true = sales_test)*100

## [1] 8.488444

The evaluation for Holt Winters model, we got 8.48 as MAPE error.

Visualization

The visualization of actual, fitted, and forecasted value using Holt Winters model.

test_forecast(actual = sales_ts, 
              forecast.obj = forecast_hw, 
              test = sales_test)

SARIMA

Auto Regressive Integrated Moving Average (ARIMA) is a combination of two methods, that are Autoregressive (AR) dan Moving Average (MA). An ARIMA model can be understood by outlining each of its components as follows:

Autoregression (AR): refers to a model that shows a changing variable that regresses on its own lagged, or prior, values.
Integrated (I): represents the differencing of raw observations to allow for the time series to become stationary (i.e., data values are replaced by the difference between the data values and the previous values).
Moving average (MA): incorporates the dependency between an observation and a residual error from a moving average model applied to lagged observations.

ARIMA model have 3 parameter:

p: the number of lag observations in the model; also known as the lag order.
d: the number of times that the raw observations are differenced; also known as the degree of differencing.
q: the size of the moving average window; also known as the order of the moving average.

If the time series data has a seasonal pattern, then we can add the seasonal term to ARIMA and it becomes Seasonal Auto Regressive Integrated Moving Average (SARIMA). The requirement for data to be processed using ARIMA is that the data must be stationary. Data is stationary if its value fluctuates around its mean. We can do differencing to make the data stationary.

adf.test(diff(sales_train, lag = 12))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(sales_train, lag = 12)
## Dickey-Fuller = -3.1454, Lag order = 5, p-value = 0.09954
## alternative hypothesis: stationary

adf.test(diff(sales_train, lag = 12) %>%
         diff())

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(sales_train, lag = 12) %>% diff()
## Dickey-Fuller = -8.0316, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

The data is stationary using 1 diff and 1 diff seasonal.

Model auto.arima

We can use auto.arima to make model without define the parameters.

model_sarima_auto <- auto.arima(sales_train, seasonal = T)
summary(model_sarima_auto)

## Series: sales_train 
## ARIMA(2,1,0)(0,1,1)[12] 
## 
## Coefficients:
##           ar1      ar2     sma1
##       -1.0407  -0.7023  -0.8117
## s.e.   0.0573   0.0572   0.0852
## 
## sigma^2 = 23451105:  log likelihood = -1541.02
## AIC=3090.03   AICc=3090.3   BIC=3102.21
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE     MASE        ACF1
## Training set 500.6924 4606.264 3489.947 0.202112 2.173995 0.406734 -0.09215646

From the auto.arima, we got the model ARIMA(2,1,0)(0,1,1)[12]

Note: ARIMA(p,d,q)(P,D,Q)[n]

pdq(2,1,0): order for data in general
PDQ(0,1,1): order for seasonal effect
n[12]: frequency data

Model SARIMA Manual

Beside using auto.arima, we can build a model manually, by define the parameter p,d,q dan P,D,Q using ACF and PACF plot.

p & P: The PACF plot, but for P use seasonal multiples of lag.
d & D: Number of differencing.
q & Q: The ACF plot, but for Q use seasonal multiples.

# plot ACF & PACF
tsdisplay(diff(sales_train, lag = 12) %>% diff())

For general data (p,d,q), see the lag from line 1, 2, 3,.. We can see that the ACF plot is cut off and PACF plot is tail off. So, it is fit using MA model. Then see the line of each plot that significant or cross the blue line.

p: 0
d: 1
q: 1, 3

For the seasonal effect (P,D,Q), see the lag from line 12, 24, 32,.. We can see that the ACF and PACF plot is taill off. So. it is fit using AR-MA model. Then see the line of each plot that significant or cross the blue line.

P: 0
D: 1
Q: 2, 3

From the result above, the combination of models to build:

ARIMA(0,1,1)(0,1,2)[12]
ARIMA(0,1,3)(0,1,2)[12]
ARIMA(0,1,1)(0,1,3)[12]
ARIMA(0,1,3)(0,1,3)[12]

Next, build the model:

model_sarima_1 <- arima(x = sales_train, order = c(0,1,1),
                        seasonal = list(order = c(0,1,2), period = 12))

model_sarima_2 <- arima(x = sales_train, order = c(0,1,3),
                        seasonal = list(order = c(0,1,2), period = 12))

model_sarima_3 <- arima(x = sales_train, order = c(0,1,1),
                        seasonal = list(order = c(0,1,3), period = 12))

model_sarima_4 <- arima(x = sales_train, order = c(0,1,3),
                        seasonal = list(order = c(0,1,3), period = 12))

Summarize error from all SARIMA model:

summary(model_sarima_1)

## 
## Call:
## arima(x = sales_train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2), 
##     period = 12))
## 
## Coefficients:
##           ma1     sma1     sma2
##       -0.8183  -0.6322  -0.3678
## s.e.   0.0387   0.2392   0.1128
## 
## sigma^2 estimated as 26995578:  log likelihood = -1559.79,  aic = 3127.59
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 824.2185 4990.742 3780.052 0.3594269 2.341505 0.3443899 -0.2881937

summary(model_sarima_2)

## 
## Call:
## arima(x = sales_train, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 2), 
##     period = 12))
## 
## Coefficients:
##           ma1     ma2      ma3     sma1     sma2
##       -1.2282  0.7176  -0.2614  -0.6434  -0.2114
## s.e.   0.0924  0.1649   0.1148   0.1095   0.0940
## 
## sigma^2 estimated as 24936262:  log likelihood = -1547.65,  aic = 3107.3
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE        ACF1
## Training set 756.6216 4796.617 3660.751 0.3344795 2.29415 0.3335207 0.002309153

summary(model_sarima_3)

## 
## Call:
## arima(x = sales_train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 3), 
##     period = 12))
## 
## Coefficients:
##           ma1     sma1     sma2     sma3
##       -0.8245  -0.5843  -0.3319  -0.0836
## s.e.   0.0397   0.1855   0.1192   0.1339
## 
## sigma^2 estimated as 27219890:  log likelihood = -1559.61,  aic = 3129.22
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 816.1498 5011.432 3787.171 0.3582036 2.348482 0.3450384 -0.2919905

summary(model_sarima_4)

## 
## Call:
## arima(x = sales_train, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 3), 
##     period = 12))
## 
## Coefficients:
##           ma1     ma2      ma3     sma1     sma2     sma3
##       -1.2635  0.7413  -0.2538  -0.6589  -0.1049  -0.2349
## s.e.   0.0937  0.1723   0.1208   0.3188   0.1709   0.1594
## 
## sigma^2 estimated as 23063446:  log likelihood = -1546.54,  aic = 3107.09
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE        ACF1
## Training set 719.2249 4612.986 3487.146 0.3189461 2.185316 0.317704 0.001397342

Forecasting

Create the forecast:

forecast_model_auto <- forecast(model_sarima_auto, h = 24)
forecast_model1 <- forecast(model_sarima_1, h = 24)
forecast_model2 <- forecast(model_sarima_2, h = 24)
forecast_model3 <- forecast(model_sarima_3, h = 24)
forecast_model4 <- forecast(model_sarima_4, h = 24)

Evaluation - Data Test

Evaluate the model based on MAPE error.

MAPE(y_pred = forecast_model_auto$mean, y_true = sales_test)*100

## [1] 5.693121

MAPE(y_pred = forecast_model1$mean, y_true = sales_test)*100

## [1] 5.183373

MAPE(y_pred = forecast_model2$mean, y_true = sales_test)*100

## [1] 5.950886

MAPE(y_pred = forecast_model3$mean, y_true = sales_test)*100

## [1] 5.478983

MAPE(y_pred = forecast_model4$mean, y_true = sales_test)*100

## [1] 6.202979

Based on MAPE error, the best model compared with other models is model_sarima_1= ARIMA(0,1,1)(0,1,2)[12]

Visualization

The visualization of actual, fitted, and forecasted value using the best SARIMA model.

test_forecast(actual = sales_ts, 
              forecast.obj = forecast_model1, 
              test = sales_test)

Assumption Checking

plot(model_sarima_1$residuals)

Normality of Residuals

H0: Residual/error has normally distribution.
H1: Residual/error has not normally distribution.

shapiro.test(model_sarima_1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_sarima_1$residuals
## W = 0.98792, p-value = 0.1588

As we can see, p-value > alpha (0.05), so the residual/error has normally distributed.

No-Autocorrelation on Residuals

H0: there is no auto correlation in residual/error.
H1: there is auto correlation in residual/error.

# Ljung-Box test
Box.test(model_sarima_1$residuals)

## 
##  Box-Pierce test
## 
## data:  model_sarima_1$residuals
## X-squared = 13.953, df = 1, p-value = 0.0001874

As we can see, p-value > alpha (0.05), there is no auto correlation in residual/error.

Conclusion

From the evaluation of all the model, it can be concluded that model_sarima_1 = ARIMA(0,1,1)(0,1,2)[12] is the best model with the lowest MAPE error of 5.18%. Based on the assumption checking, residual/error are normally distributed and there is no autocorrelation on the residual/error.

Sales Forecasting

Safira Widya Putri

2022-07-01

Introduction

Data Preparation

Pre-processing Data

Exploratory Data Analysis

Overview

Decompose

Build Model

Holt Winters

Model

Forecasting

Evaluation - Data Test

Visualization

SARIMA

Model auto.arima

Model SARIMA Manual

Forecasting

Evaluation - Data Test

Visualization

Assumption Checking

Normality of Residuals

No-Autocorrelation on Residuals

Conclusion