The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia. In TSDL, there are many subjects that can be analyzed using time series. In this project, we will use sales data and then build the best model to predict the sales.
Install the development version from Github:
# install.packages("devtools")
devtools::install_github("FinYang/tsdl")Import the library:
library(tsdl)
library(tidyverse)
library(lubridate)
library(forecast)
library(TTR)
library(fpp)
library(tseries)
library(TSstudio)
library(MLmetrics)
library(ggplot2)Checking the tdsl:
tsdl## Time Series Data Library: 648 time series
##
## Frequency
## Subject 0.1 0.25 1 4 5 6 12 13 52 365 Total
## Agriculture 0 0 37 0 0 0 3 0 0 0 40
## Chemistry 0 0 8 0 0 0 0 0 0 0 8
## Computing 0 0 6 0 0 0 0 0 0 0 6
## Crime 0 0 1 0 0 0 2 1 0 0 4
## Demography 1 0 9 2 0 0 3 0 0 2 17
## Ecology 0 0 23 0 0 0 0 0 0 0 23
## Finance 0 0 23 5 0 0 20 0 2 1 51
## Health 0 0 8 0 0 0 6 0 1 0 15
## Hydrology 0 0 42 0 0 0 78 1 0 6 127
## Industry 0 0 9 0 0 0 2 0 1 0 12
## Labour market 0 0 3 4 0 0 17 0 0 0 24
## Macroeconomic 0 0 18 33 0 0 5 0 0 0 56
## Meteorology 0 0 18 0 0 0 17 0 0 12 47
## Microeconomic 0 0 27 1 0 0 7 0 1 0 36
## Miscellaneous 0 0 4 0 1 1 3 0 1 0 10
## Physics 0 0 12 0 0 0 4 0 0 0 16
## Production 0 0 4 14 0 0 28 1 1 0 48
## Sales 0 0 10 3 0 0 24 0 9 0 46
## Sport 0 1 1 0 0 0 0 0 0 0 2
## Transport and tourism 0 0 1 1 0 0 12 0 0 0 14
## Tree-rings 0 0 34 0 0 0 1 0 0 0 35
## Utilities 0 0 2 1 0 0 8 0 0 0 11
## Total 1 1 300 64 1 1 240 3 16 21 648
Time Series is a certain sequence of data observations that a system collects within specific periods of time (daily, monthly, or yearly). Time series is a supervised learning and it can make forecasting based on past value patterns. Terms of time series data:
Input our data and put it into sales object.
sales <- subset(tsdl, 12, "Sales")
sales## Time Series Data Library: 24 Sales time series with frequency 12
##
## Frequency
## Subject 12
## Sales 24
sales_ts <- sales[[1]]
sales_ts## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1960 87695 86890 96442 98133 113615 123924 128924 134775 117357 114626
## 1961 92188 88591 98683 99207 125485 124677 132543 140735 124008 121194
## 1962 101007 94228 104255 106922 130621 125251 140318 146174 122318 128770
## 1963 108497 100482 106140 118581 132371 132042 151938 150997 130931 137018
## 1964 109894 106061 112539 125745 136251 140892 158390 148314 144148 140138
## 1965 109895 109044 122499 124264 142296 150693 163331 165837 151731 142491
## 1966 116963 118049 137869 127392 154166 160227 165869 173522 155828 153771
## 1967 124046 121260 138870 129782 162312 167211 172897 189689 166496 160754
## 1968 139625 137361 138963 155301 172026 165004 185861 190270 163903 174270
## 1969 146182 137728 148932 156751 177998 174559 198079 189073 175702 180097
## 1970 154277 144998 159644 168646 166273 190176 205541 193657 182617 189614
## 1971 158167 156261 176353 175720 193939 201269 218960 209861 198688 190474
## 1972 166286 170699 181468 174241 210802 212262 218099 229001 203200 212557
## 1973 188992 175347 196265 203526 227443 233038 234119 255133 216478 232868
## 1974 194784 189756 193522 212870 248565 221532 252642 255007 206826 233231
## 1975 199024 191813 195997 208684 244113 243108 255918 244642 237579 237579
## Nov Dec
## 1960 107677 108087
## 1961 111634 111565
## 1962 117518 115492
## 1963 121271 123548
## 1964 124075 136485
## 1965 140229 140463
## 1966 143963 143898
## 1967 155582 145936
## 1968 160272 165614
## 1969 155202 174508
## 1970 174176 184416
## 1971 194502 190755
## 1972 197095 193693
## 1973 221616 209893
## 1974 212678 217173
## 1975 217775 227621
Sales data is already in time series format. We can see that the data is from January 1960 to December 1975.
Checking any missing value.
anyNA(sales_ts)## [1] FALSE
No missing value found.
sales_ts %>% autoplot()There are three component in time series:
It can be seen from time series data plot before, this data have an additive trend and seasonal.
Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components. Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.
# decompose
sales_dc <- decompose(sales_ts, type = "additive")
# visualize decompose result
autoplot(sales_dc)Based on the decompose plot above, the data have an increasing trend and has seasonality.
Before build model, we need to do cross validation. Split the data into train and test.
sales_test <- tail(sales_ts, 24)
sales_train <- head(sales_ts, length(sales_ts) - length(sales_test))autoplot(sales_train) + geom_line(data = sales_test, col = "red")Because our time series data have trend and seaseonal component, the model will be built by using Holt Winters and SARIMA.
The Holt-Winters method uses exponential smoothing to encode lots of values from the past and use them to predict values for the present and future. The three aspects of the time series behavior value, trend, and seasonality are expressed as three types of exponential smoothing, so Holt-Winters is called triple exponential smoothing. The model predicts a current or future value by computing the combined effects of these three influences.
model_hw <- HoltWinters(sales_train, seasonal = "additive")
model_hw## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = sales_train, seasonal = "additive")
##
## Smoothing parameters:
## alpha: 0.03998687
## beta : 0.2948723
## gamma: 0.09662213
##
## Coefficients:
## [,1]
## a 220821.762
## b 1495.224
## s1 -20174.366
## s2 -25018.888
## s3 -13113.638
## s4 -11439.715
## s5 10183.605
## s6 12476.648
## s7 23048.920
## s8 26336.362
## s9 6273.779
## s10 7350.147
## s11 -3549.882
## s12 -3633.538
Create the forecast:
forecast_hw <- forecast(object = model_hw, h = 24)Evaluate the model based on MAPE error.
MAPE(y_pred = forecast_hw$mean, y_true = sales_test)*100## [1] 8.488444
The evaluation for Holt Winters model, we got 8.48 as MAPE error.
The visualization of actual, fitted, and forecasted value using Holt Winters model.
test_forecast(actual = sales_ts,
forecast.obj = forecast_hw,
test = sales_test)Auto Regressive Integrated Moving Average (ARIMA) is a combination of two methods, that are Autoregressive (AR) dan Moving Average (MA). An ARIMA model can be understood by outlining each of its components as follows:
ARIMA model have 3 parameter:
p: the number of lag observations in the model; also
known as the lag order.d: the number of times that the raw observations are
differenced; also known as the degree of differencing.q: the size of the moving average window; also known as
the order of the moving average.If the time series data has a seasonal pattern, then we can add the seasonal term to ARIMA and it becomes Seasonal Auto Regressive Integrated Moving Average (SARIMA). The requirement for data to be processed using ARIMA is that the data must be stationary. Data is stationary if its value fluctuates around its mean. We can do differencing to make the data stationary.
adf.test(diff(sales_train, lag = 12))##
## Augmented Dickey-Fuller Test
##
## data: diff(sales_train, lag = 12)
## Dickey-Fuller = -3.1454, Lag order = 5, p-value = 0.09954
## alternative hypothesis: stationary
adf.test(diff(sales_train, lag = 12) %>%
diff())##
## Augmented Dickey-Fuller Test
##
## data: diff(sales_train, lag = 12) %>% diff()
## Dickey-Fuller = -8.0316, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
The data is stationary using 1 diff and 1 diff seasonal.
We can use auto.arima to make model without define the
parameters.
model_sarima_auto <- auto.arima(sales_train, seasonal = T)
summary(model_sarima_auto)## Series: sales_train
## ARIMA(2,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 sma1
## -1.0407 -0.7023 -0.8117
## s.e. 0.0573 0.0572 0.0852
##
## sigma^2 = 23451105: log likelihood = -1541.02
## AIC=3090.03 AICc=3090.3 BIC=3102.21
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 500.6924 4606.264 3489.947 0.202112 2.173995 0.406734 -0.09215646
From the auto.arima, we got the model ARIMA(2,1,0)(0,1,1)[12]
Note: ARIMA(p,d,q)(P,D,Q)[n]
Beside using auto.arima, we can build a model manually,
by define the parameter p,d,q dan P,D,Q using
ACF and PACF plot.
p & P: The PACF plot, but for P use
seasonal multiples of lag.d & D: Number of differencing.q & Q: The ACF plot, but for Q use
seasonal multiples.# plot ACF & PACF
tsdisplay(diff(sales_train, lag = 12) %>% diff())For general data (p,d,q), see the lag from line 1, 2, 3,.. We can see that the ACF plot is cut off and PACF plot is tail off. So, it is fit using MA model. Then see the line of each plot that significant or cross the blue line.
p: 0d: 1q: 1, 3For the seasonal effect (P,D,Q), see the lag from line 12, 24, 32,.. We can see that the ACF and PACF plot is taill off. So. it is fit using AR-MA model. Then see the line of each plot that significant or cross the blue line.
P: 0D: 1Q: 2, 3From the result above, the combination of models to build:
Next, build the model:
model_sarima_1 <- arima(x = sales_train, order = c(0,1,1),
seasonal = list(order = c(0,1,2), period = 12))
model_sarima_2 <- arima(x = sales_train, order = c(0,1,3),
seasonal = list(order = c(0,1,2), period = 12))
model_sarima_3 <- arima(x = sales_train, order = c(0,1,1),
seasonal = list(order = c(0,1,3), period = 12))
model_sarima_4 <- arima(x = sales_train, order = c(0,1,3),
seasonal = list(order = c(0,1,3), period = 12))Summarize error from all SARIMA model:
summary(model_sarima_1)##
## Call:
## arima(x = sales_train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2),
## period = 12))
##
## Coefficients:
## ma1 sma1 sma2
## -0.8183 -0.6322 -0.3678
## s.e. 0.0387 0.2392 0.1128
##
## sigma^2 estimated as 26995578: log likelihood = -1559.79, aic = 3127.59
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 824.2185 4990.742 3780.052 0.3594269 2.341505 0.3443899 -0.2881937
summary(model_sarima_2)##
## Call:
## arima(x = sales_train, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 2),
## period = 12))
##
## Coefficients:
## ma1 ma2 ma3 sma1 sma2
## -1.2282 0.7176 -0.2614 -0.6434 -0.2114
## s.e. 0.0924 0.1649 0.1148 0.1095 0.0940
##
## sigma^2 estimated as 24936262: log likelihood = -1547.65, aic = 3107.3
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 756.6216 4796.617 3660.751 0.3344795 2.29415 0.3335207 0.002309153
summary(model_sarima_3)##
## Call:
## arima(x = sales_train, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 3),
## period = 12))
##
## Coefficients:
## ma1 sma1 sma2 sma3
## -0.8245 -0.5843 -0.3319 -0.0836
## s.e. 0.0397 0.1855 0.1192 0.1339
##
## sigma^2 estimated as 27219890: log likelihood = -1559.61, aic = 3129.22
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 816.1498 5011.432 3787.171 0.3582036 2.348482 0.3450384 -0.2919905
summary(model_sarima_4)##
## Call:
## arima(x = sales_train, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 3),
## period = 12))
##
## Coefficients:
## ma1 ma2 ma3 sma1 sma2 sma3
## -1.2635 0.7413 -0.2538 -0.6589 -0.1049 -0.2349
## s.e. 0.0937 0.1723 0.1208 0.3188 0.1709 0.1594
##
## sigma^2 estimated as 23063446: log likelihood = -1546.54, aic = 3107.09
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 719.2249 4612.986 3487.146 0.3189461 2.185316 0.317704 0.001397342
Create the forecast:
forecast_model_auto <- forecast(model_sarima_auto, h = 24)
forecast_model1 <- forecast(model_sarima_1, h = 24)
forecast_model2 <- forecast(model_sarima_2, h = 24)
forecast_model3 <- forecast(model_sarima_3, h = 24)
forecast_model4 <- forecast(model_sarima_4, h = 24)Evaluate the model based on MAPE error.
MAPE(y_pred = forecast_model_auto$mean, y_true = sales_test)*100## [1] 5.693121
MAPE(y_pred = forecast_model1$mean, y_true = sales_test)*100## [1] 5.183373
MAPE(y_pred = forecast_model2$mean, y_true = sales_test)*100## [1] 5.950886
MAPE(y_pred = forecast_model3$mean, y_true = sales_test)*100## [1] 5.478983
MAPE(y_pred = forecast_model4$mean, y_true = sales_test)*100## [1] 6.202979
Based on MAPE error, the best model compared with other models is
model_sarima_1=
ARIMA(0,1,1)(0,1,2)[12]
The visualization of actual, fitted, and forecasted value using the best SARIMA model.
test_forecast(actual = sales_ts,
forecast.obj = forecast_model1,
test = sales_test)plot(model_sarima_1$residuals)shapiro.test(model_sarima_1$residuals)##
## Shapiro-Wilk normality test
##
## data: model_sarima_1$residuals
## W = 0.98792, p-value = 0.1588
As we can see, p-value > alpha (0.05), so the residual/error has normally distributed.
# Ljung-Box test
Box.test(model_sarima_1$residuals)##
## Box-Pierce test
##
## data: model_sarima_1$residuals
## X-squared = 13.953, df = 1, p-value = 0.0001874
As we can see, p-value > alpha (0.05), there is no auto correlation in residual/error.
From the evaluation of all the model, it can be concluded that
model_sarima_1 = ARIMA(0,1,1)(0,1,2)[12]
is the best model with the lowest MAPE error of 5.18%. Based on the
assumption checking, residual/error are normally distributed and there
is no autocorrelation on the residual/error.