Intro

Time series. A time series is simply a series of data points ordered in time. In a time series, time is often the independent variable and the goal is usually to make a forecast for the future.

Time series data often arise when monitoring industrial processes or tracking corporate business metrics. The essential difference between modeling data via time series methods or using the process monitoring methods

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.

While regression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called “time series analysis”, which focuses on comparing values of a single time series or multiple dependent time series at different points in time. Interrupted time series analysis is the analysis of interventions on a single time series.

Time Series Analysis is used for many applications such as:

Economic Forecasting
Sales Forecasting
Budgetary Analysis
Stock Market Analysis
Yield Projections
Process and Quality Control
Inventory Studies
Workload Projections
Utility Studies
Census Analysis

and many, many more…

Background

In this case we will try to use 3 Modelling of Time Series and make predictions based on model.

Moving average
Exponential smoothing
ARIMA

In the end we will compare Mean Absolute Percentage Error from result of forecasting from each time series model.

Import Library

# Library for data Processing
library(tidyverse)
# Library to forecast
library(forecast)
# Library for Processing date data type
library(lubridate)
# Library Simple Moving Average
library(TTR)
# Library MAPE
library(MLmetrics)
# Library for Processing date data type
library(zoo)
# Data visualisation
library(plotly)
library(xts)
library(TSstudio)
library(tseries)

Dataset

In this case we will use tsdl The Time Series Data Library package. This package contain time series data from a lot of subject. Data provide on this package already convert as Time Series object.

library(tsdl)

Check kind of subject inside the Data Library

tsdl

## Time Series Data Library: 648 time series  
## 
##                        Frequency
## Subject                 0.1 0.25   1   4   5   6  12  13  52 365 Total
##   Agriculture             0    0  37   0   0   0   3   0   0   0    40
##   Chemistry               0    0   8   0   0   0   0   0   0   0     8
##   Computing               0    0   6   0   0   0   0   0   0   0     6
##   Crime                   0    0   1   0   0   0   2   1   0   0     4
##   Demography              1    0   9   2   0   0   3   0   0   2    17
##   Ecology                 0    0  23   0   0   0   0   0   0   0    23
##   Finance                 0    0  23   5   0   0  20   0   2   1    51
##   Health                  0    0   8   0   0   0   6   0   1   0    15
##   Hydrology               0    0  42   0   0   0  78   1   0   6   127
##   Industry                0    0   9   0   0   0   2   0   1   0    12
##   Labour market           0    0   3   4   0   0  17   0   0   0    24
##   Macroeconomic           0    0  18  33   0   0   5   0   0   0    56
##   Meteorology             0    0  18   0   0   0  17   0   0  12    47
##   Microeconomic           0    0  27   1   0   0   7   0   1   0    36
##   Miscellaneous           0    0   4   0   1   1   3   0   1   0    10
##   Physics                 0    0  12   0   0   0   4   0   0   0    16
##   Production              0    0   4  14   0   0  28   1   1   0    48
##   Sales                   0    0  10   3   0   0  24   0   9   0    46
##   Sport                   0    1   1   0   0   0   0   0   0   0     2
##   Transport and tourism   0    0   1   1   0   0  12   0   0   0    14
##   Tree-rings              0    0  34   0   0   0   1   0   0   0    35
##   Utilities               0    0   2   1   0   0   8   0   0   0    11
##   Total                   1    1 300  64   1   1 240   3  16  21   648

In this case i will use Sales data with Frequency 12 or Monthly. We need to subset it

sales <- subset(tsdl, 'sales', 12)
sales <- sales[[24]]

head(sales)

## Time Series:
## Start = 1966.83 
## End = 1967.24666666667 
## Frequency = 12 
## [1] 61 48 53 78 75 58
## attr(,"source")
## [1] Roberts (1992)
## attr(,"description")
## [1] Monthly unit sales, Winnebago Industries, Nov. 1966 � Feb. 1972
## attr(,"subject")
## [1] Sales

Data Wrangling

Time Series data we get from tsdl before we will check whether there is NA inside data.

# Length of data 
length(sales)

## [1] 64

# NA check
table(is.na(sales))

## 
## FALSE 
##    64

We can see total length of data 64 and we get FALSE 64, it means data has no NA. Not only NA we should check, we try to know how distribution from our time series data. we can use summary syntax to know distribution of our data.

summary(sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    48.0   194.5   371.0   466.8   628.0  1753.0

Min of our time series data is 48.0 it means our data dont have 0 value.

From tsdl package data we get it already an TS object. These are vectors or matrices with class of “ts” (and additional attributes) which represent data which has been sampled at equispaced points in time. In the matrix case, each column of the matrix data is assumed to contain a single (univariate) time series. Time series must have at least one observation, and although they need not be numeric there is very limited support for non-numeric series.

I will try to reverse it to data frame, it can help us to visualize data later.

sales_df <- data.frame(Y = as.matrix(sales), date = as_date(yearmon(time(sales))))

Exploratory Data Analysis

I this sub part, i will try to explore our timeseries object. We will try to understand more our object, whether we can found trend, seasonal, or other interpretation or insight based on visualisastion.

Additive or Multiplicative

There are basically two methods to analyze the seasonality of a Time Series: additive and multiplicative. and to categorize Time Series object there is 3 main components which will be calculated for forecasting. These components are:

trend (T): the movement of mean, globally, throughout an interval.
seasonal (S): the pattern captured on each seasonal interval.
error (E): the pattern/value that cannot be captured by both trend and seasonal.

Additive Model

Synthetically it is a model of data in which the effects of the individual factors are differentiated and added to model the data. It can be represented by:

\(y(t) = Trend + Seasonality + Noise\)

In the additive model, the behavior is linear where changes over time are consistently made by the same amount, like a linear trend. In this situation, the linear seasonality has the same amplitude and frequency.

Multiplicative Model

In this situation, trend and seasonal components are multiplied and then added to the error component. It is not linear, can be exponential or quadratic and represented by a curved line as below:

\(y(t) = Trend * Seasonality * Noise\)

Different from the additive model, the multiplicative model has an increasing or decreasing amplitude and/or frequency over time.

These charts can summarize Additive Model and Multiplicative Model

Next we will try visualize our timeseries object to know either its multiplicative or additive

ggplotly(
  sales %>% as.xts() %>%  autoplot() + theme_bw()
)

Based on chart we can say our data is Multiplicative because it show the behavior acts as an increasing funnel.

Decompose

In this part we will try decompose data, which mean we will try to seperate 3 components: Tremd, Seasonality and Error

sales_deco <- decompose(sales, type = "multiplicative")

Visualise trend data

sales_deco$trend %>% autoplot()

We found that our trand already smooth and there isn’t another seasonality shows in the line, so it mean our seasonal and frequency setup in ts object has fitted well and we can call it single seasonal.

next, visualise seasonal

sales_deco$seasonal %>% autoplot()

we can said our decompose catch seasonal pattern well, because it show repeating pattern with a fixed period of time / same. we can try inteprate more based on seasonal, lets try processing the data.

sales_df %>% 
  mutate(
    seasonal = sales_deco$seasonal,
    date = month(ymd(date), label = T, abbr = F)
  ) %>% 
  distinct(date, seasonal) %>% 
  ggplot(mapping = aes(x = date, y = seasonal)) +
  geom_col() +
  theme_bw()

Based on plot, we can assume sales start increasing start from january until middle year, and will decreasing start from middle year until december.

Cross Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. in this case we would take 12 last data as test data and keep rest data as train data

sales_train <- head(sales, length(sales) - 12)
sales_test <- tail(sales, 12)

Model BUilding

Simple Moving Average

The moving average model is probably the most naive approach to time series modelling. This model simply states that the next observation is the mean of all past observations.

sales_sma <- SMA(x = sales_train, n = 6)

sales_sma

## Time Series:
## Start = 1966.83 
## End = 1971.08 
## Frequency = 12 
##  [1]        NA        NA        NA        NA        NA  62.16667  76.33333
##  [8] 100.50000 112.33333 119.33333 129.16667 136.00000 133.33333 128.83333
## [15] 136.16667 160.66667 190.66667 246.16667 283.66667 320.00000 330.66667
## [22] 325.33333 321.83333 290.16667 272.33333 247.66667 243.83333 264.33333
## [29] 274.16667 301.50000 353.83333 379.33333 413.83333 436.33333 445.83333
## [36] 432.83333 376.16667 343.83333 300.50000 291.50000 335.16667 426.16667
## [43] 482.16667 553.16667 631.83333 667.33333 638.00000 549.66667 524.16667
## [50] 518.33333 480.83333 488.83333

autoplot(sales_train) + 
  autolayer(sales_sma) + 
  theme_minimal()

In the plot above, we applied the moving average model to a monthly window. The red line is smoothed the time series. We already have SMA model and we will forecast using forecast and length data we guess is length from our [sales_test] data.

sales_sma_forecast <- forecast(sales_sma, h = length(sales_test))

We try to plot it, as we know this model will smooth historycal data in the end it will use to forecast. we can plot it like this

plot_forecast(sales_sma_forecast)

however plot above observant is from model, we can see real data compare with our forecast.

sales_train %>% autoplot() +
  autolayer(sales_sma_forecast) +
  theme_bw()

Last we can compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 31.2%

result <- MAPE(y_pred = sales_sma_forecast$mean, y_true = sales_test)*100
result

## [1] 31.21606

result_mape <- data.frame(
  "Model" = "Simple Moving Average",
  "MAPE" = round(result)
)

Exponential Smoothing

Exponential smoothing uses a similar logic to moving average, but this time, a different decreasing weight is assigned to each observations. In other words, less importance is given to observations as we move further from the present.

Exponential smoothing method is divided into 3 type:

Simple Exponential Smooting (SES)

It suitable for time series data that does not have trends and seasonality

Double Exponential Smooting (Holt)

For non-seasonal time range data (there are trends)

Triple Exponential Smooting (Holt Winters)

For timeframe data that have trendi and seasonal.

So based on our Exploratory data analysis before, we know that data have trend and seasonal. Method we will use in this case is [Holt Winters] or Triple Exponential Smoothing

Triple Exponential Smoothing

This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal patter is an Multiplicative, but in this case i will try to do both seasonality and to know more the result

# using HoltWinters function to make model and using "additive" for seasonal setup
sales_triple_ex_add <- HoltWinters(x = sales_train, seasonal = "additive")
sales_triple_ex_add

## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = sales_train, seasonal = "additive")
## 
## Smoothing parameters:
##  alpha: 0.5265989
##  beta : 0
##  gamma: 0
## 
## Coefficients:
##           [,1]
## a   618.748592
## b    15.946241
## s1   55.288194
## s2  160.704861
## s3   34.954861
## s4   74.163194
## s5   -4.545139
## s6  -21.211806
## s7  -25.045139
## s8  -85.586806
## s9  -78.878472
## s10 -59.545139
## s11 -70.003472
## s12  19.704861

# using HoltWinters function to make model and using "multiplicative" for seasonal setup
sales_triple_ex_multi <- HoltWinters(x = sales_train, seasonal = "multiplicative")
sales_triple_ex_multi

## Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
## 
## Call:
## HoltWinters(x = sales_train, seasonal = "multiplicative")
## 
## Smoothing parameters:
##  alpha: 0.2641399
##  beta : 0
##  gamma: 0.2420101
## 
## Coefficients:
##            [,1]
## a   698.2807712
## b    15.9462413
## s1    1.2246027
## s2    1.5330102
## s3    1.3123347
## s4    1.4233987
## s5    1.0185449
## s6    0.9477492
## s7    0.8519777
## s8    0.5762602
## s9    0.6458062
## s10   0.7199728
## s11   0.6129171
## s12   1.0530772

we can explain some arguments based on result above

alpha : smoothing random
beta: smoothing trend
gamma: smoothing seasonal

After we create model we can use those model to forecast using our train_sales data:

forecast_triple_ex_multi <- forecast(object = sales_triple_ex_multi, h = length(sales_test))

forecast_triple_ex_add <- forecast(object = sales_triple_ex_add, h = length(sales_test))

forecast_triple_ex_multi

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 1971.163       874.6443  815.2322  934.0564  783.7813  965.5073
## 1971.247      1119.3630 1045.1790 1193.5471 1005.9084 1232.8177
## 1971.330       979.1585  902.3545 1055.9625  861.6970 1096.6201
## 1971.413      1084.7234  996.4155 1173.0313  949.6681 1219.7786
## 1971.497       792.4402  712.6889  872.1915  670.4711  914.4093
## 1971.580       752.4732  668.3852  836.5613  623.8717  881.0748
## 1971.663       690.0205  603.9452  776.0958  558.3796  821.6614
## 1971.747       475.9049  399.2199  552.5900  358.6253  593.1846
## 1971.830       543.6376  453.8387  633.4366  406.3020  680.9732
## 1971.913       617.5517  513.9650  721.1385  459.1295  775.9740
## 1971.997       535.4992  437.2101  633.7883  385.1790  685.8194
## 1972.080       936.8551  789.1004 1084.6097  710.8838 1162.8263

After we get result forecast, we can visualize it and compare result of forecast with our test_sales data.

test_forecast(actual = sales, # data keseluruhan
              forecast.obj = forecast_triple_ex_multi, # hasil object forecast
              train = sales_train, # data train
              test = sales_test) # data test

test_forecast(actual = sales, # data keseluruhan
              forecast.obj = forecast_triple_ex_add, # hasil object forecast
              train = sales_train, # data train
              test = sales_test) # data test

Last we can compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 26 % for Multiplicative and 36% for Additive

# Calculate MAPE for each holts winters
result_multi <- MAPE(y_pred = forecast_triple_ex_multi$mean, y_true = sales_test)*100
result_add <- MAPE(y_pred = forecast_triple_ex_add$mean, y_true = sales_test)*100

# Result of MAPE
result_multi

## [1] 26.15384

result_add

## [1] 31.91561

# Binding result to compare with other modelling
result <- data.frame(
  "Model" = c("Holts Winter Additive", "Holts Winter Multiplicative"),
  "MAPE" = c(round(result_add),round(result_multi))
)

result_mape <- rbind(result_mape, result)

Arima

While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component.

Differencing Time Series

ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.

We check our time series data, is it stationary?

H0: Data not stationary H1: Data stationary

We will reject H0 if p-value < 0.05

adf.test(x = sales)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  sales
## Dickey-Fuller = -3.1705, Lag order = 3, p-value = 0.1007
## alternative hypothesis: stationary

Result is p-value from our data is 0.100 it mean we accept H0 and our data is not stationary. We will try 1 time differencing using diff function

# attempt 1 times diff
adf.test(diff(sales))

## Warning in adf.test(diff(sales)): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(sales)
## Dickey-Fuller = -6.0575, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary

Result we only need 1 times differencing to make our data stationary. It could be our based model that our optimal model only need 1 times differencing.

Auto Arima

We make ARIMA model using function auto.arima, this function automate will give us optimal result of arima model.

sales_arima_auto <- auto.arima(y = sales_train)

summary(sales_arima_auto)

## Series: sales_train 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 13345:  log likelihood=-314.59
## AIC=631.18   AICc=631.26   BIC=633.11
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 12.28963 114.4049 80.86656 -0.2405419 24.24669 0.5348317
##                    ACF1
## Training set -0.1110519

Model we get is ARIMA(0,1,0), like we predict before optimum differencing is 1, we can see it from ARIMA(p,d,q) so differencing based on model is d = 1.

we will forecast using model and let see result

forecast_arima_auto <- forecast(sales_arima_auto, h = length(sales_test))

plot_forecast(forecast_arima_auto)

We compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 32%

result_arima <- MAPE(y_pred = forecast_arima_auto$mean, y_true = sales_test)*100

# Result of MAPE
result_arima

## [1] 32.85827

# Binding result to compare with other modelling
result <- data.frame(
  "Model" = c("ARIMA"),
  "MAPE" = c(round(result_arima))
)

result_mape <- rbind(result_mape, result)

Asumption Check

After we get result from each model Moving Average, Exponential Smooting, and ARIMA. We try to asumption check.

Normality Residuals

Function we use in this checking is shapiro.test and in this checking we expect to accept H0 because pvalue > 0.05

H0 : residuals are normally distributed
H1 : residuals are not normally distributed

# SMA Model
shapiro.test(sales_sma_forecast$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  sales_sma_forecast$residuals
## W = 0.95148, p-value = 0.04952

# Triple Exponential Smooting with Multiplicative Seasonal
shapiro.test(forecast_triple_ex_multi$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  forecast_triple_ex_multi$residuals
## W = 0.97929, p-value = 0.6632

# Triple Exponential Smooting with Additive Seasonal
shapiro.test(forecast_triple_ex_add$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  forecast_triple_ex_add$residuals
## W = 0.96391, p-value = 0.2273

# ARIMA Model
shapiro.test(forecast_arima_auto$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  forecast_arima_auto$residuals
## W = 0.96444, p-value = 0.122

Result from 4 model we check, only simple moving average that have p-value < 0.05 it means residuals are not normally distributed

No-Autocorrelation

We want check no-autocorrelation we will use Box.test function

H0 : No autocorrelation in the forecast errors H1 : there is an autocorrelation in the forecast errors

We expect p-value from our model is > 0.05 to accept H0

# SMA Model
Box.test(x = sales_sma_forecast$residuals)

## 
##  Box-Pierce test
## 
## data:  sales_sma_forecast$residuals
## X-squared = 2.1206, df = 1, p-value = 0.1453

# Triple Exponential Smooting with Multiplicative Seasonal
Box.test(x = forecast_triple_ex_multi$residuals)

## 
##  Box-Pierce test
## 
## data:  forecast_triple_ex_multi$residuals
## X-squared = 1.182, df = 1, p-value = 0.277

# Triple Exponential Smooting with Additive Seasonal
Box.test(x = forecast_triple_ex_add$residuals)

## 
##  Box-Pierce test
## 
## data:  forecast_triple_ex_add$residuals
## X-squared = 0.49319, df = 1, p-value = 0.4825

# ARIMA Model
Box.test(x = forecast_arima_auto$residuals)

## 
##  Box-Pierce test
## 
## data:  forecast_arima_auto$residuals
## X-squared = 0.64129, df = 1, p-value = 0.4232

Base on result, all our model p-value is > 0.05 it means all our model have No autocorrelation in the forecast errors

Conclusion

Lets take look our MAPE result table from all our model that we made before

result_mape

##                         Model MAPE
## 1       Simple Moving Average   31
## 2       Holts Winter Additive   32
## 3 Holts Winter Multiplicative   26
## 4                       ARIMA   33

Based on MAPE Holts Winter with Mutliplicative seasonal have smallest MAPE, and we can say this model the most suitable model to predict our sales dataset compare other model we have in this case.
Based on Normality Residuals checking, From 4 model we use in this case, only simple moving average have residuals not distributed normally.
Based on No-Autocorrelation checking, all our model we use in this case have no-autocorrelation in forecast errors.

Time Series Forecasting

Moving Average, Exponential Smoothing, ARIMA

Alfan

3/16/2020