Time series. A time series is simply a series of data points ordered in time. In a time series, time is often the independent variable and the goal is usually to make a forecast for the future.
Time series data often arise when monitoring industrial processes or tracking corporate business metrics. The essential difference between modeling data via time series methods or using the process monitoring methods
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
While regression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called “time series analysis”, which focuses on comparing values of a single time series or multiple dependent time series at different points in time. Interrupted time series analysis is the analysis of interventions on a single time series.
Time Series Analysis is used for many applications such as:
and many, many more…
In this case we will try to use 3 Modelling of Time Series and make predictions based on model.
In the end we will compare Mean Absolute Percentage Error from result of forecasting from each time series model.
# Library for data Processing
library(tidyverse)
# Library to forecast
library(forecast)
# Library for Processing date data type
library(lubridate)
# Library Simple Moving Average
library(TTR)
# Library MAPE
library(MLmetrics)
# Library for Processing date data type
library(zoo)
# Data visualisation
library(plotly)
library(xts)
library(TSstudio)
library(tseries)
In this case we will use tsdl The Time Series Data Library package. This package contain time series data from a lot of subject. Data provide on this package already convert as Time Series object.
library(tsdl)
Check kind of subject inside the Data Library
tsdl
## Time Series Data Library: 648 time series
##
## Frequency
## Subject 0.1 0.25 1 4 5 6 12 13 52 365 Total
## Agriculture 0 0 37 0 0 0 3 0 0 0 40
## Chemistry 0 0 8 0 0 0 0 0 0 0 8
## Computing 0 0 6 0 0 0 0 0 0 0 6
## Crime 0 0 1 0 0 0 2 1 0 0 4
## Demography 1 0 9 2 0 0 3 0 0 2 17
## Ecology 0 0 23 0 0 0 0 0 0 0 23
## Finance 0 0 23 5 0 0 20 0 2 1 51
## Health 0 0 8 0 0 0 6 0 1 0 15
## Hydrology 0 0 42 0 0 0 78 1 0 6 127
## Industry 0 0 9 0 0 0 2 0 1 0 12
## Labour market 0 0 3 4 0 0 17 0 0 0 24
## Macroeconomic 0 0 18 33 0 0 5 0 0 0 56
## Meteorology 0 0 18 0 0 0 17 0 0 12 47
## Microeconomic 0 0 27 1 0 0 7 0 1 0 36
## Miscellaneous 0 0 4 0 1 1 3 0 1 0 10
## Physics 0 0 12 0 0 0 4 0 0 0 16
## Production 0 0 4 14 0 0 28 1 1 0 48
## Sales 0 0 10 3 0 0 24 0 9 0 46
## Sport 0 1 1 0 0 0 0 0 0 0 2
## Transport and tourism 0 0 1 1 0 0 12 0 0 0 14
## Tree-rings 0 0 34 0 0 0 1 0 0 0 35
## Utilities 0 0 2 1 0 0 8 0 0 0 11
## Total 1 1 300 64 1 1 240 3 16 21 648
In this case i will use Sales data with Frequency 12 or Monthly. We need to subset it
sales <- subset(tsdl, 'sales', 12)
sales <- sales[[24]]
head(sales)
## Time Series:
## Start = 1966.83
## End = 1967.24666666667
## Frequency = 12
## [1] 61 48 53 78 75 58
## attr(,"source")
## [1] Roberts (1992)
## attr(,"description")
## [1] Monthly unit sales, Winnebago Industries, Nov. 1966 � Feb. 1972
## attr(,"subject")
## [1] Sales
Time Series data we get from tsdl before we will check whether there is NA inside data.
# Length of data
length(sales)
## [1] 64
# NA check
table(is.na(sales))
##
## FALSE
## 64
We can see total length of data 64 and we get FALSE 64, it means data has no NA. Not only NA we should check, we try to know how distribution from our time series data. we can use summary syntax to know distribution of our data.
summary(sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.0 194.5 371.0 466.8 628.0 1753.0
Min of our time series data is 48.0 it means our data dont have 0 value.
From tsdl package data we get it already an TS object. These are vectors or matrices with class of “ts” (and additional attributes) which represent data which has been sampled at equispaced points in time. In the matrix case, each column of the matrix data is assumed to contain a single (univariate) time series. Time series must have at least one observation, and although they need not be numeric there is very limited support for non-numeric series.
I will try to reverse it to data frame, it can help us to visualize data later.
sales_df <- data.frame(Y = as.matrix(sales), date = as_date(yearmon(time(sales))))
I this sub part, i will try to explore our timeseries object. We will try to understand more our object, whether we can found trend, seasonal, or other interpretation or insight based on visualisastion.
There are basically two methods to analyze the seasonality of a Time Series: additive and multiplicative. and to categorize Time Series object there is 3 main components which will be calculated for forecasting. These components are:
Synthetically it is a model of data in which the effects of the individual factors are differentiated and added to model the data. It can be represented by:
\(y(t) = Trend + Seasonality + Noise\)
In the additive model, the behavior is linear where changes over time are consistently made by the same amount, like a linear trend. In this situation, the linear seasonality has the same amplitude and frequency.
In this situation, trend and seasonal components are multiplied and then added to the error component. It is not linear, can be exponential or quadratic and represented by a curved line as below:
\(y(t) = Trend * Seasonality * Noise\)
Different from the additive model, the multiplicative model has an increasing or decreasing amplitude and/or frequency over time.
These charts can summarize Additive Model and Multiplicative Model
Next we will try visualize our timeseries object to know either its multiplicative or additive
ggplotly(
sales %>% as.xts() %>% autoplot() + theme_bw()
)
Based on chart we can say our data is Multiplicative because it show the behavior acts as an increasing funnel.
In this part we will try decompose data, which mean we will try to seperate 3 components: Tremd, Seasonality and Error
sales_deco <- decompose(sales, type = "multiplicative")
Visualise trend data
sales_deco$trend %>% autoplot()
We found that our trand already smooth and there isn’t another seasonality shows in the line, so it mean our seasonal and frequency setup in ts object has fitted well and we can call it single seasonal.
next, visualise seasonal
sales_deco$seasonal %>% autoplot()
we can said our decompose catch seasonal pattern well, because it show repeating pattern with a fixed period of time / same. we can try inteprate more based on seasonal, lets try processing the data.
sales_df %>%
mutate(
seasonal = sales_deco$seasonal,
date = month(ymd(date), label = T, abbr = F)
) %>%
distinct(date, seasonal) %>%
ggplot(mapping = aes(x = date, y = seasonal)) +
geom_col() +
theme_bw()
Based on plot, we can assume sales start increasing start from january until middle year, and will decreasing start from middle year until december.
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially. in this case we would take 12 last data as test data and keep rest data as train data
sales_train <- head(sales, length(sales) - 12)
sales_test <- tail(sales, 12)
The moving average model is probably the most naive approach to time series modelling. This model simply states that the next observation is the mean of all past observations.
sales_sma <- SMA(x = sales_train, n = 6)
sales_sma
## Time Series:
## Start = 1966.83
## End = 1971.08
## Frequency = 12
## [1] NA NA NA NA NA 62.16667 76.33333
## [8] 100.50000 112.33333 119.33333 129.16667 136.00000 133.33333 128.83333
## [15] 136.16667 160.66667 190.66667 246.16667 283.66667 320.00000 330.66667
## [22] 325.33333 321.83333 290.16667 272.33333 247.66667 243.83333 264.33333
## [29] 274.16667 301.50000 353.83333 379.33333 413.83333 436.33333 445.83333
## [36] 432.83333 376.16667 343.83333 300.50000 291.50000 335.16667 426.16667
## [43] 482.16667 553.16667 631.83333 667.33333 638.00000 549.66667 524.16667
## [50] 518.33333 480.83333 488.83333
autoplot(sales_train) +
autolayer(sales_sma) +
theme_minimal()
In the plot above, we applied the moving average model to a monthly window. The red line is smoothed the time series. We already have SMA model and we will forecast using forecast and length data we guess is length from our [sales_test] data.
sales_sma_forecast <- forecast(sales_sma, h = length(sales_test))
We try to plot it, as we know this model will smooth historycal data in the end it will use to forecast. we can plot it like this
plot_forecast(sales_sma_forecast)
however plot above observant is from model, we can see real data compare with our forecast.
sales_train %>% autoplot() +
autolayer(sales_sma_forecast) +
theme_bw()
Last we can compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 31.2%
result <- MAPE(y_pred = sales_sma_forecast$mean, y_true = sales_test)*100
result
## [1] 31.21606
result_mape <- data.frame(
"Model" = "Simple Moving Average",
"MAPE" = round(result)
)
Exponential smoothing uses a similar logic to moving average, but this time, a different decreasing weight is assigned to each observations. In other words, less importance is given to observations as we move further from the present.
Exponential smoothing method is divided into 3 type:
So based on our Exploratory data analysis before, we know that data have trend and seasonal. Method we will use in this case is [Holt Winters] or Triple Exponential Smoothing
This method extends double exponential smoothing, by adding a seasonal smoothing factor. We know that our seasonal patter is an Multiplicative, but in this case i will try to do both seasonality and to know more the result
# using HoltWinters function to make model and using "additive" for seasonal setup
sales_triple_ex_add <- HoltWinters(x = sales_train, seasonal = "additive")
sales_triple_ex_add
## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = sales_train, seasonal = "additive")
##
## Smoothing parameters:
## alpha: 0.5265989
## beta : 0
## gamma: 0
##
## Coefficients:
## [,1]
## a 618.748592
## b 15.946241
## s1 55.288194
## s2 160.704861
## s3 34.954861
## s4 74.163194
## s5 -4.545139
## s6 -21.211806
## s7 -25.045139
## s8 -85.586806
## s9 -78.878472
## s10 -59.545139
## s11 -70.003472
## s12 19.704861
# using HoltWinters function to make model and using "multiplicative" for seasonal setup
sales_triple_ex_multi <- HoltWinters(x = sales_train, seasonal = "multiplicative")
sales_triple_ex_multi
## Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
##
## Call:
## HoltWinters(x = sales_train, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha: 0.2641399
## beta : 0
## gamma: 0.2420101
##
## Coefficients:
## [,1]
## a 698.2807712
## b 15.9462413
## s1 1.2246027
## s2 1.5330102
## s3 1.3123347
## s4 1.4233987
## s5 1.0185449
## s6 0.9477492
## s7 0.8519777
## s8 0.5762602
## s9 0.6458062
## s10 0.7199728
## s11 0.6129171
## s12 1.0530772
we can explain some arguments based on result above
After we create model we can use those model to forecast using our train_sales data:
forecast_triple_ex_multi <- forecast(object = sales_triple_ex_multi, h = length(sales_test))
forecast_triple_ex_add <- forecast(object = sales_triple_ex_add, h = length(sales_test))
forecast_triple_ex_multi
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1971.163 874.6443 815.2322 934.0564 783.7813 965.5073
## 1971.247 1119.3630 1045.1790 1193.5471 1005.9084 1232.8177
## 1971.330 979.1585 902.3545 1055.9625 861.6970 1096.6201
## 1971.413 1084.7234 996.4155 1173.0313 949.6681 1219.7786
## 1971.497 792.4402 712.6889 872.1915 670.4711 914.4093
## 1971.580 752.4732 668.3852 836.5613 623.8717 881.0748
## 1971.663 690.0205 603.9452 776.0958 558.3796 821.6614
## 1971.747 475.9049 399.2199 552.5900 358.6253 593.1846
## 1971.830 543.6376 453.8387 633.4366 406.3020 680.9732
## 1971.913 617.5517 513.9650 721.1385 459.1295 775.9740
## 1971.997 535.4992 437.2101 633.7883 385.1790 685.8194
## 1972.080 936.8551 789.1004 1084.6097 710.8838 1162.8263
After we get result forecast, we can visualize it and compare result of forecast with our test_sales data.
test_forecast(actual = sales, # data keseluruhan
forecast.obj = forecast_triple_ex_multi, # hasil object forecast
train = sales_train, # data train
test = sales_test) # data test
test_forecast(actual = sales, # data keseluruhan
forecast.obj = forecast_triple_ex_add, # hasil object forecast
train = sales_train, # data train
test = sales_test) # data test
Last we can compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 26 % for Multiplicative and 36% for Additive
# Calculate MAPE for each holts winters
result_multi <- MAPE(y_pred = forecast_triple_ex_multi$mean, y_true = sales_test)*100
result_add <- MAPE(y_pred = forecast_triple_ex_add$mean, y_true = sales_test)*100
# Result of MAPE
result_multi
## [1] 26.15384
result_add
## [1] 31.91561
# Binding result to compare with other modelling
result <- data.frame(
"Model" = c("Holts Winter Additive", "Holts Winter Multiplicative"),
"MAPE" = c(round(result_add),round(result_multi))
)
result_mape <- rbind(result_mape, result)
While exponential smoothing methods do not make any assumptions about correlations between successive values of the time series, in some cases you can make a better predictive model by taking correlations in the data into account. Autoregressive Integrated Moving Average (ARIMA) models include an explicit statistical model for the irregular component of a time series, that allows for non-zero autocorrelations in the irregular component.
ARIMA models are defined for stationary time series. Therefore, if we start off with a non-stationary time series, will need to ‘difference’ the time series until we obtain a stationary time series.
We check our time series data, is it stationary?
H0: Data not stationary H1: Data stationary
We will reject H0 if p-value < 0.05
adf.test(x = sales)
##
## Augmented Dickey-Fuller Test
##
## data: sales
## Dickey-Fuller = -3.1705, Lag order = 3, p-value = 0.1007
## alternative hypothesis: stationary
Result is p-value from our data is 0.100 it mean we accept H0 and our data is not stationary. We will try 1 time differencing using diff function
# attempt 1 times diff
adf.test(diff(sales))
## Warning in adf.test(diff(sales)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(sales)
## Dickey-Fuller = -6.0575, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
Result we only need 1 times differencing to make our data stationary. It could be our based model that our optimal model only need 1 times differencing.
We make ARIMA model using function auto.arima, this function automate will give us optimal result of arima model.
sales_arima_auto <- auto.arima(y = sales_train)
summary(sales_arima_auto)
## Series: sales_train
## ARIMA(0,1,0)
##
## sigma^2 estimated as 13345: log likelihood=-314.59
## AIC=631.18 AICc=631.26 BIC=633.11
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 12.28963 114.4049 80.86656 -0.2405419 24.24669 0.5348317
## ACF1
## Training set -0.1110519
Model we get is ARIMA(0,1,0), like we predict before optimum differencing is 1, we can see it from ARIMA(p,d,q) so differencing based on model is d = 1.
we will forecast using model and let see result
forecast_arima_auto <- forecast(sales_arima_auto, h = length(sales_test))
plot_forecast(forecast_arima_auto)
We compare our forcasting data with our test data, and we evaluate it using MAPE, we found that our Error is around 32%
result_arima <- MAPE(y_pred = forecast_arima_auto$mean, y_true = sales_test)*100
# Result of MAPE
result_arima
## [1] 32.85827
# Binding result to compare with other modelling
result <- data.frame(
"Model" = c("ARIMA"),
"MAPE" = c(round(result_arima))
)
result_mape <- rbind(result_mape, result)
After we get result from each model Moving Average, Exponential Smooting, and ARIMA. We try to asumption check.
Function we use in this checking is shapiro.test and in this checking we expect to accept H0 because pvalue > 0.05
H0 : residuals are normally distributed
H1 : residuals are not normally distributed
# SMA Model
shapiro.test(sales_sma_forecast$residuals)
##
## Shapiro-Wilk normality test
##
## data: sales_sma_forecast$residuals
## W = 0.95148, p-value = 0.04952
# Triple Exponential Smooting with Multiplicative Seasonal
shapiro.test(forecast_triple_ex_multi$residuals)
##
## Shapiro-Wilk normality test
##
## data: forecast_triple_ex_multi$residuals
## W = 0.97929, p-value = 0.6632
# Triple Exponential Smooting with Additive Seasonal
shapiro.test(forecast_triple_ex_add$residuals)
##
## Shapiro-Wilk normality test
##
## data: forecast_triple_ex_add$residuals
## W = 0.96391, p-value = 0.2273
# ARIMA Model
shapiro.test(forecast_arima_auto$residuals)
##
## Shapiro-Wilk normality test
##
## data: forecast_arima_auto$residuals
## W = 0.96444, p-value = 0.122
Result from 4 model we check, only simple moving average that have p-value < 0.05 it means residuals are not normally distributed
We want check no-autocorrelation we will use Box.test function
H0 : No autocorrelation in the forecast errors H1 : there is an autocorrelation in the forecast errors
We expect p-value from our model is > 0.05 to accept H0
# SMA Model
Box.test(x = sales_sma_forecast$residuals)
##
## Box-Pierce test
##
## data: sales_sma_forecast$residuals
## X-squared = 2.1206, df = 1, p-value = 0.1453
# Triple Exponential Smooting with Multiplicative Seasonal
Box.test(x = forecast_triple_ex_multi$residuals)
##
## Box-Pierce test
##
## data: forecast_triple_ex_multi$residuals
## X-squared = 1.182, df = 1, p-value = 0.277
# Triple Exponential Smooting with Additive Seasonal
Box.test(x = forecast_triple_ex_add$residuals)
##
## Box-Pierce test
##
## data: forecast_triple_ex_add$residuals
## X-squared = 0.49319, df = 1, p-value = 0.4825
# ARIMA Model
Box.test(x = forecast_arima_auto$residuals)
##
## Box-Pierce test
##
## data: forecast_arima_auto$residuals
## X-squared = 0.64129, df = 1, p-value = 0.4232
Base on result, all our model p-value is > 0.05 it means all our model have No autocorrelation in the forecast errors
Lets take look our MAPE result table from all our model that we made before
result_mape
## Model MAPE
## 1 Simple Moving Average 31
## 2 Holts Winter Additive 32
## 3 Holts Winter Multiplicative 26
## 4 ARIMA 33