knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE
)

Introduction

This dataset provides a comprehensive foundation for conducting in-depth analyses and developing predictive models in the realm of time series forecasting. With its rich array of temporal data, it is particularly well-suited for examining patterns, trends, and seasonal variations over time. Researchers and analysts can leverage this dataset to enhance their understanding of time-dependent processes, improve forecasting accuracy, and explore various time series forecasting techniques and methodologies. The dataset collected from kaggel (https://www.kaggle.com/datasets/erogluegemen/airline-passengers/data)

Study Case:

Analysis of time series & forecasting for total airline passengers.

Data Preparation

Import Library

library(pageviews)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)

Load Dataset

df <- read.csv("data_input/airline-passengers.csv")
head(df)
##     month total_passengers
## 1 1949-01              112
## 2 1949-02              118
## 3 1949-03              132
## 4 1949-04              129
## 5 1949-05              121
## 6 1949-06              135

Description Columns :

  • Month : The specific month and year of the data point.
  • Total Passengers : The total number of passengers carried by the airline in that month.

Data Preprocessing

# Check missing value
colSums(is.na(df))
##            month total_passengers 
##                0                0

In the dataset above, has no missing value data in any columns.

# Check data range
range(df$month)
## [1] "1949-01" "1960-12"
# Check data summary
summary(df$total_passengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0
# Check std
sd(df$total_passengers)
## [1] 119.9663

Data Time Series

Data Type Time Series

To process and model time series data in R, we need to convert the data into a time series object using the ts() function.

Data = the observed values of y, which in this case are the views or number of viewers in the Wikipedia data.

Start = the starting time of the time series, which is 1949.

Frequency = the number of observations within one seasonal pattern. Since the seasonal pattern is annual, the frequency is 12.

df_ts <- ts(df$total_passengers, start = c(1949,1), frequency = 12)
autoplot(df_ts, series = "Actual")

Decompose time series data

The main idea of decomposition is to decompose the three components of a time series object (trend, seasonal, residual).

Trend -> represents the overall mean movement.

Seasonal -> refers to the pattern of data per frequency.

Error -> represents the values not accounted for by the Trend and Seasonal components.

To decompose a time series object into these three components, we can use the decompose() function.

df_decomp <- decompose(df_ts)
autoplot(df_decomp)

From the plot results above, it can be said that the total passengers data is Seasonal pattern or recurring pattern. For the trend pattern of this data its showed increasing trend.

Cross Validation

The concept of cross-validation (CV) in time series is the proportion of training data to build the model will be larger than the test data. There are condition that must be met:

  • The training data will use the earliest data.

  • The test data will use the most recent data.

df_train <- df_ts %>% 
  head(8*12)

df_test <- df_ts %>% 
  tail(-8*12)
#This will plot the time series
ts.plot(df_train, xlab="month", ylab="total_passengers", main="Total Passangers 1949-1960")

# This will fit in a line
abline(reg=lm(df_train~time(df_train)))

Build Model

Holt’s Winters Exponential (Triple Exponential Smoothing)

Holt’s Winters Exponential (Triple Exponential Smoothing) method, also known as a suitable forecasting method for data show trend and seasonal effects.

# Buld model Holt's Winter Exponential
model_holt <- HoltWinters(df_train)
# Evaluation Model
model_holt_forecast <- forecast(object = model_holt, h = 48)

df_ts %>% autoplot(series = "df_train")+
  autolayer(df_test, series = "df_test") +
  autolayer(model_holt_forecast$mean, series = "forecast")

accuracy(model_holt_forecast$mean, x = df_test)
##                 ME     RMSE      MAE       MPE    MAPE      ACF1 Theil's U
## Test set -1.251244 26.04453 20.45658 -1.133701 4.85445 0.6835618 0.5402626

According to Lewis (1982), MAPE values can be interpreted into 4 categories as follows:

  • <10% = very accurate.

  • 10-20% = good.

  • 20-50% = reasonable.

Model Interpretation:

  • Based on the plot above, it can be seen that the forecast results follow the pattern of the test data. Its mean this model is good to predicting total passengers for the future.

  • The accuracy results from MAPE and RMSE values indicate “very accurate”. MAPE (Mean Absolute Percentage Error) is 4.8%, and RMSE (Root Mean Squared Error) is 26.04453.

Seasonal ARIMA (SARIMA)

Before conducting modeling using seasonal ARIMA, it is necessary to test for data stationarity. The hypothesis for the kpss.test is as follows:

Testing for data stationarity

Testing with kpss.test using the hypotheses:

  • H0: data is stationary -> p-value > 0.05
  • H1: data is not stationary
df_train %>% kpss.test()
## 
##  KPSS Test for Level Stationarity
## 
## data:  .
## KPSS Level = 2.1737, Truncation lag parameter = 3, p-value = 0.01

From the test results above, it states that the data is stationary with a p-value < 0.05, indicating that differencing is necessary for the data.

# differencing 
df_train %>% diff() %>% kpss.test()
## 
##  KPSS Test for Level Stationarity
## 
## data:  .
## KPSS Level = 0.017976, Truncation lag parameter = 3, p-value = 0.1

After differencing once, the time series data is now stationary (p-value > 0.05) according to the kpss.test.

The number of differencing operations required to make the data stationary will become the value for d in the ARIMA(p,d,q) model.

d = 1 -> from the KPSS test result.

df_train %>% 
  diff(leg=12) %>% 
  diff() %>% 
  tsdisplay(lag.max = 36)

The tsdisplay function is used to determine the possible ARIMA orders that can be formed, where:

  • “tails off”: the line at lag 1 shows a slow decrease.
  • “cuts off lag”: the line at lag 1 shows a sharp decrease.

However, determining the ARIMA order can also be done with auto.arima. The required parameters include the variable used for modeling (y) and the type of modeling, with seasonal = TRUE for seasonal ARIMA.

Model Arima

# Build model of ARIMA with Auto Arima
model_arima_auto <- auto.arima(y = df_train, seasonal = TRUE)
model_arima_auto
## Series: df_train 
## ARIMA(1,1,0)(1,1,0)[12] 
## 
## Coefficients:
##           ar1     sar1
##       -0.2250  -0.2274
## s.e.   0.1076   0.1081
## 
## sigma^2 = 92.5:  log likelihood = -304.98
## AIC=615.97   AICc=616.27   BIC=623.22
# Evaluation Model
model_arima_auto_forecast<- forecast(object = model_arima_auto, h = 48)
plot(model_arima_auto_forecast)

df_ts %>% autoplot(series = "Data_Train")+
  autolayer(df_test, series = "Data_Test") +
  autolayer(model_arima_auto_forecast$mean, series = "Forecast")

accuracy(model_arima_auto_forecast$mean, x = df_test)
##               ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set 6.84904 26.47164 19.50124 0.8396151 4.405293 0.6752726 0.5053222

Model Interpretation:

  • The suitable ARIMA model identified by auto.arima is ARIMA(1,1,0).

  • The forecast results follow the pattern of the test data. It can be said that the ARIMA model is effective in forecasting total passengers.

  • The accuracy results from MAPE and RMSE values indicate “very accurate”. MAPE (Mean Absolute Percentage Error) is 4.4%, and RMSE (Root Mean Squared Error) is 26.47164.

Seasonal Trend with Loess Model (SLTM)

Seasonal Trend with Loess Model (SLTM) is a forecasting technique that combines the strengths of seasonal decomposition and local regression (Loess) methods. Here’s a breakdown of its components:

Seasonal Decomposition: SLTM first decomposes the time series data into its seasonal, trend, and remainder components. This decomposition helps in understanding and modeling the seasonal patterns and long-term trends present in the data.

Local Regression (Loess): After decomposing the time series, SLTM applies local regression, often using Loess, to model the seasonal and trend components separately. Loess (Locally Weighted Scatterplot Smoothing) is a non-parametric regression method that fits multiple regression models in local subsets of the data to capture non-linear relationships.

Forecasting: Once the seasonal and trend components are modeled separately using Loess, SLTM combines these components to generate forecasts for future time periods. By incorporating both seasonal effects and trend changes captured by Loess, SLTM aims to provide accurate forecasts that account for both short-term fluctuations and long-term trends in the data.

In summary, SLTM is a hybrid forecasting approach that leverages seasonal decomposition and local regression techniques to effectively capture and predict the seasonal trends and overall trajectory of time series data.

# Build Model STLM
model_stlm <- stlm(y = df_train, s.window = 48, method = "ets")
#Evaluation Model
model_stlm_forecast <- forecast(model_stlm, h = 48)

df_ts %>% autoplot(series = "Data Training")+
  autolayer(df_test, series = "Data Testing") +
  autolayer(model_stlm_forecast$mean, series = "forecast")

accuracy(model_stlm_forecast$mean, x = df_test)
##                ME     RMSE      MAE      MPE    MAPE      ACF1 Theil's U
## Test set 86.06176 104.4415 86.25841 19.07897 19.1443 0.8496419  1.994802

Model Interpretation:

  • Forecast results follow the pattern of the test data. It can be said that the STLM model is capable of forecasting seasonal patterns but is less effective in forecasting the trend from the data.

  • The accuracy results from MAPE and RMSE result indicate “good” category with the resulting values: MAPE (Mean Absolute Percentage Error) is 19,14%, and RMSE (Root Mean Squared Error) is 104.4415.

Conclusion

  • The selection of the best model is based on accuracy. The accuracy measures used are MAPE (Mean Absolute Percentage Error) and RMSE (Root Mean Squared Error).

  • The model that provides the smallest RMSE and MAPE values is the Auto Arima model “model_arima_auto_forecast” with MAPE (4.4%) and RMSE (26.47164).

  • Below is the difference between the actual volume and the forecasted volume:

result <- data.frame(
   Forecast  = model_arima_auto_forecast$mean,
   total_passengers_Real = df_test
)

Percentage <- with(result, (abs(Forecast - total_passengers_Real)/total_passengers_Real)*100)
Fore_per_Actual <- with(result, (Forecast/total_passengers_Real))
Fore_per_Actual <- round(Fore_per_Actual, 2)
Percentage <- round(Percentage, 2)
result_Forecast <- data.frame(result,Fore_per_Actual, Percentage)
result_Forecast
##    Forecast total_passengers_Real Fore_per_Actual Percentage
## 1  314.1742                   315            1.00       0.26
## 2  306.5884                   301            1.02       1.86
## 3  345.2537                   356            0.97       3.02
## 4  342.6112                   348            0.98       1.55
## 5  346.7032                   355            0.98       2.34
## 6  400.2019                   422            0.95       5.17
## 7  441.4756                   465            0.95       5.06
## 8  431.4293                   467            0.92       7.62
## 9  384.8398                   404            0.95       4.74
## 10 338.3408                   347            0.98       2.50
## 11 302.8861                   305            0.99       0.69
## 12 339.2503                   336            1.01       0.97
## 13 346.9301                   340            1.02       2.04
## 14 339.4775                   318            1.07       6.75
## 15 378.4463                   362            1.05       4.54
## 16 375.4951                   348            1.08       7.90
## 17 379.7936                   363            1.05       4.63
## 18 433.8610                   435            1.00       0.26
## 19 474.6177                   491            0.97       3.34
## 20 465.0367                   505            0.92       7.91
## 21 417.6718                   404            1.03       3.38
## 22 370.6041                   359            1.03       3.23
## 23 335.2528                   310            1.08       8.15
## 24 371.3068                   337            1.10      10.18
## 25 379.0990                   360            1.05       5.31
## 26 371.6161                   342            1.09       8.66
## 27 410.5159                   406            1.01       1.11
## 28 407.6350                   396            1.03       2.94
## 29 411.8865                   420            0.98       1.93
## 30 465.8246                   472            0.99       1.31
## 31 506.6989                   548            0.92       7.54
## 32 497.0120                   559            0.89      11.09
## 33 449.8234                   463            0.97       2.85
## 34 402.8851                   407            0.99       1.01
## 35 367.5102                   362            1.02       1.52
## 36 403.6348                   405            1.00       0.34
## 37 411.4014                   417            0.99       1.34
## 38 403.9254                   391            1.03       3.31
## 39 442.8409                   419            1.06       5.69
## 40 439.9440                   461            0.95       4.57
## 41 444.2062                   472            0.94       5.89
## 42 498.1737                   535            0.93       6.88
## 43 539.0212                   622            0.87      13.34
## 44 529.3585                   606            0.87      12.65
## 45 482.1297                   508            0.95       5.09
## 46 435.1620                   461            0.94       5.60
## 47 399.7925                   390            1.03       2.51
## 48 435.9010                   432            1.01       0.90