knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
This dataset provides a comprehensive foundation for conducting in-depth analyses and developing predictive models in the realm of time series forecasting. With its rich array of temporal data, it is particularly well-suited for examining patterns, trends, and seasonal variations over time. Researchers and analysts can leverage this dataset to enhance their understanding of time-dependent processes, improve forecasting accuracy, and explore various time series forecasting techniques and methodologies. The dataset collected from kaggel (https://www.kaggle.com/datasets/erogluegemen/airline-passengers/data)
Analysis of time series & forecasting for total airline passengers.
library(pageviews)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)
df <- read.csv("data_input/airline-passengers.csv")
head(df)
## month total_passengers
## 1 1949-01 112
## 2 1949-02 118
## 3 1949-03 132
## 4 1949-04 129
## 5 1949-05 121
## 6 1949-06 135
Description Columns :
# Check missing value
colSums(is.na(df))
## month total_passengers
## 0 0
In the dataset above, has no missing value data in any columns.
# Check data range
range(df$month)
## [1] "1949-01" "1960-12"
# Check data summary
summary(df$total_passengers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 104.0 180.0 265.5 280.3 360.5 622.0
# Check std
sd(df$total_passengers)
## [1] 119.9663
To process and model time series data in R, we need to convert the data into a time series object using the ts() function.
Data = the observed values of y, which in this case are the views or number of viewers in the Wikipedia data.
Start = the starting time of the time series, which is 1949.
Frequency = the number of observations within one seasonal pattern. Since the seasonal pattern is annual, the frequency is 12.
df_ts <- ts(df$total_passengers, start = c(1949,1), frequency = 12)
autoplot(df_ts, series = "Actual")
The main idea of decomposition is to decompose the three components of a time series object (trend, seasonal, residual).
Trend -> represents the overall mean movement.
Seasonal -> refers to the pattern of data per frequency.
Error -> represents the values not accounted for by the Trend and Seasonal components.
To decompose a time series object into these three components, we can use the decompose() function.
df_decomp <- decompose(df_ts)
autoplot(df_decomp)
From the plot results above, it can be said that the total passengers data is Seasonal pattern or recurring pattern. For the trend pattern of this data its showed increasing trend.
The concept of cross-validation (CV) in time series is the proportion of training data to build the model will be larger than the test data. There are condition that must be met:
The training data will use the earliest data.
The test data will use the most recent data.
df_train <- df_ts %>%
head(8*12)
df_test <- df_ts %>%
tail(-8*12)
#This will plot the time series
ts.plot(df_train, xlab="month", ylab="total_passengers", main="Total Passangers 1949-1960")
# This will fit in a line
abline(reg=lm(df_train~time(df_train)))
Holt’s Winters Exponential (Triple Exponential Smoothing) method, also known as a suitable forecasting method for data show trend and seasonal effects.
# Buld model Holt's Winter Exponential
model_holt <- HoltWinters(df_train)
# Evaluation Model
model_holt_forecast <- forecast(object = model_holt, h = 48)
df_ts %>% autoplot(series = "df_train")+
autolayer(df_test, series = "df_test") +
autolayer(model_holt_forecast$mean, series = "forecast")
accuracy(model_holt_forecast$mean, x = df_test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -1.251244 26.04453 20.45658 -1.133701 4.85445 0.6835618 0.5402626
According to Lewis (1982), MAPE values can be interpreted into 4 categories as follows:
<10% = very accurate.
10-20% = good.
20-50% = reasonable.
Model Interpretation:
Based on the plot above, it can be seen that the forecast results follow the pattern of the test data. Its mean this model is good to predicting total passengers for the future.
The accuracy results from MAPE and RMSE values indicate “very accurate”. MAPE (Mean Absolute Percentage Error) is 4.8%, and RMSE (Root Mean Squared Error) is 26.04453.
Before conducting modeling using seasonal ARIMA, it is necessary to
test for data stationarity. The hypothesis for the
kpss.test is as follows:
Testing for data stationarity
Testing with kpss.test using the hypotheses:
df_train %>% kpss.test()
##
## KPSS Test for Level Stationarity
##
## data: .
## KPSS Level = 2.1737, Truncation lag parameter = 3, p-value = 0.01
From the test results above, it states that the data is stationary with a p-value < 0.05, indicating that differencing is necessary for the data.
# differencing
df_train %>% diff() %>% kpss.test()
##
## KPSS Test for Level Stationarity
##
## data: .
## KPSS Level = 0.017976, Truncation lag parameter = 3, p-value = 0.1
After differencing once, the time series data is now stationary (p-value > 0.05) according to the kpss.test.
The number of differencing operations required to make the data stationary will become the value for d in the ARIMA(p,d,q) model.
d = 1 -> from the KPSS test result.
df_train %>%
diff(leg=12) %>%
diff() %>%
tsdisplay(lag.max = 36)
The tsdisplay function is used to determine the possible
ARIMA orders that can be formed, where:
However, determining the ARIMA order can also be done with
auto.arima. The required parameters include the variable
used for modeling (y) and the type of modeling, with
seasonal = TRUE for seasonal ARIMA.
# Build model of ARIMA with Auto Arima
model_arima_auto <- auto.arima(y = df_train, seasonal = TRUE)
model_arima_auto
## Series: df_train
## ARIMA(1,1,0)(1,1,0)[12]
##
## Coefficients:
## ar1 sar1
## -0.2250 -0.2274
## s.e. 0.1076 0.1081
##
## sigma^2 = 92.5: log likelihood = -304.98
## AIC=615.97 AICc=616.27 BIC=623.22
# Evaluation Model
model_arima_auto_forecast<- forecast(object = model_arima_auto, h = 48)
plot(model_arima_auto_forecast)
df_ts %>% autoplot(series = "Data_Train")+
autolayer(df_test, series = "Data_Test") +
autolayer(model_arima_auto_forecast$mean, series = "Forecast")
accuracy(model_arima_auto_forecast$mean, x = df_test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 6.84904 26.47164 19.50124 0.8396151 4.405293 0.6752726 0.5053222
Model Interpretation:
The suitable ARIMA model identified by auto.arima is
ARIMA(1,1,0).
The forecast results follow the pattern of the test data. It can be said that the ARIMA model is effective in forecasting total passengers.
The accuracy results from MAPE and RMSE values indicate “very accurate”. MAPE (Mean Absolute Percentage Error) is 4.4%, and RMSE (Root Mean Squared Error) is 26.47164.
Seasonal Trend with Loess Model (SLTM) is a forecasting technique that combines the strengths of seasonal decomposition and local regression (Loess) methods. Here’s a breakdown of its components:
Seasonal Decomposition: SLTM first decomposes the time series data into its seasonal, trend, and remainder components. This decomposition helps in understanding and modeling the seasonal patterns and long-term trends present in the data.
Local Regression (Loess): After decomposing the time series, SLTM applies local regression, often using Loess, to model the seasonal and trend components separately. Loess (Locally Weighted Scatterplot Smoothing) is a non-parametric regression method that fits multiple regression models in local subsets of the data to capture non-linear relationships.
Forecasting: Once the seasonal and trend components are modeled separately using Loess, SLTM combines these components to generate forecasts for future time periods. By incorporating both seasonal effects and trend changes captured by Loess, SLTM aims to provide accurate forecasts that account for both short-term fluctuations and long-term trends in the data.
In summary, SLTM is a hybrid forecasting approach that leverages seasonal decomposition and local regression techniques to effectively capture and predict the seasonal trends and overall trajectory of time series data.
# Build Model STLM
model_stlm <- stlm(y = df_train, s.window = 48, method = "ets")
#Evaluation Model
model_stlm_forecast <- forecast(model_stlm, h = 48)
df_ts %>% autoplot(series = "Data Training")+
autolayer(df_test, series = "Data Testing") +
autolayer(model_stlm_forecast$mean, series = "forecast")
accuracy(model_stlm_forecast$mean, x = df_test)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set 86.06176 104.4415 86.25841 19.07897 19.1443 0.8496419 1.994802
Model Interpretation:
Forecast results follow the pattern of the test data. It can be said that the STLM model is capable of forecasting seasonal patterns but is less effective in forecasting the trend from the data.
The accuracy results from MAPE and RMSE result indicate “good” category with the resulting values: MAPE (Mean Absolute Percentage Error) is 19,14%, and RMSE (Root Mean Squared Error) is 104.4415.
The selection of the best model is based on accuracy. The accuracy measures used are MAPE (Mean Absolute Percentage Error) and RMSE (Root Mean Squared Error).
The model that provides the smallest RMSE and MAPE values is the Auto Arima model “model_arima_auto_forecast” with MAPE (4.4%) and RMSE (26.47164).
Below is the difference between the actual volume and the forecasted volume:
result <- data.frame(
Forecast = model_arima_auto_forecast$mean,
total_passengers_Real = df_test
)
Percentage <- with(result, (abs(Forecast - total_passengers_Real)/total_passengers_Real)*100)
Fore_per_Actual <- with(result, (Forecast/total_passengers_Real))
Fore_per_Actual <- round(Fore_per_Actual, 2)
Percentage <- round(Percentage, 2)
result_Forecast <- data.frame(result,Fore_per_Actual, Percentage)
result_Forecast
## Forecast total_passengers_Real Fore_per_Actual Percentage
## 1 314.1742 315 1.00 0.26
## 2 306.5884 301 1.02 1.86
## 3 345.2537 356 0.97 3.02
## 4 342.6112 348 0.98 1.55
## 5 346.7032 355 0.98 2.34
## 6 400.2019 422 0.95 5.17
## 7 441.4756 465 0.95 5.06
## 8 431.4293 467 0.92 7.62
## 9 384.8398 404 0.95 4.74
## 10 338.3408 347 0.98 2.50
## 11 302.8861 305 0.99 0.69
## 12 339.2503 336 1.01 0.97
## 13 346.9301 340 1.02 2.04
## 14 339.4775 318 1.07 6.75
## 15 378.4463 362 1.05 4.54
## 16 375.4951 348 1.08 7.90
## 17 379.7936 363 1.05 4.63
## 18 433.8610 435 1.00 0.26
## 19 474.6177 491 0.97 3.34
## 20 465.0367 505 0.92 7.91
## 21 417.6718 404 1.03 3.38
## 22 370.6041 359 1.03 3.23
## 23 335.2528 310 1.08 8.15
## 24 371.3068 337 1.10 10.18
## 25 379.0990 360 1.05 5.31
## 26 371.6161 342 1.09 8.66
## 27 410.5159 406 1.01 1.11
## 28 407.6350 396 1.03 2.94
## 29 411.8865 420 0.98 1.93
## 30 465.8246 472 0.99 1.31
## 31 506.6989 548 0.92 7.54
## 32 497.0120 559 0.89 11.09
## 33 449.8234 463 0.97 2.85
## 34 402.8851 407 0.99 1.01
## 35 367.5102 362 1.02 1.52
## 36 403.6348 405 1.00 0.34
## 37 411.4014 417 0.99 1.34
## 38 403.9254 391 1.03 3.31
## 39 442.8409 419 1.06 5.69
## 40 439.9440 461 0.95 4.57
## 41 444.2062 472 0.94 5.89
## 42 498.1737 535 0.93 6.88
## 43 539.0212 622 0.87 13.34
## 44 529.3585 606 0.87 12.65
## 45 482.1297 508 0.95 5.09
## 46 435.1620 461 0.94 5.60
## 47 399.7925 390 1.03 2.51
## 48 435.9010 432 1.01 0.90