Introduction and Objective
This RMarkdown is created to increase our knowledge and experience as we learn by building a good time series model to forecast number of tourist arrived to Indonesia before year 2018. We will use data from Kaggle. Here we already provided with a dataset consists 216 rows and 27 columns.
Hopefully, by learning the time pattern of the tourist arrival, the model can be used to forecast several years ahead (after 2018). Nevertheless, considering how year 2020 and 2021 looks like, we think the selected time series model should be tuned up again with the historical data from those years. The reason is due to Covid-19 pandemic, cross-border traveling is restricted thus, tourism sector from all countries are affected.
Libraries Used
Below are the libraries we used:
library(tidyverse)
library(lubridate)
library(forecast)
library(TTR)
library(tseries)
library(TSstudio)
library(padr)
library(MLmetrics)Read Data and Exploratory Data Analysis
Let’s read the dataset and saved it into tourism object then peek the content using glimpse() function.
tourism <- read.csv("data_input/Top Country by Tourism number of arrivals 1995-2017.csv")
glimpse(tourism)#> Rows: 216
#> Columns: 27
#> $ Country.Name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"...
#> $ region <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
#> $ Image.URL <chr> "", "", "", "https://www.countryflags.io/al/flat/64.pn...
#> $ X1995 <chr> "619", "", "9", "304", "", "2,315,000", "2,289,000", "...
#> $ X1995.1 <chr> "619", "", "9", "304", "", "2,315,000", "2,289,000", "...
#> $ X1996 <chr> "641", "", "21", "287", "", "2,572,000", "2,614,000", ...
#> $ X1997 <chr> "650", "", "45", "119", "", "2,476,000", "2,764,000", ...
#> $ X1998 <chr> "647", "", "52", "184", "", "2,991,000", "3,012,000", ...
#> $ X1999 <chr> "683", "", "45", "371", "2,347,000", "3,393,000", "2,8...
#> $ X2000 <chr> "721", "", "51", "317", "2,949,000", "3,907,000", "2,9...
#> $ X2001 <chr> "691", "", "67", "354", "3,516,000", "4,134,000", "2,6...
#> $ X2002 <chr> "643", "", "91", "470", "3,387,000", "5,445,000", "2,8...
#> $ X2003 <chr> "642", "", "107", "557", "3,138,000", "5,871,000", "2,...
#> $ X2004 <chr> "728", "", "194", "645", "2,791,000", "6,195,000", "3,...
#> $ X2005 <chr> "733", "", "210", "748", "2,418,000", "7,126,000", "3,...
#> $ X2006 <chr> "694", "", "121", "937", "2,227,000", "", "4,173,000",...
#> $ X2007 <chr> "772", "", "195", "1,062,000", "2,189,000", "", "4,562...
#> $ X2008 <chr> "827", "", "294", "1,247,000", "2,059,000", "", "4,700...
#> $ X2009 <chr> "813", "", "366", "1,711,000", "1,830,000", "", "4,308...
#> $ X2010 <chr> "824", "", "425", "2,191,000", "1,808,000", "", "5,325...
#> $ X2011 <chr> "869", "", "481", "2,469,000", "2,242,000", "", "5,705...
#> $ X2012 <chr> "904", "", "528", "3,156,000", "2,238,000", "", "5,587...
#> $ X2013 <chr> "979", "", "650", "2,857,000", "2,328,000", "", "5,246...
#> $ X2014 <chr> "1,072,000", "", "595", "3,341,000", "2,363,000", "", ...
#> $ X2015 <chr> "1,225,000", "", "592", "3,784,000", "2,663,000", "", ...
#> $ X2016 <chr> "1,102,000", "", "397", "4,070,000", "2,831,000", "", ...
#> $ X2017 <chr> "1,070,500", "", "261", "4,643,000", "", "", "6,720,00...
Below are the description for each column in the dataset:
Country.Name: country nameregion: region indexImage.URL: url of the country flag imagesX1995, …,X2017: tourism number of arrivals for period 1995 - 2017
Data Pre-Processing
After take a look at the dataset, we subset the dataset for Indonesia country only and we can pivot the years column so we can have a simplified data frame. Further we deselect the irrelevant columns such as region, image url, etc.
tourism_long <- tourism %>%
filter(Country.Name == "Indonesia") %>%
select(-c(region, Image.URL, X1995.1, Country.Name)) %>%
pivot_longer(cols = c(X1995,X1996,X1997,X1998,X1999,X2000,X2001,X2002,X2003,X2004,X2005,X2006,X2007,X2008,X2009,X2010,X2011,X2012,X2013,X2014,X2015,X2016,X2017), names_to = "year", values_to = "value")summary(tourism_long)#> year value
#> Length:23 Length:23
#> Class :character Class :character
#> Mode :character Mode :character
After simplifying the dataset, we will have 2 columns and 23 rows with following details:
year: the year 1995 - 2017 after being pivoted into 1 columnvalue: number of tourism arrivals
Then, we need to delete comma punctuation inside value column and delete the X wording inside year column.
tourism_long$value <- as.numeric(gsub(",","",tourism_long$value))
tourism_long$year <- as.numeric(gsub("X","",tourism_long$year))Let’s see how the changes results.
tourism_long#> # A tibble: 23 x 2
#> year value
#> <dbl> <dbl>
#> 1 1995 4324000
#> 2 1996 5034000
#> 3 1997 5185000
#> 4 1998 4606000
#> 5 1999 4728000
#> 6 2000 5064000
#> 7 2001 5153000
#> 8 2002 5033000
#> 9 2003 4467000
#> 10 2004 5321000
#> # ... with 13 more rows
After several adjustment before, we check whether the dataset has any NA values.
tourism_long %>%
is.na() %>%
colSums()#> year value
#> 0 0
Now, the dataset is ready to be use in forecasting since it already fulfilled: 1. No NA or missing values. 2. Years already ordered by time. 3. Complete interval since 1995 - 2017.
Time Series Object Creation
Let’s create the time series object from the tourism_long data frame then visualize it.
tourism_ts <- ts(data = tourism_long$value,
start = "1995",
end = "2017",
frequency = 1) # yearly seasonalitytourism_ts %>%
autoplot()From the plot, we can see that the time series object tourism_ts doesn’t has any seasonal but shows a trend.
Cross-Validation
Then, we split the dataset into train and test datasets. The test dataset is the data of last 5 years (2013 - 2017), and the rest as train dataset.
tourism_test <- tail(tourism_ts, 5)
tourism_train <- head(tourism_ts, length(tourism_ts) - length(tourism_test))Now let’s create several models using Holt’s exponential smoothing and Arima.
Holt’s Exponential Smoothing
We use Holt’s Exponential Smoothing (Double Exponential Smoothing) for forecasting model because the dataset has no seasonal (only have error and trend).
Model Fitting
We create the model by adding a parameter gamma = F.
tourism_hw <- HoltWinters(x = tourism_train, gamma = F)Forecast & Model Evaluation
Then we use the model created for forecasting and evaluate the model performance.
tourism_hw_f <- forecast(object = tourism_hw, h = length(tourism_test))accuracy(tourism_hw_f, tourism_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -38596.34 522086 421798.6 -1.399966 8.005484 1.002878
#> Test set 1308745.15 1769056 1308745.2 10.712206 10.712206 3.111702
#> ACF1 Theil's U
#> Training set 0.03716165 NA
#> Test set 0.23913136 1.289665
Let’s also create a visualization to compare the actual, forecasted and the fitted data.
test_forecast(actual = tourism_ts, forecast.obj = tourism_hw_f, train = tourism_train, test = tourism_test)Arima
Then we will create several models using Arima algorithm. Before preparing the models we have to check the data stationary first.
ADF test
We use Augmented Dickey-Fuller (ADF) test or adf.test() with following hypothesis:
- H0: have a root unit (data is not stationary)
- H1: doesn’t have a root unit (data is already stationary)
p-value < 0.05 (alpha), data already stationary
adf.test(tourism_train, alternative = "stationary")#>
#> Augmented Dickey-Fuller Test
#>
#> data: tourism_train
#> Dickey-Fuller = 0.56756, Lag order = 2, p-value = 0.99
#> alternative hypothesis: stationary
From the test, p-value = 0.99, which means p-value > 0.05 and data is not stationary. We try to make the data into staionary position by differencing the data once.
# diff 1x
diff(tourism_train, differences = 1) %>% adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -2.6551, Lag order = 2, p-value = 0.3228
#> alternative hypothesis: stationary
Since the p-value still above 0.05, let’s try differencing the data twice now.
# diff 2x
diff(tourism_train, differences = 2) %>% adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -4.9491, Lag order = 2, p-value = 0.01
#> alternative hypothesis: stationary
Now the p-value = 0.01 means it is < 0.05, so the data is already stationary and we can go on.
Model Fitting
tsdisplay(diff(tourism_train, differences = 2))From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.
- p = 0 (no lags passed the threshold)
- d = 2 (we only do differencing twice to get stationary data)
- q = 0 (no lags on ACF plot passed the threshold)
We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :
- ARIMA (0,1,0)
- ARIMA (0,2,0)
#modeling
tourism_arima1 <- Arima(y = tourism_train, order = c(0,1,0))
tourism_arima2 <- Arima(y = tourism_train, order = c(0,2,0))After modeling, we’ll try to check the AIC of each arima model
tourism_arima1$aic#> [1] 496.0283
tourism_arima2$aic#> [1] 475.1592
From the AIC value, we can see that model tourism_arima2 with p,d,q = (0,2,0) is the best in arima model for this case. We’ll try to check which is the best model for arima method by MAPE (error evaluation).
accuracy(tourism_arima1)#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set 206906.9 480569.7 397462.4 3.061196 7.138125 0.9450156 -0.04862854
accuracy(tourism_arima2)#> ME RMSE MAE MPE MAPE MASE
#> Training set -17682.22 602231.1 459119.3 -0.5160297 8.678039 1.091612
#> ACF1
#> Training set -0.4002891
If we check from MAPE method, we see that tourism_arima1 is the best because the error percentage is 7.13% by MAPE. We can also using auto arima model.
tourism_auto <- auto.arima(tourism_train, seasonal = F)
summary(tourism_auto)#> Series: tourism_train
#> ARIMA(0,1,0) with drift
#>
#> Coefficients:
#> drift
#> 218823.5
#> s.e. 107551.4
#>
#> sigma^2 estimated as 208939082690: log likelihood=-245.16
#> AIC=494.32 AICc=495.18 BIC=495.99
#>
#> Training set error measures:
#> ME RMSE MAE MPE MAPE MASE
#> Training set 228.0652 430956.6 359384.9 -0.7385359 6.758032 0.8544816
#> ACF1
#> Training set -0.01646288
Then, we’ll try to compare model arima 1 and auto arima.
# AIC arima 1
tourism_arima1$aic#> [1] 496.0283
# AIC auto arima
tourism_auto$aic#> [1] 494.3235
# MAPE arima 1
accuracy(tourism_arima1)#> ME RMSE MAE MPE MAPE MASE ACF1
#> Training set 206906.9 480569.7 397462.4 3.061196 7.138125 0.9450156 -0.04862854
# MAPE auto arima
accuracy(tourism_auto)#> ME RMSE MAE MPE MAPE MASE
#> Training set 228.0652 430956.6 359384.9 -0.7385359 6.758032 0.8544816
#> ACF1
#> Training set -0.01646288
From AIC method, we see that
tourism_autois better thantourism_arima1becausetourism_arima1has AIC value higher thantourism_auto.From MAPE, we see that
tourism_autois better thantourism_arima1because the error percentage based on MAPE is 6.76%.
Forecast & Model Evaluation
Then we use the tourism_auto model for forecasting and evaluate the model performance.
tourism_auto_f <- forecast(tourism_auto, h = length(tourism_test)) accuracy(tourism_auto_f, tourism_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 228.0652 430956.6 359384.9 -0.7385359 6.758032 0.8544816
#> Test set 2140129.4118 2641921.9 2140129.4 18.0220970 18.022097 5.0884196
#> ACF1 Theil's U
#> Training set -0.01646288 NA
#> Test set 0.29042570 1.958076
Let’s also create a visualization to compare the actual, forecasted and the fitted data.
test_forecast(actual = tourism_ts, forecast.obj = tourism_auto_f, train = tourism_train, test = tourism_test)Model Selection
Let’s compare the error valoue for both tourism_hw and tourism_auto models.
data.frame(Holts_MAPE = MAPE(tourism_hw_f$mean, tourism_test)*100,
Auto_Arima_MAPE = MAPE(tourism_auto_f$mean, tourism_test)*100
)#> Holts_MAPE Auto_Arima_MAPE
#> 1 10.71221 18.0221
There is some information we can get about the MAPE:
- Model
tourism_autois better at forecasting in train dataset with 6.76% of error, but when forecasting in test dataset, modeltourism_autogot error of 18.02%. - Model
tourism_hwis better at forecasting in test dataset with 10.71% of error, but when forecasting in train dataset, modeltourism_hwgot error of 8.00%.
Therefore, based on above information, model tourism_hw will be the best model to be used since the overfit is not too large and its forecasting on the test dataset also has a lower error percentage rather than the tourism_auto model.
Assumption Checking
Let’s check the autocorrelation and residual distribution for model tourism_hw.
No-autocorrelation Test Using Box.test() Function
Hypothesis:
- H0 : No autocorrelation in the forecast errors
- H1 : there is an autocorrelation in the forecast errors
p-value > 0.05 (alpha), error has no-autocorrelation
Box.test(tourism_hw_f$residuals, type = "Ljung-Box")#>
#> Box-Ljung test
#>
#> data: tourism_hw_f$residuals
#> X-squared = 0.026515, df = 1, p-value = 0.8706
Based on Box-Ljung test result, tourism_hw model has p-value = 0.87 (p-value > 0.05) thus, no-autocorrelation in forecast errors.
Normality of The Residual Test Using shapiro.test() Function
Hypothesis:
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
p-value > 0.05 (alpha), residuals are normally distributed
shapiro.test(x = tourism_hw_f$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: tourism_hw_f$residuals
#> W = 0.98065, p-value = 0.9685
Based on shapiro.test result, tourism_hw model has p-value = 0.97 (p-value > 0.05) thus, residual distributed normally.
Conclusion
From several models we have created before, the Holt’s Exponential Smoothing model, model tourism_hw, has the best MAPE percentage amounting to 10.71% when forecasting the test dataset. Further, the model already passed the normality and auto-correlation test.
Hopefully, by learning the time pattern of the tourist arrival, the tourism_hw model can be used to forecast years after 2017. Nevertheless, considering how year 2020 and 2021 looks like, we think tourism_hw model should be tuned up again with the historical data from those 2 years because during the Covid-19 pandemic, cross-border traveling is restricted and tourism sector from all countries are affected.