Forecasting Indonesia Tourism Number of Arrivals Based on 1995-2017 Data

Margareth Devina

11 April 2021

Introduction and Objective

This RMarkdown is created to increase our knowledge and experience as we learn by building a good time series model to forecast number of tourist arrived to Indonesia before year 2018. We will use data from Kaggle. Here we already provided with a dataset consists 216 rows and 27 columns.

Hopefully, by learning the time pattern of the tourist arrival, the model can be used to forecast several years ahead (after 2018). Nevertheless, considering how year 2020 and 2021 looks like, we think the selected time series model should be tuned up again with the historical data from those years. The reason is due to Covid-19 pandemic, cross-border traveling is restricted thus, tourism sector from all countries are affected.

Libraries Used

Below are the libraries we used:

library(tidyverse)
library(lubridate)
library(forecast)
library(TTR)
library(tseries)
library(TSstudio)
library(padr)
library(MLmetrics)

Read Data and Exploratory Data Analysis

Let’s read the dataset and saved it into tourism object then peek the content using glimpse() function.

tourism <- read.csv("data_input/Top Country by Tourism number of arrivals 1995-2017.csv")

glimpse(tourism)
#> Rows: 216
#> Columns: 27
#> $ Country.Name <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"...
#> $ region       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
#> $ Image.URL    <chr> "", "", "", "https://www.countryflags.io/al/flat/64.pn...
#> $ X1995        <chr> "619", "", "9", "304", "", "2,315,000", "2,289,000", "...
#> $ X1995.1      <chr> "619", "", "9", "304", "", "2,315,000", "2,289,000", "...
#> $ X1996        <chr> "641", "", "21", "287", "", "2,572,000", "2,614,000", ...
#> $ X1997        <chr> "650", "", "45", "119", "", "2,476,000", "2,764,000", ...
#> $ X1998        <chr> "647", "", "52", "184", "", "2,991,000", "3,012,000", ...
#> $ X1999        <chr> "683", "", "45", "371", "2,347,000", "3,393,000", "2,8...
#> $ X2000        <chr> "721", "", "51", "317", "2,949,000", "3,907,000", "2,9...
#> $ X2001        <chr> "691", "", "67", "354", "3,516,000", "4,134,000", "2,6...
#> $ X2002        <chr> "643", "", "91", "470", "3,387,000", "5,445,000", "2,8...
#> $ X2003        <chr> "642", "", "107", "557", "3,138,000", "5,871,000", "2,...
#> $ X2004        <chr> "728", "", "194", "645", "2,791,000", "6,195,000", "3,...
#> $ X2005        <chr> "733", "", "210", "748", "2,418,000", "7,126,000", "3,...
#> $ X2006        <chr> "694", "", "121", "937", "2,227,000", "", "4,173,000",...
#> $ X2007        <chr> "772", "", "195", "1,062,000", "2,189,000", "", "4,562...
#> $ X2008        <chr> "827", "", "294", "1,247,000", "2,059,000", "", "4,700...
#> $ X2009        <chr> "813", "", "366", "1,711,000", "1,830,000", "", "4,308...
#> $ X2010        <chr> "824", "", "425", "2,191,000", "1,808,000", "", "5,325...
#> $ X2011        <chr> "869", "", "481", "2,469,000", "2,242,000", "", "5,705...
#> $ X2012        <chr> "904", "", "528", "3,156,000", "2,238,000", "", "5,587...
#> $ X2013        <chr> "979", "", "650", "2,857,000", "2,328,000", "", "5,246...
#> $ X2014        <chr> "1,072,000", "", "595", "3,341,000", "2,363,000", "", ...
#> $ X2015        <chr> "1,225,000", "", "592", "3,784,000", "2,663,000", "", ...
#> $ X2016        <chr> "1,102,000", "", "397", "4,070,000", "2,831,000", "", ...
#> $ X2017        <chr> "1,070,500", "", "261", "4,643,000", "", "", "6,720,00...

Below are the description for each column in the dataset:

  • Country.Name : country name
  • region : region index
  • Image.URL : url of the country flag images
  • X1995, …, X2017 : tourism number of arrivals for period 1995 - 2017

Data Pre-Processing

After take a look at the dataset, we subset the dataset for Indonesia country only and we can pivot the years column so we can have a simplified data frame. Further we deselect the irrelevant columns such as region, image url, etc.

tourism_long <- tourism %>% 
  filter(Country.Name == "Indonesia") %>% 
  select(-c(region, Image.URL, X1995.1, Country.Name)) %>% 
  pivot_longer(cols = c(X1995,X1996,X1997,X1998,X1999,X2000,X2001,X2002,X2003,X2004,X2005,X2006,X2007,X2008,X2009,X2010,X2011,X2012,X2013,X2014,X2015,X2016,X2017), names_to = "year", values_to = "value")
summary(tourism_long)
#>      year              value          
#>  Length:23          Length:23         
#>  Class :character   Class :character  
#>  Mode  :character   Mode  :character

After simplifying the dataset, we will have 2 columns and 23 rows with following details:

  • year : the year 1995 - 2017 after being pivoted into 1 column
  • value : number of tourism arrivals

Then, we need to delete comma punctuation inside value column and delete the X wording inside year column.

tourism_long$value <- as.numeric(gsub(",","",tourism_long$value))
tourism_long$year <- as.numeric(gsub("X","",tourism_long$year))

Let’s see how the changes results.

tourism_long
#> # A tibble: 23 x 2
#>     year   value
#>    <dbl>   <dbl>
#>  1  1995 4324000
#>  2  1996 5034000
#>  3  1997 5185000
#>  4  1998 4606000
#>  5  1999 4728000
#>  6  2000 5064000
#>  7  2001 5153000
#>  8  2002 5033000
#>  9  2003 4467000
#> 10  2004 5321000
#> # ... with 13 more rows

After several adjustment before, we check whether the dataset has any NA values.

tourism_long %>%
  is.na() %>% 
  colSums()
#>  year value 
#>     0     0

Now, the dataset is ready to be use in forecasting since it already fulfilled: 1. No NA or missing values. 2. Years already ordered by time. 3. Complete interval since 1995 - 2017.

Time Series Object Creation

Let’s create the time series object from the tourism_long data frame then visualize it.

tourism_ts <- ts(data = tourism_long$value,
   start = "1995",
   end = "2017",
   frequency = 1) # yearly seasonality
tourism_ts %>% 
  autoplot()

From the plot, we can see that the time series object tourism_ts doesn’t has any seasonal but shows a trend.

Cross-Validation

Then, we split the dataset into train and test datasets. The test dataset is the data of last 5 years (2013 - 2017), and the rest as train dataset.

tourism_test <- tail(tourism_ts, 5)
tourism_train <- head(tourism_ts, length(tourism_ts) - length(tourism_test))

Now let’s create several models using Holt’s exponential smoothing and Arima.

Holt’s Exponential Smoothing

We use Holt’s Exponential Smoothing (Double Exponential Smoothing) for forecasting model because the dataset has no seasonal (only have error and trend).

Model Fitting

We create the model by adding a parameter gamma = F.

tourism_hw <- HoltWinters(x = tourism_train, gamma = F)

Forecast & Model Evaluation

Then we use the model created for forecasting and evaluate the model performance.

tourism_hw_f <- forecast(object = tourism_hw, h = length(tourism_test))
accuracy(tourism_hw_f, tourism_test)
#>                      ME    RMSE       MAE       MPE      MAPE     MASE
#> Training set  -38596.34  522086  421798.6 -1.399966  8.005484 1.002878
#> Test set     1308745.15 1769056 1308745.2 10.712206 10.712206 3.111702
#>                    ACF1 Theil's U
#> Training set 0.03716165        NA
#> Test set     0.23913136  1.289665

Let’s also create a visualization to compare the actual, forecasted and the fitted data.

test_forecast(actual = tourism_ts, forecast.obj = tourism_hw_f, train = tourism_train, test = tourism_test)

Arima

Then we will create several models using Arima algorithm. Before preparing the models we have to check the data stationary first.

ADF test

We use Augmented Dickey-Fuller (ADF) test or adf.test() with following hypothesis:

  • H0: have a root unit (data is not stationary)
  • H1: doesn’t have a root unit (data is already stationary)

p-value < 0.05 (alpha), data already stationary

adf.test(tourism_train, alternative = "stationary")
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  tourism_train
#> Dickey-Fuller = 0.56756, Lag order = 2, p-value = 0.99
#> alternative hypothesis: stationary

From the test, p-value = 0.99, which means p-value > 0.05 and data is not stationary. We try to make the data into staionary position by differencing the data once.

# diff 1x
diff(tourism_train, differences = 1) %>% adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -2.6551, Lag order = 2, p-value = 0.3228
#> alternative hypothesis: stationary

Since the p-value still above 0.05, let’s try differencing the data twice now.

# diff 2x
diff(tourism_train, differences = 2) %>% adf.test()
#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -4.9491, Lag order = 2, p-value = 0.01
#> alternative hypothesis: stationary

Now the p-value = 0.01 means it is < 0.05, so the data is already stationary and we can go on.

Model Fitting

tsdisplay(diff(tourism_train, differences = 2))

From the first 5 lags, we can see the p and q from PACF and ACF by count how many line which passed the threshold.

  • p = 0 (no lags passed the threshold)
  • d = 2 (we only do differencing twice to get stationary data)
  • q = 0 (no lags on ACF plot passed the threshold)

We’ll try to do ARIMA modeling by combining the (p,d,and q) value. Some combination we can try :

  • ARIMA (0,1,0)
  • ARIMA (0,2,0)
#modeling
tourism_arima1 <- Arima(y = tourism_train, order = c(0,1,0))
tourism_arima2 <- Arima(y = tourism_train, order = c(0,2,0))

After modeling, we’ll try to check the AIC of each arima model

tourism_arima1$aic
#> [1] 496.0283
tourism_arima2$aic
#> [1] 475.1592

From the AIC value, we can see that model tourism_arima2 with p,d,q = (0,2,0) is the best in arima model for this case. We’ll try to check which is the best model for arima method by MAPE (error evaluation).

accuracy(tourism_arima1)
#>                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
#> Training set 206906.9 480569.7 397462.4 3.061196 7.138125 0.9450156 -0.04862854
accuracy(tourism_arima2)
#>                     ME     RMSE      MAE        MPE     MAPE     MASE
#> Training set -17682.22 602231.1 459119.3 -0.5160297 8.678039 1.091612
#>                    ACF1
#> Training set -0.4002891

If we check from MAPE method, we see that tourism_arima1 is the best because the error percentage is 7.13% by MAPE. We can also using auto arima model.

tourism_auto <- auto.arima(tourism_train, seasonal = F)
summary(tourism_auto)
#> Series: tourism_train 
#> ARIMA(0,1,0) with drift 
#> 
#> Coefficients:
#>          drift
#>       218823.5
#> s.e.  107551.4
#> 
#> sigma^2 estimated as 208939082690:  log likelihood=-245.16
#> AIC=494.32   AICc=495.18   BIC=495.99
#> 
#> Training set error measures:
#>                    ME     RMSE      MAE        MPE     MAPE      MASE
#> Training set 228.0652 430956.6 359384.9 -0.7385359 6.758032 0.8544816
#>                     ACF1
#> Training set -0.01646288

Then, we’ll try to compare model arima 1 and auto arima.

# AIC arima 1
tourism_arima1$aic
#> [1] 496.0283
# AIC auto arima
tourism_auto$aic
#> [1] 494.3235
# MAPE arima 1
accuracy(tourism_arima1)
#>                    ME     RMSE      MAE      MPE     MAPE      MASE        ACF1
#> Training set 206906.9 480569.7 397462.4 3.061196 7.138125 0.9450156 -0.04862854
# MAPE auto arima
accuracy(tourism_auto)
#>                    ME     RMSE      MAE        MPE     MAPE      MASE
#> Training set 228.0652 430956.6 359384.9 -0.7385359 6.758032 0.8544816
#>                     ACF1
#> Training set -0.01646288
  • From AIC method, we see that tourism_auto is better than tourism_arima1 because tourism_arima1 has AIC value higher than tourism_auto.

  • From MAPE, we see that tourism_auto is better than tourism_arima1 because the error percentage based on MAPE is 6.76%.

Forecast & Model Evaluation

Then we use the tourism_auto model for forecasting and evaluate the model performance.

tourism_auto_f <- forecast(tourism_auto, h = length(tourism_test)) 
accuracy(tourism_auto_f, tourism_test)
#>                        ME      RMSE       MAE        MPE      MAPE      MASE
#> Training set     228.0652  430956.6  359384.9 -0.7385359  6.758032 0.8544816
#> Test set     2140129.4118 2641921.9 2140129.4 18.0220970 18.022097 5.0884196
#>                     ACF1 Theil's U
#> Training set -0.01646288        NA
#> Test set      0.29042570  1.958076

Let’s also create a visualization to compare the actual, forecasted and the fitted data.

test_forecast(actual = tourism_ts, forecast.obj = tourism_auto_f, train = tourism_train, test = tourism_test)

Model Selection

Let’s compare the error valoue for both tourism_hw and tourism_auto models.

data.frame(Holts_MAPE = MAPE(tourism_hw_f$mean, tourism_test)*100,
           Auto_Arima_MAPE = MAPE(tourism_auto_f$mean, tourism_test)*100
           )
#>   Holts_MAPE Auto_Arima_MAPE
#> 1   10.71221         18.0221

There is some information we can get about the MAPE:

  • Model tourism_auto is better at forecasting in train dataset with 6.76% of error, but when forecasting in test dataset, model tourism_auto got error of 18.02%.
  • Model tourism_hw is better at forecasting in test dataset with 10.71% of error, but when forecasting in train dataset, model tourism_hw got error of 8.00%.

Therefore, based on above information, model tourism_hw will be the best model to be used since the overfit is not too large and its forecasting on the test dataset also has a lower error percentage rather than the tourism_auto model.

Assumption Checking

Let’s check the autocorrelation and residual distribution for model tourism_hw.

No-autocorrelation Test Using Box.test() Function

Hypothesis:

  • H0 : No autocorrelation in the forecast errors
  • H1 : there is an autocorrelation in the forecast errors

p-value > 0.05 (alpha), error has no-autocorrelation

Box.test(tourism_hw_f$residuals, type = "Ljung-Box")
#> 
#>  Box-Ljung test
#> 
#> data:  tourism_hw_f$residuals
#> X-squared = 0.026515, df = 1, p-value = 0.8706

Based on Box-Ljung test result, tourism_hw model has p-value = 0.87 (p-value > 0.05) thus, no-autocorrelation in forecast errors.

Normality of The Residual Test Using shapiro.test() Function

Hypothesis:

  • H0 : residuals are normally distributed
  • H1 : residuals are not normally distributed

p-value > 0.05 (alpha), residuals are normally distributed

shapiro.test(x = tourism_hw_f$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  tourism_hw_f$residuals
#> W = 0.98065, p-value = 0.9685

Based on shapiro.test result, tourism_hw model has p-value = 0.97 (p-value > 0.05) thus, residual distributed normally.

Conclusion

From several models we have created before, the Holt’s Exponential Smoothing model, model tourism_hw, has the best MAPE percentage amounting to 10.71% when forecasting the test dataset. Further, the model already passed the normality and auto-correlation test.

Hopefully, by learning the time pattern of the tourist arrival, the tourism_hw model can be used to forecast years after 2017. Nevertheless, considering how year 2020 and 2021 looks like, we think tourism_hw model should be tuned up again with the historical data from those 2 years because during the Covid-19 pandemic, cross-border traveling is restricted and tourism sector from all countries are affected.