FORECASTING TOURISTS’ ARRIVAL IN KENYA USING ARIMA

ABSTRACT

Tourism is among the leading sources of revenue to the Kenyan Government. The tourism industry’s resilience relies on effective predictive models in capturing the dynamic nature of travel patterns. In this study, data of quarterly tourists’ arrivals in Kenya from 2000 to 2023 was obtained from the Kenya National Bureau of Statistics website. R statistical software version 4.3.2 was used to build the ARIMA models based on the Box-Jenkins method to model the tourists’ arrivals in Kenya. ARIMA(1,0,0) was established to be the best model. This follows the assumption of normality and stationarity of the time series data. The model was selected based on the low AIC values as the model selection criteria. The diagnostic checks shows that the model fits the data well and was good for forecasting the future number of tourists’ arrivals in Kenya. The forecasted values and the 95% confidence interval showed that the predicted values were within the range. This shows that the forecasting of this model relatively adequate and efficient in modelling the quarterly number of tourists’ arrivals in Kenya for the next four years (2024-2027).

IMPORTING THE NECESSARY LIBRARIES

we first loaded the necessary libraries that were to be used in the analysis

library(readxl)
library(psych)
library(e1071) 
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(TTR)
library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(forecast)
library(ggfortify)

## Registered S3 methods overwritten by 'ggfortify':
##   method                 from    
##   autoplot.Arima         forecast
##   autoplot.acf           forecast
##   autoplot.ar            forecast
##   autoplot.bats          forecast
##   autoplot.decomposed.ts forecast
##   autoplot.ets           forecast
##   autoplot.forecast      forecast
##   autoplot.stl           forecast
##   autoplot.ts            forecast
##   fitted.ar              forecast
##   fortify.ts             forecast
##   residuals.ar           forecast

importing the data and preprocessing

here we set the working directory and imported the data for preprocessing. we checked the the class of the data,missing valuse, the distribution and the presence of outliers.

#we set seed to make our work reproducible
set.seed(1243)
# setting the working directory to desktop
setwd("C:/Users/Anderson/Desktop")
# importing the data and viewing the first few columns
anderson_data <- read_excel("anderson data.xlsx")
head(anderson_data)

## # A tibble: 6 × 2
##    Year Arrival
##   <dbl>   <dbl>
## 1  2000  265300
## 2  2000  207000
## 3  2000  247004
## 4  2000  283000
## 5  2001  270652
## 6  2001  211146

arrival<-ts(anderson_data[,2], start = c(2000,1), end = c(2023,4),
            frequency = 4)
class(arrival)

## [1] "ts"

head(arrival, n =20)

##        Qtr1   Qtr2   Qtr3   Qtr4
## 2000 265300 207000 247004 283000
## 2001 270652 211146 250795 261045
## 2002 254825 222045 256288 268122
## 2003 301000 224300 311000 309900
## 2004 317500 277000 369800 396400

sum(is.na(arrival))

## [1] 0

hist(arrival, col = 3)

#determining the skewness and kurtosis of the data
skewness(arrival)

## [1] -0.469157

kurtosis(arrival)

## [1] 0.7794268

#checking for the prencess of outliers
boxplot(arrival~cycle(arrival),main= "Tourists Box PLot 1")

#detecting and removing outliers
det_outliers <- function(data) {
  q1 <- quantile(data, 0.25)
  q3 <- quantile(data, 0.75)
  iqr <- q3 - q1
  lower_bound <- q1 - (1.5 * iqr)
  upper_bound <- q3 + (1.5 * iqr)
  return(data < lower_bound | data > upper_bound)
}

outlier_indices <- det_outliers(arrival)

arrival_imputed <- replace(arrival, outlier_indices, median(arrival))
# histogram after removing the outliers values
hist(arrival_imputed, main = "Histogram of tourists arrivals", col = 4)

#boxplot after removing the outlier values
boxplot(arrival_imputed~cycle(arrival_imputed), main = "Tourists' Box PLot 2")

##DESCRIPTIVE STATISTICS here we get to descriptive the tourists arrivals using descriptive statistics.

head(arrival_imputed, n = 20)

##        Qtr1   Qtr2   Qtr3   Qtr4
## 2000 265300 207000 247004 283000
## 2001 270652 211146 250795 261045
## 2002 254825 222045 256288 268122
## 2003 301000 224300 311000 309900
## 2004 317500 277000 369800 396400

summary(arrival_imputed)

##     Arrival      
##  Min.   :140690  
##  1st Qu.:294377  
##  Median :364225  
##  Mean   :360567  
##  3rd Qu.:429156  
##  Max.   :599689

describe(arrival_imputed)

##    vars  n     mean       sd median  trimmed      mad    min    max  range skew
## X1    1 96 360566.9 93427.16 364225 360493.8 100712.3 140690 599689 458999 0.03
##    kurtosis      se
## X1    -0.27 9535.37

skewness(arrival_imputed)

## [1] 0.02937688

kurtosis(arrival_imputed)

## [1] -0.2654101

#plotting the time series data
autoplot(arrival_imputed)+
  labs( title = "TOURISTS ARRIVAL TIME PLOT", 
        x = "TIME", y = "NO OF ARRIVALS")

DECOMPOSITION OF THE TIME SERIES DATA

The graph below a decomposition of the tourists’ data to its 3 component. The first graph from the top is the observed time plot of the tourist data. The second from the top is the random fluctuations of the data. The third graph is the seasonal component of the time series data and we can observe that the data has seasonal variation. Picks every summer and troughs every winter. The bottom graph shows trend component that give the long run of the data and we can observe that has downward and upward trends

#decomposing time series
d_arrival<-decompose(arrival_imputed)
autoplot(d_arrival)+
  labs(title = "TOURIST ARRIVAL DECOMPOSED")

## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_line()`).

CHECKING FOR STATIONARITY

One of the assumptions of time series analysis is that the data must be stationary that is the mean, variance and autocovariance should not change over time. We test for stationarity using the “adf.test ()” function which is the augmented Dickey – Fuller Test.

Ho: The Tourists Arrival data is not stationary H1: The Tourists Arrival data is stationary

Conclusion Since the p – value of 0.01572 is less than our level of significance of 0.05 we reject the null hypothesis and conclude that the data of tourists’ arrival is stationary.

adf.test(arrival_imputed)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  arrival_imputed
## Dickey-Fuller = -3.9312, Lag order = 4, p-value = 0.01572
## alternative hypothesis: stationary

ANALYSIS OF TREND OF TOURISTS’ ARRIVALS IN KENYA

The trend of the tourist arrivals was analyzed by use of visual representation. The graph below shows the trend of the tourists’ arrivals from 2000 to 2023. From the graph below the general trend of the tourists’ arrivals is slightly an upward trend. We can observe from the Year 2000upto 2008 the trend was moving upward suggesting that the number of tourist arrivals increased this was maybe due to favourable government policy that favors tourist to travel to Kenya. Then from end of 2007 towards 2008 there was a sharp downward trend that was experienced. This could be possible due to the post violence election that was experienced in the country. Then from 2008 the trend started to move upwards again up-to around 2012. Then there was a downward trend of the tourists towards 2015 then there was an upward trends up-to 2019 around December then the number of tourists decreased sharply from 2020 to 2022 due to the outbreak of the COVID-19 pandemic that resulted to lockdowns and there was little movement of tourist into the country. From the end of 2022 the trend started to increase due to change in policy that favors tourist to travel and the discovery of the COVID-19 vaccine. Holding other factors constant, the number of tourist arrivals is expected to increase with time.

autoplot(d_arrival$trend)+
  labs(title = "TREND OF TOURISTS ARRIVALS", x = "TIME", 
       y = "NO OF ARRIVALS")

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

MODEL BUILDING

To build an ARIMA model (p,d,q) we need to examine the order of the Auto-Regressive (AR) , the number of times differenced or the order of integration (d) and the order of the Moving Average(MA). We used the ACF and PACF plots to examine correlation of the tourists’ series data to give us the possible ARIMA candidate.

The graph below is an ACF plot of the tourists’ arrival data. The graph plots the tourists’ data against its lagged values. From the graph the first lag is equal to 1 which is significant since it does not have the previous value. The second up-to the fifth one exceed the significance bounce. After the fifth it is tailing to zero. So, this suggests an AR of order(4). Hence the candidate ARMA is (4,0)

autoplot(acf(arrival_imputed)) + 
  labs(title = "ACF PLOT")

the figure below is a PACF plot of the tourists’ arrival data. This plot gives the correlation of the series after elimination of the intermediate lags. From the graph we can observe the first lag exceeds the significant bounce. Hence the possible order of Moving Averages is 1 hence MA(1). The possible candidate ARMA is (0,1)

autoplot(pacf(arrival_imputed)) +
  labs(title = "PACF PLOT")

model_fit<-auto.arima(arrival_imputed,trace = TRUE, ic = "aic")

## 
##  ARIMA(2,0,2)(1,1,1)[4] with drift         : 2267.591
##  ARIMA(0,0,0)(0,1,0)[4] with drift         : 2368.291
##  ARIMA(1,0,0)(1,1,0)[4] with drift         : 2283.355
##  ARIMA(0,0,1)(0,1,1)[4] with drift         : 2312.12
##  ARIMA(0,0,0)(0,1,0)[4]                    : 2366.582
##  ARIMA(2,0,2)(0,1,1)[4] with drift         : 2269.828
##  ARIMA(2,0,2)(1,1,0)[4] with drift         : Inf
##  ARIMA(2,0,2)(2,1,1)[4] with drift         : 2269.162
##  ARIMA(2,0,2)(1,1,2)[4] with drift         : 2269.633
##  ARIMA(2,0,2)(0,1,0)[4] with drift         : 2283.737
##  ARIMA(2,0,2)(0,1,2)[4] with drift         : 2268.15
##  ARIMA(2,0,2)(2,1,0)[4] with drift         : 2272.904
##  ARIMA(2,0,2)(2,1,2)[4] with drift         : 2271.396
##  ARIMA(1,0,2)(1,1,1)[4] with drift         : 2267.485
##  ARIMA(1,0,2)(0,1,1)[4] with drift         : 2267.996
##  ARIMA(1,0,2)(1,1,0)[4] with drift         : 2282.075
##  ARIMA(1,0,2)(2,1,1)[4] with drift         : 2269.484
##  ARIMA(1,0,2)(1,1,2)[4] with drift         : 2269.484
##  ARIMA(1,0,2)(0,1,0)[4] with drift         : 2287.689
##  ARIMA(1,0,2)(0,1,2)[4] with drift         : 2267.644
##  ARIMA(1,0,2)(2,1,0)[4] with drift         : 2278.252
##  ARIMA(1,0,2)(2,1,2)[4] with drift         : Inf
##  ARIMA(0,0,2)(1,1,1)[4] with drift         : Inf
##  ARIMA(1,0,1)(1,1,1)[4] with drift         : 2266.952
##  ARIMA(1,0,1)(0,1,1)[4] with drift         : 2267.088
##  ARIMA(1,0,1)(1,1,0)[4] with drift         : 2285.193
##  ARIMA(1,0,1)(2,1,1)[4] with drift         : 2268.941
##  ARIMA(1,0,1)(1,1,2)[4] with drift         : 2268.945
##  ARIMA(1,0,1)(0,1,0)[4] with drift         : 2297.404
##  ARIMA(1,0,1)(0,1,2)[4] with drift         : 2267.015
##  ARIMA(1,0,1)(2,1,0)[4] with drift         : 2280.566
##  ARIMA(1,0,1)(2,1,2)[4] with drift         : Inf
##  ARIMA(0,0,1)(1,1,1)[4] with drift         : Inf
##  ARIMA(1,0,0)(1,1,1)[4] with drift         : 2265.016
##  ARIMA(1,0,0)(0,1,1)[4] with drift         : 2265.088
##  ARIMA(1,0,0)(2,1,1)[4] with drift         : 2267.007
##  ARIMA(1,0,0)(1,1,2)[4] with drift         : 2267.01
##  ARIMA(1,0,0)(0,1,0)[4] with drift         : 2296.138
##  ARIMA(1,0,0)(0,1,2)[4] with drift         : 2265.074
##  ARIMA(1,0,0)(2,1,0)[4] with drift         : 2278.642
##  ARIMA(1,0,0)(2,1,2)[4] with drift         : Inf
##  ARIMA(0,0,0)(1,1,1)[4] with drift         : Inf
##  ARIMA(2,0,0)(1,1,1)[4] with drift         : 2266.931
##  ARIMA(2,0,1)(1,1,1)[4] with drift         : 2266.109
##  ARIMA(1,0,0)(1,1,1)[4]                    : 2263.866
##  ARIMA(1,0,0)(0,1,1)[4]                    : 2263.733
##  ARIMA(1,0,0)(0,1,0)[4]                    : 2294.169
##  ARIMA(1,0,0)(0,1,2)[4]                    : 2263.895
##  ARIMA(1,0,0)(1,1,0)[4]                    : 2281.44
##  ARIMA(1,0,0)(1,1,2)[4]                    : 2265.853
##  ARIMA(0,0,0)(0,1,1)[4]                    : 2364.854
##  ARIMA(2,0,0)(0,1,1)[4]                    : 2265.732
##  ARIMA(1,0,1)(0,1,1)[4]                    : 2265.732
##  ARIMA(0,0,1)(0,1,1)[4]                    : 2311.572
##  ARIMA(2,0,1)(0,1,1)[4]                    : 2267.282
## 
##  Best model: ARIMA(1,0,0)(0,1,1)[4]

arrival_model<-arima(arrival_imputed, order = c(1,0,0))
summary(arrival_model)

## 
## Call:
## arima(x = arrival_imputed, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.7048   359822.6
## s.e.  0.0716    22197.1
## 
## sigma^2 estimated as 4.326e+09:  log likelihood = -1201.58,  aic = 2409.16
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE        ACF1
## Training set 980.7115 65770.67 49593.42 -3.901437 15.4343 0.8902417 -0.06828334

DIAGONISTICS CHECKS

hist(arrival_model$residuals)

qqnorm((arrival_model$residuals))

autoplot(acf(residuals(arrival_model)))

autoplot(pacf(arrival_model$residuals))

autoplot(arrival_model$residuals)+
  labs(title = "RESIDUAL PLOT", x = "Time", y = "Residuls")

b_test<-Box.test(arrival_model$residuals,lag = 10,type = "Ljung-Box")
b_test

## 
##  Box-Ljung test
## 
## data:  arrival_model$residuals
## X-squared = 65.653, df = 10, p-value = 3.04e-10

FORECASTING USING THE BEST MODEL

The graph below shows the plot of the forecasted number of tourists arrivals in Kenya. The blue line shows the forecasted plots and the dark region is the 95% confidence interval of the tourists’ arrivals from 2024 to 2027. The trend is expected to be slightly decreasing for the next four years.

p_arrival<-forecast(arrival_model, h = 16)
head(p_arrival)

## $method
## [1] "ARIMA(1,0,0) with non-zero mean"
## 
## $model
## 
## Call:
## arima(x = arrival_imputed, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.7048   359822.6
## s.e.  0.0716    22197.1
## 
## sigma^2 estimated as 4.326e+09:  log likelihood = -1201.58,  aic = 2409.16
## 
## $level
## [1] 80 95
## 
## $mean
##          Qtr1     Qtr2     Qtr3     Qtr4
## 2024 405409.1 391953.3 382469.3 375784.7
## 2025 371073.1 367752.3 365411.7 363762.0
## 2026 362599.2 361779.6 361202.0 360794.8
## 2027 360507.8 360305.6 360163.0 360062.5
## 
## $lower
##              80%      95%
## 2024 Q1 321120.6 276501.0
## 2024 Q2 288832.1 234243.1
## 2024 Q3 271170.8 212252.9
## 2024 Q4 260639.5 199685.4
## 2025 Q1 254064.1 192123.2
## 2025 Q2 249828.2 187402.9
## 2025 Q3 247035.6 184371.1
## 2025 Q4 245162.0 182379.0
## 2026 Q1 243888.2 181046.4
## 2026 Q2 243013.5 180142.5
## 2026 Q3 242408.4 179523.0
## 2026 Q4 241987.7 179095.0
## 2027 Q1 241694.0 178797.7
## 2027 Q2 241488.4 178590.3
## 2027 Q3 241344.1 178445.2
## 2027 Q4 241242.8 178343.5
## 
## $upper
##              80%      95%
## 2024 Q1 489697.6 534317.3
## 2024 Q2 495074.6 549663.6
## 2024 Q3 493767.8 552685.7
## 2024 Q4 490929.8 551884.0
## 2025 Q1 488082.2 550023.1
## 2025 Q2 485676.4 548101.7
## 2025 Q3 483787.7 546452.2
## 2025 Q4 482361.9 545144.9
## 2026 Q1 481310.2 544152.0
## 2026 Q2 480545.7 543416.7
## 2026 Q3 479995.5 542880.9
## 2026 Q4 479601.9 542494.6
## 2027 Q1 479321.7 542218.0
## 2027 Q2 479122.8 542020.8
## 2027 Q3 478981.9 541880.8
## 2027 Q4 478882.2 541781.6

autoplot(p_arrival)+
  labs(title = "FORECASTS FOR 4 YEARS", x = "TIME", y = "ARRIVALS")