This report can also be found at: http://rpubs.com/borjaureta/529795

1. Introduction

1.1 Team Globetrotters

Team Globetrotters is comprised of:

Jose Vilardy
Borja Ureta
Nicole Lerman
Timucin Engin
Robin Collis
Sandra Sorza

1.2 Background on Capstone project

We partnered with DFS Group to forecast travel and purchase intent of Chinese consumers to Paris. DFS Group is the world’s leading luxury travel retailer. Network consists of duty free stores located in 11 major global airports and 20 downtown T Galleria locations.

DFS has re-oriented its strategy towards the Chinese market and being as close as possible to where Chinese travelers are. As part of this strategy, DFS will be opening a new store in Paris, France. This store will have a deep focus on Chinese tourists.

Our initial goal is to predict flows of Chinese tourists into Paris. For this, we will leverage certain information owned by DFS on actual and forward looking reservation data from a proprietary database.

# Creating a vector of packages used within.  
packages <- c('adabag', 'astsa',
              'base',
              'caret', 'caTools','chron', 'Cubist', 'cowplot',
              'DMwR2','doParallel','dplyr', 'data.table',
              'e1071','earth',
              'forecast',
              'ggplot2','gridExtra', 'gbm', 'grt', 'ggfortify',
              'here', 'h2o', 'httr',
              'ipred',
              'janitor',
              'knitr',
              'lme4','lubridate',
              'MASS',
              'neuralnet', 'nnet',
              'plyr','psych', 'performanceEstimation', 'prophet',
              'randomForest','readr', 'readxl','rlang','rpart','rpart.plot',
              'stats', 'scales',
              'tidyverse', 'timeSeries', 'tsbox',
              'urca',
              'varSelRF',
              'xgboost', 'xts',
              'zoo')
# Checking for package installations on the system and installing if not found.
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
# Including the packages for use.
for(package in packages){
  library(package, character.only = TRUE)
}
#Ensure wd is set to current location by using here()
setwd(here::here())

#h20Instance <- h2o.init(ip = 'localhost')

1.3 The dataset

The main data source for this exercise is Forward Keys (“FK”). FK is a company that analyses millions of daily flights and provides clients with insights for making their tactical decisions. FK crunches and analyzes over 17 million booking transactions a day. Additionally, they pull a wide range of other relevant data like flight searches and aviation capacity.

DFS provided us with a data pull from FK. This dataset contains the actual information on Chinese travelers to Paris between January 1 2015 and August 31, 2019. It also provides additional valuable information such as length of stay in Paris, total trip duration (overall), number of people per booking, lead times between booking and flight, cabin, distribution channels, type of stay, one-way/return, travel day of the week and traveler profile (i.e. business, leisure and group).

1.4 Goal of this analysis

We will leverage the FK data with two goals:

Visualization and insights: analyze the data and capture relevant trends, insights and an initial indication of potential predictors
Predictive modeling: develop a predictive model that predicts the number of travelers to Paris on a particular date.

2. Analysis

2.1 Data Import

The dataset was included in a spreadsheet that contained 3 tabs, with a few tables. We imported the data and run a few data cleaning and structuring code lines to enable the subsequent data analysis.

#Import dataset
#Import table with 15 to 19 data
pdset15_19 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Paris(FR)_PAR.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
                                skip = 121, n_max = 1703))
#Import 2014 from second table
pdset14 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Paris(FR)_PAR.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
                                skip = 1828, n_max = 364))
#Join both tables by row
pdset <- rbind(pdset14, pdset15_19)


#Change column names
colnames(pdset) <- c('Date', 'Total_trav', 'Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'LoS_1night', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights', 'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'Endoftrip', 'TrDur_0night', 'TrDur_1night', 'TrDur_2nights', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_5pax', 'ppb_6to9pax', 'ppb_10pl_pax', 'LeadT_0to4days', 'LeadT_5to14days', 'LeadT_15to29days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'LeadT_120to364days', 'Cabin_Economy', 'Cabin_EconomyPremium', 'Cabin_Business', 'Cabin_First', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Weekendstay', 'TypeSt_Workweekstay', 'TypeSt_Combinedstay', 'TypeSt_Nostay', 'OneRet_Oneway', 'OneRet_Return', 'OneRet_MultiCity', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Profile_VFRndExpats', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB')

#Get rid of columns that have no data and are not useful for our analysis
pdset <- dplyr::select(pdset, -c('Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'Endoftrip', 'TrDur_0night', 'TypeSt_Nostay', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_VFRndExpats', 'OneRet_Oneway'))

#Add day of the week, week, month and year.
pdset <- dplyr::mutate(pdset, 
                WeekDay=lubridate::wday(pdset$Date, label = TRUE), 
                Month=lubridate::month(pdset$Date, label = TRUE), 
                Week=lubridate::week(pdset$Date),
               Year=lubridate::year(pdset$Date))

#Import dataset
#Import table with 15 to 19 data
vdset15_19 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Venice(IT)_VCE.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
                                skip = 121, n_max = 1703))
#Import 2014 from second table
vdset14 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Venice(IT)_VCE.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
                                skip = 1828, n_max = 364))
#Join both tables by row
vdset <- rbind(vdset14, vdset15_19)


#Change column names
colnames(vdset) <- c('Date', 'Total_trav', 'Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'LoS_1night', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights', 'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'Endoftrip', 'TrDur_0night', 'TrDur_1night', 'TrDur_2nights', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_5pax', 'ppb_6to9pax', 'ppb_10pl_pax', 'LeadT_0to4days', 'LeadT_5to14days', 'LeadT_15to29days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'LeadT_120to364days', 'Cabin_Economy', 'Cabin_EconomyPremium', 'Cabin_Business', 'Cabin_First', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Weekendstay', 'TypeSt_Workweekstay', 'TypeSt_Combinedstay', 'TypeSt_Nostay', 'OneRet_Oneway', 'OneRet_Return', 'OneRet_MultiCity', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Profile_VFRndExpats', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB')

#Get rid of columns that have no data and are not useful for our analysis
vdset <- dplyr::select(vdset, -c('Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'Endoftrip', 'TrDur_0night', 'TypeSt_Nostay', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_VFRndExpats', 'OneRet_Oneway'))

#Add day of the week, week, month and year.
vdset <- dplyr::mutate(vdset, 
                WeekDay=lubridate::wday(vdset$Date, label = TRUE), 
                Month=lubridate::month(vdset$Date, label = TRUE), 
                Week=lubridate::week(vdset$Date),
               Year=lubridate::year(vdset$Date))

2.2 Exploratory Analysis

We have included below the summary and structure of the dataset.

str(pdset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    2067 obs. of  61 variables:
##  $ Date                       : POSIXct, format: "2014-01-02" "2014-01-03" ...
##  $ Total_trav                 : num  147 154 202 172 180 141 290 150 267 314 ...
##  $ LoS_1night                 : num  0 0 3 5 13 6 8 11 15 15 ...
##  $ LoS_2nights                : num  10 12 1 15 18 8 19 42 7 25 ...
##  $ LoS_3nights                : num  46 20 16 27 17 13 63 14 37 35 ...
##  $ LoS_4to5nights             : num  80 91 67 72 73 39 86 27 129 87 ...
##  $ LoS_6to8nights             : num  28 16 65 20 35 27 65 29 33 79 ...
##  $ LoS_9to13nights            : num  12 6 29 18 14 25 20 11 16 44 ...
##  $ LoS_14to21nights           : num  -35 6 16 13 4 21 27 15 29 19 ...
##  $ LoS_22nightsormore         : num  6 3 5 2 6 2 2 1 1 10 ...
##  $ TrDur_1night               : num  0 0 0 3 1 2 1 1 3 0 ...
##  $ TrDur_2nights              : num  1 0 0 0 4 1 5 6 1 1 ...
##  $ TrDur_3nights              : num  1 0 1 0 11 1 12 4 6 5 ...
##  $ TrDur_4to5nights           : num  11 11 6 18 34 12 30 13 23 26 ...
##  $ TrDur_6to8nights           : num  53 70 78 80 33 42 116 66 130 115 ...
##  $ TrDur_9to13nights          : num  82 41 71 50 75 55 73 36 51 124 ...
##  $ TrDur_14to21nights         : num  15 24 33 12 15 19 38 17 44 31 ...
##  $ TrDur_22nightsormore       : num  -16 8 13 9 7 9 15 7 9 12 ...
##  $ ppb_1pax                   : num  46 40 87 75 91 67 106 79 89 114 ...
##  $ ppb_2pax                   : num  32 20 34 26 24 46 70 56 46 62 ...
##  $ ppb_3pax                   : num  0 18 12 6 36 3 27 3 27 27 ...
##  $ ppb_4pax                   : num  4 4 20 8 8 12 28 12 12 32 ...
##  $ ppb_5pax                   : num  5 0 20 5 10 0 5 0 15 20 ...
##  $ ppb_6to9pax                : num  55 21 0 0 0 13 38 0 38 19 ...
##  $ ppb_10pl_pax               : num  5 51 29 52 11 0 16 0 40 40 ...
##  $ LeadT_0to4days             : num  -12 -82 -27 -19 -11 15 -26 1 -6 -9 ...
##  $ LeadT_5to14days            : num  7 -14 14 19 -21 -32 35 -7 13 -1 ...
##  $ LeadT_15to29days           : num  26 40 68 9 80 86 128 33 23 115 ...
##  $ LeadT_30to44days           : num  -32 25 13 49 49 44 82 22 78 99 ...
##  $ LeadT_45to59days           : num  54 76 70 40 23 16 9 77 33 28 ...
##  $ LeadT_60to89days           : num  60 26 12 73 17 4 48 17 28 71 ...
##  $ LeadT_90to119days          : num  -72 0 44 -40 42 -33 8 1 7 -7 ...
##  $ LeadT_120to364days         : num  116 83 8 41 1 41 6 6 91 18 ...
##  $ Cabin_Economy              : num  130 134 153 140 133 112 247 116 236 258 ...
##  $ Cabin_EconomyPremium       : num  3 0 16 7 9 9 5 15 5 18 ...
##  $ Cabin_Business             : num  13 14 32 25 36 18 37 18 22 27 ...
##  $ Cabin_First                : num  1 6 1 0 2 2 1 1 4 11 ...
##  $ DistCh_OnlineTA            : num  15 7 39 14 32 30 32 19 31 69 ...
##  $ DistCh_CorporateTA         : num  17 42 58 41 25 29 54 17 64 46 ...
##  $ DistCh_RetailTA            : num  115 104 86 112 118 74 196 102 160 183 ...
##  $ DistCh_OtherTA             : num  0 1 19 5 5 8 8 12 12 16 ...
##  $ TypeSt_Weekendstay         : num  0 12 3 0 0 0 0 0 22 15 ...
##  $ TypeSt_Workweekstay        : num  0 0 0 119 93 27 27 11 0 0 ...
##  $ TypeSt_Combinedstay        : num  147 142 199 53 87 114 263 139 245 299 ...
##  $ OneRet_Return              : num  29 9 96 43 64 73 122 48 92 126 ...
##  $ OneRet_MultiCity           : num  118 145 106 129 116 68 168 102 175 188 ...
##  $ Profile_Business           : num  43 65 109 97 106 74 146 83 133 119 ...
##  $ Profile_Leisure            : num  44 30 93 23 63 54 106 74 96 136 ...
##  $ Profile_Group              : num  60 59 0 52 11 13 38 -7 38 59 ...
##  $ Sign_InitialBookings       : num  704 963 902 1032 848 ...
##  $ Sign_Partialadditions_modif: num  326 348 161 211 324 182 361 172 382 278 ...
##  $ Sign_Partialcancellations  : num  -577 -869 -593 -630 -794 ...
##  $ Sign_Fulltripcancellations : num  -306 -288 -268 -441 -198 -209 -358 -195 -325 -273 ...
##  $ Avg.LOS                    : num  5.62 7.86 9.18 6.81 7.22 8.99 6.79 6.61 6.17 6.97 ...
##  $ Avg.LOT                    : num  11.5 16.1 12.9 10.6 10.5 ...
##  $ Avg.LT                     : num  66.2 137.1 65.8 74.4 57.8 ...
##  $ Avg.PPB                    : num  4.16 6.9 3.99 5.71 2.74 2.18 3.34 1.65 4.69 4.08 ...
##  $ WeekDay                    : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 5 6 7 1 2 3 4 5 6 7 ...
##  $ Month                      : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Week                       : num  1 1 1 1 1 1 2 2 2 2 ...
##  $ Year                       : num  2014 2014 2014 2014 2014 ...

summary(pdset)

##       Date                       Total_trav       LoS_1night   
##  Min.   :2014-01-02 00:00:00   Min.   :  71.0   Min.   :-4.00  
##  1st Qu.:2015-06-03 12:00:00   1st Qu.: 404.0   1st Qu.: 6.00  
##  Median :2016-11-01 00:00:00   Median : 567.0   Median :11.00  
##  Mean   :2016-10-31 19:46:24   Mean   : 630.8   Mean   :12.54  
##  3rd Qu.:2018-04-01 12:00:00   3rd Qu.: 798.0   3rd Qu.:17.00  
##  Max.   :2019-08-31 00:00:00   Max.   :2129.0   Max.   :79.00  
##                                                                
##   LoS_2nights      LoS_3nights    LoS_4to5nights   LoS_6to8nights 
##  Min.   : -6.00   Min.   : -3.0   Min.   : -16.0   Min.   :-37.0  
##  1st Qu.: 14.00   1st Qu.: 34.0   1st Qu.: 128.0   1st Qu.: 76.0  
##  Median : 22.00   Median : 55.0   Median : 198.0   Median :125.0  
##  Mean   : 25.37   Mean   : 62.2   Mean   : 236.4   Mean   :146.9  
##  3rd Qu.: 33.00   3rd Qu.: 81.0   3rd Qu.: 298.0   3rd Qu.:191.0  
##  Max.   :167.00   Max.   :473.0   Max.   :1078.0   Max.   :618.0  
##                                                                   
##  LoS_9to13nights  LoS_14to21nights LoS_22nightsormore  TrDur_1night   
##  Min.   :  6.00   Min.   :-35.00   Min.   :-7.000     Min.   :-5.000  
##  1st Qu.: 46.00   1st Qu.: 23.00   1st Qu.: 2.000     1st Qu.: 0.000  
##  Median : 68.00   Median : 41.00   Median : 5.000     Median : 1.000  
##  Mean   : 85.56   Mean   : 54.97   Mean   : 6.868     Mean   : 1.917  
##  3rd Qu.:112.00   3rd Qu.: 72.00   3rd Qu.: 9.000     3rd Qu.: 3.000  
##  Max.   :593.00   Max.   :412.00   Max.   :49.000     Max.   :39.000  
##                                                                       
##  TrDur_2nights    TrDur_3nights    TrDur_4to5nights TrDur_6to8nights
##  Min.   :-7.000   Min.   : -3.00   Min.   :-28.00   Min.   : 13.0   
##  1st Qu.: 1.000   1st Qu.:  5.00   1st Qu.: 24.00   1st Qu.:107.0   
##  Median : 4.000   Median : 11.00   Median : 40.00   Median :157.0   
##  Mean   : 4.717   Mean   : 14.95   Mean   : 50.63   Mean   :173.9   
##  3rd Qu.: 7.000   3rd Qu.: 20.00   3rd Qu.: 64.00   3rd Qu.:225.0   
##  Max.   :39.000   Max.   :187.00   Max.   :555.00   Max.   :657.0   
##                                                                     
##  TrDur_9to13nights TrDur_14to21nights TrDur_22nightsormore    ppb_1pax  
##  Min.   :   8.0    Min.   :  9.00     Min.   :-16.00       Min.   : 38  
##  1st Qu.: 138.0    1st Qu.: 44.00     1st Qu.: 11.00       1st Qu.:136  
##  Median : 220.0    Median : 74.00     Median : 17.00       Median :179  
##  Mean   : 264.1    Mean   : 97.92     Mean   : 22.66       Mean   :195  
##  3rd Qu.: 343.0    3rd Qu.:127.00     3rd Qu.: 28.00       3rd Qu.:239  
##  Max.   :1236.0    Max.   :525.00     Max.   :164.00       Max.   :526  
##                                                                         
##     ppb_2pax        ppb_3pax         ppb_4pax         ppb_5pax     
##  Min.   : 12.0   Min.   : -3.00   Min.   : -4.00   Min.   : -5.00  
##  1st Qu.: 82.0   1st Qu.: 30.00   1st Qu.: 24.00   1st Qu.: 10.00  
##  Median :118.0   Median : 48.00   Median : 36.00   Median : 20.00  
##  Mean   :132.8   Mean   : 61.82   Mean   : 44.15   Mean   : 27.16  
##  3rd Qu.:170.0   3rd Qu.: 81.00   3rd Qu.: 60.00   3rd Qu.: 35.00  
##  Max.   :570.0   Max.   :297.00   Max.   :168.00   Max.   :310.00  
##                                                                    
##   ppb_6to9pax     ppb_10pl_pax   LeadT_0to4days    LeadT_5to14days  
##  Min.   : -8.0   Min.   :-82.0   Min.   :-349.00   Min.   :-128.00  
##  1st Qu.: 25.0   1st Qu.: 31.0   1st Qu.: -25.00   1st Qu.:  36.00  
##  Median : 46.0   Median : 71.0   Median :  -8.00   Median :  75.00  
##  Mean   : 55.4   Mean   :114.5   Mean   : -13.84   Mean   :  83.66  
##  3rd Qu.: 76.0   3rd Qu.:142.5   3rd Qu.:   5.00   3rd Qu.: 120.00  
##  Max.   :296.0   Max.   :965.0   Max.   :  84.00   Max.   : 470.00  
##                                                                     
##  LeadT_15to29days LeadT_30to44days LeadT_45to59days LeadT_60to89days
##  Min.   :-45.0    Min.   :-75.0    Min.   :-88.00   Min.   :-70.00  
##  1st Qu.: 87.0    1st Qu.: 53.5    1st Qu.: 32.00   1st Qu.: 31.00  
##  Median :130.0    Median : 93.0    Median : 63.00   Median : 69.00  
##  Mean   :137.6    Mean   :102.6    Mean   : 74.24   Mean   : 87.79  
##  3rd Qu.:178.0    3rd Qu.:140.0    3rd Qu.:105.00   3rd Qu.:126.50  
##  Max.   :459.0    Max.   :386.0    Max.   :356.00   Max.   :502.00  
##                                                                     
##  LeadT_90to119days LeadT_120to364days Cabin_Economy   
##  Min.   :-139.0    Min.   :-45.0      Min.   :  55.0  
##  1st Qu.:  10.0    1st Qu.: 37.0      1st Qu.: 329.0  
##  Median :  31.0    Median : 79.0      Median : 472.0  
##  Mean   :  44.3    Mean   :114.4      Mean   : 534.3  
##  3rd Qu.:  66.5    3rd Qu.:152.5      3rd Qu.: 672.5  
##  Max.   : 412.0    Max.   :849.0      Max.   :1953.0  
##                                                       
##  Cabin_EconomyPremium Cabin_Business    Cabin_First     DistCh_OnlineTA 
##  Min.   : -2.0        Min.   :  4.00   Min.   :-4.000   Min.   :  7.00  
##  1st Qu.:  9.0        1st Qu.: 41.00   1st Qu.: 4.000   1st Qu.: 44.00  
##  Median : 16.0        Median : 60.00   Median : 7.000   Median : 66.00  
##  Mean   : 20.8        Mean   : 67.45   Mean   : 8.277   Mean   : 80.91  
##  3rd Qu.: 26.0        3rd Qu.: 85.00   3rd Qu.:11.000   3rd Qu.:100.00  
##  Max.   :118.0        Max.   :275.00   Max.   :56.000   Max.   :480.00  
##                                                                         
##  DistCh_CorporateTA DistCh_RetailTA  DistCh_OtherTA   TypeSt_Weekendstay
##  Min.   : -4.00     Min.   :  51.0   Min.   :-27.00   Min.   :  0.000   
##  1st Qu.: 52.00     1st Qu.: 269.0   1st Qu.:  4.00   1st Qu.:  0.000   
##  Median : 77.00     Median : 396.0   Median : 11.00   Median :  0.000   
##  Mean   : 84.86     Mean   : 441.6   Mean   : 23.37   Mean   :  7.872   
##  3rd Qu.:109.00     3rd Qu.: 574.5   3rd Qu.: 29.00   3rd Qu.:  9.000   
##  Max.   :282.00     Max.   :1453.0   Max.   :313.00   Max.   :125.000   
##                                                                         
##  TypeSt_Workweekstay TypeSt_Combinedstay OneRet_Return   OneRet_MultiCity
##  Min.   :   0.0      Min.   :  53.0      Min.   :  9.0   Min.   :  23.0  
##  1st Qu.:   0.0      1st Qu.: 293.0      1st Qu.:143.0   1st Qu.: 226.0  
##  Median :  35.0      Median : 450.0      Median :208.0   Median : 340.0  
##  Mean   : 104.7      Mean   : 518.3      Mean   :236.5   Mean   : 394.3  
##  3rd Qu.: 154.0      3rd Qu.: 672.5      3rd Qu.:298.0   3rd Qu.: 499.0  
##  Max.   :1275.0      Max.   :2119.0      Max.   :813.0   Max.   :1732.0  
##                                                                          
##  Profile_Business Profile_Leisure Profile_Group    Sign_InitialBookings
##  Min.   : 31.0    Min.   : 22.0   Min.   : -50.0   Min.   : 282        
##  1st Qu.:162.0    1st Qu.:146.0   1st Qu.:  55.0   1st Qu.:1162        
##  Median :215.0    Median :212.0   Median : 107.0   Median :1672        
##  Mean   :230.8    Mean   :249.6   Mean   : 150.4   Mean   :1950        
##  3rd Qu.:284.5    3rd Qu.:330.0   3rd Qu.: 194.5   3rd Qu.:2410        
##  Max.   :694.0    Max.   :910.0   Max.   :1043.0   Max.   :7307        
##                                                                        
##  Sign_Partialadditions_modif Sign_Partialcancellations
##  Min.   :  32.0              Min.   :-5556            
##  1st Qu.: 234.0              1st Qu.:-1706            
##  Median : 340.0              Median :-1161            
##  Mean   : 396.6              Mean   :-1360            
##  3rd Qu.: 497.5              3rd Qu.: -807            
##  Max.   :1999.0              Max.   : -195            
##                                                       
##  Sign_Fulltripcancellations    Avg.LOS          Avg.LOT     
##  Min.   :-1843.0            Min.   : 4.330   Min.   : 6.93  
##  1st Qu.: -433.0            1st Qu.: 6.300   1st Qu.:10.16  
##  Median : -265.0            Median : 6.970   Median :11.11  
##  Mean   : -355.6            Mean   : 7.163   Mean   :11.34  
##  3rd Qu.: -168.0            3rd Qu.: 7.800   3rd Qu.:12.24  
##  Max.   :  -16.0            Max.   :27.600   Max.   :30.05  
##                                                             
##      Avg.LT          Avg.PPB       WeekDay       Month          Week      
##  Min.   : 10.63   Min.   : 0.930   Sun:295   Mar    :186   Min.   : 1.00  
##  1st Qu.: 51.30   1st Qu.: 3.240   Mon:295   May    :186   1st Qu.:13.00  
##  Median : 64.24   Median : 4.180   Tue:295   Jul    :186   Median :25.00  
##  Mean   : 68.03   Mean   : 4.493   Wed:295   Aug    :186   Mean   :25.59  
##  3rd Qu.: 80.11   3rd Qu.: 5.470   Thu:295   Jan    :184   3rd Qu.:38.00  
##  Max.   :297.71   Max.   :11.100   Fri:296   Apr    :180   Max.   :53.00  
##                                    Sat:296   (Other):959                  
##       Year     
##  Min.   :2014  
##  1st Qu.:2015  
##  Median :2016  
##  Mean   :2016  
##  3rd Qu.:2018  
##  Max.   :2019  
##

Below we start plotting some of the main variables.

2.2.1 Total travelers analysis

Given 2019 is only a partial year with only 8 months of data, it has been excluded from this initial piece of exploratory analysis

# Create Paris subset to exclude partial 2019 data
pdset_14_18 <- subset(pdset, Year <= 2018)
vdset_14_18 <- subset(vdset, Year <= 2018)

# Define colours
barfill = "gold1"
barlines = "goldenrod2"

# Inbound passengers by Year Ex 2019
p_hist_tt_y <- ggplot(data = pdset_14_18, aes(x = Year, y = Total_trav)) +
                geom_bar(stat="identity", fill = barfill) + 
                theme_light() +
                labs(x = "Year", y = "Total Inbound Passengers") + 
                scale_y_continuous(labels = comma) +
                ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Year 2014 - 2018")

# Inbound passengers by Month Ex 2019
p_hist_tt_m <- ggplot(data = pdset_14_18, aes(x = Month, y = Total_trav)) +
                geom_bar(stat="identity", fill = barfill) + 
                theme_light() +
                labs(x = "Month", y = "Total Inbound Passengers") + 
                scale_y_continuous(labels = comma) +
                ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Month 2014 - 2018")

# Inbound passengers by Day of Week Ex 2019 
p_hist_tt_d <- ggplot(data = pdset_14_18, aes(x = WeekDay, y = Total_trav)) +
                geom_bar(stat="identity", fill = barfill) + 
                theme_light() +
                labs(x = "Week Day", y = "Total Inbound Passengers") + 
                scale_y_continuous(labels = comma) +
                ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Weekday 2014 - 2018")

# Inbound passengers by Week Ex 2019 
p_hist_tt_wk <- ggplot(data = pdset_14_18, aes(x = Week, y = Total_trav)) +
                geom_bar(stat="identity", fill = barfill) + 
                theme_light() +
                labs(x = "Week", y = "Total Inbound Passengers") + 
                scale_y_continuous(labels = comma) +
                ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Week 2014 - 2018")


# Plot grid layout using plot_grid function
plot_grid(p_hist_tt_y, p_hist_tt_m, p_hist_tt_d, p_hist_tt_wk, nrow = 2)

One can clearly see the drop-off in inbound passengers from 2016 which can be attributed to the Paris terrorist attacks occurring in November 2015. Whilst 2016 received the lowest number of inbound direct visitors in our data set, there has been an uptick in subsequent years reflecitng increased tourism confidence in Paris.

From a month perspective, we can see how the strongest period is June through September, with July being the strongest month.

Saturday is the strongest day of the week, followed by Sunday, with Tuesday being the lowest.

The pattern is very easy to observe when we plot the weeks, with a strong peak during the month of July, a decline in August and pick up in September, declining from then to the end of the year.

In conclusion, we can observe how there is a strong pattern that is repeated each year with a peak during the summer months, with a slight decline in August that picks up in September. Weekends are the strongest days of the week. In terms of years 2015 showed the largest flows of Chinese tourists, with 2017 and 2018 showing improved performance on 2016.

# Subset dataset for arrivals, month and year
pdset_14_18_m_y <- pdset_14_18%>%dplyr::select(Month, Year, Total_trav)%>%dplyr::group_by(Month, Year)%>%dplyr::summarise(Total_trav = sum(Total_trav))

# Create line chart
line_chart_paris <- ggplot(data = pdset_14_18_m_y, aes(x = Month, y = Total_trav, group = Year, colour = as.factor(Year))) +
        geom_line() +
        geom_point() +
        labs(x="Month", y="Passenger Arrivals", colour="Year") +
        scale_y_continuous(labels = comma) +
        ggtitle("Chinese Passenger Arrival to Paris by Year by Month") +      
        theme_light() +
        theme(legend.position="top")

# Plot line chart
line_chart_paris

Not surprisingly, the arrival of new Chinese passengers peaks in Paris during the summer months, where the month of July is the most popular month. While we dont have the chart for Venice here, when we looked at the data for Venice, we noticed that the prominence of the summer months are more pronounced relative to Venice, which means potentially DFS will need to carry a somewhat larger level of summer months inventory targeted for Chinese clients relative to Venice.

In addition, since 2015, it seems like the flow of Chinese passengers to Paris has become a bit more even. While summer months are still the most popular months, we now observed relatively higher level of visitors in other months of the year as well, particularly fall and the spring, while the winter months, November to March represent the lowest levels of activities for Chinese tourism.

2.2.2 Length of stay (nights) analysis

The forward keys data set allows us to assess arrivals by length of stay.

# Inbound passengers by LoS Ex 2019
# Select required rows from data frame 
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Year")
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")

# Melt into desired format
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops] 
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")

# Transpose the data set
hist_tt_y_los_transpose <- data.frame(t(hist_tt_y_los))

# Make first row the column names and delete first row
names(hist_tt_y_los_transpose) <- as.matrix(hist_tt_y_los_transpose[1, ])
hist_tt_y_los_transpose <- hist_tt_y_los_transpose[-1, ]
hist_tt_y_los_transpose[] <- lapply(hist_tt_y_los_transpose, function(x) type.convert(as.character(x)))

# Create new taret column
hist_tt_y_los_transpose <- as.data.frame(hist_tt_y_los_transpose)
hist_tt_y_los_transpose$LoS <- rownames(hist_tt_y_los_transpose)

# Create first ID row 
row.names(hist_tt_y_los_transpose) <- NULL

# Add Total Row
hist_tt_y_los_transpose$Total <- hist_tt_y_los_transpose$`2014` + 
                                  hist_tt_y_los_transpose$`2015` + 
                                  hist_tt_y_los_transpose$`2016` + 
                                  hist_tt_y_los_transpose$`2017` + 
                                  hist_tt_y_los_transpose$`2018`


# Add levels to maintain order
hist_tt_y_los_transpose$LoS <- factor(hist_tt_y_los_transpose$LoS, levels = hist_tt_y_los_transpose$LoS)

p_hist_tt_los_barchart <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = Total)) +
                geom_bar(stat="identity", fill = barfill) + 
                theme_light() +
                labs(x = "Length of Stay", y = "Total Inbound Passengers") + 
                scale_y_continuous(labels = comma) +
                ggtitle("Total Chinese Arrivals to Paris by Length of Stay 2014 - 2018")

hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Date"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Date")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id = "Date")

p_hist_tt_y_los_boxplot <- ggplot(hist_tt_y_los_melt, aes(x = variable, y = value))+
        geom_boxplot(alpha = 0) +
        geom_jitter(alpha = 0.1, color = "tomato") +
        theme_light() +
        labs(x = "Length of Stay", y = "Daily Arrivals Volume") + 
        ggtitle("Length of Chinese Arrivals to Paris by Daily Arrival Volume")

# Plot charts in grid
plot_grid(p_hist_tt_los_barchart + coord_flip(), p_hist_tt_y_los_boxplot + coord_flip(), nrow = 1)

According to forward keys data, a total of 1.152mn Chinese travelers landed in Paris between 2014 and 2018.

Chinese tourists traveling into Paris generally stay several nights in the city. Short term stay categories (1 night, 2 nights and 3 nights) together only represent 16% of the total bookings, while 4 to 5 nights represent the most popular category with 38% of the bookings, and collectively stays above 3 nights represent 84% of the total bookings.

After 4 - 5 nights, the most popular category is 6 to 8 nights which represents 22.8% of the travelers, while people only staying one night before proceeding to another destination represented only 25 of the total travelers.

This positions Paris as an attractive city for the luxury retailers. Given the average length of stay of the Chinese passengers, DFS can have multiple touch points with potential clients (from airport display ads to boutique in the city) throughout their stay. Also, potentially the same client could drop by more than once in the same boutique (unlike the Chinese tourists who are on tour and spend a day or less in each city they visit via bus).

However, this also creates one particular challenge; arguably the larger length of stay gives the potential client more time to compare prices & products across different outlets when making a purchase decision, meaning for customers with same price elasticity, DFS prices should probably be not too far from the market prices for comparable products.

# Create melted data set for arrivals by length of stay by year
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops] 
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")
names(hist_tt_y_los_melt) <- c("Year", "Length_of_Stay", "Value")

hist_tt_y_los_melt$Year <- as.factor(hist_tt_y_los_melt$Year)


p_hist_tt_y_los_melt_heatmap <- ggplot(hist_tt_y_los_melt, aes(x = hist_tt_y_los_melt$Year, y = hist_tt_y_los_melt$Length_of_Stay)) +
                                    geom_tile(aes(fill = Value)) +
                                    scale_x_discrete("Year", expand = c(0, 0)) +
                                    scale_y_discrete("Length of Stay", expand = c(0, 0)) +
                                    scale_fill_gradient("Number of Arrivals", low = "white", high = "steelblue" ) +
                                    theme_bw() +
                                    ggtitle("Length of Chinese Arrival to Paris by Year") +
                                    theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# Create melted data set for arrivals by length of stay by month
hist_tt_m_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Month"))
hist_tt_m_los$Month <- as.integer(hist_tt_m_los$Month)

hist_tt_m_los <- aggregate(hist_tt_m_los, by = list(hist_tt_m_los$Month), FUN = sum)
drops <- c("Month")
hist_tt_m_los <- hist_tt_m_los[ , !names(hist_tt_m_los) %in% drops] 
names(hist_tt_m_los) <- c("Month", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_m_los_melt <- melt(hist_tt_m_los, id="Month")
names(hist_tt_m_los_melt) <- c("Month", "Length_of_Stay", "Value")

hist_tt_m_los_melt$Month <- as.factor(hist_tt_m_los_melt$Month)

p_hist_tt_m_los_heatmap <- ggplot(hist_tt_m_los_melt, aes(x = hist_tt_m_los_melt$Month, y = hist_tt_m_los_melt$Length_of_Stay)) +
                                    geom_tile(aes(fill = Value)) +
                                    scale_x_discrete("Month", expand = c(0, 0)) +
                                    scale_y_discrete("Length of Stay", expand = c(0, 0)) +
                                    scale_fill_gradient("Number of Arrivals", low = "white", high = "steelblue" ) +
                                    theme_bw() +
                                    ggtitle("Length of Chinese Arrival to Paris by Month") +
                                    theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line(colour = NA))

# Plot heatmaps
plot_grid(p_hist_tt_y_los_melt_heatmap, p_hist_tt_m_los_heatmap, nrow = 2)

We also looked at the evolution of the length of stay from 2014 to 2018. While Paris has the advantage of being a city where the incoming Chinese passengers spend several nights which creates a major commercial opportunity for DFS, we also observe that the average length of stay has been declining visibly between 2014 - 18 in the most popular 4 - 5 nights category while the other categories remained relatively unchanged. However, as we will discuss in the next section, the underlying story here is slightly different.

Looking at the arrival data on a monthly basis, and analyzing each stay length category, we see that the overall length of stay generally seems to increase in the summer months for the most popular categories (4 - 5 nights and 6 to 9 nights), while in the shorter term stay categories the seasonal changes are muted.

Therefore we believe the summer months will be of utmost importance to DFS as it should try to leverage on the increased number of stays by The Chinese tourists. To capitalize on the opportunity, DFS should increase its marketing & advertising budget for the summer to establish multiple contact points with the Chinese clients (starting with display ads at the airport) and have in store promotions to trigger return clients (such as 5% discount coupons with each purchase valid for 24 - 48 hours for your second purchase) and increase their store traffic. This also means, DFS will potentially need to carry a larger inventory as well as a larger store sales staff during the peak summer months.

# Select required rows from data frame 
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Year")
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")

# Melt into desired format
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops] 
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")

# Transpose the data set
hist_tt_y_los_transpose <- data.frame(t(hist_tt_y_los))

# Make first row the column names and delete first row
names(hist_tt_y_los_transpose) <- as.matrix(hist_tt_y_los_transpose[1, ])
hist_tt_y_los_transpose <- hist_tt_y_los_transpose[-1, ]
hist_tt_y_los_transpose[] <- lapply(hist_tt_y_los_transpose, function(x) type.convert(as.character(x)))

# Create new taret column
hist_tt_y_los_transpose <- as.data.frame(hist_tt_y_los_transpose)
hist_tt_y_los_transpose$LoS <- rownames(hist_tt_y_los_transpose)

# Create first ID row 
row.names(hist_tt_y_los_transpose) <- NULL

# Add delta column
hist_tt_y_los_transpose$delta <- hist_tt_y_los_transpose$`2018` - hist_tt_y_los_transpose$`2014`
hist_tt_y_los_transpose$delta2 <- hist_tt_y_los_transpose$`2018` - hist_tt_y_los_transpose$`2016`

# Add levels to maintain order
hist_tt_y_los_transpose$LoS <- factor(hist_tt_y_los_transpose$LoS, levels = hist_tt_y_los_transpose$LoS)

# Create Delta chart
p_hist_tt_y_los_delta <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = delta, label = delta)) +
                geom_point(stat="identity", fill = "black", size = 12) + 
                theme_light() +
                geom_text(color = "white", size = 4) +
                geom_segment(aes(y = 0, x = `LoS`, yend = delta, xend = `LoS`), color =  "black") +
                labs(x = "Passenger Length of Stay", y = "Delta Between 2018 and 2014") + 
                ggtitle("Delta in Chinese Arrivals to Paris by Length of Stay 2014 to 2018") +
                scale_y_continuous(labels = comma)

p_hist_tt_y_los_delta2 <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = delta2, label = delta2)) +
                geom_point(stat="identity", fill = "black", size = 12) + 
                theme_light() +
                geom_text(color = "white", size = 4) +
                geom_segment(aes(y = 0, x = `LoS`, yend = delta2, xend = `LoS`), color =  "black") +
                labs(x = "Passenger Length of Stay", y = "Delta Between 2018 and 2016") + 
                ggtitle("Delta in Chinese Arrivals to Paris by Length of Stay 2016 to 2018") +
                scale_y_continuous(labels = comma)

plot_grid(p_hist_tt_y_los_delta + coord_flip(), p_hist_tt_y_los_delta2 + coord_flip(), nrow = 1)

Length of Stay Conclusions

As discussed earlier, While a large majority of Chinese passengers traveling to Paris stay 4 nights or longer, we also note that over the past 4 years (2014 to 2018), there has been a visible decline (40,000 people) in the bracket of tourists who stay 4 to 5 nights, while there is an increase in the 6 to 8 nights, and 9 to 13 night categories, while the change in other categories were relatively muted. The delta analysis clearly shows these trends.

To understand the drivers of the decline, we did some google search on Parisian tourism stats, and the search right away brought results / articles on the devastating terrorist attacks in the city in 2015 and the negative impact on the tourism flows. To look at the trend, we divided our dataset into two, took out 2015, the year of the attack and then looked at the trends in length of stay from 2016 to 2018 (the recovery period) and the data clearly show a recovery. While the negative delta for 4 to 5 nights is about 40,000 between 2014 - 2018 (which includes the events of 2015), delta from 2016 to 2018 is a positive of about 4,100 bookings. Similarly, 6 to 8 nights stays increased by almost 10,000 in the same period and 9 to 13 nights by about 5,000.

Therefore, isolating the aftermath of the 2015 attacks, we don’t necessarily see the average length of stays in Paris declining visibly over the next few years.

2.2.3 Passengers per Booking Analysis

The forward keys data set allows us to assess arrivals by number of passengers by booking.

Paris seems to be a very popular tourism destination for one person bookings from China. While, a good portion of these can be singles traveling, we note that (as we will see in the next sections), Paris is an important business travel destination for Chinese as well. One person bookings represent around 30% of the total passengers for 2014 - 2018 period, followed by 2 people bookings (which we believe is largely couples and they represent 21% of the bookings), third largest group is +10 people (20%) who are group travelers, another important source of tourists for Paris.

This is an important demographics information for DFS. The company should potentially do some market research on the type of products bought by single-leisure and business travelers, as well as couples in addition to research on group purchases, as these 3 categories represent over 70% of the Chinese travelers flying into Paris.

Given the important role of client demographics for DFS we wanted to look at the passengers per booking data in a bit more depth. We first looked at the evolution of the categories on an annual basis between 2014 - 2018. As the heatmap shows, while most categories seem to be relatively stable, it is very easy to see a sharp decline in the category for 10 plus passengers per booking. While this group currently represents about 20%, it has a visibly larger share of total tourists in 2014 and this category started a sharp decline in 2015 and since then it has not necessarily recovered.

We also looked at the passengers per booking data on a monthly basis. While most categories are relatively stable (from the top number 2, 3, 4, and 5), the seasonality seems to be way more pronounced for the most popular categories; 1 passenger, 2 passengers and + 10 passengers bookings. For single person and 2 person bookings the months of June, July and September are the most popular, while July and August are the most popular months for the +10 person bookings.

This is a crucial data for DFS as it can play around with its inventory / product mix by taking into account the differing preferences of each category clientele. For example, in August, DFS can prioritize most popular items for group shoppers in its store, while in June and July it can try to cater more to single and couple clients.

The delta analysis shows the acute decline in group bookings more clearly. Between 2014 - 2018 total decline in this group (+10 person per booking) was about 5,600, while decline between 2016 - 2018 is stronger at about 7,500. Clearly, this category continues to decline while singles and 2 person bookings continues to gain popularity. Single bookings increased by 4,850 between 2014 - 2018, but 8,700 between 2016 - 2018, showing that this category is really gaining momentum. Similarly, 2 person bookings are up by about 7,100 in the same period and 3 passenger bookings are up by 4,200.

Passengers per Booking Conclusions

Looking at the data, we believe the popularity of Paris as a destination for single and 2 person travelers are here to stay for the next few years, while the +10 person bookings continue to decline.

2.2.4 Lead Times Analysis

The forward keys data set allows us to assess arrivals by booking lead time

Data shows that about 18% of the travelers purchase their tickets between 120 to 364 days in advance, while the largest group is relatively last minute travelers who booked their tickets 15 - 29 days in advance, adding the 5 - 14 days category to this group, the total number of late travel planners reaches about 35% of the total Chinese travelers traveling to Paris.

Looking at the evolution of booking lead time over the past 4 years, we can see that the early planners (120 - 364 days) declined visibly, while there were relatively limited movements in other categories. Looking at the monthly data, we can see some seasonality in relatively late bookings where there is some pickup during June, July and September. Similarly, July and August seem to be the most popular months for early planners (120 to 364 days) as well.

Delta analysis further confirms the decline in the early planners (120 to 364 days), while the trends for other categories in recent years (20156 - 2018) seem to be visibly positive. One other category that stands out is the 0 to 4 days where around 1,400 decline was registered in 2016 - 2018, however this is a very small portion of the total passenger data.

Booking Lead Time Conclusions

Booking lead time shows that about 35% of the population purchases their ticket relatively late (within one month). While the remaining categories are relatively balanced, the second largest group is the early purchasers as discussed earlier.

2.2.5 Cabin Analysis

The forward keys data set allows us to assess arrivals by Cabin Type

Economy class passengers represented about 85% of the total passengers while business passengers is about 10% of Chinese passengers traveling to Paris. This is pretty at much at par with many international destinations.

The annual statistics show that the cabin preferences of Chinese passengers to Paris (as expected) are quite stable, we don’t see any emerging trends there. We also looked at the data on a monthly basis, there seems to be some level of increased economy class activity in the summer months, particularly July, which we believe is at par with the increased levels of arrivals during the summer months.

Delta analysis shows a relatively balanced level of increase among different cabin types between 2016 - 2018.

Cabin Type Conclusions

We do not see any particularly valuable insights from the cabin data. Similar to most destinations about 85% of the tickets are economy and there is very limited (if any) movements across different years and limited monthly variation.

2.2.6 Distribution Channels Analysis

The forward keys data set allows us to assess arrivals by Distribution Channels

Bookings via retail represent close to 70% of the bookings, while corporate bookings and online bookings each represent about 13% of the total bookings.

Heatmaps (annual evolution and monthly variations) provide very limited additional insights. The retail bookings increase during the summer months in line with the overall tourism demand and reach its peak in July.

Delta analysis shows that retail bookings have increased visibly at the expense of other categories over the past few years. It is particularly interesting to see the decline in online bookings. Based on our business understanding, we know that many of the Chinese tourists have a considerable language barrier and thus, require booking through a retail agency to facilitate travel arrangements and on the ground guidance.

Distribution Type Conclusions

Retail bookings are a key source of bookings and are gaining further popularity.

2.2.7 Multicity Ticket Type Analysis

The forward keys data set allows us to assess arrivals by Return or Multicity ticket type analysis

63% of the passengers traveling into Paris are multicity travelers, going to another destination after Paris. This is not surprising as Paris is an aviation hub for Continental Europe. Return flights represent the remaining 37%.

Data shows that after 2015, there was some decline in multicity flights, while the stats have been fairly stable over the past few years. The summer months of July and August show increased level of multicity flights which we believe is at par with the general increased level of touristic flights.

Delta analysis shows that between 2016 - 2018 the multicity was quite stable, while there was continued increase in return flights.

Multicity Ticket Type Conclusions

Looking at 2016 - 2018, there was about a 8,500 increase in the return flights which is not necessarily a very large number. Therefore we dont see particular trends in this data.

2.2.8 Passenger Profile Type Analysis

The forward keys data set allows us to assess arrivals by Passenger Profile analysis.

Paris is one of the most popular tourism destinations in Europe and 39% of the passenger travel is leisure related, while as an important hub, hosting several pan European corporate headquarters, around 36% of the flights to the city are business flights, while the remaining are group travelers.

Heatmap further confirms the decline in the group travelers over the past few years (as we discussed in earlier sections). Despite some year over year movements, the leisure and business travel trends remain non volatile. In 2018, we saw a visible increase in leisure travel. Not surprisingly, monthly analysis shows increased level of leisure flights in the popular summer months, while the business travel flights do not show a significant seasonality.

Delta analysis also confirms the decline in group travelers, while we continue to see a relatively higher increase in leisure travelers in the 2016 - 2018 period.

Passenger Profile Type Conclusions

Given both the leisure and business travelers are relatively balanced (39% vs 36% of total travelers), DFS will need to cater to both audiences in its Parisian shop.

2.2.9 Week Type Analysis

The forward keys data set allows us to assess arrivals by Return or Multicity ticket type analysis.

When we look at the type of days (weekend / weekdays) Chinese passengers spent in Paris, not surprisingly we see that the combined stay represents close to 90% of the stays (as the average night of stay is above 5 days for the sample, most passengers regardless of their arrival data end up spending both some weekend and some weekdays in the city).

Combined category of stay increases in the summer months peaking in July as the average number of days of stay increases during the summer (meaning more people who arrive in the week also spend weekend days and vice versa).

The combined (weekday + weekend) stay has been increasing to a certain extent and that is at par with the increasing days of stay at the city.

Conclusion

Given the average length of stay in the city is above 5 days most Chinese tourists get to see the city both during the week and during the weekend (about 90%), which means DFS should cater to the different dynamics of both the weekend and weekday shopping). Weekday shopping experience for the non resident tourist will be different than the weekend.

2.2.10 Paris and Venice Comparison

Whilst current focus is on Paris and predicting passenger flows in that location, given DFS has a store in Venice in the future we will look to do understand passengers arrivals into that location and the subsequent relationship with sales at the Venice store

However, at this stage we wanted to do some preliminary analysis to compare the two locations

Whilst the volumes at an annual level appear quite different, perhaps most impacted by the terrorist attacks in Paris in November 2015, the profiles of the other charts appear quite similar.

While the Parisian passenger traffic has been recovering over the past few years from the 2015 terrorist attacks and is on a steady growth trajectory, Venice does not seem to have a growth momentum. After a contraction in 2016, the passenger volumes grew in 2017 only to decline again in 2018. Therefore Paris seems to have better business prospects.

Peak month for travel is July for both cities, with December receiving the lowest number of arrivals.
Saturday is the most common day of arrival, with Friday the least popular. This perhaps reflects the distance required to travel; passengers leaving on a Friday in China following a week of work will arrive in Europe on the Saturday. Comparing the two cities, while the most popular days of arrival in Paris are clearly the weekend days. Saturday followed by Sunday, for Venice Saturday and Wednesday are the most popular arrival days.
Arrivals by week show a very similar pattern. There is a small peak around week 6-8 in both cases, before a lull then steady rise to the peak months of summer. Whilst there is a subsequent large drop-off from week 35. The Golden Week annual holiday at the beginning of October drives a large increase over that period, before a continued steady decline towards the winter period

While the weekly distribution data clearly shows that both cities are subject to visible seasonality throughout the year, the seasonality in Venice seems to be somewhat less pronounced.

Comparing passengers per booking data for both cities, the large groups of 10+ passengers per booking have declined across both Paris and Venice. For Paris, this was accompanied by a visible increase in the 1 and 2 passengers per booking category, while this trend was visibly less pronounced for Venice.

Arrivals by Cabin type appear quite different, likely as a result of the attacks in Paris in November 2015. Economy passengers appear to have been most impacted by the events in Paris but are showing signs of recovery from 2016. However, post 2016, Venice has seen a slight decline in economy arrivals, though an increase across the overall 5 year period.

Recent trends from 2016 would indicate a shift towards return tickets, aligning with the notion of less large group travelers who would arrive on multicity tickets.

Delta analysis shows that the group profile passengers are on the decline for both cities. While this is clearly accompanied by a rise in the leisure and business travelers for Paris, the effect looks less pronounced for Venice.

Conclusion

Both cities are recipients of both a similar levels of both leisure and business travelers which constitutes over 70% of the passenger stats for both cities. DFS potentially already has a marketing and inventory management strategy to cater to these two distinct groups in their Venice store, and given the similarity of both cities, it can potentially leverage on some aspects of its existing strategy in Venice in Paris with a few tweaks.

The chart above compares the passenger flows to Paris versus Venice.

We analyzed the correlation between the total travelers during the whole period to both cities.

## 
##  Pearson's product-moment correlation
## 
## data:  pdset$Total_trav and vdset$Total_trav
## t = 24.05, df = 2065, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4333990 0.5007876
## sample estimates:
##       cor 
## 0.4677729

General Conclusion From Our Exploratory Analysis

We can talk about many factors here, but 5 key important insights for us are as follows:

While the Chinese travelers to Paris still remain well below the figures achieved in 2015, the market continues to recover at a healthy pace. Therefore there is growth opportunity in Paris, while Venice does not seem to have the same growth offering.
A very important quality of the Chinese arriving in Paris is the number of nights they stay, which creates an important opportunity:

Short term stay categories (1 night, 2 nights and 3 nights) together only represent 16% of the total bookings, while 4 to 5 nights is the most popular category with 38% of the bookings, and collectively stays above 3 nights represent 84% of the total bookings. Given the average length of stay of the Chinese passengers, DFS can have multiple touch points with potential clients (from airport display ads to boutique in the city) throughout their stay. Also, potentially the same client could drop by more than once in the same boutique (unlike the Chinese tourists who are on tour and spend a day or less in each city they visit via bus). However, this also creates one particular challenge; arguably the larger length of stay gives the potential client more time to compare prices & products across different outlets when making a purchase decision, meaning for customers with same price elasticity, DFS prices should probably be not too far from the market prices for comparable products.

Summer months are key for sales: Chinese passenger arrivals to the city display pronounced seasonality. The arrival volumes between June and September are markedly higher than the remaining months of the year which means, DFS should have an aggressive marketing & promotion strategy during the summer months to capitalize on the significant increase in volumes. It also means it will have to carry a substantially higher level of inventory and employ higher number of part-time sales staff during the summer months. While over the past few years, the seasonality seems to have become slightly less pronounced, the dominance of summer months is potentially here to stay for the foreseeable future. In addition, looking at the arrival data on a monthly basis, and analyzing each stay length category, we see that the overall length of stay generally seems to increase in the summer months for the most popular categories (4 - 5 nights and 6 to 9 nights), while in the shorter term stay categories the seasonal changes are muted. Therefore we believe the summer months will be of utmost importance to DFS as the company should try to leverage on the increased number of stays by The Chinese tourists. To capitalize on the opportunity, DFS should increase its marketing & advertising budget for the summer to establish multiple contact points with the Chinese clients (starting with display ads at the airport) and have in store promotions to trigger return clients (such as 5% discount coupons with each purchase valid for 24 - 48 hours for your second purchase) and increase their store traffic.
Paris seems to be a very popular tourism destination for one person bookings from China. While, a good portion of these can be singles traveling, we note that (as we will see in the next sections), Paris is an important business travel destination for Chinese as well. One person bookings represent around 30% of the total passengers for 2014 - 2018 period, followed by 2 people bookings (which we believe is largely couples and they represent 21% of the bookings), third largest group is +10 people (20%) who are group travelers, another important source of tourists for Paris. In addition, our delta analysis shows that 1-person and 2 person bookings are growing visibly stronger than other categories. Therefore dominance of this two categories seem to be here to stay for some time.

This is an important demographics information for DFS. The company should potentially do some market search on the type of products bought by single-leisure and business travelers, as well as couples (2 people bookings) in addition to research on group purchases, as these 3 categories represent over 70% of the Chinese travelers flying into Paris.

Leisure and business travel are equally important. Around 39% of the passenger travel is reported as leisure related, while around 36% of the flights are reported as business related, while the remaining are group travelers. This is an important piece of information for DFS. It means that DFS needs to carry items that would be appealing to both group of clients. The heatmap analysis shows that in the moths of July and August, leisure becomes visibly more important, which means DFS could rotate some of its inventory to cater a bit more to leisure clients during those two months.

2.3 Predictive Analysis

2.3.1 Machine Learning Approach

The goal of this exercise is the following:

Predict the count of total Chinese tourists traveling to Paris 30 days in advance

2.3.1.1 Data preparation

In order to perform a predictive model we first need to prepare the dataset by applying a lag of 30 days for all the numeric predictors (all variables except target variable, date and date related fields).

#Select only those columns that we need to lag
pdsetlag30 <- dplyr::select(pdset, -c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year'))
#Lag the columns 30 days using mapply and convert to DF
pdsetlag30 <-data.frame(mapply(FUN = dplyr::lag, n = 30, pdsetlag30, SIMPLIFY = TRUE, USE.NAMES = TRUE))
#Rebuild dataframe with all contents and lagged variables
pdsetlag30 <- cbind(dplyr::select(pdset, c('Date', 'Total_trav')), pdsetlag30, dplyr::select(pdset, c('WeekDay', 'Month', 'Week', 'Year')))
#Rename columns
colnames(pdsetlag30) <- colnames(pdset)
#Add lag30 days of total travelers
pdsetlag30 <- dplyr::mutate(pdsetlag30, 
                lag30_totaltravelers=lag(pdset$Total_trav,n = 30, default = NA))

2.3.1.2 Predictor selection

We have identified a number of predictors that can potentially help in our prediction of the total travelers.

In order to make a determination on what predictors to keep, we first calculated the correlations of all the potential predictors with total travelers.

model.cor <- as.data.frame(t(cor(as.matrix(pdsetlag30 %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select('Total_trav')),as.matrix(pdsetlag30 %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select(-c('Date', 'WeekDay', 'Month', 'Week', 'Year'))))))

kable(model.cor[order(-model.cor$Total_trav), , drop = FALSE])

	Total_trav
Total_trav	1.0000000
lag30_totaltravelers	1.0000000
TrDur_22nightsormore	0.3840050
LoS_22nightsormore	0.3526264
Sign_InitialBookings	0.3272416
Sign_Partialadditions_modif	0.3118424
TypeSt_Combinedstay	0.3086107
OneRet_MultiCity	0.3060757
Cabin_Economy	0.3023342
TrDur_14to21nights	0.3013660
LoS_14to21nights	0.2920704
TrDur_9to13nights	0.2733796
DistCh_RetailTA	0.2702961
DistCh_OnlineTA	0.2698661
LoS_4to5nights	0.2521981
DistCh_CorporateTA	0.2521667
Profile_Leisure	0.2495882
Profile_Business	0.2480910
Profile_Group	0.2477527
ppb_2pax	0.2472777
DistCh_OtherTA	0.2447026
ppb_1pax	0.2431852
Cabin_Business	0.2414299
LeadT_30to44days	0.2397771
LeadT_45to59days	0.2389684
ppb_10pl_pax	0.2317211
LeadT_60to89days	0.2295702
LoS_9to13nights	0.2231875
ppb_3pax	0.2230266
ppb_4pax	0.2229783
LeadT_120to364days	0.2119751
LoS_2nights	0.1962357
Avg.LT	0.1852328
LoS_6to8nights	0.1848752
ppb_6to9pax	0.1817943
LoS_1night	0.1791345
Avg.LOT	0.1742244
Avg.PPB	0.1731577
TrDur_6to8nights	0.1680194
OneRet_Return	0.1672108
LeadT_90to119days	0.1577172
LeadT_15to29days	0.1277789
LoS_3nights	0.1196493
Avg.LOS	0.1188098
TypeSt_Weekendstay	0.1185411
ppb_5pax	0.1150126
LeadT_5to14days	0.1080870
Cabin_First	0.0729665
TrDur_1night	0.0465214
TrDur_4to5nights	0.0357405
Cabin_EconomyPremium	0.0287407
TrDur_3nights	0.0189710
TrDur_2nights	0.0152281
TypeSt_Workweekstay	-0.0026945
LeadT_0to4days	-0.0845452
Sign_Fulltripcancellations	-0.2908092
Sign_Partialcancellations	-0.3291667

Next, we used a Random Forest model to estimate variable importance and set a threshold below which we will discard a variable i.e. based on a % of increase in the error if we were going to exclude that variable.

We plotted the importance in the fitted model of the different models.

We are also showing the % MSE driven by each indicator in a table.

#Perform RF skipping NAs from Jan 14
set.seed(1234)
rf <- randomForest(formula = Total_trav ~ ., data = dplyr::filter(pdsetlag30, Date > '2014-01-31'), ntree = 1000, importance = TRUE)
#Get table sorted of importance of predictors
imp <- varImp(rf)
kable(imp[order(-imp$Overall), , drop = FALSE])

	Overall
lag30_totaltravelers	135.280323
Week	26.266690
Month	19.638352
Date	18.597290
Sign_Fulltripcancellations	11.253787
OneRet_MultiCity	10.379293
TypeSt_Workweekstay	10.101510
Sign_Partialcancellations	9.684716
TrDur_22nightsormore	9.475380
Year	9.352064
DistCh_OnlineTA	9.252995
TrDur_3nights	8.138174
TrDur_14to21nights	7.645574
Sign_Partialadditions_modif	7.617813
LoS_3nights	7.154588
Avg.PPB	7.146600
Sign_InitialBookings	7.007508
LoS_6to8nights	6.996177
LeadT_45to59days	6.924812
TypeSt_Combinedstay	6.887909
Cabin_EconomyPremium	6.871663
TrDur_9to13nights	6.868522
LeadT_60to89days	6.864499
Cabin_Business	6.639644
WeekDay	6.493526
DistCh_CorporateTA	6.444416
Avg.LOS	6.354290
Profile_Business	6.305578
DistCh_OtherTA	6.246042
OneRet_Return	6.188918
ppb_4pax	6.059882
ppb_1pax	6.054468
Cabin_Economy	6.021336
Profile_Group	5.960073
DistCh_RetailTA	5.901489
ppb_3pax	5.763776
Avg.LT	5.757369
LoS_14to21nights	5.681071
Avg.LOT	5.652449
TrDur_6to8nights	5.600380
LoS_4to5nights	5.579301
ppb_10pl_pax	5.564779
LoS_22nightsormore	5.492581
LoS_9to13nights	5.442431
LeadT_120to364days	5.104104
LoS_2nights	5.043263
LeadT_30to44days	4.848114
TrDur_4to5nights	4.706765
ppb_2pax	4.457614
LoS_1night	4.332915
LeadT_5to14days	4.218485
Cabin_First	3.848661
ppb_6to9pax	3.837902
Profile_Leisure	3.729659
LeadT_0to4days	3.498446
LeadT_15to29days	3.398809
LeadT_90to119days	2.722089
TrDur_2nights	2.579971
TypeSt_Weekendstay	2.528184
TrDur_1night	2.214097
ppb_5pax	1.832548

#Plot importance
varImpPlot(rf, type = 1)

Based on this, we created two different data frames, one with all predictors with a % over 10 and another one with over 15 percent.

#Check variables with imp over 10 and create df
rownames(imp)[which(imp>10)]

## [1] "Date"                       "TypeSt_Workweekstay"       
## [3] "OneRet_MultiCity"           "Sign_Fulltripcancellations"
## [5] "Month"                      "Week"                      
## [7] "lag30_totaltravelers"

pdsetlag30_s10 <- pdsetlag30 %>% dplyr::select('Date', 'Total_trav', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights', 
'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_10pl_pax', 'LeadT_5to14days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'Cabin_EconomyPremium', 'Cabin_Business', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Workweekstay', 'OneRet_Return', 'OneRet_MultiCity', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB', 'WeekDay', 'Month', 'Week', 'Year') %>% dplyr::filter(Date >'2014-01-31')

#Check variables with imp over 10 and create df
rownames(imp)[which(imp>15)]

## [1] "Date"                 "Month"                "Week"                
## [4] "lag30_totaltravelers"

pdsetlag30_s15 <- pdsetlag30 %>% dplyr::select('Date', 'Total_trav', 'TrDur_3nights', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'OneRet_MultiCity', 'Profile_Business',  'Profile_Group', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.PPB', 'WeekDay', 'Month', 'Week', 'Year') %>% dplyr::filter(Date >'2014-01-31')

2.3.1.3 Evaluation of different models - Part 1

We run a number of models with the following parameters: + Models: SVM (sliding and standard), lm, rpartXse, rpart, earth, nnet and randomforest + We used nmae as our evaluation metric as it’s normalized and easy to interpret. + We used MonteCarlo as our estimation method, with 5 iterations to obtain certainty on the performance of each of the models.

Tform_Total_trav <- as.formula('Total_trav ~.')

exp <- performanceEstimation(
    PredTask(Tform_Total_trav, pdsetlag30_s15, 'Total_trav_30d'),   
    c(Workflow('standardWF', wfID="standSVM",
               learner='svm',learner.pars=list(cost=10,gamma=0.01)),
      Workflow('timeseriesWF', wfID="slideSVM", 
               type="slide", relearn.step=90,
               learner='svm',learner.pars=list(cost=10,gamma=0.01)),
      Workflow(learner="randomForest",learner.pars=list(ntree=1000)),
      Workflow(learner="rpart",.fullOutput=TRUE),
      Workflow(learner= 'lm'),
      Workflow(learner="bagging",learner.pars=list(nbagg=500)),
      Workflow(learner="rpartXse"),
      Workflow(learner='nnet', learner.pars=list(linout=TRUE, trace=FALSE, maxit=1000, size=15, decay=0.01))
      ),
    EstimationTask(metrics="nmse",
                   method=MonteCarlo(nReps=5,szTrain=0.5,szTest=0.25)))

## 
## 
## ##### PERFORMANCE ESTIMATION USING  MONTE CARLO  #####
## 
## ** PREDICTIVE TASK :: Total_trav_30d
## 
## ++ MODEL/WORKFLOW :: standSVM 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: slideSVM 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: randomForest 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: rpart 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: bagging 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: nnet 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509

summary(exp)

## 
## == Summary of a  Monte Carlo Performance Estimation Experiment ==
## 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## 
## * Predictive Tasks ::  Total_trav_30d
## * Workflows  ::  standSVM, slideSVM, randomForest, rpart, lm, bagging, rpartXse, nnet 
## 
## -> Task:  Total_trav_30d
##   *Workflow: standSVM 
##               nmse
## avg     0.86156225
## std     0.27330726
## med     0.72994660
## iqr     0.04659865
## min     0.72270560
## max     1.34897089
## invalid 0.00000000
## 
##   *Workflow: slideSVM 
##                nmse
## avg     0.542201437
## std     0.023170741
## med     0.536390663
## iqr     0.008913675
## min     0.524265703
## max     0.582411262
## invalid 0.000000000
## 
##   *Workflow: randomForest 
##               nmse
## avg     0.47327259
## std     0.01998234
## med     0.47161439
## iqr     0.03110022
## min     0.44991771
## max     0.49703349
## invalid 0.00000000
## 
##   *Workflow: rpart 
##               nmse
## avg     0.62033355
## std     0.05549068
## med     0.63659207
## iqr     0.03109266
## min     0.52417316
## max     0.65834659
## invalid 0.00000000
## 
##   *Workflow: lm 
##              nmse
## avg     1.4404186
## std     0.2268344
## med     1.5788348
## iqr     0.2865787
## min     1.1043898
## max     1.6207747
## invalid 0.0000000
## 
##   *Workflow: bagging 
##               nmse
## avg     0.48905493
## std     0.03604078
## med     0.47544499
## iqr     0.02432157
## min     0.46306398
## max     0.55067340
## invalid 0.00000000
## 
##   *Workflow: rpartXse 
##               nmse
## avg     0.60375738
## std     0.03218525
## med     0.59711170
## iqr     0.04759748
## min     0.56524924
## max     0.64196229
## invalid 0.00000000
## 
##   *Workflow: nnet 
##                 nmse
## avg     9.999988e-01
## std     9.154329e-07
## med     9.999992e-01
## iqr     2.195862e-07
## min     9.999972e-01
## max     9.999993e-01
## invalid 0.000000e+00

plot(exp)

2.3.1.4 Evaluation of different models - Part 2

Next, we tried the same with the broader data frame (importance over 10 instead of 15), with the best performing models only. The goal is to see if increasing the number of predictors would materially improve the metrics.

exp2 <- performanceEstimation(
    PredTask(Tform_Total_trav, pdsetlag30_s10, 'Total_trav_30d'),   
    c(Workflow(learner="randomForest",learner.pars=list(ntree=1000)),
      Workflow(learner="rpart",.fullOutput=TRUE),
      Workflow(learner="rpartXse"),
      Workflow(learner='nnet', learner.pars=list(linout=TRUE, trace=FALSE, maxit=1000, size=15, decay=0.01))
      ),
    EstimationTask(metrics="nmse",
                   method=MonteCarlo(nReps=5,szTrain=0.5,szTest=0.25)))

## 
## 
## ##### PERFORMANCE ESTIMATION USING  MONTE CARLO  #####
## 
## ** PREDICTIVE TASK :: Total_trav_30d
## 
## ++ MODEL/WORKFLOW :: randomForest 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: rpart 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509 
## 
## 
## 
## ++ MODEL/WORKFLOW :: nnet 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## Repetition  1 
##   start test =  1077 ; test size =  509 
## Repetition  2 
##   start test =  1329 ; test size =  509 
## Repetition  3 
##   start test =  1335 ; test size =  509 
## Repetition  4 
##   start test =  1336 ; test size =  509 
## Repetition  5 
##   start test =  1455 ; test size =  509

summary(exp2)

## 
## == Summary of a  Monte Carlo Performance Estimation Experiment ==
## 
## Task for estimating  nmse  using
## 5  repetitions Monte Carlo Simulation using: 
##   seed =  1234 
##   train size =  0.5 x NROW(DataSet) 
##   test size =  0.25 x NROW(DataSet) 
## 
## * Predictive Tasks ::  Total_trav_30d
## * Workflows  ::  randomForest, rpart, rpartXse, nnet 
## 
## -> Task:  Total_trav_30d
##   *Workflow: randomForest 
##                nmse
## avg     0.475506227
## std     0.015384097
## med     0.475996357
## iqr     0.006047883
## min     0.452595329
## max     0.495514575
## invalid 0.000000000
## 
##   *Workflow: rpart 
##               nmse
## avg     0.61612635
## std     0.04419247
## med     0.62573163
## iqr     0.08647399
## min     0.56937897
## max     0.65834659
## invalid 0.00000000
## 
##   *Workflow: rpartXse 
##               nmse
## avg     0.61216302
## std     0.04831333
## med     0.61982419
## iqr     0.08169229
## min     0.55068805
## max     0.65834659
## invalid 0.00000000
## 
##   *Workflow: nnet 
##                 nmse
## avg     9.999989e-01
## std     5.312570e-07
## med     9.999992e-01
## iqr     2.326554e-07
## min     9.999980e-01
## max     9.999992e-01
## invalid 0.000000e+00

plot(exp2)

The metrics show a limited improvement from using all the predictors with a % over 10. Therefore, we focused on the data frame with predictors explaining over 15 percent of the error. Also, the Random Forest appears to be the highest performing model in every scenario.

2.3.1.5 Evaluation of more complex standalone models

Next, we tried furthermore complex models on a standalone basis to check if any of them could outperform the random forest. In doing so, we used January 2014 through July 2019 as training set and August 2019 as test set.

RANDOM FOREST

#Build train and test datasets
Tdata.train_lag30_s15 <- pdsetlag30_s15 %>% dplyr::filter(Date < '2019-08-01')
Tdata.eval_lag30_s15 <- pdsetlag30_s15 %>% dplyr::filter(Date > '2019-07-31')
#Run model
RF_Total_trav_model <- randomForest(Tform_Total_trav, data = Tdata.train_lag30_s15[-1], ntree = 1000)
#prediction
RF_TT_prediction <- predict(RF_Total_trav_model, Tdata.eval_lag30_s15)
#Check
RF_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, RF_TT_prediction)
#Evaluation
rmse.rf <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - RF_TT_prediction)^2)),2)
nmse.rf <- round(sum((Tdata.eval_lag30_s15$Total_trav - RF_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.rf

## [1] 164.85

nmse.rf

## [1] 47.7

NEURAL NETWORKS

#Normalize numeric data
pdsetlag30_s15_NN <- data.frame(pdsetlag30_s15[c('Date', 'WeekDay', 'Month')], lapply(pdsetlag30_s15[c('Total_trav', 'TrDur_3nights', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'OneRet_MultiCity', 'Profile_Business',  'Profile_Group', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.PPB', 'Week', 'Year')],scale))
#Build train and test datasets
Tdata.train_lag30_s15_NN <- pdsetlag30_s15_NN %>% dplyr::filter(Date < '2019-08-01')
Tdata.eval_lag30_s15_NN <- pdsetlag30_s15_NN %>% dplyr::filter(Date > '2019-07-31')

#Unnormalize function
unnormalize <- function(x) {return((x*(sd(Tdata.eval_lag30_s15$Total_trav))+ base::mean(Tdata.eval_lag30_s15$Total_trav)))}

#Run model
NN_Total_trav_model <- nnet(Tform_Total_trav, data = Tdata.train_lag30_s15_NN[-1], size = 20, decay = 0.01, maxit = 10000, linout = TRUE, trace = FALSE, MaxNWts = 1000)
#prediction
NN_TT_prediction <- unnormalize(predict(NN_Total_trav_model, Tdata.eval_lag30_s15_NN[-1]))
#Check
NN_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, NN_TT_prediction)
#Evaluation
rmse.nn <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - NN_TT_prediction)^2)),2)
nmse.nn <- round(sum((Tdata.eval_lag30_s15$Total_trav - NN_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.nn

## [1] 217.97

nmse.nn

## [1] 83.39

DEEP LEARNING

# trH <- as.h2o(Tdata.train_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month), 'trH')
# tsH <- as.h2o(Tdata.eval_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month), 'tsH')
# DL_Total_trav_model <- h2o.deeplearning(x = 2:16, y = 1, training_frame = trH, hidden = c(500, 500, 500, 500, 500), epochs = 1000)
# DL_TT_prediction <- as.vector(h2o.predict(DL_Total_trav_model, tsH))
# #Check
# DL_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, DL_TT_prediction)
# #Evaluation
# rmse.dl <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - DL_TT_prediction)^2)),2)
# nmse.dl <- round(sum((Tdata.eval_lag30_s15$Total_trav - DL_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)

BAGGING

#Run model
BG_Total_trav_model <- bagging(Tform_Total_trav, data = Tdata.train_lag30_s15[-1], mfinal = 1000)
#prediction
BG_TT_prediction <- predict(BG_Total_trav_model, Tdata.eval_lag30_s15[-1])
#Check
BG_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, BG_TT_prediction)
#Evaluation
rmse.bg <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - BG_TT_prediction)^2)),2)
nmse.bg <- round(sum((Tdata.eval_lag30_s15$Total_trav - BG_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.bg

## [1] 182.62

nmse.bg

## [1] 58.54

XGBOOST

#Source: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
#Run model
Tdata.train_lag30_s15_XGB <- Tdata.train_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month)
Tdata.eval_lag30_s15_XGB <- Tdata.eval_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month)
  
XGB_Total_trav_model <- xgboost(data = as.matrix(Tdata.train_lag30_s15_XGB[-1]), label = Tdata.train_lag30_s15_XGB$Total_trav, nrounds = 100, max_depth = 100, objective = 'reg:squarederror', eta = 0.2, verbose = 2, booster = 'gbtree')

## [1]  train-rmse:576.872009 
## [2]  train-rmse:472.082550 
## [3]  train-rmse:388.590515 
## [4]  train-rmse:321.142120 
## [5]  train-rmse:266.992126 
## [6]  train-rmse:222.780151 
## [7]  train-rmse:186.722610 
## [8]  train-rmse:156.763199 
## [9]  train-rmse:132.223389 
## [10] train-rmse:111.406235 
## [11] train-rmse:94.252525 
## [12] train-rmse:79.960075 
## [13] train-rmse:67.991562 
## [14] train-rmse:58.035343 
## [15] train-rmse:49.627621 
## [16] train-rmse:42.548389 
## [17] train-rmse:36.574173 
## [18] train-rmse:31.528339 
## [19] train-rmse:27.167871 
## [20] train-rmse:23.492336 
## [21] train-rmse:20.321342 
## [22] train-rmse:17.539219 
## [23] train-rmse:15.172000 
## [24] train-rmse:13.165424 
## [25] train-rmse:11.457286 
## [26] train-rmse:9.981150 
## [27] train-rmse:8.724131 
## [28] train-rmse:7.650667 
## [29] train-rmse:6.708563 
## [30] train-rmse:5.900376 
## [31] train-rmse:5.194231 
## [32] train-rmse:4.568176 
## [33] train-rmse:4.035913 
## [34] train-rmse:3.571569 
## [35] train-rmse:3.170204 
## [36] train-rmse:2.811042 
## [37] train-rmse:2.489473 
## [38] train-rmse:2.211863 
## [39] train-rmse:1.955314 
## [40] train-rmse:1.737705 
## [41] train-rmse:1.546050 
## [42] train-rmse:1.377857 
## [43] train-rmse:1.225231 
## [44] train-rmse:1.093255 
## [45] train-rmse:0.971850 
## [46] train-rmse:0.866639 
## [47] train-rmse:0.773485 
## [48] train-rmse:0.691379 
## [49] train-rmse:0.618551 
## [50] train-rmse:0.553726 
## [51] train-rmse:0.495769 
## [52] train-rmse:0.444477 
## [53] train-rmse:0.398906 
## [54] train-rmse:0.358065 
## [55] train-rmse:0.321683 
## [56] train-rmse:0.288930 
## [57] train-rmse:0.259488 
## [58] train-rmse:0.233131 
## [59] train-rmse:0.209440 
## [60] train-rmse:0.187343 
## [61] train-rmse:0.167790 
## [62] train-rmse:0.150358 
## [63] train-rmse:0.134832 
## [64] train-rmse:0.120943 
## [65] train-rmse:0.108472 
## [66] train-rmse:0.097359 
## [67] train-rmse:0.087420 
## [68] train-rmse:0.078518 
## [69] train-rmse:0.070560 
## [70] train-rmse:0.063420 
## [71] train-rmse:0.056924 
## [72] train-rmse:0.051171 
## [73] train-rmse:0.045993 
## [74] train-rmse:0.041341 
## [75] train-rmse:0.037148 
## [76] train-rmse:0.033401 
## [77] train-rmse:0.030032 
## [78] train-rmse:0.027012 
## [79] train-rmse:0.024298 
## [80] train-rmse:0.021860 
## [81] train-rmse:0.019667 
## [82] train-rmse:0.017695 
## [83] train-rmse:0.015915 
## [84] train-rmse:0.014318 
## [85] train-rmse:0.012881 
## [86] train-rmse:0.011589 
## [87] train-rmse:0.010427 
## [88] train-rmse:0.009383 
## [89] train-rmse:0.008444 
## [90] train-rmse:0.007599 
## [91] train-rmse:0.006840 
## [92] train-rmse:0.006157 
## [93] train-rmse:0.005543 
## [94] train-rmse:0.004990 
## [95] train-rmse:0.004493 
## [96] train-rmse:0.004047 
## [97] train-rmse:0.003648 
## [98] train-rmse:0.003289 
## [99] train-rmse:0.002965 
## [100]    train-rmse:0.002675

#prediction
XGB_TT_prediction <- predict(XGB_Total_trav_model, newdata =  as.matrix(Tdata.eval_lag30_s15_XGB[-1]))
#Check
XGB_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, XGB_TT_prediction)
#Evaluation
rmse.xgb <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - XGB_TT_prediction)^2)),2)
nmse.xgb <- round(sum((Tdata.eval_lag30_s15$Total_trav - XGB_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)

rmse.xgb

## [1] 158.2

nmse.xgb

## [1] 43.93

#Check importance
importance_matrix <- xgb.importance(model = XGB_Total_trav_model)
print(importance_matrix)

##                         Feature        Gain      Cover   Frequency
##  1:                        Week 0.605492326 0.11760613 0.041773493
##  2:                        Year 0.081670034 0.02656203 0.017731857
##  3:                     Avg.PPB 0.056721359 0.06645040 0.049357508
##  4:        TrDur_22nightsormore 0.035662860 0.10771803 0.113345717
##  5:                      nMonth 0.034226322 0.01399555 0.002962986
##  6:  Sign_Fulltripcancellations 0.033804525 0.06657973 0.063742573
##  7:            OneRet_MultiCity 0.023659269 0.05739302 0.085082211
##  8:                    nWeekday 0.022914228 0.03162365 0.021462456
##  9:         TypeSt_Workweekstay 0.022125231 0.04047559 0.073092098
## 10: Sign_Partialadditions_modif 0.020532328 0.06000644 0.070789260
## 11:   Sign_Partialcancellations 0.017033564 0.15662357 0.063128483
## 12:            Profile_Business 0.014798899 0.04255655 0.070697146
## 13:               TrDur_3nights 0.012724523 0.05013907 0.206764205
## 14:                     Avg.LOS 0.011385117 0.04858439 0.061777484
## 15:               Profile_Group 0.007249416 0.11368585 0.058292522

xgb.plot.importance(importance_matrix = importance_matrix)

CUBIST

CBST_Total_trav_model <- cubist(Tdata.train_lag30_s15_XGB[-1], Tdata.train_lag30_s15_XGB$Total_trav, committees = 5)
#prediction
CBST_TT_prediction <- predict(CBST_Total_trav_model, Tdata.eval_lag30_s15_XGB[-1])
#Check
CBST_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, CBST_TT_prediction)
#Evaluation
rmse.cbst <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - CBST_TT_prediction)^2)),2)
nmse.cbst <- round(sum((Tdata.eval_lag30_s15$Total_trav - CBST_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.cbst

## [1] 179.47

nmse.cbst

## [1] 56.53

RESULTS

The table below shows the performance metrics achieved with each of the standalone models. XGBoost proved to be the best performing model, followed by Random Forest.

output.table <- data.frame("Model" = c('XGBoost','Bagging','Random Forest', 'Neural Networks', 'Cubist'), 
           "RMSE" = c(rmse.xgb,rmse.bg,rmse.rf,  rmse.nn, rmse.cbst), 
           "NMSE" = c(nmse.xgb,nmse.bg,nmse.rf,  nmse.nn, nmse.cbst))
kable(output.table)

Model	RMSE	NMSE
XGBoost	158.20	43.93
Bagging	182.62	58.54
Random Forest	164.85	47.70
Neural Networks	217.97	83.39
Cubist	179.47	56.53

2.3.1.6 Refinement of XGBOOST and Random Forest

XGBOOST

The conclusions from the different models assessed show that XGBoost is the best performing model. Therefore, we focused on refining this model and checking if a different combination of predictors or parameters could further improve the metrics.

pdsetlag30t <- data.frame(pdset[c('Date', 'Total_trav')], 
                          lapply(dplyr::select(pdset, -c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year')),
                                 FUN = dplyr::lag, 
                                 n=30))
pdsetlag30t <- pdsetlag30t %>% dplyr::mutate(WeekDay=lubridate::wday(pdsetlag30t$Date), Month=lubridate::month(pdsetlag30t$Date), Week=lubridate::week(pdsetlag30t$Date), Year=lubridate::year(pdsetlag30t$Date), lag30_totaltravelers=dplyr::lag(pdsetlag30$Total_trav,n = 30, default = NA)) %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select(c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'Avg.PPB', 'Avg.LOS', 'Avg.LT', 'Profile_Business'))

#Run model
Tdata.train_lag30_XGBt <- pdsetlag30t %>% dplyr::filter(Date < '2019-08-01') %>% dplyr::select(-Date)

Tdata.eval_lag30_XGBt <- pdsetlag30t %>% dplyr::filter(Date > '2019-07-31') %>% dplyr::select(-Date)

#Create matrices for XGB model
XGB_train_matrix <- xgb.DMatrix(data = as.matrix(Tdata.train_lag30_XGBt[-1]), label = Tdata.train_lag30_XGBt$Total_trav)
XGB_test_matrix <- xgb.DMatrix(data = as.matrix(Tdata.eval_lag30_XGBt[-1]), label = Tdata.eval_lag30_XGBt$Total_trav)
#Define parameters of model
params <- list(booster = "gbtree", objective = "reg:squarederror", eta=0.2, gamma=0, max_depth=25, min_child_weight=1, subsample=1, colsample_bytree=1)

#Check optimal nrounds with CV
#xgbcv <- xgb.cv( params = params, data = XGB_train_matrix, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)

XGB_Total_trav_modelt <- xgboost(data = XGB_train_matrix, 
                                 nrounds = 28,
                                 eval_metric = 'rmse',
                                 params = params)

## [1]  train-rmse:577.234436 
## [2]  train-rmse:472.615173 
## [3]  train-rmse:388.825104 
## [4]  train-rmse:321.971680 
## [5]  train-rmse:268.173828 
## [6]  train-rmse:224.245224 
## [7]  train-rmse:189.055267 
## [8]  train-rmse:159.243744 
## [9]  train-rmse:134.827805 
## [10] train-rmse:114.467705 
## [11] train-rmse:97.253555 
## [12] train-rmse:82.858589 
## [13] train-rmse:70.844826 
## [14] train-rmse:60.536037 
## [15] train-rmse:51.970814 
## [16] train-rmse:44.666649 
## [17] train-rmse:38.455173 
## [18] train-rmse:33.235947 
## [19] train-rmse:28.641903 
## [20] train-rmse:24.748526 
## [21] train-rmse:21.432110 
## [22] train-rmse:18.582981 
## [23] train-rmse:16.175577 
## [24] train-rmse:14.076773 
## [25] train-rmse:12.294971 
## [26] train-rmse:10.754439 
## [27] train-rmse:9.445511 
## [28] train-rmse:8.293904

#prediction
XGB_TT_predictiont <- predict(XGB_Total_trav_modelt, newdata =  XGB_test_matrix)
#Check
XGB_TT_validatet <- data.frame(Tdata.eval_lag30_XGBt$Total_trav, round(XGB_TT_predictiont,0))
#Evaluation
rmse.xgbt <- round(sqrt(mean((Tdata.eval_lag30_XGBt$Total_trav - XGB_TT_predictiont)^2)),2)
nmse.xgbt <- round(sum((Tdata.eval_lag30_XGBt$Total_trav - XGB_TT_predictiont)^2) / sum((Tdata.eval_lag30_XGBt$Total_trav-mean(Tdata.eval_lag30_XGBt$Total_trav))^2)*100,2)

#Check importance
# importance_matrixt <- xgb.importance(model = XGB_Total_trav_modelt)
# print(importance_matrixt)
# xgb.plot.importance(importance_matrix = importance_matrixt)

rmse.xgbt

## [1] 112.82

nmse.xgbt

## [1] 22.34

After a few iterations with the XGBoost parameters we were able to narrow down the list of predictors to 10 and significantly improve the metrics.

We then replicated the Random Forest with the same predictors as the XGBoost

RANDOM FOREST

#Run model
RF_Total_trav_modelt <- randomForest(Tform_Total_trav, data = Tdata.train_lag30_XGBt, ntree = 1000)
#prediction
RF_TT_predictiont <- predict(RF_Total_trav_modelt, Tdata.eval_lag30_XGBt)
#Check
RF_TT_validate <- data.frame(Tdata.eval_lag30_XGBt$Total_trav, RF_TT_predictiont)
#Evaluation
rmse.rft <- round(sqrt(mean((Tdata.eval_lag30_XGBt$Total_trav - RF_TT_predictiont)^2)),2)
nmse.rft <- round(sum((Tdata.eval_lag30_XGBt$Total_trav - RF_TT_predictiont)^2) / sum((Tdata.eval_lag30_XGBt$Total_trav-mean(Tdata.eval_lag30_XGBt$Total_trav))^2)*100,2)
rmse.rft

## [1] 134.94

nmse.rft

## [1] 31.96

With this same combination of predictors we were able to improve the metrics significantly. Notwithstanding, the performance of the XGBoost is better than the Random Forest.

The table below shows the summary metrics

output.tablet <- data.frame("Model" = c('XGBoost Refined','Random Forest Refined'), 
           "RMSE" = c(rmse.xgbt,rmse.rft), 
           "NMSE" = c(nmse.xgbt,nmse.rft))
kable(output.tablet)

Model	RMSE	NMSE
XGBoost Refined	112.82	22.34
Random Forest Refined	134.94	31.96

Last, we build a data frame containing daily predictions for August with both models and also calculated a prediction average (ensemble) and the performance metrics of such average.

MLpredOutTble <- data.frame(Date=seq.Date(from = as.Date('2019-08-01'), to = as.Date('2019-08-31'), by = 'days'),
                            Actuals=Tdata.eval_lag30_XGBt$Total_trav,
                            XGBoost=round(XGB_TT_predictiont,0),
                            RandomForest=round(RF_TT_predictiont,0))
MLpredOutTble <- dplyr::mutate(MLpredOutTble,
                         Prediction_Average = round((MLpredOutTble$XGBoost + MLpredOutTble$RandomForest)/2,0))
ensmb.rmse <- round(sqrt(mean((MLpredOutTble$Actuals - MLpredOutTble$Prediction_Average)^2)),2)
ensmb.nmse <- round(sum((MLpredOutTble$Actuals - MLpredOutTble$Prediction_Average)^2) / sum((MLpredOutTble$Actuals-mean(MLpredOutTble$Actuals))^2)*100,2)
kable(MLpredOutTble)

Date	Actuals	XGBoost	RandomForest	Prediction_Average
2019-08-01	999	986	917	952
2019-08-02	1102	885	879	882
2019-08-03	1269	1045	1001	1023
2019-08-04	920	909	896	902
2019-08-05	1068	870	872	871
2019-08-06	798	742	840	791
2019-08-07	713	822	835	828
2019-08-08	828	866	851	858
2019-08-09	832	806	862	834
2019-08-10	1186	1101	1007	1054
2019-08-11	869	879	853	866
2019-08-12	773	826	876	851
2019-08-13	851	758	769	764
2019-08-14	671	845	760	802
2019-08-15	584	657	750	704
2019-08-16	701	706	751	728
2019-08-17	672	756	834	795
2019-08-18	448	635	648	642
2019-08-19	498	617	742	680
2019-08-20	607	563	644	604
2019-08-21	583	500	687	594
2019-08-22	578	522	592	557
2019-08-23	468	468	596	532
2019-08-24	727	815	774	794
2019-08-25	498	538	552	545
2019-08-26	484	445	485	465
2019-08-27	383	405	568	486
2019-08-28	347	454	455	454
2019-08-29	392	455	477	466
2019-08-30	934	640	700	670
2019-08-31	589	661	744	702

ggplot(MLpredOutTble, aes(x = Date)) + geom_line(aes(y = Actuals, color='Actuals'), linetype=5, size=1.5) + geom_line(aes(y = XGBoost, color='XGBoost'))+ geom_line(aes(y = RandomForest, color='Random Forest')) + labs(color='Legend') + theme_classic()

#geom_lin

2.3.2 Time series approach

The dataset corresponds to data points collected at constant time intervals. As a result, time series forecasting models could be another methodology to forecast the number of Chinese passengers traveling to Paris. The time series models used for forecasting include decomposition models, exponential smoothing, ARIMA models and the innovative Facebook model Prophet which will also be included in this analysis. The first step for this type of forecasting is to understand the inherent properties of the time series, such as trend, seasonality pattern, and cycles. However, time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. In order to identify those components, the decompose function is applied to the data as per below:

max <- lubridate::yday(max(pdset$Date))
ts <- xts(x = pdset$Total_trav, order.by = pdset$Date)
ts1 <- as.ts(ts)
dfts <- ts(ts1, frequency=365)
dec <- decompose(dfts) 
autoplot(dec)

As described in the exploratory analysis and in the above graph the data follow a weekly, monthly and yearly seasonal pattern, which makes sense as the data has a high frequency (daily) that generally exhibits more complicated seasonal patterns. ETS or conventional ARIMA so far are unable to deal with large length and multiple seasonality patterns. Instead, we will convert the time series to a msts class that can handle the multiple seasonality problem. This allows us to specify all the frequencies that might be relevant. After that, we will use the advanced modelling techniques such as dynamic harmonic regression and TBATS models. Lastly, we will use the prophet model from Facebook which also can manage data series with high frequencies and multiple seasonal patterns.

Firstly, we convert the time series in a msts class object to capture the weekly pattern as well as the longer annual pattern. The value 7 is the weekly pattern and the period 365.25 is the average length of a year allowing for leap years.

### Converting Series in a msts object

date_range <- as.data.frame(seq(as.Date("2014-01-02"), as.Date("2019-08-31"), by = 'days'))
colnames(date_range) <- c("Date")


df1 <- pdset%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS, Avg.PPB, Profile_Group)
df2 <- df1%>%dplyr::mutate(Date = as.Date("2015-01-01"), Total_trav = 84, Profile_Business = 31, Avg.LOS = 8.19, Avg.PPB = 4.38, Profile_Group = 31)
df2 <- df2[1,]
df <-  rbind(df1,df2)%>%dplyr::arrange(Date)


y <- msts(df$Total_trav, start = c(2014,2), end = c(2019,244), seasonal.periods=c(7,365.25))
train <- window(y, start = c(2014,2), end = c(2019,213))
test <-  window(y, start= c(2019,214), end = c(2019,244))

date_range1 <- as.data.frame(seq(as.Date("2014-01-02"), as.Date("2019-07-31"), by = 'days'))
date_range2 <- as.data.frame(seq(as.Date("2019-08-01"), as.Date("2019-08-31"), by = 'days'))
colnames(date_range1) <- c("Date")
colnames(date_range2) <- c("Date")
a <- as.data.frame(train)
b <- as.data.frame(test)
colnames(a) <- c("value")
colnames(b) <- c("value")
traindf <- a%>%mutate(time = date_range1$Date)%>%dplyr::select(time,value)
testdf <-  b%>%mutate(time = date_range2$Date)%>%dplyr::select(time,value)

After converting the time series into a msts object, we proceed to apply the time series models to test the accuracy of the forecasting on each of them.

Dynamic Harmonic Regression

The Arima() and auto.arima() functions will allow a seasonal period up to m=350, but in practice will usually run out of memory whenever the seasonal period is more than about 200. In any case, seasonal differencing of high order does not make a lot of sense for daily data it involves comparing what happened today with what happened exactly a year ago and there is no constraint that the seasonal pattern is smooth. For time series, it is preferred to use a harmonic regression approach where the seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an ARMA error.

The fourier() function makes it easy to generate the required harmonics. The first argument to fourier() allows it to identify the seasonal period m and the length of the predictors to return. The second argument K specifies how many pairs of sin and cos terms to include. The maximum allowed is K=m/2 where m is the seasonal period. The higher the order (K), the more “wiggly” the seasonal pattern is allowed to be. With K=1, it is a pure sine curve.

We will fit a dynamic harmonic regression model with an ARMA error structure. The total number of Fourier terms for each seasonal period has been chosen to minimize the AIC and decrease the RMSE and the NMAE.

As the data follow a weekly and annual pattern two K values are included in the model. The best two parameters for K to minimize the AIC and decrease the RMSE and the NMAE are displayed below:

### Dynamic Harmonic Regression


z <- fourier(train, K=c(1,7))
zf <- fourier(train, K=c(1,7), h=31)
fit <- auto.arima(train, xreg=z, seasonal=FALSE)
fc <- forecast(fit, xreg=zf, h=31)
rmseDHR <- accuracy(fc,y)["Test set", "RMSE"]
rmseDHR

## [1] 150.6233

nmseDHR<- round(sum((testdf$value - fc$mean)^2) / sum((testdf$value-mean(testdf$value))^2)*100,2)
nmseDHR

## [1] 39.82

The RMSE on the test data is 150.62, which is higher in comparison with the XG boosting model. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend but it does not recognize the weekly peaks of the data.

### Dynamic Harmonic Regression Forecast vs Actuals

forecast2 <- fc$mean
traindf2 <- traindf%>%dplyr::mutate(forecast = as.numeric(value), actuals = as.numeric(value))%>%dplyr::select(time, actuals, forecast)
results2 <- testdf%>%dplyr::mutate(forecast = fc$mean, actuals = as.numeric(testdf$value))%>%dplyr::select(time, actuals, forecast)
all2 <- rbind(traindf2, results2)%>%dplyr::select(time, actuals, forecast)%>%dplyr::filter(time > "2019-06-01")

p <- ggplot() + 
  geom_line(data = all2, aes(x = time, y = forecast, color = 'Forecast'))+
  geom_line(data = all2, aes(x = time, y = actuals, color = 'Actuals')) +
  xlab('Dates') +
  ylab('Travellers') +  scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
  labs(color="") + ggtitle("Dynamic Harmonic Regression Forecast") +  theme_bw()  +
  theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p

TBATS

The TBATS model was introduced by De Livera, Hyndman & Snyder (2011, JASA). “TBATS” is an acronym denoting its salient features: • T for trigonometric regressors to model multiple-seasonalities • B for Box-Cox transformations • A for ARMA errors • T for trend • S for seasonality

A TBATS model differs from dynamic harmonic regression in that the seasonality is allowed to change slowly over time in a TBATS model, while harmonic regression terms force the seasonal patterns to repeat periodically without changing. Using the TBATS model is one way to incorporate multiple seasonality in our model. It’s going to automate the process of choosing a Box-Cox transformation for our target variable, and the function will also automatically choose the parameters for the ARMA model, and the fourier transforms for the seasonal trends.

### TBATS
fit <- tbats(train)
fc1 <- forecast(fit, h = 31)
forecast <- fc$mean
rmsetbats <- accuracy(fc1,y)["Test set", "RMSE"]
rmsetbats

## [1] 183.0618

nmsetbats <- round(sum((testdf$value - fc1$mean)^2) / sum((testdf$value-mean(testdf$value))^2)*100,2)
nmsetbats

## [1] 58.82

The RMSE on the test data is 183, which is higher in comparison with dynamic harmonic regression and XG boosting model. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend, but it does not recognize the weekly peaks and drops of the data.

### TBATS Forecast vs Actuals


traindf3 <- traindf%>%mutate(forecast = as.numeric(value), actuals = as.numeric(value))%>%dplyr::select(time, actuals, forecast)
results3 <- testdf%>%mutate(forecast = fc1$mean, actuals = as.numeric(testdf$value))%>%dplyr::select(time, actuals, forecast)
all <- rbind(traindf3, results3)%>%dplyr::select(time, actuals, forecast)%>%dplyr::filter(time > "2019-06-01")


p <- ggplot() + 
  geom_line(data = all, aes(x = time, y = forecast, color = 'Forecast'))+
  geom_line(data = all, aes(x = time, y = actuals, color = 'Actuals')) +
  xlab('Dates') +
  ylab('Travellers') +  scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
                                                 labs(color= "")+ 
  ggtitle("TBATS Forecast") +  theme_bw()  +
  theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p

Prophet Model

The prophet is an algorithm from Facebook to fit an additive model by including seasonality, autocorrelation, extra regressors, etc. One of the features of the prophet() function is that it will also automatically choose change points in the time series. This allows the time series models to be a little bit more robust in comparison to other models.

The Prophet uses a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation: y(t)= g(t) + s(t) + h(t) + εt g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series s(t): periodic changes (e.g. weekly/yearly seasonality) h(t): effects of holidays (user provided) with irregular schedules εt: error term accounts for any unusual changes not accommodated by the model Using time as a regressor, Prophet is trying to fit several linear and non linear functions of time as components. Prophet is framing the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series.

Seasonality’s in Prophet are estimated using a partial Fourier sum. The number of terms in the partial sum (the order) is a parameter that determines how quickly the seasonality can change. In the case of our time series, the parameters of prophet have been selected in order to minimize the RMSE on test data. For example, the default yearly and monthly seasonality parameter was minimised to fit lower frequency changes as the yearly pattern does not change much over the years. On the other hand, the default weekly seasonality parameter was not changed as this default parameter fit higher-frequency changes in the weekly pattern.

The holiday parameter was added to include the public holidays in China as the data has followed a different pattern in those days.

The code below displays the parametrization process for this algorithm.

### Prophet Model

trainr <- df%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS)%>%dplyr::filter(Date < "2019-08-01")
testr <- df%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS)%>%dplyr::filter(Date >= "2019-08-01")

## Creating data frame for prophet model

trainp <- traindf2%>%dplyr::select(time, actuals)%>%mutate(ds = time, y = actuals)%>%dplyr::select(ds, y)
testp <- testdf%>%mutate(ds = time, y = value)%>%dplyr::select(ds, y)


### Creating the outliers 
outliers <- data_frame(
  holiday = 'holiday',
  ds = as.Date(c('2015-08-08', '2015-08-01', '2019-08-30')),
  lower_window = 0,
  upper_window = 1,
)

holidays <- outliers

m <- prophet(weekly.seasonality= TRUE, yearly.seasonality = 7, daily.seasonality = TRUE, seasonality.mode = "additive", changepoint.prior.scale = 5, holidays.prior.scale = 1, holidays = holidays)
m <- add_country_holidays(m, country_name = 'China')
m <- fit.prophet(m, trainp)


## Forecasting Model 

future <- make_future_dataframe(m, periods = 31, include_history = FALSE)
fcst1 <- predict(m, future)
rmseprophet <- accuracy(fcst1$yhat,testp$y)["Test set", "RMSE"]
rmseprophet

## [1] 110.4259

nmseprophet <- round(sum((testp$y - fcst1$yhat)^2) / sum((testp$y-mean(testp$y))^2)*100,2)
nmseprophet

## [1] 21.4

The RMSE on the test data is 110.42, which is lowest one among all the models. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend and the yearly, weekly and daily seasonality. Also, the data is recognizing the changing points of the data (peaks and drops).

### Prophet Model Forecast vs Actuals

results4 <- fcst1%>%mutate(time = as.Date(ds), forecast = round(yhat, 0), actuals = testdf$value)%>%dplyr::select(time, actuals, forecast)
trainf4 <- traindf2%>%dplyr::select(time, actuals, forecast)
all3 <- rbind(trainf4, results4)%>%dplyr::filter(time > "2019-06-01")

p <- ggplot() + 
  geom_line(data = all3, aes(x = time, y = forecast, color = 'Forecast'))+
  geom_line(data = all3, aes(x = time, y = actuals, color = 'Actuals')) +
  xlab('Dates') +
  ylab('Travellers') +  scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
  labs(color="") + ggtitle("Prophet Forecast") +  theme_bw()  +
  theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p

Prophet Model with Avg LOS as Regressor

In Prophet additional regressors can be added to the linear part of the model using the add_regressor method or function. A column with the regressor value will need to be present in both the fitting and prediction dataframes. In our case, the Avg.LOS predictor was added as a regressor and the RMSE improved. The RMSE went from 110.57 to 109.83 (See code and graph below).

### Prophet Model with Avg Loss as Regressor

## Adding a Regresor

### Avg.LOS Prophet Model

t4 <- as.data.frame(trainr%>%dplyr::select(Date, Avg.LOS)%>%mutate(ds = as.Date(Date), y = Avg.LOS)%>%dplyr::select(ds, y))
test4 <- as.data.frame(testr%>%dplyr::select(Date, Avg.LOS)%>%mutate(ds = Date, y = Avg.LOS)%>%dplyr::select(ds, y))
m4 <- prophet(weekly.seasonality= FALSE, yearly.seasonality = 6, daily.seasonality = 4, seasonality.mode = "additive", changepoint.prior.scale = 0.001, holidays = holidays)
m4 <- fit.prophet(m4, t4)
future4 <- make_future_dataframe(m4, periods = 31, include_history = FALSE)
fcst4 <- predict(m4, future4)


trainp1 <- traindf2%>%dplyr::select(time, actuals)%>%mutate(ds = time, y = actuals)%>%dplyr::select(ds, y)
trainp1 <- dplyr::left_join(trainp1, t4%>%dplyr::rename(Avg.LOS = y), by = c('ds' = 'ds'))
trainp1 <- trainp1%>%dplyr::mutate(Avg.LOS = ifelse(is.nan(Avg.LOS), 0, Avg.LOS))
testp1 <- testdf%>%mutate(ds = time, y = value)%>%dplyr::select(ds, y)


## Prophet Model with Avg.LOS as regressor
m <- prophet(weekly.seasonality= TRUE, yearly.seasonality = 7, daily.seasonality = TRUE, seasonality.mode = "additive", changepoint.prior.scale = 5, holidays.prior.scale = 1, holidays = holidays)
m <- add_country_holidays(m, country_name = 'China')
m <- add_regressor(m, 'Avg.LOS')
m <- fit.prophet(m, trainp1)


future <- make_future_dataframe(m, periods = 31, include_history = FALSE)
future.2 <- dplyr::left_join(future,fcst4 %>% 
                               dplyr::select(ds, yhat), by = c('ds' = 'ds')) %>%
  dplyr::rename(Avg.LOS= yhat)

future.2$Avg.LOS <- ifelse(is.na(future.2$Avg.LOS), mean(future.2$Avg.LOS, na.rm = TRUE), future.2$Avg.LOS)

fcst2 <- predict(m, future.2)
rmserprophetavglos <- accuracy(fcst2$yhat,testp1$y)["Test set", "RMSE"]
rmserprophetavglos

## [1] 109.7955

nmseprophetr <- round(sum((testp1$y - fcst2$yhat)^2) / sum((testp1$y-mean(testp1$y))^2)*100,2)
nmseprophetr

## [1] 21.16

### Prophet Model with Avg.LOS as Regressor Forecast vs Actuals

results5 <- fcst2%>%mutate(time = as.Date(ds), forecast = round(yhat, 0), actuals = testdf$value)%>%dplyr::select(time, actuals, forecast)
trainf5 <- traindf2%>%dplyr::select(time, actuals, forecast)
all4 <- rbind(trainf5, results5)%>%dplyr::filter(time > "2019-06-01")

p <- ggplot() + 
  geom_line(data = all4, aes(x = time, y = forecast, color = 'Forecast'))+
  geom_line(data = all4, aes(x = time, y = actuals, color = 'Actuals')) +
  xlab('Dates') +
  ylab('Travellers') +  scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
  labs(color="") + ggtitle("Prophet Forecast") +  theme_bw()  +
  theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p

As part of future research, the prophet model could be validated with more regressors in order to improve the model performance.

2.3.3 Conclusions on modeling

We have tried a number of ML and time series models and concluded on Random Forest and XGBoost are the best performing on the ML side, and the Prophet model is the best one on the time series side.

In the coming weeks, we will try to refine these models. Some ideas along this front are:

Add holiday data for China to ML models
Add columns with Google Search counts for Chinese people searching for Paris, Europe travel or similar terms
Consider weekly predictions (DFS would be indiferent)
Add macroeconomic indicators

Below we show the table with the predictions for August for each model compared to the actuals and the table with the performance metrics (i.e. RMSE and NMSE) for each model.

### Prophet Model with Avg.LOS as Regressor Forecast vs Actuals

append1 <- traindf2%>%dplyr::select(time, actuals)%>%mutate(Date = time, Actuals = actuals, XGBoost = actuals, RandomForest = actuals, Prediction_Average = actuals,
                                                           DHR = actuals, TBATS= actuals, Prophet = actuals, Prophet_Regressor = actuals)%>%
  dplyr::select(Date, Actuals, XGBoost, RandomForest, Prediction_Average,DHR,TBATS, Prophet, Prophet_Regressor)


MLpredOutTble <- MLpredOutTble%>%mutate(DHR = results2$forecast, TBATS = results3$forecast, Prophet = results4$forecast, Prophet_Regressor = results5$forecast)

apptable <- rbind(append1, MLpredOutTble)%>%dplyr::filter(Date > '2019-06-30')
write.csv(apptable, "apptable.csv")

kable(MLpredOutTble )

Date	Actuals	XGBoost	RandomForest	Prediction_Average	DHR	TBATS	Prophet	Prophet_Regressor
2019-08-01	999	986	917	952	807.8437	767.5234	977	976
2019-08-02	1102	885	879	882	893.2569	706.9147	939	940
2019-08-03	1269	1045	1001	1023	938.2670	932.2406	1109	1115
2019-08-04	920	909	896	902	952.6667	865.8596	1028	1027
2019-08-05	1068	870	872	871	898.5229	789.5321	949	940
2019-08-06	798	742	840	791	841.7251	685.4971	878	867
2019-08-07	713	822	835	828	802.0072	714.6886	884	866
2019-08-08	828	866	851	858	821.0152	695.4295	861	856
2019-08-09	832	806	862	834	863.3970	647.7509	814	811
2019-08-10	1186	1101	1007	1054	899.3646	845.2376	974	977
2019-08-11	869	879	853	866	881.6828	781.9332	886	882
2019-08-12	773	826	876	851	818.9974	706.1233	800	788
2019-08-13	851	758	769	764	738.7777	605.0032	722	710
2019-08-14	671	845	760	802	692.1391	622.0901	723	704
2019-08-15	584	657	750	704	694.8388	597.1188	698	691
2019-08-16	701	706	751	728	732.6611	549.5090	647	645
2019-08-17	672	756	834	795	758.3649	711.6362	807	811
2019-08-18	448	635	648	642	738.9408	654.4061	719	716
2019-08-19	498	617	742	680	671.2565	589.4530	635	624
2019-08-20	607	563	644	604	592.3417	505.4215	561	550
2019-08-21	583	500	687	594	545.3482	522.3883	566	550
2019-08-22	578	522	592	557	552.4410	505.3893	547	543
2019-08-23	468	468	596	532	594.0012	469.9450	504	505
2019-08-24	727	815	774	794	627.0628	617.2653	672	680
2019-08-25	498	538	552	545	614.9296	575.8284	595	596
2019-08-26	484	445	485	465	557.2997	526.8783	521	516
2019-08-27	383	405	568	486	488.6655	459.3342	459	454
2019-08-28	347	454	455	454	454.0290	483.2979	478	467
2019-08-29	392	455	477	466	473.7872	476.0727	472	474
2019-08-30	934	640	700	670	529.5072	450.7729	1033	1044
2019-08-31	589	661	744	702	576.9698	603.0402	704	737

output.tablet <- data.frame("Model" = c('XGBoost Refined','Random Forest Refined', "Ensamble", "DHR", "TBATS", "Prophet", "Prophet with Regressor"), 
           "RMSE" = c(rmse.xgbt,rmse.rft, ensmb.rmse, rmseDHR, rmsetbats, rmseprophet, rmserprophetavglos ), 
           "NMSE" = c(nmse.xgbt,nmse.rft, ensmb.nmse, nmseDHR, nmsetbats, nmseprophet, nmseprophetr))
kable(output.tablet)

Model	RMSE	NMSE
XGBoost Refined	112.8200	22.34
Random Forest Refined	134.9400	31.96
Ensamble	118.5200	24.65
DHR	150.6233	39.82
TBATS	183.0618	58.82
Prophet	110.4259	21.40
Prophet with Regressor	109.7955	21.16

Globetrotters - FK Analysis