This report can also be found at: http://rpubs.com/borjaureta/529795
Team Globetrotters is comprised of:
We partnered with DFS Group to forecast travel and purchase intent of Chinese consumers to Paris. DFS Group is the world’s leading luxury travel retailer. Network consists of duty free stores located in 11 major global airports and 20 downtown T Galleria locations.
DFS has re-oriented its strategy towards the Chinese market and being as close as possible to where Chinese travelers are. As part of this strategy, DFS will be opening a new store in Paris, France. This store will have a deep focus on Chinese tourists.
Our initial goal is to predict flows of Chinese tourists into Paris. For this, we will leverage certain information owned by DFS on actual and forward looking reservation data from a proprietary database.
# Creating a vector of packages used within.
packages <- c('adabag', 'astsa',
'base',
'caret', 'caTools','chron', 'Cubist', 'cowplot',
'DMwR2','doParallel','dplyr', 'data.table',
'e1071','earth',
'forecast',
'ggplot2','gridExtra', 'gbm', 'grt', 'ggfortify',
'here', 'h2o', 'httr',
'ipred',
'janitor',
'knitr',
'lme4','lubridate',
'MASS',
'neuralnet', 'nnet',
'plyr','psych', 'performanceEstimation', 'prophet',
'randomForest','readr', 'readxl','rlang','rpart','rpart.plot',
'stats', 'scales',
'tidyverse', 'timeSeries', 'tsbox',
'urca',
'varSelRF',
'xgboost', 'xts',
'zoo')
# Checking for package installations on the system and installing if not found.
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
# Including the packages for use.
for(package in packages){
library(package, character.only = TRUE)
}
#Ensure wd is set to current location by using here()
setwd(here::here())
#h20Instance <- h2o.init(ip = 'localhost')
The main data source for this exercise is Forward Keys (“FK”). FK is a company that analyses millions of daily flights and provides clients with insights for making their tactical decisions. FK crunches and analyzes over 17 million booking transactions a day. Additionally, they pull a wide range of other relevant data like flight searches and aviation capacity.
DFS provided us with a data pull from FK. This dataset contains the actual information on Chinese travelers to Paris between January 1 2015 and August 31, 2019. It also provides additional valuable information such as length of stay in Paris, total trip duration (overall), number of people per booking, lead times between booking and flight, cabin, distribution channels, type of stay, one-way/return, travel day of the week and traveler profile (i.e. business, leisure and group).
We will leverage the FK data with two goals:
The dataset was included in a spreadsheet that contained 3 tabs, with a few tables. We imported the data and run a few data cleaning and structuring code lines to enable the subsequent data analysis.
#Import dataset
#Import table with 15 to 19 data
pdset15_19 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Paris(FR)_PAR.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
skip = 121, n_max = 1703))
#Import 2014 from second table
pdset14 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Paris(FR)_PAR.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
skip = 1828, n_max = 364))
#Join both tables by row
pdset <- rbind(pdset14, pdset15_19)
#Change column names
colnames(pdset) <- c('Date', 'Total_trav', 'Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'LoS_1night', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights', 'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'Endoftrip', 'TrDur_0night', 'TrDur_1night', 'TrDur_2nights', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_5pax', 'ppb_6to9pax', 'ppb_10pl_pax', 'LeadT_0to4days', 'LeadT_5to14days', 'LeadT_15to29days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'LeadT_120to364days', 'Cabin_Economy', 'Cabin_EconomyPremium', 'Cabin_Business', 'Cabin_First', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Weekendstay', 'TypeSt_Workweekstay', 'TypeSt_Combinedstay', 'TypeSt_Nostay', 'OneRet_Oneway', 'OneRet_Return', 'OneRet_MultiCity', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Profile_VFRndExpats', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB')
#Get rid of columns that have no data and are not useful for our analysis
pdset <- dplyr::select(pdset, -c('Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'Endoftrip', 'TrDur_0night', 'TypeSt_Nostay', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_VFRndExpats', 'OneRet_Oneway'))
#Add day of the week, week, month and year.
pdset <- dplyr::mutate(pdset,
WeekDay=lubridate::wday(pdset$Date, label = TRUE),
Month=lubridate::month(pdset$Date, label = TRUE),
Week=lubridate::week(pdset$Date),
Year=lubridate::year(pdset$Date))
#Import dataset
#Import table with 15 to 19 data
vdset15_19 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Venice(IT)_VCE.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
skip = 121, n_max = 1703))
#Import 2014 from second table
vdset14 <- as.tibble(read_xlsx('forwardkeys-Arrivals-History-Venice(IT)_VCE.xlsx', col_names = FALSE, na = c("", "NA"), sheet = 'China-CN',
skip = 1828, n_max = 364))
#Join both tables by row
vdset <- rbind(vdset14, vdset15_19)
#Change column names
colnames(vdset) <- c('Date', 'Total_trav', 'Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'LoS_1night', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights', 'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'Endoftrip', 'TrDur_0night', 'TrDur_1night', 'TrDur_2nights', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_5pax', 'ppb_6to9pax', 'ppb_10pl_pax', 'LeadT_0to4days', 'LeadT_5to14days', 'LeadT_15to29days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'LeadT_120to364days', 'Cabin_Economy', 'Cabin_EconomyPremium', 'Cabin_Business', 'Cabin_First', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Weekendstay', 'TypeSt_Workweekstay', 'TypeSt_Combinedstay', 'TypeSt_Nostay', 'OneRet_Oneway', 'OneRet_Return', 'OneRet_MultiCity', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Profile_VFRndExpats', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB')
#Get rid of columns that have no data and are not useful for our analysis
vdset <- dplyr::select(vdset, -c('Total_trav_China/CN', 'Returnhome', 'ShortTransfer', 'DwellingTransfer', 'LongTransfer', 'Stopover', 'DayTrip', 'Endoftrip', 'TrDur_0night', 'TypeSt_Nostay', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB', 'TrDay_Sunday', 'TrDay_Monday', 'TrDay_Tuesday', 'TrDay_Wednesday', 'TrDay_Thursday', 'TrDay_Friday', 'TrDay_Saturday', 'Profile_VFRndExpats', 'OneRet_Oneway'))
#Add day of the week, week, month and year.
vdset <- dplyr::mutate(vdset,
WeekDay=lubridate::wday(vdset$Date, label = TRUE),
Month=lubridate::month(vdset$Date, label = TRUE),
Week=lubridate::week(vdset$Date),
Year=lubridate::year(vdset$Date))
We have included below the summary and structure of the dataset.
str(pdset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2067 obs. of 61 variables:
## $ Date : POSIXct, format: "2014-01-02" "2014-01-03" ...
## $ Total_trav : num 147 154 202 172 180 141 290 150 267 314 ...
## $ LoS_1night : num 0 0 3 5 13 6 8 11 15 15 ...
## $ LoS_2nights : num 10 12 1 15 18 8 19 42 7 25 ...
## $ LoS_3nights : num 46 20 16 27 17 13 63 14 37 35 ...
## $ LoS_4to5nights : num 80 91 67 72 73 39 86 27 129 87 ...
## $ LoS_6to8nights : num 28 16 65 20 35 27 65 29 33 79 ...
## $ LoS_9to13nights : num 12 6 29 18 14 25 20 11 16 44 ...
## $ LoS_14to21nights : num -35 6 16 13 4 21 27 15 29 19 ...
## $ LoS_22nightsormore : num 6 3 5 2 6 2 2 1 1 10 ...
## $ TrDur_1night : num 0 0 0 3 1 2 1 1 3 0 ...
## $ TrDur_2nights : num 1 0 0 0 4 1 5 6 1 1 ...
## $ TrDur_3nights : num 1 0 1 0 11 1 12 4 6 5 ...
## $ TrDur_4to5nights : num 11 11 6 18 34 12 30 13 23 26 ...
## $ TrDur_6to8nights : num 53 70 78 80 33 42 116 66 130 115 ...
## $ TrDur_9to13nights : num 82 41 71 50 75 55 73 36 51 124 ...
## $ TrDur_14to21nights : num 15 24 33 12 15 19 38 17 44 31 ...
## $ TrDur_22nightsormore : num -16 8 13 9 7 9 15 7 9 12 ...
## $ ppb_1pax : num 46 40 87 75 91 67 106 79 89 114 ...
## $ ppb_2pax : num 32 20 34 26 24 46 70 56 46 62 ...
## $ ppb_3pax : num 0 18 12 6 36 3 27 3 27 27 ...
## $ ppb_4pax : num 4 4 20 8 8 12 28 12 12 32 ...
## $ ppb_5pax : num 5 0 20 5 10 0 5 0 15 20 ...
## $ ppb_6to9pax : num 55 21 0 0 0 13 38 0 38 19 ...
## $ ppb_10pl_pax : num 5 51 29 52 11 0 16 0 40 40 ...
## $ LeadT_0to4days : num -12 -82 -27 -19 -11 15 -26 1 -6 -9 ...
## $ LeadT_5to14days : num 7 -14 14 19 -21 -32 35 -7 13 -1 ...
## $ LeadT_15to29days : num 26 40 68 9 80 86 128 33 23 115 ...
## $ LeadT_30to44days : num -32 25 13 49 49 44 82 22 78 99 ...
## $ LeadT_45to59days : num 54 76 70 40 23 16 9 77 33 28 ...
## $ LeadT_60to89days : num 60 26 12 73 17 4 48 17 28 71 ...
## $ LeadT_90to119days : num -72 0 44 -40 42 -33 8 1 7 -7 ...
## $ LeadT_120to364days : num 116 83 8 41 1 41 6 6 91 18 ...
## $ Cabin_Economy : num 130 134 153 140 133 112 247 116 236 258 ...
## $ Cabin_EconomyPremium : num 3 0 16 7 9 9 5 15 5 18 ...
## $ Cabin_Business : num 13 14 32 25 36 18 37 18 22 27 ...
## $ Cabin_First : num 1 6 1 0 2 2 1 1 4 11 ...
## $ DistCh_OnlineTA : num 15 7 39 14 32 30 32 19 31 69 ...
## $ DistCh_CorporateTA : num 17 42 58 41 25 29 54 17 64 46 ...
## $ DistCh_RetailTA : num 115 104 86 112 118 74 196 102 160 183 ...
## $ DistCh_OtherTA : num 0 1 19 5 5 8 8 12 12 16 ...
## $ TypeSt_Weekendstay : num 0 12 3 0 0 0 0 0 22 15 ...
## $ TypeSt_Workweekstay : num 0 0 0 119 93 27 27 11 0 0 ...
## $ TypeSt_Combinedstay : num 147 142 199 53 87 114 263 139 245 299 ...
## $ OneRet_Return : num 29 9 96 43 64 73 122 48 92 126 ...
## $ OneRet_MultiCity : num 118 145 106 129 116 68 168 102 175 188 ...
## $ Profile_Business : num 43 65 109 97 106 74 146 83 133 119 ...
## $ Profile_Leisure : num 44 30 93 23 63 54 106 74 96 136 ...
## $ Profile_Group : num 60 59 0 52 11 13 38 -7 38 59 ...
## $ Sign_InitialBookings : num 704 963 902 1032 848 ...
## $ Sign_Partialadditions_modif: num 326 348 161 211 324 182 361 172 382 278 ...
## $ Sign_Partialcancellations : num -577 -869 -593 -630 -794 ...
## $ Sign_Fulltripcancellations : num -306 -288 -268 -441 -198 -209 -358 -195 -325 -273 ...
## $ Avg.LOS : num 5.62 7.86 9.18 6.81 7.22 8.99 6.79 6.61 6.17 6.97 ...
## $ Avg.LOT : num 11.5 16.1 12.9 10.6 10.5 ...
## $ Avg.LT : num 66.2 137.1 65.8 74.4 57.8 ...
## $ Avg.PPB : num 4.16 6.9 3.99 5.71 2.74 2.18 3.34 1.65 4.69 4.08 ...
## $ WeekDay : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 5 6 7 1 2 3 4 5 6 7 ...
## $ Month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Week : num 1 1 1 1 1 1 2 2 2 2 ...
## $ Year : num 2014 2014 2014 2014 2014 ...
summary(pdset)
## Date Total_trav LoS_1night
## Min. :2014-01-02 00:00:00 Min. : 71.0 Min. :-4.00
## 1st Qu.:2015-06-03 12:00:00 1st Qu.: 404.0 1st Qu.: 6.00
## Median :2016-11-01 00:00:00 Median : 567.0 Median :11.00
## Mean :2016-10-31 19:46:24 Mean : 630.8 Mean :12.54
## 3rd Qu.:2018-04-01 12:00:00 3rd Qu.: 798.0 3rd Qu.:17.00
## Max. :2019-08-31 00:00:00 Max. :2129.0 Max. :79.00
##
## LoS_2nights LoS_3nights LoS_4to5nights LoS_6to8nights
## Min. : -6.00 Min. : -3.0 Min. : -16.0 Min. :-37.0
## 1st Qu.: 14.00 1st Qu.: 34.0 1st Qu.: 128.0 1st Qu.: 76.0
## Median : 22.00 Median : 55.0 Median : 198.0 Median :125.0
## Mean : 25.37 Mean : 62.2 Mean : 236.4 Mean :146.9
## 3rd Qu.: 33.00 3rd Qu.: 81.0 3rd Qu.: 298.0 3rd Qu.:191.0
## Max. :167.00 Max. :473.0 Max. :1078.0 Max. :618.0
##
## LoS_9to13nights LoS_14to21nights LoS_22nightsormore TrDur_1night
## Min. : 6.00 Min. :-35.00 Min. :-7.000 Min. :-5.000
## 1st Qu.: 46.00 1st Qu.: 23.00 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 68.00 Median : 41.00 Median : 5.000 Median : 1.000
## Mean : 85.56 Mean : 54.97 Mean : 6.868 Mean : 1.917
## 3rd Qu.:112.00 3rd Qu.: 72.00 3rd Qu.: 9.000 3rd Qu.: 3.000
## Max. :593.00 Max. :412.00 Max. :49.000 Max. :39.000
##
## TrDur_2nights TrDur_3nights TrDur_4to5nights TrDur_6to8nights
## Min. :-7.000 Min. : -3.00 Min. :-28.00 Min. : 13.0
## 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 24.00 1st Qu.:107.0
## Median : 4.000 Median : 11.00 Median : 40.00 Median :157.0
## Mean : 4.717 Mean : 14.95 Mean : 50.63 Mean :173.9
## 3rd Qu.: 7.000 3rd Qu.: 20.00 3rd Qu.: 64.00 3rd Qu.:225.0
## Max. :39.000 Max. :187.00 Max. :555.00 Max. :657.0
##
## TrDur_9to13nights TrDur_14to21nights TrDur_22nightsormore ppb_1pax
## Min. : 8.0 Min. : 9.00 Min. :-16.00 Min. : 38
## 1st Qu.: 138.0 1st Qu.: 44.00 1st Qu.: 11.00 1st Qu.:136
## Median : 220.0 Median : 74.00 Median : 17.00 Median :179
## Mean : 264.1 Mean : 97.92 Mean : 22.66 Mean :195
## 3rd Qu.: 343.0 3rd Qu.:127.00 3rd Qu.: 28.00 3rd Qu.:239
## Max. :1236.0 Max. :525.00 Max. :164.00 Max. :526
##
## ppb_2pax ppb_3pax ppb_4pax ppb_5pax
## Min. : 12.0 Min. : -3.00 Min. : -4.00 Min. : -5.00
## 1st Qu.: 82.0 1st Qu.: 30.00 1st Qu.: 24.00 1st Qu.: 10.00
## Median :118.0 Median : 48.00 Median : 36.00 Median : 20.00
## Mean :132.8 Mean : 61.82 Mean : 44.15 Mean : 27.16
## 3rd Qu.:170.0 3rd Qu.: 81.00 3rd Qu.: 60.00 3rd Qu.: 35.00
## Max. :570.0 Max. :297.00 Max. :168.00 Max. :310.00
##
## ppb_6to9pax ppb_10pl_pax LeadT_0to4days LeadT_5to14days
## Min. : -8.0 Min. :-82.0 Min. :-349.00 Min. :-128.00
## 1st Qu.: 25.0 1st Qu.: 31.0 1st Qu.: -25.00 1st Qu.: 36.00
## Median : 46.0 Median : 71.0 Median : -8.00 Median : 75.00
## Mean : 55.4 Mean :114.5 Mean : -13.84 Mean : 83.66
## 3rd Qu.: 76.0 3rd Qu.:142.5 3rd Qu.: 5.00 3rd Qu.: 120.00
## Max. :296.0 Max. :965.0 Max. : 84.00 Max. : 470.00
##
## LeadT_15to29days LeadT_30to44days LeadT_45to59days LeadT_60to89days
## Min. :-45.0 Min. :-75.0 Min. :-88.00 Min. :-70.00
## 1st Qu.: 87.0 1st Qu.: 53.5 1st Qu.: 32.00 1st Qu.: 31.00
## Median :130.0 Median : 93.0 Median : 63.00 Median : 69.00
## Mean :137.6 Mean :102.6 Mean : 74.24 Mean : 87.79
## 3rd Qu.:178.0 3rd Qu.:140.0 3rd Qu.:105.00 3rd Qu.:126.50
## Max. :459.0 Max. :386.0 Max. :356.00 Max. :502.00
##
## LeadT_90to119days LeadT_120to364days Cabin_Economy
## Min. :-139.0 Min. :-45.0 Min. : 55.0
## 1st Qu.: 10.0 1st Qu.: 37.0 1st Qu.: 329.0
## Median : 31.0 Median : 79.0 Median : 472.0
## Mean : 44.3 Mean :114.4 Mean : 534.3
## 3rd Qu.: 66.5 3rd Qu.:152.5 3rd Qu.: 672.5
## Max. : 412.0 Max. :849.0 Max. :1953.0
##
## Cabin_EconomyPremium Cabin_Business Cabin_First DistCh_OnlineTA
## Min. : -2.0 Min. : 4.00 Min. :-4.000 Min. : 7.00
## 1st Qu.: 9.0 1st Qu.: 41.00 1st Qu.: 4.000 1st Qu.: 44.00
## Median : 16.0 Median : 60.00 Median : 7.000 Median : 66.00
## Mean : 20.8 Mean : 67.45 Mean : 8.277 Mean : 80.91
## 3rd Qu.: 26.0 3rd Qu.: 85.00 3rd Qu.:11.000 3rd Qu.:100.00
## Max. :118.0 Max. :275.00 Max. :56.000 Max. :480.00
##
## DistCh_CorporateTA DistCh_RetailTA DistCh_OtherTA TypeSt_Weekendstay
## Min. : -4.00 Min. : 51.0 Min. :-27.00 Min. : 0.000
## 1st Qu.: 52.00 1st Qu.: 269.0 1st Qu.: 4.00 1st Qu.: 0.000
## Median : 77.00 Median : 396.0 Median : 11.00 Median : 0.000
## Mean : 84.86 Mean : 441.6 Mean : 23.37 Mean : 7.872
## 3rd Qu.:109.00 3rd Qu.: 574.5 3rd Qu.: 29.00 3rd Qu.: 9.000
## Max. :282.00 Max. :1453.0 Max. :313.00 Max. :125.000
##
## TypeSt_Workweekstay TypeSt_Combinedstay OneRet_Return OneRet_MultiCity
## Min. : 0.0 Min. : 53.0 Min. : 9.0 Min. : 23.0
## 1st Qu.: 0.0 1st Qu.: 293.0 1st Qu.:143.0 1st Qu.: 226.0
## Median : 35.0 Median : 450.0 Median :208.0 Median : 340.0
## Mean : 104.7 Mean : 518.3 Mean :236.5 Mean : 394.3
## 3rd Qu.: 154.0 3rd Qu.: 672.5 3rd Qu.:298.0 3rd Qu.: 499.0
## Max. :1275.0 Max. :2119.0 Max. :813.0 Max. :1732.0
##
## Profile_Business Profile_Leisure Profile_Group Sign_InitialBookings
## Min. : 31.0 Min. : 22.0 Min. : -50.0 Min. : 282
## 1st Qu.:162.0 1st Qu.:146.0 1st Qu.: 55.0 1st Qu.:1162
## Median :215.0 Median :212.0 Median : 107.0 Median :1672
## Mean :230.8 Mean :249.6 Mean : 150.4 Mean :1950
## 3rd Qu.:284.5 3rd Qu.:330.0 3rd Qu.: 194.5 3rd Qu.:2410
## Max. :694.0 Max. :910.0 Max. :1043.0 Max. :7307
##
## Sign_Partialadditions_modif Sign_Partialcancellations
## Min. : 32.0 Min. :-5556
## 1st Qu.: 234.0 1st Qu.:-1706
## Median : 340.0 Median :-1161
## Mean : 396.6 Mean :-1360
## 3rd Qu.: 497.5 3rd Qu.: -807
## Max. :1999.0 Max. : -195
##
## Sign_Fulltripcancellations Avg.LOS Avg.LOT
## Min. :-1843.0 Min. : 4.330 Min. : 6.93
## 1st Qu.: -433.0 1st Qu.: 6.300 1st Qu.:10.16
## Median : -265.0 Median : 6.970 Median :11.11
## Mean : -355.6 Mean : 7.163 Mean :11.34
## 3rd Qu.: -168.0 3rd Qu.: 7.800 3rd Qu.:12.24
## Max. : -16.0 Max. :27.600 Max. :30.05
##
## Avg.LT Avg.PPB WeekDay Month Week
## Min. : 10.63 Min. : 0.930 Sun:295 Mar :186 Min. : 1.00
## 1st Qu.: 51.30 1st Qu.: 3.240 Mon:295 May :186 1st Qu.:13.00
## Median : 64.24 Median : 4.180 Tue:295 Jul :186 Median :25.00
## Mean : 68.03 Mean : 4.493 Wed:295 Aug :186 Mean :25.59
## 3rd Qu.: 80.11 3rd Qu.: 5.470 Thu:295 Jan :184 3rd Qu.:38.00
## Max. :297.71 Max. :11.100 Fri:296 Apr :180 Max. :53.00
## Sat:296 (Other):959
## Year
## Min. :2014
## 1st Qu.:2015
## Median :2016
## Mean :2016
## 3rd Qu.:2018
## Max. :2019
##
Below we start plotting some of the main variables.
Given 2019 is only a partial year with only 8 months of data, it has been excluded from this initial piece of exploratory analysis
# Create Paris subset to exclude partial 2019 data
pdset_14_18 <- subset(pdset, Year <= 2018)
vdset_14_18 <- subset(vdset, Year <= 2018)
# Define colours
barfill = "gold1"
barlines = "goldenrod2"
# Inbound passengers by Year Ex 2019
p_hist_tt_y <- ggplot(data = pdset_14_18, aes(x = Year, y = Total_trav)) +
geom_bar(stat="identity", fill = barfill) +
theme_light() +
labs(x = "Year", y = "Total Inbound Passengers") +
scale_y_continuous(labels = comma) +
ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Year 2014 - 2018")
# Inbound passengers by Month Ex 2019
p_hist_tt_m <- ggplot(data = pdset_14_18, aes(x = Month, y = Total_trav)) +
geom_bar(stat="identity", fill = barfill) +
theme_light() +
labs(x = "Month", y = "Total Inbound Passengers") +
scale_y_continuous(labels = comma) +
ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Month 2014 - 2018")
# Inbound passengers by Day of Week Ex 2019
p_hist_tt_d <- ggplot(data = pdset_14_18, aes(x = WeekDay, y = Total_trav)) +
geom_bar(stat="identity", fill = barfill) +
theme_light() +
labs(x = "Week Day", y = "Total Inbound Passengers") +
scale_y_continuous(labels = comma) +
ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Weekday 2014 - 2018")
# Inbound passengers by Week Ex 2019
p_hist_tt_wk <- ggplot(data = pdset_14_18, aes(x = Week, y = Total_trav)) +
geom_bar(stat="identity", fill = barfill) +
theme_light() +
labs(x = "Week", y = "Total Inbound Passengers") +
scale_y_continuous(labels = comma) +
ggtitle("Total Chinese Passengers Arrivals Direct to Paris by Week 2014 - 2018")
# Plot grid layout using plot_grid function
plot_grid(p_hist_tt_y, p_hist_tt_m, p_hist_tt_d, p_hist_tt_wk, nrow = 2)
One can clearly see the drop-off in inbound passengers from 2016 which can be attributed to the Paris terrorist attacks occurring in November 2015. Whilst 2016 received the lowest number of inbound direct visitors in our data set, there has been an uptick in subsequent years reflecitng increased tourism confidence in Paris.
From a month perspective, we can see how the strongest period is June through September, with July being the strongest month.
Saturday is the strongest day of the week, followed by Sunday, with Tuesday being the lowest.
The pattern is very easy to observe when we plot the weeks, with a strong peak during the month of July, a decline in August and pick up in September, declining from then to the end of the year.
In conclusion, we can observe how there is a strong pattern that is repeated each year with a peak during the summer months, with a slight decline in August that picks up in September. Weekends are the strongest days of the week. In terms of years 2015 showed the largest flows of Chinese tourists, with 2017 and 2018 showing improved performance on 2016.
# Subset dataset for arrivals, month and year
pdset_14_18_m_y <- pdset_14_18%>%dplyr::select(Month, Year, Total_trav)%>%dplyr::group_by(Month, Year)%>%dplyr::summarise(Total_trav = sum(Total_trav))
# Create line chart
line_chart_paris <- ggplot(data = pdset_14_18_m_y, aes(x = Month, y = Total_trav, group = Year, colour = as.factor(Year))) +
geom_line() +
geom_point() +
labs(x="Month", y="Passenger Arrivals", colour="Year") +
scale_y_continuous(labels = comma) +
ggtitle("Chinese Passenger Arrival to Paris by Year by Month") +
theme_light() +
theme(legend.position="top")
# Plot line chart
line_chart_paris
Not surprisingly, the arrival of new Chinese passengers peaks in Paris during the summer months, where the month of July is the most popular month. While we dont have the chart for Venice here, when we looked at the data for Venice, we noticed that the prominence of the summer months are more pronounced relative to Venice, which means potentially DFS will need to carry a somewhat larger level of summer months inventory targeted for Chinese clients relative to Venice.
In addition, since 2015, it seems like the flow of Chinese passengers to Paris has become a bit more even. While summer months are still the most popular months, we now observed relatively higher level of visitors in other months of the year as well, particularly fall and the spring, while the winter months, November to March represent the lowest levels of activities for Chinese tourism.
The forward keys data set allows us to assess arrivals by length of stay.
# Inbound passengers by LoS Ex 2019
# Select required rows from data frame
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Year")
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")
# Melt into desired format
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops]
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")
# Transpose the data set
hist_tt_y_los_transpose <- data.frame(t(hist_tt_y_los))
# Make first row the column names and delete first row
names(hist_tt_y_los_transpose) <- as.matrix(hist_tt_y_los_transpose[1, ])
hist_tt_y_los_transpose <- hist_tt_y_los_transpose[-1, ]
hist_tt_y_los_transpose[] <- lapply(hist_tt_y_los_transpose, function(x) type.convert(as.character(x)))
# Create new taret column
hist_tt_y_los_transpose <- as.data.frame(hist_tt_y_los_transpose)
hist_tt_y_los_transpose$LoS <- rownames(hist_tt_y_los_transpose)
# Create first ID row
row.names(hist_tt_y_los_transpose) <- NULL
# Add Total Row
hist_tt_y_los_transpose$Total <- hist_tt_y_los_transpose$`2014` +
hist_tt_y_los_transpose$`2015` +
hist_tt_y_los_transpose$`2016` +
hist_tt_y_los_transpose$`2017` +
hist_tt_y_los_transpose$`2018`
# Add levels to maintain order
hist_tt_y_los_transpose$LoS <- factor(hist_tt_y_los_transpose$LoS, levels = hist_tt_y_los_transpose$LoS)
p_hist_tt_los_barchart <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = Total)) +
geom_bar(stat="identity", fill = barfill) +
theme_light() +
labs(x = "Length of Stay", y = "Total Inbound Passengers") +
scale_y_continuous(labels = comma) +
ggtitle("Total Chinese Arrivals to Paris by Length of Stay 2014 - 2018")
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Date"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Date")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id = "Date")
p_hist_tt_y_los_boxplot <- ggplot(hist_tt_y_los_melt, aes(x = variable, y = value))+
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.1, color = "tomato") +
theme_light() +
labs(x = "Length of Stay", y = "Daily Arrivals Volume") +
ggtitle("Length of Chinese Arrivals to Paris by Daily Arrival Volume")
# Plot charts in grid
plot_grid(p_hist_tt_los_barchart + coord_flip(), p_hist_tt_y_los_boxplot + coord_flip(), nrow = 1)
According to forward keys data, a total of 1.152mn Chinese travelers landed in Paris between 2014 and 2018.
Chinese tourists traveling into Paris generally stay several nights in the city. Short term stay categories (1 night, 2 nights and 3 nights) together only represent 16% of the total bookings, while 4 to 5 nights represent the most popular category with 38% of the bookings, and collectively stays above 3 nights represent 84% of the total bookings.
After 4 - 5 nights, the most popular category is 6 to 8 nights which represents 22.8% of the travelers, while people only staying one night before proceeding to another destination represented only 25 of the total travelers.
This positions Paris as an attractive city for the luxury retailers. Given the average length of stay of the Chinese passengers, DFS can have multiple touch points with potential clients (from airport display ads to boutique in the city) throughout their stay. Also, potentially the same client could drop by more than once in the same boutique (unlike the Chinese tourists who are on tour and spend a day or less in each city they visit via bus).
However, this also creates one particular challenge; arguably the larger length of stay gives the potential client more time to compare prices & products across different outlets when making a purchase decision, meaning for customers with same price elasticity, DFS prices should probably be not too far from the market prices for comparable products.
# Create melted data set for arrivals by length of stay by year
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops]
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")
names(hist_tt_y_los_melt) <- c("Year", "Length_of_Stay", "Value")
hist_tt_y_los_melt$Year <- as.factor(hist_tt_y_los_melt$Year)
p_hist_tt_y_los_melt_heatmap <- ggplot(hist_tt_y_los_melt, aes(x = hist_tt_y_los_melt$Year, y = hist_tt_y_los_melt$Length_of_Stay)) +
geom_tile(aes(fill = Value)) +
scale_x_discrete("Year", expand = c(0, 0)) +
scale_y_discrete("Length of Stay", expand = c(0, 0)) +
scale_fill_gradient("Number of Arrivals", low = "white", high = "steelblue" ) +
theme_bw() +
ggtitle("Length of Chinese Arrival to Paris by Year") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# Create melted data set for arrivals by length of stay by month
hist_tt_m_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Month"))
hist_tt_m_los$Month <- as.integer(hist_tt_m_los$Month)
hist_tt_m_los <- aggregate(hist_tt_m_los, by = list(hist_tt_m_los$Month), FUN = sum)
drops <- c("Month")
hist_tt_m_los <- hist_tt_m_los[ , !names(hist_tt_m_los) %in% drops]
names(hist_tt_m_los) <- c("Month", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_m_los_melt <- melt(hist_tt_m_los, id="Month")
names(hist_tt_m_los_melt) <- c("Month", "Length_of_Stay", "Value")
hist_tt_m_los_melt$Month <- as.factor(hist_tt_m_los_melt$Month)
p_hist_tt_m_los_heatmap <- ggplot(hist_tt_m_los_melt, aes(x = hist_tt_m_los_melt$Month, y = hist_tt_m_los_melt$Length_of_Stay)) +
geom_tile(aes(fill = Value)) +
scale_x_discrete("Month", expand = c(0, 0)) +
scale_y_discrete("Length of Stay", expand = c(0, 0)) +
scale_fill_gradient("Number of Arrivals", low = "white", high = "steelblue" ) +
theme_bw() +
ggtitle("Length of Chinese Arrival to Paris by Month") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line(colour = NA))
# Plot heatmaps
plot_grid(p_hist_tt_y_los_melt_heatmap, p_hist_tt_m_los_heatmap, nrow = 2)
We also looked at the evolution of the length of stay from 2014 to 2018. While Paris has the advantage of being a city where the incoming Chinese passengers spend several nights which creates a major commercial opportunity for DFS, we also observe that the average length of stay has been declining visibly between 2014 - 18 in the most popular 4 - 5 nights category while the other categories remained relatively unchanged. However, as we will discuss in the next section, the underlying story here is slightly different.
Looking at the arrival data on a monthly basis, and analyzing each stay length category, we see that the overall length of stay generally seems to increase in the summer months for the most popular categories (4 - 5 nights and 6 to 9 nights), while in the shorter term stay categories the seasonal changes are muted.
Therefore we believe the summer months will be of utmost importance to DFS as it should try to leverage on the increased number of stays by The Chinese tourists. To capitalize on the opportunity, DFS should increase its marketing & advertising budget for the summer to establish multiple contact points with the Chinese clients (starting with display ads at the airport) and have in store promotions to trigger return clients (such as 5% discount coupons with each purchase valid for 24 - 48 hours for your second purchase) and increase their store traffic. This also means, DFS will potentially need to carry a larger inventory as well as a larger store sales staff during the peak summer months.
# Select required rows from data frame
hist_tt_y_los <- subset(pdset_14_18, select = c("LoS_1night", "LoS_2nights", "LoS_3nights", "LoS_4to5nights","LoS_6to8nights", "LoS_9to13nights", "LoS_14to21nights", "LoS_22nightsormore", "Year"))
names(hist_tt_y_los) <- c("Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More", "Year")
hist_tt_y_los <- aggregate(hist_tt_y_los, by = list(hist_tt_y_los$Year), FUN = sum)
drops <- c("Year")
# Melt into desired format
hist_tt_y_los <- hist_tt_y_los[ , !names(hist_tt_y_los) %in% drops]
names(hist_tt_y_los) <- c("Year", "Los 1 Night", "Los 2 Nights", "LoS 3 Nights", "Los 4 to 5 Nights", "Los 6 to 8 Nights", "LoS 9 to 13 Nights", "Los 14 to 21 Nights", "Los 21 Nights or More")
hist_tt_y_los_melt <- melt(hist_tt_y_los, id="Year")
# Transpose the data set
hist_tt_y_los_transpose <- data.frame(t(hist_tt_y_los))
# Make first row the column names and delete first row
names(hist_tt_y_los_transpose) <- as.matrix(hist_tt_y_los_transpose[1, ])
hist_tt_y_los_transpose <- hist_tt_y_los_transpose[-1, ]
hist_tt_y_los_transpose[] <- lapply(hist_tt_y_los_transpose, function(x) type.convert(as.character(x)))
# Create new taret column
hist_tt_y_los_transpose <- as.data.frame(hist_tt_y_los_transpose)
hist_tt_y_los_transpose$LoS <- rownames(hist_tt_y_los_transpose)
# Create first ID row
row.names(hist_tt_y_los_transpose) <- NULL
# Add delta column
hist_tt_y_los_transpose$delta <- hist_tt_y_los_transpose$`2018` - hist_tt_y_los_transpose$`2014`
hist_tt_y_los_transpose$delta2 <- hist_tt_y_los_transpose$`2018` - hist_tt_y_los_transpose$`2016`
# Add levels to maintain order
hist_tt_y_los_transpose$LoS <- factor(hist_tt_y_los_transpose$LoS, levels = hist_tt_y_los_transpose$LoS)
# Create Delta chart
p_hist_tt_y_los_delta <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = delta, label = delta)) +
geom_point(stat="identity", fill = "black", size = 12) +
theme_light() +
geom_text(color = "white", size = 4) +
geom_segment(aes(y = 0, x = `LoS`, yend = delta, xend = `LoS`), color = "black") +
labs(x = "Passenger Length of Stay", y = "Delta Between 2018 and 2014") +
ggtitle("Delta in Chinese Arrivals to Paris by Length of Stay 2014 to 2018") +
scale_y_continuous(labels = comma)
p_hist_tt_y_los_delta2 <- ggplot(data = hist_tt_y_los_transpose, aes(x = LoS, y = delta2, label = delta2)) +
geom_point(stat="identity", fill = "black", size = 12) +
theme_light() +
geom_text(color = "white", size = 4) +
geom_segment(aes(y = 0, x = `LoS`, yend = delta2, xend = `LoS`), color = "black") +
labs(x = "Passenger Length of Stay", y = "Delta Between 2018 and 2016") +
ggtitle("Delta in Chinese Arrivals to Paris by Length of Stay 2016 to 2018") +
scale_y_continuous(labels = comma)
plot_grid(p_hist_tt_y_los_delta + coord_flip(), p_hist_tt_y_los_delta2 + coord_flip(), nrow = 1)
Length of Stay Conclusions
As discussed earlier, While a large majority of Chinese passengers traveling to Paris stay 4 nights or longer, we also note that over the past 4 years (2014 to 2018), there has been a visible decline (40,000 people) in the bracket of tourists who stay 4 to 5 nights, while there is an increase in the 6 to 8 nights, and 9 to 13 night categories, while the change in other categories were relatively muted. The delta analysis clearly shows these trends.
To understand the drivers of the decline, we did some google search on Parisian tourism stats, and the search right away brought results / articles on the devastating terrorist attacks in the city in 2015 and the negative impact on the tourism flows. To look at the trend, we divided our dataset into two, took out 2015, the year of the attack and then looked at the trends in length of stay from 2016 to 2018 (the recovery period) and the data clearly show a recovery. While the negative delta for 4 to 5 nights is about 40,000 between 2014 - 2018 (which includes the events of 2015), delta from 2016 to 2018 is a positive of about 4,100 bookings. Similarly, 6 to 8 nights stays increased by almost 10,000 in the same period and 9 to 13 nights by about 5,000.
Therefore, isolating the aftermath of the 2015 attacks, we don’t necessarily see the average length of stays in Paris declining visibly over the next few years.
The forward keys data set allows us to assess arrivals by number of passengers by booking.
Paris seems to be a very popular tourism destination for one person bookings from China. While, a good portion of these can be singles traveling, we note that (as we will see in the next sections), Paris is an important business travel destination for Chinese as well. One person bookings represent around 30% of the total passengers for 2014 - 2018 period, followed by 2 people bookings (which we believe is largely couples and they represent 21% of the bookings), third largest group is +10 people (20%) who are group travelers, another important source of tourists for Paris.
This is an important demographics information for DFS. The company should potentially do some market research on the type of products bought by single-leisure and business travelers, as well as couples in addition to research on group purchases, as these 3 categories represent over 70% of the Chinese travelers flying into Paris.
Given the important role of client demographics for DFS we wanted to look at the passengers per booking data in a bit more depth. We first looked at the evolution of the categories on an annual basis between 2014 - 2018. As the heatmap shows, while most categories seem to be relatively stable, it is very easy to see a sharp decline in the category for 10 plus passengers per booking. While this group currently represents about 20%, it has a visibly larger share of total tourists in 2014 and this category started a sharp decline in 2015 and since then it has not necessarily recovered.
We also looked at the passengers per booking data on a monthly basis. While most categories are relatively stable (from the top number 2, 3, 4, and 5), the seasonality seems to be way more pronounced for the most popular categories; 1 passenger, 2 passengers and + 10 passengers bookings. For single person and 2 person bookings the months of June, July and September are the most popular, while July and August are the most popular months for the +10 person bookings.
This is a crucial data for DFS as it can play around with its inventory / product mix by taking into account the differing preferences of each category clientele. For example, in August, DFS can prioritize most popular items for group shoppers in its store, while in June and July it can try to cater more to single and couple clients.
The delta analysis shows the acute decline in group bookings more clearly. Between 2014 - 2018 total decline in this group (+10 person per booking) was about 5,600, while decline between 2016 - 2018 is stronger at about 7,500. Clearly, this category continues to decline while singles and 2 person bookings continues to gain popularity. Single bookings increased by 4,850 between 2014 - 2018, but 8,700 between 2016 - 2018, showing that this category is really gaining momentum. Similarly, 2 person bookings are up by about 7,100 in the same period and 3 passenger bookings are up by 4,200.
Passengers per Booking Conclusions
Looking at the data, we believe the popularity of Paris as a destination for single and 2 person travelers are here to stay for the next few years, while the +10 person bookings continue to decline.
The forward keys data set allows us to assess arrivals by booking lead time
Data shows that about 18% of the travelers purchase their tickets between 120 to 364 days in advance, while the largest group is relatively last minute travelers who booked their tickets 15 - 29 days in advance, adding the 5 - 14 days category to this group, the total number of late travel planners reaches about 35% of the total Chinese travelers traveling to Paris.
Looking at the evolution of booking lead time over the past 4 years, we can see that the early planners (120 - 364 days) declined visibly, while there were relatively limited movements in other categories. Looking at the monthly data, we can see some seasonality in relatively late bookings where there is some pickup during June, July and September. Similarly, July and August seem to be the most popular months for early planners (120 to 364 days) as well.
Delta analysis further confirms the decline in the early planners (120 to 364 days), while the trends for other categories in recent years (20156 - 2018) seem to be visibly positive. One other category that stands out is the 0 to 4 days where around 1,400 decline was registered in 2016 - 2018, however this is a very small portion of the total passenger data.
Booking Lead Time Conclusions
Booking lead time shows that about 35% of the population purchases their ticket relatively late (within one month). While the remaining categories are relatively balanced, the second largest group is the early purchasers as discussed earlier.
The forward keys data set allows us to assess arrivals by Cabin Type
Economy class passengers represented about 85% of the total passengers while business passengers is about 10% of Chinese passengers traveling to Paris. This is pretty at much at par with many international destinations.
The annual statistics show that the cabin preferences of Chinese passengers to Paris (as expected) are quite stable, we don’t see any emerging trends there. We also looked at the data on a monthly basis, there seems to be some level of increased economy class activity in the summer months, particularly July, which we believe is at par with the increased levels of arrivals during the summer months.
Delta analysis shows a relatively balanced level of increase among different cabin types between 2016 - 2018.
Cabin Type Conclusions
We do not see any particularly valuable insights from the cabin data. Similar to most destinations about 85% of the tickets are economy and there is very limited (if any) movements across different years and limited monthly variation.
The forward keys data set allows us to assess arrivals by Distribution Channels
Bookings via retail represent close to 70% of the bookings, while corporate bookings and online bookings each represent about 13% of the total bookings.
Heatmaps (annual evolution and monthly variations) provide very limited additional insights. The retail bookings increase during the summer months in line with the overall tourism demand and reach its peak in July.
Delta analysis shows that retail bookings have increased visibly at the expense of other categories over the past few years. It is particularly interesting to see the decline in online bookings. Based on our business understanding, we know that many of the Chinese tourists have a considerable language barrier and thus, require booking through a retail agency to facilitate travel arrangements and on the ground guidance.
Distribution Type Conclusions
Retail bookings are a key source of bookings and are gaining further popularity.
The forward keys data set allows us to assess arrivals by Return or Multicity ticket type analysis
63% of the passengers traveling into Paris are multicity travelers, going to another destination after Paris. This is not surprising as Paris is an aviation hub for Continental Europe. Return flights represent the remaining 37%.
Data shows that after 2015, there was some decline in multicity flights, while the stats have been fairly stable over the past few years. The summer months of July and August show increased level of multicity flights which we believe is at par with the general increased level of touristic flights.
Delta analysis shows that between 2016 - 2018 the multicity was quite stable, while there was continued increase in return flights.
Multicity Ticket Type Conclusions
Looking at 2016 - 2018, there was about a 8,500 increase in the return flights which is not necessarily a very large number. Therefore we dont see particular trends in this data.
The forward keys data set allows us to assess arrivals by Passenger Profile analysis.
Paris is one of the most popular tourism destinations in Europe and 39% of the passenger travel is leisure related, while as an important hub, hosting several pan European corporate headquarters, around 36% of the flights to the city are business flights, while the remaining are group travelers.
Heatmap further confirms the decline in the group travelers over the past few years (as we discussed in earlier sections). Despite some year over year movements, the leisure and business travel trends remain non volatile. In 2018, we saw a visible increase in leisure travel. Not surprisingly, monthly analysis shows increased level of leisure flights in the popular summer months, while the business travel flights do not show a significant seasonality.
Delta analysis also confirms the decline in group travelers, while we continue to see a relatively higher increase in leisure travelers in the 2016 - 2018 period.
Passenger Profile Type Conclusions
Given both the leisure and business travelers are relatively balanced (39% vs 36% of total travelers), DFS will need to cater to both audiences in its Parisian shop.
The forward keys data set allows us to assess arrivals by Return or Multicity ticket type analysis.
When we look at the type of days (weekend / weekdays) Chinese passengers spent in Paris, not surprisingly we see that the combined stay represents close to 90% of the stays (as the average night of stay is above 5 days for the sample, most passengers regardless of their arrival data end up spending both some weekend and some weekdays in the city).
Combined category of stay increases in the summer months peaking in July as the average number of days of stay increases during the summer (meaning more people who arrive in the week also spend weekend days and vice versa).
The combined (weekday + weekend) stay has been increasing to a certain extent and that is at par with the increasing days of stay at the city.
Conclusion
Given the average length of stay in the city is above 5 days most Chinese tourists get to see the city both during the week and during the weekend (about 90%), which means DFS should cater to the different dynamics of both the weekend and weekday shopping). Weekday shopping experience for the non resident tourist will be different than the weekend.
Whilst current focus is on Paris and predicting passenger flows in that location, given DFS has a store in Venice in the future we will look to do understand passengers arrivals into that location and the subsequent relationship with sales at the Venice store
However, at this stage we wanted to do some preliminary analysis to compare the two locations
Whilst the volumes at an annual level appear quite different, perhaps most impacted by the terrorist attacks in Paris in November 2015, the profiles of the other charts appear quite similar.
While the Parisian passenger traffic has been recovering over the past few years from the 2015 terrorist attacks and is on a steady growth trajectory, Venice does not seem to have a growth momentum. After a contraction in 2016, the passenger volumes grew in 2017 only to decline again in 2018. Therefore Paris seems to have better business prospects.
Peak month for travel is July for both cities, with December receiving the lowest number of arrivals.
Saturday is the most common day of arrival, with Friday the least popular. This perhaps reflects the distance required to travel; passengers leaving on a Friday in China following a week of work will arrive in Europe on the Saturday. Comparing the two cities, while the most popular days of arrival in Paris are clearly the weekend days. Saturday followed by Sunday, for Venice Saturday and Wednesday are the most popular arrival days.
Arrivals by week show a very similar pattern. There is a small peak around week 6-8 in both cases, before a lull then steady rise to the peak months of summer. Whilst there is a subsequent large drop-off from week 35. The Golden Week annual holiday at the beginning of October drives a large increase over that period, before a continued steady decline towards the winter period
While the weekly distribution data clearly shows that both cities are subject to visible seasonality throughout the year, the seasonality in Venice seems to be somewhat less pronounced.
Comparing passengers per booking data for both cities, the large groups of 10+ passengers per booking have declined across both Paris and Venice. For Paris, this was accompanied by a visible increase in the 1 and 2 passengers per booking category, while this trend was visibly less pronounced for Venice.
Arrivals by Cabin type appear quite different, likely as a result of the attacks in Paris in November 2015. Economy passengers appear to have been most impacted by the events in Paris but are showing signs of recovery from 2016. However, post 2016, Venice has seen a slight decline in economy arrivals, though an increase across the overall 5 year period.
Recent trends from 2016 would indicate a shift towards return tickets, aligning with the notion of less large group travelers who would arrive on multicity tickets.
Delta analysis shows that the group profile passengers are on the decline for both cities. While this is clearly accompanied by a rise in the leisure and business travelers for Paris, the effect looks less pronounced for Venice.
Conclusion
Both cities are recipients of both a similar levels of both leisure and business travelers which constitutes over 70% of the passenger stats for both cities. DFS potentially already has a marketing and inventory management strategy to cater to these two distinct groups in their Venice store, and given the similarity of both cities, it can potentially leverage on some aspects of its existing strategy in Venice in Paris with a few tweaks.
The chart above compares the passenger flows to Paris versus Venice.
We analyzed the correlation between the total travelers during the whole period to both cities.
##
## Pearson's product-moment correlation
##
## data: pdset$Total_trav and vdset$Total_trav
## t = 24.05, df = 2065, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4333990 0.5007876
## sample estimates:
## cor
## 0.4677729
General Conclusion From Our Exploratory Analysis
We can talk about many factors here, but 5 key important insights for us are as follows:
While the Chinese travelers to Paris still remain well below the figures achieved in 2015, the market continues to recover at a healthy pace. Therefore there is growth opportunity in Paris, while Venice does not seem to have the same growth offering.
A very important quality of the Chinese arriving in Paris is the number of nights they stay, which creates an important opportunity:
Short term stay categories (1 night, 2 nights and 3 nights) together only represent 16% of the total bookings, while 4 to 5 nights is the most popular category with 38% of the bookings, and collectively stays above 3 nights represent 84% of the total bookings. Given the average length of stay of the Chinese passengers, DFS can have multiple touch points with potential clients (from airport display ads to boutique in the city) throughout their stay. Also, potentially the same client could drop by more than once in the same boutique (unlike the Chinese tourists who are on tour and spend a day or less in each city they visit via bus). However, this also creates one particular challenge; arguably the larger length of stay gives the potential client more time to compare prices & products across different outlets when making a purchase decision, meaning for customers with same price elasticity, DFS prices should probably be not too far from the market prices for comparable products.
Summer months are key for sales: Chinese passenger arrivals to the city display pronounced seasonality. The arrival volumes between June and September are markedly higher than the remaining months of the year which means, DFS should have an aggressive marketing & promotion strategy during the summer months to capitalize on the significant increase in volumes. It also means it will have to carry a substantially higher level of inventory and employ higher number of part-time sales staff during the summer months. While over the past few years, the seasonality seems to have become slightly less pronounced, the dominance of summer months is potentially here to stay for the foreseeable future. In addition, looking at the arrival data on a monthly basis, and analyzing each stay length category, we see that the overall length of stay generally seems to increase in the summer months for the most popular categories (4 - 5 nights and 6 to 9 nights), while in the shorter term stay categories the seasonal changes are muted. Therefore we believe the summer months will be of utmost importance to DFS as the company should try to leverage on the increased number of stays by The Chinese tourists. To capitalize on the opportunity, DFS should increase its marketing & advertising budget for the summer to establish multiple contact points with the Chinese clients (starting with display ads at the airport) and have in store promotions to trigger return clients (such as 5% discount coupons with each purchase valid for 24 - 48 hours for your second purchase) and increase their store traffic.
Paris seems to be a very popular tourism destination for one person bookings from China. While, a good portion of these can be singles traveling, we note that (as we will see in the next sections), Paris is an important business travel destination for Chinese as well. One person bookings represent around 30% of the total passengers for 2014 - 2018 period, followed by 2 people bookings (which we believe is largely couples and they represent 21% of the bookings), third largest group is +10 people (20%) who are group travelers, another important source of tourists for Paris. In addition, our delta analysis shows that 1-person and 2 person bookings are growing visibly stronger than other categories. Therefore dominance of this two categories seem to be here to stay for some time.
This is an important demographics information for DFS. The company should potentially do some market search on the type of products bought by single-leisure and business travelers, as well as couples (2 people bookings) in addition to research on group purchases, as these 3 categories represent over 70% of the Chinese travelers flying into Paris.
The goal of this exercise is the following:
In order to perform a predictive model we first need to prepare the dataset by applying a lag of 30 days for all the numeric predictors (all variables except target variable, date and date related fields).
#Select only those columns that we need to lag
pdsetlag30 <- dplyr::select(pdset, -c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year'))
#Lag the columns 30 days using mapply and convert to DF
pdsetlag30 <-data.frame(mapply(FUN = dplyr::lag, n = 30, pdsetlag30, SIMPLIFY = TRUE, USE.NAMES = TRUE))
#Rebuild dataframe with all contents and lagged variables
pdsetlag30 <- cbind(dplyr::select(pdset, c('Date', 'Total_trav')), pdsetlag30, dplyr::select(pdset, c('WeekDay', 'Month', 'Week', 'Year')))
#Rename columns
colnames(pdsetlag30) <- colnames(pdset)
#Add lag30 days of total travelers
pdsetlag30 <- dplyr::mutate(pdsetlag30,
lag30_totaltravelers=lag(pdset$Total_trav,n = 30, default = NA))
We have identified a number of predictors that can potentially help in our prediction of the total travelers.
In order to make a determination on what predictors to keep, we first calculated the correlations of all the potential predictors with total travelers.
model.cor <- as.data.frame(t(cor(as.matrix(pdsetlag30 %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select('Total_trav')),as.matrix(pdsetlag30 %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select(-c('Date', 'WeekDay', 'Month', 'Week', 'Year'))))))
kable(model.cor[order(-model.cor$Total_trav), , drop = FALSE])
Total_trav | |
---|---|
Total_trav | 1.0000000 |
lag30_totaltravelers | 1.0000000 |
TrDur_22nightsormore | 0.3840050 |
LoS_22nightsormore | 0.3526264 |
Sign_InitialBookings | 0.3272416 |
Sign_Partialadditions_modif | 0.3118424 |
TypeSt_Combinedstay | 0.3086107 |
OneRet_MultiCity | 0.3060757 |
Cabin_Economy | 0.3023342 |
TrDur_14to21nights | 0.3013660 |
LoS_14to21nights | 0.2920704 |
TrDur_9to13nights | 0.2733796 |
DistCh_RetailTA | 0.2702961 |
DistCh_OnlineTA | 0.2698661 |
LoS_4to5nights | 0.2521981 |
DistCh_CorporateTA | 0.2521667 |
Profile_Leisure | 0.2495882 |
Profile_Business | 0.2480910 |
Profile_Group | 0.2477527 |
ppb_2pax | 0.2472777 |
DistCh_OtherTA | 0.2447026 |
ppb_1pax | 0.2431852 |
Cabin_Business | 0.2414299 |
LeadT_30to44days | 0.2397771 |
LeadT_45to59days | 0.2389684 |
ppb_10pl_pax | 0.2317211 |
LeadT_60to89days | 0.2295702 |
LoS_9to13nights | 0.2231875 |
ppb_3pax | 0.2230266 |
ppb_4pax | 0.2229783 |
LeadT_120to364days | 0.2119751 |
LoS_2nights | 0.1962357 |
Avg.LT | 0.1852328 |
LoS_6to8nights | 0.1848752 |
ppb_6to9pax | 0.1817943 |
LoS_1night | 0.1791345 |
Avg.LOT | 0.1742244 |
Avg.PPB | 0.1731577 |
TrDur_6to8nights | 0.1680194 |
OneRet_Return | 0.1672108 |
LeadT_90to119days | 0.1577172 |
LeadT_15to29days | 0.1277789 |
LoS_3nights | 0.1196493 |
Avg.LOS | 0.1188098 |
TypeSt_Weekendstay | 0.1185411 |
ppb_5pax | 0.1150126 |
LeadT_5to14days | 0.1080870 |
Cabin_First | 0.0729665 |
TrDur_1night | 0.0465214 |
TrDur_4to5nights | 0.0357405 |
Cabin_EconomyPremium | 0.0287407 |
TrDur_3nights | 0.0189710 |
TrDur_2nights | 0.0152281 |
TypeSt_Workweekstay | -0.0026945 |
LeadT_0to4days | -0.0845452 |
Sign_Fulltripcancellations | -0.2908092 |
Sign_Partialcancellations | -0.3291667 |
Next, we used a Random Forest model to estimate variable importance and set a threshold below which we will discard a variable i.e. based on a % of increase in the error if we were going to exclude that variable.
We plotted the importance in the fitted model of the different models.
We are also showing the % MSE driven by each indicator in a table.
#Perform RF skipping NAs from Jan 14
set.seed(1234)
rf <- randomForest(formula = Total_trav ~ ., data = dplyr::filter(pdsetlag30, Date > '2014-01-31'), ntree = 1000, importance = TRUE)
#Get table sorted of importance of predictors
imp <- varImp(rf)
kable(imp[order(-imp$Overall), , drop = FALSE])
Overall | |
---|---|
lag30_totaltravelers | 135.280323 |
Week | 26.266690 |
Month | 19.638352 |
Date | 18.597290 |
Sign_Fulltripcancellations | 11.253787 |
OneRet_MultiCity | 10.379293 |
TypeSt_Workweekstay | 10.101510 |
Sign_Partialcancellations | 9.684716 |
TrDur_22nightsormore | 9.475380 |
Year | 9.352064 |
DistCh_OnlineTA | 9.252995 |
TrDur_3nights | 8.138174 |
TrDur_14to21nights | 7.645574 |
Sign_Partialadditions_modif | 7.617813 |
LoS_3nights | 7.154588 |
Avg.PPB | 7.146600 |
Sign_InitialBookings | 7.007508 |
LoS_6to8nights | 6.996177 |
LeadT_45to59days | 6.924812 |
TypeSt_Combinedstay | 6.887909 |
Cabin_EconomyPremium | 6.871663 |
TrDur_9to13nights | 6.868522 |
LeadT_60to89days | 6.864499 |
Cabin_Business | 6.639644 |
WeekDay | 6.493526 |
DistCh_CorporateTA | 6.444416 |
Avg.LOS | 6.354290 |
Profile_Business | 6.305578 |
DistCh_OtherTA | 6.246042 |
OneRet_Return | 6.188918 |
ppb_4pax | 6.059882 |
ppb_1pax | 6.054468 |
Cabin_Economy | 6.021336 |
Profile_Group | 5.960073 |
DistCh_RetailTA | 5.901489 |
ppb_3pax | 5.763776 |
Avg.LT | 5.757369 |
LoS_14to21nights | 5.681071 |
Avg.LOT | 5.652449 |
TrDur_6to8nights | 5.600380 |
LoS_4to5nights | 5.579301 |
ppb_10pl_pax | 5.564779 |
LoS_22nightsormore | 5.492581 |
LoS_9to13nights | 5.442431 |
LeadT_120to364days | 5.104104 |
LoS_2nights | 5.043263 |
LeadT_30to44days | 4.848114 |
TrDur_4to5nights | 4.706765 |
ppb_2pax | 4.457614 |
LoS_1night | 4.332915 |
LeadT_5to14days | 4.218485 |
Cabin_First | 3.848661 |
ppb_6to9pax | 3.837902 |
Profile_Leisure | 3.729659 |
LeadT_0to4days | 3.498446 |
LeadT_15to29days | 3.398809 |
LeadT_90to119days | 2.722089 |
TrDur_2nights | 2.579971 |
TypeSt_Weekendstay | 2.528184 |
TrDur_1night | 2.214097 |
ppb_5pax | 1.832548 |
#Plot importance
varImpPlot(rf, type = 1)
Based on this, we created two different data frames, one with all predictors with a % over 10 and another one with over 15 percent.
#Check variables with imp over 10 and create df
rownames(imp)[which(imp>10)]
## [1] "Date" "TypeSt_Workweekstay"
## [3] "OneRet_MultiCity" "Sign_Fulltripcancellations"
## [5] "Month" "Week"
## [7] "lag30_totaltravelers"
pdsetlag30_s10 <- pdsetlag30 %>% dplyr::select('Date', 'Total_trav', 'LoS_2nights', 'LoS_3nights', 'LoS_4to5nights', 'LoS_6to8nights',
'LoS_9to13nights', 'LoS_14to21nights', 'LoS_22nightsormore', 'TrDur_3nights', 'TrDur_4to5nights', 'TrDur_6to8nights', 'TrDur_9to13nights', 'TrDur_14to21nights', 'TrDur_22nightsormore', 'ppb_1pax', 'ppb_2pax', 'ppb_3pax', 'ppb_4pax', 'ppb_10pl_pax', 'LeadT_5to14days', 'LeadT_30to44days', 'LeadT_45to59days', 'LeadT_60to89days', 'LeadT_90to119days', 'Cabin_EconomyPremium', 'Cabin_Business', 'DistCh_OnlineTA', 'DistCh_CorporateTA', 'DistCh_RetailTA', 'DistCh_OtherTA', 'TypeSt_Workweekstay', 'OneRet_Return', 'OneRet_MultiCity', 'Profile_Business', 'Profile_Leisure', 'Profile_Group', 'Sign_InitialBookings', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.LOT', 'Avg.LT', 'Avg.PPB', 'WeekDay', 'Month', 'Week', 'Year') %>% dplyr::filter(Date >'2014-01-31')
#Check variables with imp over 10 and create df
rownames(imp)[which(imp>15)]
## [1] "Date" "Month" "Week"
## [4] "lag30_totaltravelers"
pdsetlag30_s15 <- pdsetlag30 %>% dplyr::select('Date', 'Total_trav', 'TrDur_3nights', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'OneRet_MultiCity', 'Profile_Business', 'Profile_Group', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.PPB', 'WeekDay', 'Month', 'Week', 'Year') %>% dplyr::filter(Date >'2014-01-31')
We run a number of models with the following parameters: + Models: SVM (sliding and standard), lm, rpartXse, rpart, earth, nnet and randomforest + We used nmae as our evaluation metric as it’s normalized and easy to interpret. + We used MonteCarlo as our estimation method, with 5 iterations to obtain certainty on the performance of each of the models.
Tform_Total_trav <- as.formula('Total_trav ~.')
exp <- performanceEstimation(
PredTask(Tform_Total_trav, pdsetlag30_s15, 'Total_trav_30d'),
c(Workflow('standardWF', wfID="standSVM",
learner='svm',learner.pars=list(cost=10,gamma=0.01)),
Workflow('timeseriesWF', wfID="slideSVM",
type="slide", relearn.step=90,
learner='svm',learner.pars=list(cost=10,gamma=0.01)),
Workflow(learner="randomForest",learner.pars=list(ntree=1000)),
Workflow(learner="rpart",.fullOutput=TRUE),
Workflow(learner= 'lm'),
Workflow(learner="bagging",learner.pars=list(nbagg=500)),
Workflow(learner="rpartXse"),
Workflow(learner='nnet', learner.pars=list(linout=TRUE, trace=FALSE, maxit=1000, size=15, decay=0.01))
),
EstimationTask(metrics="nmse",
method=MonteCarlo(nReps=5,szTrain=0.5,szTest=0.25)))
##
##
## ##### PERFORMANCE ESTIMATION USING MONTE CARLO #####
##
## ** PREDICTIVE TASK :: Total_trav_30d
##
## ++ MODEL/WORKFLOW :: standSVM
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: slideSVM
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: randomForest
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: rpart
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: bagging
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: rpartXse
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: nnet
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
summary(exp)
##
## == Summary of a Monte Carlo Performance Estimation Experiment ==
##
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
##
## * Predictive Tasks :: Total_trav_30d
## * Workflows :: standSVM, slideSVM, randomForest, rpart, lm, bagging, rpartXse, nnet
##
## -> Task: Total_trav_30d
## *Workflow: standSVM
## nmse
## avg 0.86156225
## std 0.27330726
## med 0.72994660
## iqr 0.04659865
## min 0.72270560
## max 1.34897089
## invalid 0.00000000
##
## *Workflow: slideSVM
## nmse
## avg 0.542201437
## std 0.023170741
## med 0.536390663
## iqr 0.008913675
## min 0.524265703
## max 0.582411262
## invalid 0.000000000
##
## *Workflow: randomForest
## nmse
## avg 0.47327259
## std 0.01998234
## med 0.47161439
## iqr 0.03110022
## min 0.44991771
## max 0.49703349
## invalid 0.00000000
##
## *Workflow: rpart
## nmse
## avg 0.62033355
## std 0.05549068
## med 0.63659207
## iqr 0.03109266
## min 0.52417316
## max 0.65834659
## invalid 0.00000000
##
## *Workflow: lm
## nmse
## avg 1.4404186
## std 0.2268344
## med 1.5788348
## iqr 0.2865787
## min 1.1043898
## max 1.6207747
## invalid 0.0000000
##
## *Workflow: bagging
## nmse
## avg 0.48905493
## std 0.03604078
## med 0.47544499
## iqr 0.02432157
## min 0.46306398
## max 0.55067340
## invalid 0.00000000
##
## *Workflow: rpartXse
## nmse
## avg 0.60375738
## std 0.03218525
## med 0.59711170
## iqr 0.04759748
## min 0.56524924
## max 0.64196229
## invalid 0.00000000
##
## *Workflow: nnet
## nmse
## avg 9.999988e-01
## std 9.154329e-07
## med 9.999992e-01
## iqr 2.195862e-07
## min 9.999972e-01
## max 9.999993e-01
## invalid 0.000000e+00
plot(exp)
Next, we tried the same with the broader data frame (importance over 10 instead of 15), with the best performing models only. The goal is to see if increasing the number of predictors would materially improve the metrics.
exp2 <- performanceEstimation(
PredTask(Tform_Total_trav, pdsetlag30_s10, 'Total_trav_30d'),
c(Workflow(learner="randomForest",learner.pars=list(ntree=1000)),
Workflow(learner="rpart",.fullOutput=TRUE),
Workflow(learner="rpartXse"),
Workflow(learner='nnet', learner.pars=list(linout=TRUE, trace=FALSE, maxit=1000, size=15, decay=0.01))
),
EstimationTask(metrics="nmse",
method=MonteCarlo(nReps=5,szTrain=0.5,szTest=0.25)))
##
##
## ##### PERFORMANCE ESTIMATION USING MONTE CARLO #####
##
## ** PREDICTIVE TASK :: Total_trav_30d
##
## ++ MODEL/WORKFLOW :: randomForest
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: rpart
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: rpartXse
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
##
##
##
## ++ MODEL/WORKFLOW :: nnet
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
## Repetition 1
## start test = 1077 ; test size = 509
## Repetition 2
## start test = 1329 ; test size = 509
## Repetition 3
## start test = 1335 ; test size = 509
## Repetition 4
## start test = 1336 ; test size = 509
## Repetition 5
## start test = 1455 ; test size = 509
summary(exp2)
##
## == Summary of a Monte Carlo Performance Estimation Experiment ==
##
## Task for estimating nmse using
## 5 repetitions Monte Carlo Simulation using:
## seed = 1234
## train size = 0.5 x NROW(DataSet)
## test size = 0.25 x NROW(DataSet)
##
## * Predictive Tasks :: Total_trav_30d
## * Workflows :: randomForest, rpart, rpartXse, nnet
##
## -> Task: Total_trav_30d
## *Workflow: randomForest
## nmse
## avg 0.475506227
## std 0.015384097
## med 0.475996357
## iqr 0.006047883
## min 0.452595329
## max 0.495514575
## invalid 0.000000000
##
## *Workflow: rpart
## nmse
## avg 0.61612635
## std 0.04419247
## med 0.62573163
## iqr 0.08647399
## min 0.56937897
## max 0.65834659
## invalid 0.00000000
##
## *Workflow: rpartXse
## nmse
## avg 0.61216302
## std 0.04831333
## med 0.61982419
## iqr 0.08169229
## min 0.55068805
## max 0.65834659
## invalid 0.00000000
##
## *Workflow: nnet
## nmse
## avg 9.999989e-01
## std 5.312570e-07
## med 9.999992e-01
## iqr 2.326554e-07
## min 9.999980e-01
## max 9.999992e-01
## invalid 0.000000e+00
plot(exp2)
The metrics show a limited improvement from using all the predictors with a % over 10. Therefore, we focused on the data frame with predictors explaining over 15 percent of the error. Also, the Random Forest appears to be the highest performing model in every scenario.
Next, we tried furthermore complex models on a standalone basis to check if any of them could outperform the random forest. In doing so, we used January 2014 through July 2019 as training set and August 2019 as test set.
RANDOM FOREST
#Build train and test datasets
Tdata.train_lag30_s15 <- pdsetlag30_s15 %>% dplyr::filter(Date < '2019-08-01')
Tdata.eval_lag30_s15 <- pdsetlag30_s15 %>% dplyr::filter(Date > '2019-07-31')
#Run model
RF_Total_trav_model <- randomForest(Tform_Total_trav, data = Tdata.train_lag30_s15[-1], ntree = 1000)
#prediction
RF_TT_prediction <- predict(RF_Total_trav_model, Tdata.eval_lag30_s15)
#Check
RF_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, RF_TT_prediction)
#Evaluation
rmse.rf <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - RF_TT_prediction)^2)),2)
nmse.rf <- round(sum((Tdata.eval_lag30_s15$Total_trav - RF_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.rf
## [1] 164.85
nmse.rf
## [1] 47.7
NEURAL NETWORKS
#Normalize numeric data
pdsetlag30_s15_NN <- data.frame(pdsetlag30_s15[c('Date', 'WeekDay', 'Month')], lapply(pdsetlag30_s15[c('Total_trav', 'TrDur_3nights', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'OneRet_MultiCity', 'Profile_Business', 'Profile_Group', 'Sign_Partialadditions_modif', 'Sign_Partialcancellations', 'Sign_Fulltripcancellations', 'Avg.LOS', 'Avg.PPB', 'Week', 'Year')],scale))
#Build train and test datasets
Tdata.train_lag30_s15_NN <- pdsetlag30_s15_NN %>% dplyr::filter(Date < '2019-08-01')
Tdata.eval_lag30_s15_NN <- pdsetlag30_s15_NN %>% dplyr::filter(Date > '2019-07-31')
#Unnormalize function
unnormalize <- function(x) {return((x*(sd(Tdata.eval_lag30_s15$Total_trav))+ base::mean(Tdata.eval_lag30_s15$Total_trav)))}
#Run model
NN_Total_trav_model <- nnet(Tform_Total_trav, data = Tdata.train_lag30_s15_NN[-1], size = 20, decay = 0.01, maxit = 10000, linout = TRUE, trace = FALSE, MaxNWts = 1000)
#prediction
NN_TT_prediction <- unnormalize(predict(NN_Total_trav_model, Tdata.eval_lag30_s15_NN[-1]))
#Check
NN_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, NN_TT_prediction)
#Evaluation
rmse.nn <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - NN_TT_prediction)^2)),2)
nmse.nn <- round(sum((Tdata.eval_lag30_s15$Total_trav - NN_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.nn
## [1] 217.97
nmse.nn
## [1] 83.39
DEEP LEARNING
# trH <- as.h2o(Tdata.train_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month), 'trH')
# tsH <- as.h2o(Tdata.eval_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month), 'tsH')
# DL_Total_trav_model <- h2o.deeplearning(x = 2:16, y = 1, training_frame = trH, hidden = c(500, 500, 500, 500, 500), epochs = 1000)
# DL_TT_prediction <- as.vector(h2o.predict(DL_Total_trav_model, tsH))
# #Check
# DL_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, DL_TT_prediction)
# #Evaluation
# rmse.dl <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - DL_TT_prediction)^2)),2)
# nmse.dl <- round(sum((Tdata.eval_lag30_s15$Total_trav - DL_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
BAGGING
#Run model
BG_Total_trav_model <- bagging(Tform_Total_trav, data = Tdata.train_lag30_s15[-1], mfinal = 1000)
#prediction
BG_TT_prediction <- predict(BG_Total_trav_model, Tdata.eval_lag30_s15[-1])
#Check
BG_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, BG_TT_prediction)
#Evaluation
rmse.bg <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - BG_TT_prediction)^2)),2)
nmse.bg <- round(sum((Tdata.eval_lag30_s15$Total_trav - BG_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.bg
## [1] 182.62
nmse.bg
## [1] 58.54
XGBOOST
#Source: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html
#Run model
Tdata.train_lag30_s15_XGB <- Tdata.train_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month)
Tdata.eval_lag30_s15_XGB <- Tdata.eval_lag30_s15 %>% dplyr::mutate(nWeekday=wday(Date), nMonth=lubridate::month(Date)) %>% dplyr::select(-Date,-WeekDay, -Month)
XGB_Total_trav_model <- xgboost(data = as.matrix(Tdata.train_lag30_s15_XGB[-1]), label = Tdata.train_lag30_s15_XGB$Total_trav, nrounds = 100, max_depth = 100, objective = 'reg:squarederror', eta = 0.2, verbose = 2, booster = 'gbtree')
## [1] train-rmse:576.872009
## [2] train-rmse:472.082550
## [3] train-rmse:388.590515
## [4] train-rmse:321.142120
## [5] train-rmse:266.992126
## [6] train-rmse:222.780151
## [7] train-rmse:186.722610
## [8] train-rmse:156.763199
## [9] train-rmse:132.223389
## [10] train-rmse:111.406235
## [11] train-rmse:94.252525
## [12] train-rmse:79.960075
## [13] train-rmse:67.991562
## [14] train-rmse:58.035343
## [15] train-rmse:49.627621
## [16] train-rmse:42.548389
## [17] train-rmse:36.574173
## [18] train-rmse:31.528339
## [19] train-rmse:27.167871
## [20] train-rmse:23.492336
## [21] train-rmse:20.321342
## [22] train-rmse:17.539219
## [23] train-rmse:15.172000
## [24] train-rmse:13.165424
## [25] train-rmse:11.457286
## [26] train-rmse:9.981150
## [27] train-rmse:8.724131
## [28] train-rmse:7.650667
## [29] train-rmse:6.708563
## [30] train-rmse:5.900376
## [31] train-rmse:5.194231
## [32] train-rmse:4.568176
## [33] train-rmse:4.035913
## [34] train-rmse:3.571569
## [35] train-rmse:3.170204
## [36] train-rmse:2.811042
## [37] train-rmse:2.489473
## [38] train-rmse:2.211863
## [39] train-rmse:1.955314
## [40] train-rmse:1.737705
## [41] train-rmse:1.546050
## [42] train-rmse:1.377857
## [43] train-rmse:1.225231
## [44] train-rmse:1.093255
## [45] train-rmse:0.971850
## [46] train-rmse:0.866639
## [47] train-rmse:0.773485
## [48] train-rmse:0.691379
## [49] train-rmse:0.618551
## [50] train-rmse:0.553726
## [51] train-rmse:0.495769
## [52] train-rmse:0.444477
## [53] train-rmse:0.398906
## [54] train-rmse:0.358065
## [55] train-rmse:0.321683
## [56] train-rmse:0.288930
## [57] train-rmse:0.259488
## [58] train-rmse:0.233131
## [59] train-rmse:0.209440
## [60] train-rmse:0.187343
## [61] train-rmse:0.167790
## [62] train-rmse:0.150358
## [63] train-rmse:0.134832
## [64] train-rmse:0.120943
## [65] train-rmse:0.108472
## [66] train-rmse:0.097359
## [67] train-rmse:0.087420
## [68] train-rmse:0.078518
## [69] train-rmse:0.070560
## [70] train-rmse:0.063420
## [71] train-rmse:0.056924
## [72] train-rmse:0.051171
## [73] train-rmse:0.045993
## [74] train-rmse:0.041341
## [75] train-rmse:0.037148
## [76] train-rmse:0.033401
## [77] train-rmse:0.030032
## [78] train-rmse:0.027012
## [79] train-rmse:0.024298
## [80] train-rmse:0.021860
## [81] train-rmse:0.019667
## [82] train-rmse:0.017695
## [83] train-rmse:0.015915
## [84] train-rmse:0.014318
## [85] train-rmse:0.012881
## [86] train-rmse:0.011589
## [87] train-rmse:0.010427
## [88] train-rmse:0.009383
## [89] train-rmse:0.008444
## [90] train-rmse:0.007599
## [91] train-rmse:0.006840
## [92] train-rmse:0.006157
## [93] train-rmse:0.005543
## [94] train-rmse:0.004990
## [95] train-rmse:0.004493
## [96] train-rmse:0.004047
## [97] train-rmse:0.003648
## [98] train-rmse:0.003289
## [99] train-rmse:0.002965
## [100] train-rmse:0.002675
#prediction
XGB_TT_prediction <- predict(XGB_Total_trav_model, newdata = as.matrix(Tdata.eval_lag30_s15_XGB[-1]))
#Check
XGB_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, XGB_TT_prediction)
#Evaluation
rmse.xgb <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - XGB_TT_prediction)^2)),2)
nmse.xgb <- round(sum((Tdata.eval_lag30_s15$Total_trav - XGB_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.xgb
## [1] 158.2
nmse.xgb
## [1] 43.93
#Check importance
importance_matrix <- xgb.importance(model = XGB_Total_trav_model)
print(importance_matrix)
## Feature Gain Cover Frequency
## 1: Week 0.605492326 0.11760613 0.041773493
## 2: Year 0.081670034 0.02656203 0.017731857
## 3: Avg.PPB 0.056721359 0.06645040 0.049357508
## 4: TrDur_22nightsormore 0.035662860 0.10771803 0.113345717
## 5: nMonth 0.034226322 0.01399555 0.002962986
## 6: Sign_Fulltripcancellations 0.033804525 0.06657973 0.063742573
## 7: OneRet_MultiCity 0.023659269 0.05739302 0.085082211
## 8: nWeekday 0.022914228 0.03162365 0.021462456
## 9: TypeSt_Workweekstay 0.022125231 0.04047559 0.073092098
## 10: Sign_Partialadditions_modif 0.020532328 0.06000644 0.070789260
## 11: Sign_Partialcancellations 0.017033564 0.15662357 0.063128483
## 12: Profile_Business 0.014798899 0.04255655 0.070697146
## 13: TrDur_3nights 0.012724523 0.05013907 0.206764205
## 14: Avg.LOS 0.011385117 0.04858439 0.061777484
## 15: Profile_Group 0.007249416 0.11368585 0.058292522
xgb.plot.importance(importance_matrix = importance_matrix)
CUBIST
CBST_Total_trav_model <- cubist(Tdata.train_lag30_s15_XGB[-1], Tdata.train_lag30_s15_XGB$Total_trav, committees = 5)
#prediction
CBST_TT_prediction <- predict(CBST_Total_trav_model, Tdata.eval_lag30_s15_XGB[-1])
#Check
CBST_TT_validate <- data.frame(Tdata.eval_lag30_s15$Total_trav, CBST_TT_prediction)
#Evaluation
rmse.cbst <- round(sqrt(mean((Tdata.eval_lag30_s15$Total_trav - CBST_TT_prediction)^2)),2)
nmse.cbst <- round(sum((Tdata.eval_lag30_s15$Total_trav - CBST_TT_prediction)^2) / sum((Tdata.eval_lag30_s15$Total_trav-mean(Tdata.eval_lag30_s15$Total_trav))^2)*100,2)
rmse.cbst
## [1] 179.47
nmse.cbst
## [1] 56.53
RESULTS
The table below shows the performance metrics achieved with each of the standalone models. XGBoost proved to be the best performing model, followed by Random Forest.
output.table <- data.frame("Model" = c('XGBoost','Bagging','Random Forest', 'Neural Networks', 'Cubist'),
"RMSE" = c(rmse.xgb,rmse.bg,rmse.rf, rmse.nn, rmse.cbst),
"NMSE" = c(nmse.xgb,nmse.bg,nmse.rf, nmse.nn, nmse.cbst))
kable(output.table)
Model | RMSE | NMSE |
---|---|---|
XGBoost | 158.20 | 43.93 |
Bagging | 182.62 | 58.54 |
Random Forest | 164.85 | 47.70 |
Neural Networks | 217.97 | 83.39 |
Cubist | 179.47 | 56.53 |
XGBOOST
The conclusions from the different models assessed show that XGBoost is the best performing model. Therefore, we focused on refining this model and checking if a different combination of predictors or parameters could further improve the metrics.
pdsetlag30t <- data.frame(pdset[c('Date', 'Total_trav')],
lapply(dplyr::select(pdset, -c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year')),
FUN = dplyr::lag,
n=30))
pdsetlag30t <- pdsetlag30t %>% dplyr::mutate(WeekDay=lubridate::wday(pdsetlag30t$Date), Month=lubridate::month(pdsetlag30t$Date), Week=lubridate::week(pdsetlag30t$Date), Year=lubridate::year(pdsetlag30t$Date), lag30_totaltravelers=dplyr::lag(pdsetlag30$Total_trav,n = 30, default = NA)) %>% dplyr::filter(Date > '2014-01-31') %>% dplyr::select(c('Date', 'Total_trav', 'WeekDay', 'Month', 'Week', 'Year', 'TrDur_22nightsormore', 'TypeSt_Workweekstay', 'Avg.PPB', 'Avg.LOS', 'Avg.LT', 'Profile_Business'))
#Run model
Tdata.train_lag30_XGBt <- pdsetlag30t %>% dplyr::filter(Date < '2019-08-01') %>% dplyr::select(-Date)
Tdata.eval_lag30_XGBt <- pdsetlag30t %>% dplyr::filter(Date > '2019-07-31') %>% dplyr::select(-Date)
#Create matrices for XGB model
XGB_train_matrix <- xgb.DMatrix(data = as.matrix(Tdata.train_lag30_XGBt[-1]), label = Tdata.train_lag30_XGBt$Total_trav)
XGB_test_matrix <- xgb.DMatrix(data = as.matrix(Tdata.eval_lag30_XGBt[-1]), label = Tdata.eval_lag30_XGBt$Total_trav)
#Define parameters of model
params <- list(booster = "gbtree", objective = "reg:squarederror", eta=0.2, gamma=0, max_depth=25, min_child_weight=1, subsample=1, colsample_bytree=1)
#Check optimal nrounds with CV
#xgbcv <- xgb.cv( params = params, data = XGB_train_matrix, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
XGB_Total_trav_modelt <- xgboost(data = XGB_train_matrix,
nrounds = 28,
eval_metric = 'rmse',
params = params)
## [1] train-rmse:577.234436
## [2] train-rmse:472.615173
## [3] train-rmse:388.825104
## [4] train-rmse:321.971680
## [5] train-rmse:268.173828
## [6] train-rmse:224.245224
## [7] train-rmse:189.055267
## [8] train-rmse:159.243744
## [9] train-rmse:134.827805
## [10] train-rmse:114.467705
## [11] train-rmse:97.253555
## [12] train-rmse:82.858589
## [13] train-rmse:70.844826
## [14] train-rmse:60.536037
## [15] train-rmse:51.970814
## [16] train-rmse:44.666649
## [17] train-rmse:38.455173
## [18] train-rmse:33.235947
## [19] train-rmse:28.641903
## [20] train-rmse:24.748526
## [21] train-rmse:21.432110
## [22] train-rmse:18.582981
## [23] train-rmse:16.175577
## [24] train-rmse:14.076773
## [25] train-rmse:12.294971
## [26] train-rmse:10.754439
## [27] train-rmse:9.445511
## [28] train-rmse:8.293904
#prediction
XGB_TT_predictiont <- predict(XGB_Total_trav_modelt, newdata = XGB_test_matrix)
#Check
XGB_TT_validatet <- data.frame(Tdata.eval_lag30_XGBt$Total_trav, round(XGB_TT_predictiont,0))
#Evaluation
rmse.xgbt <- round(sqrt(mean((Tdata.eval_lag30_XGBt$Total_trav - XGB_TT_predictiont)^2)),2)
nmse.xgbt <- round(sum((Tdata.eval_lag30_XGBt$Total_trav - XGB_TT_predictiont)^2) / sum((Tdata.eval_lag30_XGBt$Total_trav-mean(Tdata.eval_lag30_XGBt$Total_trav))^2)*100,2)
#Check importance
# importance_matrixt <- xgb.importance(model = XGB_Total_trav_modelt)
# print(importance_matrixt)
# xgb.plot.importance(importance_matrix = importance_matrixt)
rmse.xgbt
## [1] 112.82
nmse.xgbt
## [1] 22.34
After a few iterations with the XGBoost parameters we were able to narrow down the list of predictors to 10 and significantly improve the metrics.
We then replicated the Random Forest with the same predictors as the XGBoost
RANDOM FOREST
#Run model
RF_Total_trav_modelt <- randomForest(Tform_Total_trav, data = Tdata.train_lag30_XGBt, ntree = 1000)
#prediction
RF_TT_predictiont <- predict(RF_Total_trav_modelt, Tdata.eval_lag30_XGBt)
#Check
RF_TT_validate <- data.frame(Tdata.eval_lag30_XGBt$Total_trav, RF_TT_predictiont)
#Evaluation
rmse.rft <- round(sqrt(mean((Tdata.eval_lag30_XGBt$Total_trav - RF_TT_predictiont)^2)),2)
nmse.rft <- round(sum((Tdata.eval_lag30_XGBt$Total_trav - RF_TT_predictiont)^2) / sum((Tdata.eval_lag30_XGBt$Total_trav-mean(Tdata.eval_lag30_XGBt$Total_trav))^2)*100,2)
rmse.rft
## [1] 134.94
nmse.rft
## [1] 31.96
With this same combination of predictors we were able to improve the metrics significantly. Notwithstanding, the performance of the XGBoost is better than the Random Forest.
The table below shows the summary metrics
output.tablet <- data.frame("Model" = c('XGBoost Refined','Random Forest Refined'),
"RMSE" = c(rmse.xgbt,rmse.rft),
"NMSE" = c(nmse.xgbt,nmse.rft))
kable(output.tablet)
Model | RMSE | NMSE |
---|---|---|
XGBoost Refined | 112.82 | 22.34 |
Random Forest Refined | 134.94 | 31.96 |
Last, we build a data frame containing daily predictions for August with both models and also calculated a prediction average (ensemble) and the performance metrics of such average.
MLpredOutTble <- data.frame(Date=seq.Date(from = as.Date('2019-08-01'), to = as.Date('2019-08-31'), by = 'days'),
Actuals=Tdata.eval_lag30_XGBt$Total_trav,
XGBoost=round(XGB_TT_predictiont,0),
RandomForest=round(RF_TT_predictiont,0))
MLpredOutTble <- dplyr::mutate(MLpredOutTble,
Prediction_Average = round((MLpredOutTble$XGBoost + MLpredOutTble$RandomForest)/2,0))
ensmb.rmse <- round(sqrt(mean((MLpredOutTble$Actuals - MLpredOutTble$Prediction_Average)^2)),2)
ensmb.nmse <- round(sum((MLpredOutTble$Actuals - MLpredOutTble$Prediction_Average)^2) / sum((MLpredOutTble$Actuals-mean(MLpredOutTble$Actuals))^2)*100,2)
kable(MLpredOutTble)
Date | Actuals | XGBoost | RandomForest | Prediction_Average |
---|---|---|---|---|
2019-08-01 | 999 | 986 | 917 | 952 |
2019-08-02 | 1102 | 885 | 879 | 882 |
2019-08-03 | 1269 | 1045 | 1001 | 1023 |
2019-08-04 | 920 | 909 | 896 | 902 |
2019-08-05 | 1068 | 870 | 872 | 871 |
2019-08-06 | 798 | 742 | 840 | 791 |
2019-08-07 | 713 | 822 | 835 | 828 |
2019-08-08 | 828 | 866 | 851 | 858 |
2019-08-09 | 832 | 806 | 862 | 834 |
2019-08-10 | 1186 | 1101 | 1007 | 1054 |
2019-08-11 | 869 | 879 | 853 | 866 |
2019-08-12 | 773 | 826 | 876 | 851 |
2019-08-13 | 851 | 758 | 769 | 764 |
2019-08-14 | 671 | 845 | 760 | 802 |
2019-08-15 | 584 | 657 | 750 | 704 |
2019-08-16 | 701 | 706 | 751 | 728 |
2019-08-17 | 672 | 756 | 834 | 795 |
2019-08-18 | 448 | 635 | 648 | 642 |
2019-08-19 | 498 | 617 | 742 | 680 |
2019-08-20 | 607 | 563 | 644 | 604 |
2019-08-21 | 583 | 500 | 687 | 594 |
2019-08-22 | 578 | 522 | 592 | 557 |
2019-08-23 | 468 | 468 | 596 | 532 |
2019-08-24 | 727 | 815 | 774 | 794 |
2019-08-25 | 498 | 538 | 552 | 545 |
2019-08-26 | 484 | 445 | 485 | 465 |
2019-08-27 | 383 | 405 | 568 | 486 |
2019-08-28 | 347 | 454 | 455 | 454 |
2019-08-29 | 392 | 455 | 477 | 466 |
2019-08-30 | 934 | 640 | 700 | 670 |
2019-08-31 | 589 | 661 | 744 | 702 |
ggplot(MLpredOutTble, aes(x = Date)) + geom_line(aes(y = Actuals, color='Actuals'), linetype=5, size=1.5) + geom_line(aes(y = XGBoost, color='XGBoost'))+ geom_line(aes(y = RandomForest, color='Random Forest')) + labs(color='Legend') + theme_classic()
#geom_lin
The dataset corresponds to data points collected at constant time intervals. As a result, time series forecasting models could be another methodology to forecast the number of Chinese passengers traveling to Paris. The time series models used for forecasting include decomposition models, exponential smoothing, ARIMA models and the innovative Facebook model Prophet which will also be included in this analysis. The first step for this type of forecasting is to understand the inherent properties of the time series, such as trend, seasonality pattern, and cycles. However, time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. In order to identify those components, the decompose function is applied to the data as per below:
max <- lubridate::yday(max(pdset$Date))
ts <- xts(x = pdset$Total_trav, order.by = pdset$Date)
ts1 <- as.ts(ts)
dfts <- ts(ts1, frequency=365)
dec <- decompose(dfts)
autoplot(dec)
As described in the exploratory analysis and in the above graph the data follow a weekly, monthly and yearly seasonal pattern, which makes sense as the data has a high frequency (daily) that generally exhibits more complicated seasonal patterns. ETS or conventional ARIMA so far are unable to deal with large length and multiple seasonality patterns. Instead, we will convert the time series to a msts class that can handle the multiple seasonality problem. This allows us to specify all the frequencies that might be relevant. After that, we will use the advanced modelling techniques such as dynamic harmonic regression and TBATS models. Lastly, we will use the prophet model from Facebook which also can manage data series with high frequencies and multiple seasonal patterns.
Firstly, we convert the time series in a msts class object to capture the weekly pattern as well as the longer annual pattern. The value 7 is the weekly pattern and the period 365.25 is the average length of a year allowing for leap years.
### Converting Series in a msts object
date_range <- as.data.frame(seq(as.Date("2014-01-02"), as.Date("2019-08-31"), by = 'days'))
colnames(date_range) <- c("Date")
df1 <- pdset%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS, Avg.PPB, Profile_Group)
df2 <- df1%>%dplyr::mutate(Date = as.Date("2015-01-01"), Total_trav = 84, Profile_Business = 31, Avg.LOS = 8.19, Avg.PPB = 4.38, Profile_Group = 31)
df2 <- df2[1,]
df <- rbind(df1,df2)%>%dplyr::arrange(Date)
y <- msts(df$Total_trav, start = c(2014,2), end = c(2019,244), seasonal.periods=c(7,365.25))
train <- window(y, start = c(2014,2), end = c(2019,213))
test <- window(y, start= c(2019,214), end = c(2019,244))
date_range1 <- as.data.frame(seq(as.Date("2014-01-02"), as.Date("2019-07-31"), by = 'days'))
date_range2 <- as.data.frame(seq(as.Date("2019-08-01"), as.Date("2019-08-31"), by = 'days'))
colnames(date_range1) <- c("Date")
colnames(date_range2) <- c("Date")
a <- as.data.frame(train)
b <- as.data.frame(test)
colnames(a) <- c("value")
colnames(b) <- c("value")
traindf <- a%>%mutate(time = date_range1$Date)%>%dplyr::select(time,value)
testdf <- b%>%mutate(time = date_range2$Date)%>%dplyr::select(time,value)
After converting the time series into a msts object, we proceed to apply the time series models to test the accuracy of the forecasting on each of them.
Dynamic Harmonic Regression
The Arima() and auto.arima() functions will allow a seasonal period up to m=350, but in practice will usually run out of memory whenever the seasonal period is more than about 200. In any case, seasonal differencing of high order does not make a lot of sense for daily data it involves comparing what happened today with what happened exactly a year ago and there is no constraint that the seasonal pattern is smooth. For time series, it is preferred to use a harmonic regression approach where the seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an ARMA error.
The fourier() function makes it easy to generate the required harmonics. The first argument to fourier() allows it to identify the seasonal period m and the length of the predictors to return. The second argument K specifies how many pairs of sin and cos terms to include. The maximum allowed is K=m/2 where m is the seasonal period. The higher the order (K), the more “wiggly” the seasonal pattern is allowed to be. With K=1, it is a pure sine curve.
We will fit a dynamic harmonic regression model with an ARMA error structure. The total number of Fourier terms for each seasonal period has been chosen to minimize the AIC and decrease the RMSE and the NMAE.
As the data follow a weekly and annual pattern two K values are included in the model. The best two parameters for K to minimize the AIC and decrease the RMSE and the NMAE are displayed below:
### Dynamic Harmonic Regression
z <- fourier(train, K=c(1,7))
zf <- fourier(train, K=c(1,7), h=31)
fit <- auto.arima(train, xreg=z, seasonal=FALSE)
fc <- forecast(fit, xreg=zf, h=31)
rmseDHR <- accuracy(fc,y)["Test set", "RMSE"]
rmseDHR
## [1] 150.6233
nmseDHR<- round(sum((testdf$value - fc$mean)^2) / sum((testdf$value-mean(testdf$value))^2)*100,2)
nmseDHR
## [1] 39.82
The RMSE on the test data is 150.62, which is higher in comparison with the XG boosting model. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend but it does not recognize the weekly peaks of the data.
### Dynamic Harmonic Regression Forecast vs Actuals
forecast2 <- fc$mean
traindf2 <- traindf%>%dplyr::mutate(forecast = as.numeric(value), actuals = as.numeric(value))%>%dplyr::select(time, actuals, forecast)
results2 <- testdf%>%dplyr::mutate(forecast = fc$mean, actuals = as.numeric(testdf$value))%>%dplyr::select(time, actuals, forecast)
all2 <- rbind(traindf2, results2)%>%dplyr::select(time, actuals, forecast)%>%dplyr::filter(time > "2019-06-01")
p <- ggplot() +
geom_line(data = all2, aes(x = time, y = forecast, color = 'Forecast'))+
geom_line(data = all2, aes(x = time, y = actuals, color = 'Actuals')) +
xlab('Dates') +
ylab('Travellers') + scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
labs(color="") + ggtitle("Dynamic Harmonic Regression Forecast") + theme_bw() +
theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p
TBATS
The TBATS model was introduced by De Livera, Hyndman & Snyder (2011, JASA). “TBATS” is an acronym denoting its salient features: • T for trigonometric regressors to model multiple-seasonalities • B for Box-Cox transformations • A for ARMA errors • T for trend • S for seasonality
A TBATS model differs from dynamic harmonic regression in that the seasonality is allowed to change slowly over time in a TBATS model, while harmonic regression terms force the seasonal patterns to repeat periodically without changing. Using the TBATS model is one way to incorporate multiple seasonality in our model. It’s going to automate the process of choosing a Box-Cox transformation for our target variable, and the function will also automatically choose the parameters for the ARMA model, and the fourier transforms for the seasonal trends.
### TBATS
fit <- tbats(train)
fc1 <- forecast(fit, h = 31)
forecast <- fc$mean
rmsetbats <- accuracy(fc1,y)["Test set", "RMSE"]
rmsetbats
## [1] 183.0618
nmsetbats <- round(sum((testdf$value - fc1$mean)^2) / sum((testdf$value-mean(testdf$value))^2)*100,2)
nmsetbats
## [1] 58.82
The RMSE on the test data is 183, which is higher in comparison with dynamic harmonic regression and XG boosting model. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend, but it does not recognize the weekly peaks and drops of the data.
### TBATS Forecast vs Actuals
traindf3 <- traindf%>%mutate(forecast = as.numeric(value), actuals = as.numeric(value))%>%dplyr::select(time, actuals, forecast)
results3 <- testdf%>%mutate(forecast = fc1$mean, actuals = as.numeric(testdf$value))%>%dplyr::select(time, actuals, forecast)
all <- rbind(traindf3, results3)%>%dplyr::select(time, actuals, forecast)%>%dplyr::filter(time > "2019-06-01")
p <- ggplot() +
geom_line(data = all, aes(x = time, y = forecast, color = 'Forecast'))+
geom_line(data = all, aes(x = time, y = actuals, color = 'Actuals')) +
xlab('Dates') +
ylab('Travellers') + scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
labs(color= "")+
ggtitle("TBATS Forecast") + theme_bw() +
theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p
Prophet Model
The prophet is an algorithm from Facebook to fit an additive model by including seasonality, autocorrelation, extra regressors, etc. One of the features of the prophet() function is that it will also automatically choose change points in the time series. This allows the time series models to be a little bit more robust in comparison to other models.
The Prophet uses a decomposable time series model with three main model components: trend, seasonality, and holidays. They are combined in the following equation: y(t)= g(t) + s(t) + h(t) + εt g(t): piecewise linear or logistic growth curve for modeling non-periodic changes in time series s(t): periodic changes (e.g. weekly/yearly seasonality) h(t): effects of holidays (user provided) with irregular schedules εt: error term accounts for any unusual changes not accommodated by the model Using time as a regressor, Prophet is trying to fit several linear and non linear functions of time as components. Prophet is framing the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series.
Seasonality’s in Prophet are estimated using a partial Fourier sum. The number of terms in the partial sum (the order) is a parameter that determines how quickly the seasonality can change. In the case of our time series, the parameters of prophet have been selected in order to minimize the RMSE on test data. For example, the default yearly and monthly seasonality parameter was minimised to fit lower frequency changes as the yearly pattern does not change much over the years. On the other hand, the default weekly seasonality parameter was not changed as this default parameter fit higher-frequency changes in the weekly pattern.
The holiday parameter was added to include the public holidays in China as the data has followed a different pattern in those days.
The code below displays the parametrization process for this algorithm.
### Prophet Model
trainr <- df%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS)%>%dplyr::filter(Date < "2019-08-01")
testr <- df%>%dplyr::select(Date, Total_trav, Profile_Business, Avg.LOS)%>%dplyr::filter(Date >= "2019-08-01")
## Creating data frame for prophet model
trainp <- traindf2%>%dplyr::select(time, actuals)%>%mutate(ds = time, y = actuals)%>%dplyr::select(ds, y)
testp <- testdf%>%mutate(ds = time, y = value)%>%dplyr::select(ds, y)
### Creating the outliers
outliers <- data_frame(
holiday = 'holiday',
ds = as.Date(c('2015-08-08', '2015-08-01', '2019-08-30')),
lower_window = 0,
upper_window = 1,
)
holidays <- outliers
m <- prophet(weekly.seasonality= TRUE, yearly.seasonality = 7, daily.seasonality = TRUE, seasonality.mode = "additive", changepoint.prior.scale = 5, holidays.prior.scale = 1, holidays = holidays)
m <- add_country_holidays(m, country_name = 'China')
m <- fit.prophet(m, trainp)
## Forecasting Model
future <- make_future_dataframe(m, periods = 31, include_history = FALSE)
fcst1 <- predict(m, future)
rmseprophet <- accuracy(fcst1$yhat,testp$y)["Test set", "RMSE"]
rmseprophet
## [1] 110.4259
nmseprophet <- round(sum((testp$y - fcst1$yhat)^2) / sum((testp$y-mean(testp$y))^2)*100,2)
nmseprophet
## [1] 21.4
The RMSE on the test data is 110.42, which is lowest one among all the models. The graph below showcases the performance of the predicted values in comparison with the test data. The predicted values are following the trend and the yearly, weekly and daily seasonality. Also, the data is recognizing the changing points of the data (peaks and drops).
### Prophet Model Forecast vs Actuals
results4 <- fcst1%>%mutate(time = as.Date(ds), forecast = round(yhat, 0), actuals = testdf$value)%>%dplyr::select(time, actuals, forecast)
trainf4 <- traindf2%>%dplyr::select(time, actuals, forecast)
all3 <- rbind(trainf4, results4)%>%dplyr::filter(time > "2019-06-01")
p <- ggplot() +
geom_line(data = all3, aes(x = time, y = forecast, color = 'Forecast'))+
geom_line(data = all3, aes(x = time, y = actuals, color = 'Actuals')) +
xlab('Dates') +
ylab('Travellers') + scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
labs(color="") + ggtitle("Prophet Forecast") + theme_bw() +
theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p
Prophet Model with Avg LOS as Regressor
In Prophet additional regressors can be added to the linear part of the model using the add_regressor method or function. A column with the regressor value will need to be present in both the fitting and prediction dataframes. In our case, the Avg.LOS predictor was added as a regressor and the RMSE improved. The RMSE went from 110.57 to 109.83 (See code and graph below).
### Prophet Model with Avg Loss as Regressor
## Adding a Regresor
### Avg.LOS Prophet Model
t4 <- as.data.frame(trainr%>%dplyr::select(Date, Avg.LOS)%>%mutate(ds = as.Date(Date), y = Avg.LOS)%>%dplyr::select(ds, y))
test4 <- as.data.frame(testr%>%dplyr::select(Date, Avg.LOS)%>%mutate(ds = Date, y = Avg.LOS)%>%dplyr::select(ds, y))
m4 <- prophet(weekly.seasonality= FALSE, yearly.seasonality = 6, daily.seasonality = 4, seasonality.mode = "additive", changepoint.prior.scale = 0.001, holidays = holidays)
m4 <- fit.prophet(m4, t4)
future4 <- make_future_dataframe(m4, periods = 31, include_history = FALSE)
fcst4 <- predict(m4, future4)
trainp1 <- traindf2%>%dplyr::select(time, actuals)%>%mutate(ds = time, y = actuals)%>%dplyr::select(ds, y)
trainp1 <- dplyr::left_join(trainp1, t4%>%dplyr::rename(Avg.LOS = y), by = c('ds' = 'ds'))
trainp1 <- trainp1%>%dplyr::mutate(Avg.LOS = ifelse(is.nan(Avg.LOS), 0, Avg.LOS))
testp1 <- testdf%>%mutate(ds = time, y = value)%>%dplyr::select(ds, y)
## Prophet Model with Avg.LOS as regressor
m <- prophet(weekly.seasonality= TRUE, yearly.seasonality = 7, daily.seasonality = TRUE, seasonality.mode = "additive", changepoint.prior.scale = 5, holidays.prior.scale = 1, holidays = holidays)
m <- add_country_holidays(m, country_name = 'China')
m <- add_regressor(m, 'Avg.LOS')
m <- fit.prophet(m, trainp1)
future <- make_future_dataframe(m, periods = 31, include_history = FALSE)
future.2 <- dplyr::left_join(future,fcst4 %>%
dplyr::select(ds, yhat), by = c('ds' = 'ds')) %>%
dplyr::rename(Avg.LOS= yhat)
future.2$Avg.LOS <- ifelse(is.na(future.2$Avg.LOS), mean(future.2$Avg.LOS, na.rm = TRUE), future.2$Avg.LOS)
fcst2 <- predict(m, future.2)
rmserprophetavglos <- accuracy(fcst2$yhat,testp1$y)["Test set", "RMSE"]
rmserprophetavglos
## [1] 109.7955
nmseprophetr <- round(sum((testp1$y - fcst2$yhat)^2) / sum((testp1$y-mean(testp1$y))^2)*100,2)
nmseprophetr
## [1] 21.16
### Prophet Model with Avg.LOS as Regressor Forecast vs Actuals
results5 <- fcst2%>%mutate(time = as.Date(ds), forecast = round(yhat, 0), actuals = testdf$value)%>%dplyr::select(time, actuals, forecast)
trainf5 <- traindf2%>%dplyr::select(time, actuals, forecast)
all4 <- rbind(trainf5, results5)%>%dplyr::filter(time > "2019-06-01")
p <- ggplot() +
geom_line(data = all4, aes(x = time, y = forecast, color = 'Forecast'))+
geom_line(data = all4, aes(x = time, y = actuals, color = 'Actuals')) +
xlab('Dates') +
ylab('Travellers') + scale_color_manual(values = c( Forecast = "red", Actuals = "blue"))+
labs(color="") + ggtitle("Prophet Forecast") + theme_bw() +
theme(plot.title = element_text(size = 15, hjust = 0.5, face = "bold"))
p
As part of future research, the prophet model could be validated with more regressors in order to improve the model performance.
We have tried a number of ML and time series models and concluded on Random Forest and XGBoost are the best performing on the ML side, and the Prophet model is the best one on the time series side.
In the coming weeks, we will try to refine these models. Some ideas along this front are:
Below we show the table with the predictions for August for each model compared to the actuals and the table with the performance metrics (i.e. RMSE and NMSE) for each model.
### Prophet Model with Avg.LOS as Regressor Forecast vs Actuals
append1 <- traindf2%>%dplyr::select(time, actuals)%>%mutate(Date = time, Actuals = actuals, XGBoost = actuals, RandomForest = actuals, Prediction_Average = actuals,
DHR = actuals, TBATS= actuals, Prophet = actuals, Prophet_Regressor = actuals)%>%
dplyr::select(Date, Actuals, XGBoost, RandomForest, Prediction_Average,DHR,TBATS, Prophet, Prophet_Regressor)
MLpredOutTble <- MLpredOutTble%>%mutate(DHR = results2$forecast, TBATS = results3$forecast, Prophet = results4$forecast, Prophet_Regressor = results5$forecast)
apptable <- rbind(append1, MLpredOutTble)%>%dplyr::filter(Date > '2019-06-30')
write.csv(apptable, "apptable.csv")
kable(MLpredOutTble )
Date | Actuals | XGBoost | RandomForest | Prediction_Average | DHR | TBATS | Prophet | Prophet_Regressor |
---|---|---|---|---|---|---|---|---|
2019-08-01 | 999 | 986 | 917 | 952 | 807.8437 | 767.5234 | 977 | 976 |
2019-08-02 | 1102 | 885 | 879 | 882 | 893.2569 | 706.9147 | 939 | 940 |
2019-08-03 | 1269 | 1045 | 1001 | 1023 | 938.2670 | 932.2406 | 1109 | 1115 |
2019-08-04 | 920 | 909 | 896 | 902 | 952.6667 | 865.8596 | 1028 | 1027 |
2019-08-05 | 1068 | 870 | 872 | 871 | 898.5229 | 789.5321 | 949 | 940 |
2019-08-06 | 798 | 742 | 840 | 791 | 841.7251 | 685.4971 | 878 | 867 |
2019-08-07 | 713 | 822 | 835 | 828 | 802.0072 | 714.6886 | 884 | 866 |
2019-08-08 | 828 | 866 | 851 | 858 | 821.0152 | 695.4295 | 861 | 856 |
2019-08-09 | 832 | 806 | 862 | 834 | 863.3970 | 647.7509 | 814 | 811 |
2019-08-10 | 1186 | 1101 | 1007 | 1054 | 899.3646 | 845.2376 | 974 | 977 |
2019-08-11 | 869 | 879 | 853 | 866 | 881.6828 | 781.9332 | 886 | 882 |
2019-08-12 | 773 | 826 | 876 | 851 | 818.9974 | 706.1233 | 800 | 788 |
2019-08-13 | 851 | 758 | 769 | 764 | 738.7777 | 605.0032 | 722 | 710 |
2019-08-14 | 671 | 845 | 760 | 802 | 692.1391 | 622.0901 | 723 | 704 |
2019-08-15 | 584 | 657 | 750 | 704 | 694.8388 | 597.1188 | 698 | 691 |
2019-08-16 | 701 | 706 | 751 | 728 | 732.6611 | 549.5090 | 647 | 645 |
2019-08-17 | 672 | 756 | 834 | 795 | 758.3649 | 711.6362 | 807 | 811 |
2019-08-18 | 448 | 635 | 648 | 642 | 738.9408 | 654.4061 | 719 | 716 |
2019-08-19 | 498 | 617 | 742 | 680 | 671.2565 | 589.4530 | 635 | 624 |
2019-08-20 | 607 | 563 | 644 | 604 | 592.3417 | 505.4215 | 561 | 550 |
2019-08-21 | 583 | 500 | 687 | 594 | 545.3482 | 522.3883 | 566 | 550 |
2019-08-22 | 578 | 522 | 592 | 557 | 552.4410 | 505.3893 | 547 | 543 |
2019-08-23 | 468 | 468 | 596 | 532 | 594.0012 | 469.9450 | 504 | 505 |
2019-08-24 | 727 | 815 | 774 | 794 | 627.0628 | 617.2653 | 672 | 680 |
2019-08-25 | 498 | 538 | 552 | 545 | 614.9296 | 575.8284 | 595 | 596 |
2019-08-26 | 484 | 445 | 485 | 465 | 557.2997 | 526.8783 | 521 | 516 |
2019-08-27 | 383 | 405 | 568 | 486 | 488.6655 | 459.3342 | 459 | 454 |
2019-08-28 | 347 | 454 | 455 | 454 | 454.0290 | 483.2979 | 478 | 467 |
2019-08-29 | 392 | 455 | 477 | 466 | 473.7872 | 476.0727 | 472 | 474 |
2019-08-30 | 934 | 640 | 700 | 670 | 529.5072 | 450.7729 | 1033 | 1044 |
2019-08-31 | 589 | 661 | 744 | 702 | 576.9698 | 603.0402 | 704 | 737 |
output.tablet <- data.frame("Model" = c('XGBoost Refined','Random Forest Refined', "Ensamble", "DHR", "TBATS", "Prophet", "Prophet with Regressor"),
"RMSE" = c(rmse.xgbt,rmse.rft, ensmb.rmse, rmseDHR, rmsetbats, rmseprophet, rmserprophetavglos ),
"NMSE" = c(nmse.xgbt,nmse.rft, ensmb.nmse, nmseDHR, nmsetbats, nmseprophet, nmseprophetr))
kable(output.tablet)
Model | RMSE | NMSE |
---|---|---|
XGBoost Refined | 112.8200 | 22.34 |
Random Forest Refined | 134.9400 | 31.96 |
Ensamble | 118.5200 | 24.65 |
DHR | 150.6233 | 39.82 |
TBATS | 183.0618 | 58.82 |
Prophet | 110.4259 | 21.40 |
Prophet with Regressor | 109.7955 | 21.16 |