The second part of the project will be based on the dataset concerning hotel bookings. The data was retrieved from Kaggle and is available here. It consists of almost 120 thousand observations and 32 variables. Wider description of variables as well as details about the whole data collection can be found in the article here. The aim of this analysis is to obtain good predictions of whether the reservations were cancelled or not. After some initial preparation of variables, including their selection and transformations, multiple models are performed to achieve the goal.

Libraries used

library(Rmisc)
library(dplyr)
library(ggplot2)
library(bestNormalize)
library(caret)
library(corrplot)
library(stargazer)
library(rpart)
library(rattle)
library(pROC)
library(neuralnet)
library(here)

First glimpse into data

hotel <- read.csv2('hotel_bookings.csv', sep =',', dec = '.', 
                   encoding =  'UTF-8', header = T, stringsAsFactors = T)

glimpse(hotel)
## Rows: 119,390
## Columns: 32
## $ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,~
## $ country                        <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,~
## $ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, ~
## $ assigned_room_type             <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, ~
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent                          <fct> NULL, NULL, NULL, 304, 240, 240, NULL, ~
## $ company                        <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <fct> Transient, Transient, Transient, Transi~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date        <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201~

Our target variable is is_canceled. Besides that there are many variables with details about the reservation, customers and distribution of the specific offer. There is also a variable reservation_status which is generally duplicating our target so we won’t use it or reservation_status_date at all.

Split into train and test sample

To be able to check the correctness of our predictions we need to start with splitting the dataset into two parts. Train sample consisting of 70% of all observations will be used to check if any data preparation is needed and in the process of training our models, while the remaining 30% of obervations will be used as a test sample only to assess the efficiency of our predictions on the new, unused part of data. To be able to replicate the results of the whole analysis, a specified seed will be used where necessary.

set.seed(123456789)
training_obs <- createDataPartition(hotel$is_canceled, 
                                    p = 0.7, 
                                    list = FALSE)
hotel_train <- hotel[training_obs,]
hotel_test  <- hotel[-training_obs,]

nrow(hotel_train)
## [1] 83573
nrow(hotel_test)
## [1] 35817

Train data consists of over 30 thousand observations of cancelled reservations and over 52 thousand observations of bookings that weren’t cancelled.

table(hotel_train$is_canceled)
## 
##     0     1 
## 52527 31046
hotel_train$is_canceled <- as.factor(hotel_train$is_canceled)
hotel_train$is_canceled <- recode(hotel_train$is_canceled, "0" = "No", "1" = "Yes")

ggplot(hotel_train,
       aes(x = is_canceled)) +
  geom_bar(fill = "orange",
                 color = 'black') +
  theme_minimal()

Variables transformation

Box-Cox transformation

summary(hotel_train$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.30   94.95  101.90  126.00 5400.00
p1 <- ggplot(hotel_train,
       aes(x = lead_time)) +
  geom_histogram(fill = "orange",
                 bins = 100) +
  theme_bw()

p2 <- ggplot(hotel_train,
       aes(x = adr)) +
  geom_histogram(fill = "orange",
                 bins = 300) +
  theme_bw()

multiplot(p1, p2)

We can see that two continuous variables: lead_time and adr are skewed. We will use a Box-Cox transformation to change the distribution of those variables. adr has some negative values which means this transformation can’t be applied, however, adr is describing the average daily rate calculated by dividing the sum of all lodging transactions by the total number of staying nights and in this case should be a positive value by definition. We will treat those negative or equal to zero values as errors and change them to NAs first. In case of lead_time we will perform Box-Cox transformation previously adding 1 to all values to omit the problem with zeros.

# lead_time:
box <- boxcox(hotel_train$lead_time + 1)

box
## Standardized Box Cox Transformation with 83573 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.2776842 
##  - mean (before standardization) = 7.822438 
##  - sd (before standardization) = 4.257422
hotel_train$lead_time_box <- box$x.t

p3 <- ggplot(hotel_train,
       aes(x = lead_time_box)) +
  geom_histogram(fill = "orange",
                 bins = 100) +
  theme_bw()

# adr:
hotel_train[hotel_train$adr <=0,]$adr <- NA

box2 <- boxcox(hotel_train$adr)

box2
## Standardized Box Cox Transformation with 82189 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.2807637 
##  - mean (before standardization) = 9.285888 
##  - sd (before standardization) = 1.627004
hotel_train$adr_box <- box2$x.t


p4 <- ggplot(hotel_train,
       aes(x = adr_box)) +
  geom_histogram(fill = "orange",
                 bins = 300) +
  theme_bw()

In both cases the lambda parameter is similar, close to 0.28. The distribution of variables in the training sample after transformation looks as presented below. Maybe for lead_time it is not perfect, however we will see later how it affects the relation with our target variable.

multiplot(p3, p4)

Now that we have the objects with Box-Cox transformations prepared on the training data, let’s apply those transformations on the whole dataset, remembering once again about the non-positive values of adr.

# lead_time with lambda value obtained on the training sample
hotel$lead_time_box <- predict(box, newdata = hotel$lead_time + 1)

# adr with lambda value obtained on the training sample
hotel[hotel$adr <=0,]$adr <- NA

hotel$adr_box <- predict(box2, newdata = hotel$adr)

Grouping and other preparations

Many variables in this dataset are of a type of count data, with the majority of observations equal to first few possible values and not many observations for different, larger numbers. Because of that we will group those higher values to one, common level.

Variables stays_in_week_nights and stays_in_weekend_nights as well as children and babies will be merged respectively to days and kids so that we will have less predictors but we shouldn’t lose much of their predictive power. kids will also have grouped levels in the same way as explained before.

To avoid potential problems with models, we will also rename those ordered factor variables so that their level names won’t start with a number. Finally, we will make sure that all qualitative variables are coded as factors and we will order the ordinal ones. To avoid potential problems with models, we will also rename those ordered factor variables so that their level labels won’t start with a number.

# is_canceled
hotel$is_canceled <- as.factor(hotel$is_canceled)
hotel$is_canceled <- recode(hotel$is_canceled, "0" = "No", "1" = "Yes")
hotel$is_canceled <- factor(hotel$is_canceled, levels = c("Yes", "No"))

# arrival date
hotel$arrival_date_year <- factor(hotel$arrival_date_year, order = T, levels = c("2015", "2016", "2017"))

# days
hotel$days <- hotel$stays_in_week_nights + hotel$stays_in_weekend_nights
hotel[hotel$days == 0,]$days <- NA

# adults
hotel$adults <- ifelse(hotel$adults > 2, "A3+", as.factor(hotel$adults))
hotel$adults <- recode_factor(hotel$adults, "1" = "A0", "2" = "A1", "3" = "A2")
hotel$adults <- factor(hotel$adults, order = T, levels = c("A0", "A1", "A2", "A3+"))

# kids
sum(hotel_train$children>0 | hotel_train$babies>0, na.rm = T)
## [1] 6675
hotel$kids <- ifelse(hotel$children>0 | hotel$babies>0, "yes", 'no')
hotel$kids <- as.factor(hotel$kids)

# meal
hotel[hotel$meal == "Undefined",]$meal <- NA 
hotel$meal <- recode_factor(hotel$meal, "FB" = "FB/HB", "HB" = "FB/HB")
hotel$meal <- factor(hotel$meal, order = T, levels = c("SC", "BB", "FB/HB"))

# distribution_chanel
hotel[hotel$distribution_channel == "GDS",]$distribution_channel <- "TA/TO" 
hotel[hotel$distribution_channel == "Undefined",]$distribution_channel <- NA 
hotel$distribution_channel <- droplevels(hotel$distribution_channel)

# is_repeate_guest
hotel$is_repeated_guest <- as.factor(hotel$is_repeated_guest)

# previous cancellations
hotel$canc <- ifelse(hotel$previous_cancellations > 0, 'yes', 'no')
hotel$canc <- as.factor(hotel$canc)


# parking space
hotel$parking <- ifelse(hotel$required_car_parking_spaces > 0, 'yes', 'no')
hotel$parking <- as.factor(hotel$parking)

# special requests
hotel$requests <- ifelse(hotel$total_of_special_requests > 2, "A3+", as.factor(hotel$total_of_special_requests))
hotel$requests <- recode_factor(hotel$requests, "1" = "A0", "2" = "A1", "3" = "A2")

Now when we have everything transformed on the whole data we can easily apply those changes to train and test sample by assigning once again the same observations to our samples.

hotel_train <- hotel[training_obs,]
hotel_test  <- hotel[-training_obs,]

hotel_train %>%
  is.na() %>%
  colSums() %>%
  sort()
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                         babies                        country 
##                              0                              0 
##                 market_segment              is_repeated_guest 
##                              0                              0 
##         previous_cancellations previous_bookings_not_canceled 
##                              0                              0 
##             reserved_room_type             assigned_room_type 
##                              0                              0 
##                booking_changes                   deposit_type 
##                              0                              0 
##                          agent                        company 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0 
##                  lead_time_box                           canc 
##                              0                              0 
##                        parking                       requests 
##                              0                              0 
##                       children                           kids 
##                              3                              3 
##           distribution_channel                           days 
##                              4                            500 
##                           meal                            adr 
##                            824                           1384 
##                        adr_box 
##                           1384
hotel_test %>%
  is.na() %>%
  colSums() %>%
  sort()
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                         babies                        country 
##                              0                              0 
##                 market_segment              is_repeated_guest 
##                              0                              0 
##         previous_cancellations previous_bookings_not_canceled 
##                              0                              0 
##             reserved_room_type             assigned_room_type 
##                              0                              0 
##                booking_changes                   deposit_type 
##                              0                              0 
##                          agent                        company 
##                              0                              0 
##           days_in_waiting_list                  customer_type 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0 
##                  lead_time_box                           canc 
##                              0                              0 
##                        parking                       requests 
##                              0                              0 
##                       children           distribution_channel 
##                              1                              1 
##                           kids                           days 
##                              1                            215 
##                           meal                            adr 
##                            345                            576 
##                        adr_box 
##                            576

Now we can also deal with the missing data. There aren’t many missing values comparing to the size of the dataset and we will simply remove those observations from our analysis, especially that the come mostly from the erroneous values removed earlier.

hotel_train <- na.omit(hotel_train)

hotel_test <- na.omit(hotel_test)

Variable selection

After transformations, descriptive analysis of numerical variables presents as follows.

numeric_vars <- sapply(hotel_train, is.numeric) %>%  which() %>%  names()

stargazer(
  hotel_train[, numeric_vars], type = "text", 
  summary.stat = c("min", "p25", "median", "p75", "max", "mean", "sd")
)
## 
## ======================================================================================
## Statistic                       Min   Pctl(25) Median Pctl(75)  Max    Mean   St. Dev.
## --------------------------------------------------------------------------------------
## lead_time                        0       19      71     162     709   105.477 107.127 
## arrival_date_week_number         1       16      28      38      53   27.219   13.496 
## arrival_date_day_of_month        1       8       16      23      31   15.757   8.788  
## stays_in_weekend_nights          0       0       1       2       19    0.938   0.998  
## stays_in_week_nights             0       1       2       3       50    2.518   1.899  
## children                         0       0       0       0       10    0.107   0.405  
## babies                           0       0       0       0       10    0.008   0.102  
## previous_cancellations           0       0       0       0       26    0.086   0.842  
## previous_bookings_not_canceled   0       0       0       0       72    0.126   1.449  
## booking_changes                  0       0       0       0       18    0.215   0.626  
## days_in_waiting_list             0       0       0       0      391    2.251   17.504 
## adr                              0       71      95     126    5,400  103.694  50.199 
## required_car_parking_spaces      0       0       0       0       8     0.063   0.245  
## total_of_special_requests        0       0       0       1       5     0.577   0.795  
## lead_time_box                  -1.837  -0.740  0.090   0.797   2.553   0.016   0.995  
## adr_box                        -6.397  -0.652  -0.034  0.614   16.549  0.002   0.999  
## days                             1       2       3       4       69    3.456   2.543  
## --------------------------------------------------------------------------------------

Correlations

In the process of variable selection we will look at first at correlations between potential predictors among continuous variables.

numeric_vars <- sapply(hotel_train, is.numeric) %>%  which() %>%  names()

correlations <-   cor(hotel_train[,numeric_vars],
                              use = "pairwise.complete.obs")

corrplot(correlations)

There is almost no correlation between the majority of variables, which is generally a good thing. The only exceptions are the variables that we created from existing ones like adr and adr_box or length of stay in week or weekend and days. Those two original variables concerning the legth of stay are also correlated with each other, but we have already solved this problem in a way by using them to create a new predictor.

ANOVA

anova <- function(numeric_var) {
  anova_ <- aov(hotel_train[[numeric_var]] ~ 
                  hotel_train$is_canceled) 
  
  return(summary(anova_)[[1]][1, 4])
}

sapply(numeric_vars,
       anova) %>% 
  sort(decreasing = TRUE) -> anova_all

anova_all
##                  lead_time_box                      lead_time 
##                    9506.962099                    7636.312969 
##      total_of_special_requests    required_car_parking_spaces 
##                    5029.214138                    3363.168359 
##                booking_changes         previous_cancellations 
##                    1757.466498                     995.675062 
## previous_bookings_not_canceled           days_in_waiting_list 
##                     261.302358                     250.447236 
##                        adr_box                         babies 
##                     167.290462                      93.903786 
##                            adr           stays_in_week_nights 
##                      88.013989                      37.886082 
##                           days       arrival_date_week_number 
##                      17.429446                      10.182337 
##      arrival_date_day_of_month                       children 
##                       3.731515                       1.277450 
##        stays_in_weekend_nights 
##                       1.146784

The majority of variables have pretty high F statistics which means our variables should have the significant effect on the dependent variable. It is a good sign that Box-Cox transformations increased the F statistics in both case so we will use those transformed versions of predictors. Some predictors, like day of month or week, may not have the strongest effect on booking cancellations so we may reduce the number of variables, especially the ones which don’t make much sense.

Cramer’s V

DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$hotel)
## [1] 0.1374975
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$arrival_date_month)
## [1] 0.07273162
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$arrival_date_year)
## [1] 0.02471176
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$adults)
## [1] 0.08177464
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$meal)
## [1] 0.009971023
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$distribution_channel)
## [1] 0.1725203
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$is_repeated_guest)
## [1] 0.0763397
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$deposit_type)
## [1] 0.4792627
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$customer_type)
## [1] 0.1344559
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$kids)
## [1] 0.01462105
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$canc)
## [1] 0.2708739
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$parking)
## [1] 0.2007072
DescTools::CramerV(hotel_train$is_canceled,
                   hotel_train$requests)
## [1] 0.273077

Let’s also check the Cramer’s coefficients between the target variable and qualitative predictors. Some of them are not impressive, including arrival date details or type of meals. On the other hand, there are some results consistent with expectations e.g. high coefficient for the deposit type.

Variance

Finally we will analyse the variance of the variables.

nearZeroVar(hotel_train, saveMetrics = TRUE)
##                                 freqRatio percentUnique zeroVar   nzv
## hotel                            2.034299   0.002457335   FALSE FALSE
## is_canceled                      1.652231   0.002457335   FALSE FALSE
## lead_time                        1.797619   0.581159616   FALSE FALSE
## arrival_date_year                1.392310   0.003686002   FALSE FALSE
## arrival_date_month               1.099726   0.014744007   FALSE FALSE
## arrival_date_week_number         1.150000   0.065119365   FALSE FALSE
## arrival_date_day_of_month        1.033642   0.038088685   FALSE FALSE
## stays_in_weekend_nights          1.523727   0.018430009   FALSE FALSE
## stays_in_week_nights             1.109905   0.038088685   FALSE FALSE
## adults                           4.003644   0.004914669   FALSE FALSE
## children                        22.109155   0.006143336   FALSE  TRUE
## babies                         131.747145   0.006143336   FALSE  TRUE
## meal                             6.024735   0.003686002   FALSE FALSE
## country                          3.887398   0.213788104   FALSE FALSE
## market_segment                   2.388117   0.008600671   FALSE FALSE
## distribution_channel             7.083482   0.003686002   FALSE FALSE
## is_repeated_guest               34.006022   0.002457335   FALSE  TRUE
## previous_cancellations          18.325631   0.017201342   FALSE FALSE
## previous_bookings_not_canceled  76.793003   0.074948703   FALSE  TRUE
## reserved_room_type               4.426225   0.011058005   FALSE FALSE
## assigned_room_type               2.901870   0.013515340   FALSE FALSE
## booking_changes                  8.196429   0.019658676   FALSE FALSE
## deposit_type                     7.043547   0.003686002   FALSE FALSE
## agent                            2.112131   0.383344187   FALSE FALSE
## company                        122.651274   0.379658185   FALSE  TRUE
## days_in_waiting_list           484.147239   0.144982737   FALSE  TRUE
## customer_type                    3.675681   0.004914669   FALSE FALSE
## adr                              1.376190   9.136369780   FALSE FALSE
## required_car_parking_spaces     15.089974   0.006143336   FALSE FALSE
## total_of_special_requests        2.077581   0.007372004   FALSE FALSE
## reservation_status               1.696911   0.003686002   FALSE FALSE
## reservation_status_date          1.753982   1.121773213   FALSE FALSE
## lead_time_box                    1.797619   0.581159616   FALSE FALSE
## adr_box                          1.376190   9.136369780   FALSE FALSE
## days                             1.022599   0.046689356   FALSE FALSE
## kids                            11.429597   0.002457335   FALSE FALSE
## canc                            17.256842   0.002457335   FALSE FALSE
## parking                         15.024611   0.002457335   FALSE FALSE
## requests                         2.077581   0.004914669   FALSE FALSE

There are some variables with the variance close to zero. We didn’t even try to merge levels of some of them because of one level dominance and we won’t use them (is_repeated_guest, previous_bookings_not_canceled, days_in_waiting_list…) but we successfully eliminated this problem creating for example kids predictor from children and babies.

Selection summary

Taking into consideration full information obtained about the variables we decide to leave 15 of them. We will use them as predictors to train models. We will define a common formula to be used later.

model_formula <- is_canceled ~ hotel + arrival_date_month + adults + meal + 
  distribution_channel + booking_changes + deposit_type + customer_type + 
  lead_time_box + kids + canc + adr_box + parking + requests + days

Modelling

Classification trees

We will start with the classification trees. They are very similar to the regression trees. For classification, the most frequent value in a group is used to obtain prediction of the target variable. We will begin the modelling with the default tree from rpart function.

tree1 <- 
  rpart(model_formula,
        data = hotel_train,
        method = "class")

fancyRpartPlot(tree1, cex = 0.6, palettes = 'OrRd')

tree1_pred  <- predict(tree1,  hotel_train)

tree1_roc  <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
                        tree1_pred[, 1])

cat("Gini = ", 2 * tree1_roc$auc - 1,  "\n", sep = "")
## Gini = 0.6294704

We end with 9 terminal nodes. Some of them contains only a little part of all observations from training sample.

barplot(rev(tree1$variable.importance),
        col = "#E15D44", 
        main = "Imporatnce of variables for tree1",
        horiz = T,  
        las = 1,   
        cex.names = 0.6)

Main advantage of classification trees is that not only we can see which variables contributed to the splits but also we can easily compute the variable importance. One of the measures, based on the quality of the splits, is kept in the model so it’s quickly accessible. Differences in the variable importance are huge for this tree. Deposit_type is far ahead of any other predictor. On the other hand adults and meal seem to be unimportant at all.

We can try to limit the splits with additional parameter so that the nodes will be bigger. Therefore our tree may become more universal and robust for data changes. Gini coefficient is equal to 62.9%.

tree2 <- 
  rpart(model_formula,
        data = hotel_train,
        method = "class",
        minsplit = 10000,
        maxdepth = 20)


fancyRpartPlot(tree2, cex = 0.6, palettes = 'OrRd')

tree2_pred  <- predict(tree2,  hotel_train)

tree2_roc  <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
                  tree2_pred[, 1])

cat("Gini = ", 2 * tree2_roc$auc - 1,  "\n", sep = "")
## Gini = 0.5869887

We reduced the number of terminal nodes to seven, in some cases more numerous groups. Unfortunately, the Gini coefficient is significantly lower than for the previous tree.

barplot(rev(tree2$variable.importance),
        col = "#E15D44", 
        main = "Imporatnce of variables for tree2",
        horiz = T,  
        las = 1,   
        cex.names = 0.6)

As a result of lower number of splits, some variables lost their importance comparing to the previous tree and some changes in the order of importance is visible.

For the next tree we will try to find the optimal complexity parameter. In order to do this we will use a cross-validation with 10 folds.

# setting control
cv_10 <- trainControl(method = "cv",
                   number = 10, 
                   classProbs = TRUE,
                   summaryFunction = twoClassSummary)

cp.grid <- expand.grid(cp = seq(0, 0.01, 0.0005))

# training
set.seed(123456789)
tree3 <- 
  train(model_formula,
        data = hotel_train, 
        method = "rpart", 
        metric = "ROC",
        trControl = cv_10,
        tuneGrid  = cp.grid)

plot(tree3)

Without doubt, 0 is the best value of complexity parameter. For any positive values, AUC obtained during cross-validation is much lower.

tree3_pred  <- predict(tree3,  hotel_train, type = 'prob')

tree3_roc  <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
                  tree3_pred[, 1])

cat("Gini = ", 2 * tree3_roc$auc - 1,  "\n", sep = "")
## Gini = 0.8657437

This time the Gini coefficient is much, much higher than for the previous trees. It seems that we managed to improve our model with the complexity parameter change.

#train
list(
  tree1_roc  = tree1_roc,
  tree2_roc = tree2_roc,
  tree3_roc = tree3_roc
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TRAIN: ",
                         "tree1 = ", 
                         round(100*(2 * auc(tree1_roc) - 1), 1), "%, ",
                         "tree2 = ", 
                         round(100*(2 * auc(tree2_roc) - 1), 1), "% ",
                         "tree3 = ", 
                         round(100*(2 * auc(tree3_roc) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Dark2")[c(1:3)])

Graphical representation of ROC curves confirms our conclusions. The difference between two first and the third model are surprisingly big. It is a perfect example how even a single change may result in completely different predictions.

#test
tree1_test_pred  <- predict(tree1,  hotel_test)

tree1_test_roc  <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
                       tree1_test_pred[, 1])

tree2_test_pred  <- predict(tree2,  hotel_test)

tree2_test_roc  <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
                       tree2_test_pred[, 1])

tree3_test_pred  <- predict(tree3,  hotel_test, type = 'prob')

tree3_test_roc  <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
                       tree3_test_pred[, 1])


list(
  tree1_test_roc  = tree1_test_roc,
  tree2_test_roc = tree2_test_roc,
  tree3_test_roc = tree3_test_roc
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste0("Gini TEST: ",
                         "tree1 = ", 
                         round(100*(2 * auc(tree1_test_roc) - 1), 1), "%, ",
                         "tree2 = ", 
                         round(100*(2 * auc(tree2_test_roc) - 1), 1), "% ",
                         "tree3 = ", 
                         round(100*(2 * auc(tree3_test_roc) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = 4, 
                                                       name = "Dark2")[c(1:3)])

Results on the test sample are very similar, although the Gini coefficient is much lower this time. It still makes the best predictions but it seems that an overfitting appeared. On the other hand, first and second tree have almost the same quality of predictions as on the training sample.

Subagging of trees

Next model will be created with the subagging of classification trees. It is a method of bagging in which a number of independent trees is trained on the simple samples from the training data. Then the predictions are made based on the most frequent value obtained from those trees. We will make 17 trees and each of them will be with complexity parameter equal to zero, so each of our trees will have parameters which we consider as optimal after our previous steps of modelling.

if(0) {results_sub <- list()

n <- nrow(hotel_train)

for (sample in 1:17) {
  message(sample)
  set.seed(1234 + sample)
  data_sample <- 
    hotel_train[sample(x = 1:n, 
                         size = n/2,
                         replace = FALSE),]
  results_sub[[sample]] <- rpart(model_formula,
                                 data = hotel_train,
                                 method = "class",
                                 cp = 0)
  rm(data_sample)
  
}

saveRDS(object = results_sub,
            file   = here("output", "results_sub.rds"))
}

results_sub <- readRDS(here("output", "results_sub.rds"))


sub_pred <- 
  sapply(results_sub,
         function(x) 
           predict(object = x,
                   newdata = hotel_train,
                   type = 'prob')[,1])


hist(rowSums(sub_pred < 0.5), 
     breaks = 0:17,
     main = "Frequency of votes for single observations")

Based on frequency of votes we can see that the results are rather determined.

sub_pred_res <-
  ifelse(rowSums(sub_pred < 0.5) > 17/2,
         "No", "Yes") %>% 
  factor(., levels = c("Yes", "No"))


confusionMatrix(data = sub_pred_res,
                reference = hotel_train$is_canceled,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Yes    No
##        Yes 24624  3686
##        No   6063 47016
##                                          
##                Accuracy : 0.8802         
##                  95% CI : (0.878, 0.8824)
##     No Information Rate : 0.623          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7411         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8024         
##             Specificity : 0.9273         
##          Pos Pred Value : 0.8698         
##          Neg Pred Value : 0.8858         
##              Prevalence : 0.3770         
##          Detection Rate : 0.3025         
##    Detection Prevalence : 0.3478         
##       Balanced Accuracy : 0.8649         
##                                          
##        'Positive' Class : Yes            
## 
sub_pred2 <- rowMeans(sub_pred)

sub_pred_roc <-
  roc(as.numeric(hotel_train$is_canceled == "Yes"), 
      sub_pred2)

cat("Gini = ", 2 * sub_pred_roc$auc - 1,  "\n", sep = "")
## Gini = 0.8740883

Gini coefficient is a little bit higher than for previous tree. We can check some additional measures which are also high and confirm that at least on the training sample this model produces very good results.

#test
sub_test_pred <- 
  sapply(results_sub,
         function(x) 
           predict(object = x,
                   newdata = hotel_test,
                   type = 'prob')[,1])

sub_test_pred_res <-
  ifelse(rowSums(sub_test_pred < 0.5) > 17/2,
         "No", "Yes") %>% 
  factor(., levels = c("Yes", "No"))

confusionMatrix(data = sub_test_pred_res,
                reference = hotel_test$is_canceled,
                positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   Yes    No
##        Yes  9412  2773
##        No   3624 19096
##                                           
##                Accuracy : 0.8167          
##                  95% CI : (0.8126, 0.8208)
##     No Information Rate : 0.6265          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6032          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7220          
##             Specificity : 0.8732          
##          Pos Pred Value : 0.7724          
##          Neg Pred Value : 0.8405          
##              Prevalence : 0.3735          
##          Detection Rate : 0.2696          
##    Detection Prevalence : 0.3491          
##       Balanced Accuracy : 0.7976          
##                                           
##        'Positive' Class : Yes             
## 
sub_test_pred2 <- rowMeans(sub_test_pred)

sub_test_pred_roc <-
  roc(as.numeric(hotel_test$is_canceled == "Yes"), 
      sub_test_pred2)

cat("Gini = ", 2 * sub_test_pred_roc$auc - 1,  "\n", sep = "")
## Gini = 0.7748227

On the test sample the accuracy and sensitivity are lower than expected. Similarly as in a tree from cross-validation, Gini coefficient indicates an overfitting which wasn’t reduced with the application of subagging method.

XGBoost

In the next step we will perform an XGBoost algorithm. It is a popular boosting technique with multiple options of parameter tuning. We will use here a cross-validation with 3 folds. Starting with some default values, we’ll create a process of parameter optimization, but we’ll also keep the majority of those models to be able to compare changes also on the testing sample. We will start with max depth equal to 10 because our data is pretty large, colsample_bytree = 0.26, min_child_weight = 800, subsample = 0.8 and gamma = 1. We’ll begin with searching for an optimal number of trees and learning rate.

cv3 <- trainControl(method = "cv", 
                         number = 3,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)

if(0) {
parameters_xgb <- expand.grid(nrounds = seq(20, 120, 10),
                              max_depth = c(10),
                              eta = c(0.2, 0.3, 0.4, 0.5), 
                              gamma = 1,
                              colsample_bytree = c(0.26),
                              min_child_weight = c(800),
                              subsample = 0.8)



set.seed(123456789)
xgb <- train(model_formula,
                     data = hotel_train,
                     method = "xgbTree",
                     trControl = cv3,
                     tuneGrid  = parameters_xgb)

saveRDS(object = xgb,
            file   = here("output", "xgb.rds"))
}

xgb <- readRDS(here("output", "xgb.rds"))

xgb
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results across tuning parameters:
## 
##   eta  nrounds  ROC        Sens       Spec     
##   0.2   20      0.8090920  0.4765536  0.9455840
##   0.2   30      0.8163986  0.5088800  0.9366296
##   0.2   40      0.8199860  0.5355362  0.9228827
##   0.2   50      0.8210700  0.5501678  0.9117982
##   0.2   60      0.8226554  0.5582168  0.9083861
##   0.2   70      0.8235115  0.5637892  0.9058813
##   0.2   80      0.8241339  0.5679604  0.9032582
##   0.2   90      0.8247259  0.5692313  0.9022918
##   0.2  100      0.8256364  0.5707629  0.9018974
##   0.2  110      0.8261797  0.5709258  0.9017594
##   0.2  120      0.8266851  0.5724574  0.9015819
##   0.3   20      0.8193347  0.5110307  0.9368663
##   0.3   30      0.8209637  0.5428357  0.9203580
##   0.3   40      0.8245537  0.5577280  0.9120350
##   0.3   50      0.8252689  0.5617688  0.9093724
##   0.3   60      0.8265035  0.5675042  0.9054870
##   0.3   70      0.8272350  0.5686773  0.9051517
##   0.3   80      0.8280981  0.5710887  0.9049545
##   0.3   90      0.8287677  0.5717079  0.9061379
##   0.3  100      0.8293300  0.5744452  0.9045403
##   0.3  110      0.8296615  0.5731743  0.9042839
##   0.3  120      0.8300857  0.5743800  0.9040078
##   0.4   20      0.8186528  0.5330270  0.9217194
##   0.4   30      0.8229878  0.5592922  0.9082483
##   0.4   40      0.8248488  0.5646039  0.9065127
##   0.4   50      0.8263786  0.5690357  0.9048954
##   0.4   60      0.8276312  0.5708606  0.9035739
##   0.4   70      0.8285538  0.5703718  0.9048559
##   0.4   80      0.8295042  0.5743800  0.9031992
##   0.4   90      0.8299923  0.5742497  0.9036528
##   0.4  100      0.8305197  0.5744126  0.9037711
##   0.4  110      0.8308545  0.5745104  0.9035739
##   0.4  120      0.8314847  0.5764656  0.9026272
##   0.5   20      0.8196316  0.5345586  0.9197863
##   0.5   30      0.8227172  0.5580865  0.9075974
##   0.5   40      0.8245893  0.5636589  0.9052108
##   0.5   50      0.8258704  0.5652882  0.9045995
##   0.5   60      0.8273242  0.5681885  0.9039091
##   0.5   70      0.8288709  0.5693942  0.9041261
##   0.5   80      0.8298985  0.5694920  0.9050531
##   0.5   90      0.8304029  0.5712517  0.9053095
##   0.5  100      0.8313275  0.5703718  0.9056053
##   0.5  110      0.8319734  0.5739238  0.9052108
##   0.5  120      0.8323342  0.5719686  0.9057631
## 
## Tuning parameter 'max_depth' was held constant at a value of 10
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 800
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 10, eta
##  = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 800
##  and subsample = 0.8.
plot(xgb)

The function of number of trees to the obtained ROC is constantly growing for every learnign rate, so the more trees we have, the better may be the results. However, 120 trees, which is the highest number tested, are already much and increasing it even more will have a negative effect on the time needed to finish later computations, so let’s stay with 120 trees and learning rate equal to 0.5. Next, we’ll tune the max depth of a tree and minimum child weight

if(0) {parameters_xgb2 <- expand.grid(nrounds = 120,
                              max_depth = seq(6, 18, 2),
                              eta = c(0.5), 
                              gamma = 1,
                              colsample_bytree = c(0.26),
                              min_child_weight = seq(100, 1000, 100),
                              subsample = 0.8)



set.seed(123456789)
xgb2 <- train(model_formula,
             data = hotel_train,
             method = "xgbTree",
             trControl = cv3,
             tuneGrid  = parameters_xgb2)


saveRDS(object = xgb2,
            file   = here("output", "xgb2.rds"))
}

xgb2 <- readRDS(here("output", "xgb2.rds"))

xgb2
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results across tuning parameters:
## 
##   max_depth  min_child_weight  ROC        Sens       Spec     
##    6          100              0.8805259  0.6461368  0.9177744
##    6          200              0.8738655  0.6246619  0.9205160
##    6          300              0.8632513  0.6140711  0.9140074
##    6          400              0.8535837  0.5971584  0.9139285
##    6          500              0.8506938  0.5968325  0.9105163
##    6          600              0.8422932  0.5825920  0.9119759
##    6          700              0.8342798  0.5832437  0.9018382
##    6          800              0.8314463  0.5774758  0.9027258
##    6          900              0.8274910  0.5795614  0.8952900
##    6         1000              0.8235973  0.5657770  0.9004181
##    8          100              0.8823585  0.6537622  0.9160782
##    8          200              0.8739957  0.6273666  0.9174392
##    8          300              0.8652073  0.6148532  0.9159797
##    8          400              0.8567313  0.6063806  0.9152499
##    8          500              0.8517939  0.5925310  0.9148949
##    8          600              0.8432196  0.5851338  0.9120351
##    8          700              0.8345534  0.5768241  0.9074790
##    8          800              0.8323048  0.5698178  0.9078931
##    8          900              0.8277140  0.5700459  0.9001025
##    8         1000              0.8242032  0.5703718  0.8968286
##   10          100              0.8827999  0.6518395  0.9170052
##   10          200              0.8754674  0.6344054  0.9167095
##   10          300              0.8607888  0.6117900  0.9133762
##   10          400              0.8555159  0.6023723  0.9153682
##   10          500              0.8455640  0.5865024  0.9137312
##   10          600              0.8367524  0.5841887  0.9032189
##   10          700              0.8346780  0.5762049  0.9096487
##   10          800              0.8327760  0.5764330  0.9049544
##   10          900              0.8308141  0.5703718  0.9083865
##   10         1000              0.8236702  0.5715124  0.8941856
##   12          100              0.8848718  0.6631798  0.9129225
##   12          200              0.8739660  0.6300388  0.9165122
##   12          300              0.8652702  0.6114316  0.9152499
##   12          400              0.8584400  0.6033500  0.9149540
##   12          500              0.8487258  0.5950077  0.9121928
##   12          600              0.8418636  0.5894027  0.9068479
##   12          700              0.8342570  0.5740867  0.9079522
##   12          800              0.8312461  0.5776387  0.9013450
##   12          900              0.8298294  0.5678300  0.9081691
##   12         1000              0.8229520  0.5723270  0.8962171
##   14          100              0.8841096  0.6623326  0.9148948
##   14          200              0.8743949  0.6338515  0.9179913
##   14          300              0.8588259  0.6054029  0.9142835
##   14          400              0.8584950  0.6042624  0.9166306
##   14          500              0.8434333  0.5854271  0.9099641
##   14          600              0.8419050  0.5906084  0.9041260
##   14          700              0.8347057  0.5806042  0.9048558
##   14          800              0.8305610  0.5757813  0.9020947
##   14          900              0.8291469  0.5691009  0.9055657
##   14         1000              0.8253908  0.5494835  0.9129818
##   16          100              0.8845089  0.6605403  0.9135142
##   16          200              0.8767912  0.6357415  0.9208710
##   16          300              0.8644948  0.6142340  0.9169066
##   16          400              0.8574375  0.6063154  0.9167095
##   16          500              0.8489491  0.5893049  0.9143033
##   16          600              0.8402894  0.5850360  0.9075580
##   16          700              0.8353135  0.5778343  0.9081495
##   16          800              0.8316911  0.5753251  0.9024298
##   16          900              0.8267627  0.5737935  0.8992940
##   16         1000              0.8253833  0.5668524  0.9021931
##   18          100              0.8836314  0.6579333  0.9142637
##   18          200              0.8746455  0.6325480  0.9182873
##   18          300              0.8663147  0.6163196  0.9164136
##   18          400              0.8562308  0.5993092  0.9153091
##   18          500              0.8511864  0.5966044  0.9115419
##   18          600              0.8425807  0.5891094  0.9064140
##   18          700              0.8347018  0.5783557  0.9054476
##   18          800              0.8325570  0.5746407  0.9063943
##   18          900              0.8280188  0.5733047  0.9033173
##   18         1000              0.8251337  0.5624532  0.9079526
## 
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
##  'colsample_bytree' was held constant at a value of 0.26
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
##  = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 100
##  and subsample = 0.8.
plot(xgb2)

We obtained max_depth = 12 and min_child_weight = 100. Because the second parameter is on the lower bound of the values we checked, let’s make a more precise optimising of this parameter.

if(0) {parameters_xgb3 <- expand.grid(nrounds = 120,
                               max_depth = 12,
                               eta = c(0.5), 
                               gamma = 1,
                               colsample_bytree = c(0.26),
                               min_child_weight = seq(50, 150, 10),
                               subsample = 0.8)



set.seed(123456789)
xgb3 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb3)


saveRDS(object = xgb3,
            file   = here("output", "xgb3.rds"))
}

xgb3 <- readRDS(here("output", "xgb3.rds"))

xgb3
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results across tuning parameters:
## 
##   min_child_weight  ROC        Sens       Spec     
##    50               0.8895442  0.6761821  0.9095893
##    60               0.8884130  0.6746831  0.9110686
##    70               0.8870180  0.6709030  0.9104177
##    80               0.8850066  0.6617460  0.9134749
##    90               0.8849588  0.6620067  0.9146582
##   100               0.8844831  0.6611269  0.9119364
##   110               0.8836660  0.6590087  0.9142440
##   120               0.8822343  0.6543813  0.9160586
##   130               0.8814016  0.6492000  0.9174589
##   140               0.8792692  0.6452569  0.9180900
##   150               0.8806423  0.6508619  0.9175772
## 
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
##  'colsample_bytree' was held constant at a value of 0.26
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
##  = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 50 and
##  subsample = 0.8.
plot(xgb3)

Again we get a minimal value equal to 50 from our set. We will leave it at this level because it is already very low. Now it’s time to check the optimal value of colsample_bytree.

if(0) {parameters_xgb4 <- expand.grid(nrounds = 120,
                               max_depth = 12,
                               eta = c(0.5), 
                               gamma = 1,
                               colsample_bytree = seq(0.2, 0.8, 0.1),
                               min_child_weight = 50,
                               subsample = 0.8)



set.seed(123456789)
xgb4 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb4)


saveRDS(object = xgb4,
            file   = here("output", "xgb4.rds"))
}

xgb4 <- readRDS(here("output", "xgb4.rds"))

xgb4
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results across tuning parameters:
## 
##   colsample_bytree  ROC        Sens       Spec     
##   0.2               0.8857069  0.6654935  0.9104769
##   0.3               0.8911128  0.6813960  0.9083862
##   0.4               0.8919072  0.6894125  0.9067690
##   0.5               0.8936424  0.6934207  0.9059012
##   0.6               0.8936666  0.6972985  0.9025482
##   0.7               0.8952262  0.6986672  0.9042641
##   0.8               0.8951719  0.6983087  0.9037513
## 
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
##  'min_child_weight' was held constant at a value of 50
## Tuning
##  parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
##  = 0.5, gamma = 1, colsample_bytree = 0.7, min_child_weight = 50 and
##  subsample = 0.8.
plot(xgb4)

Having the best result with parameter equal to 0.7, the last one we’ll check is subsample.

if(0) {parameters_xgb5 <- expand.grid(nrounds = 120,
                               max_depth = 12,
                               eta = c(0.5), 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 50,
                               subsample = seq(0.5, 0.9, 0.1))



set.seed(123456789)
xgb5 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb5)


saveRDS(object = xgb5,
            file   = here("output", "xgb5.rds"))
}

xgb5 <- readRDS(here("output", "xgb5.rds"))

xgb5
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results across tuning parameters:
## 
##   subsample  ROC        Sens       Spec     
##   0.5        0.8885252  0.6907159  0.8996489
##   0.6        0.8914834  0.6984065  0.9002998
##   0.7        0.8928361  0.6946590  0.9027652
##   0.8        0.8951797  0.6976570  0.9037119
##   0.9        0.8962220  0.6994493  0.9056053
## 
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
##  'colsample_bytree' was held constant at a value of 0.7
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 50
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
##  = 0.5, gamma = 1, colsample_bytree = 0.7, min_child_weight = 50 and
##  subsample = 0.9.
plot(xgb5)

And finally the subsample used for splits will be equal to 0.9. Having all those parameters tuned, we can also simultaneously increase the number of trees and lower the learning rate. This way we should obtain a much more stable final model. Models 6 to 8 are the effect of gradual changes in these two parameters.

if(0) {parameters_xgb6 <- expand.grid(nrounds = 250,
                               max_depth = 12,
                               eta = 0.2, 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 50,
                               subsample = 0.9)



set.seed(123456789)
xgb6 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb6)


saveRDS(object = xgb6,
            file   = here("output", "xgb6.rds"))
}

xgb6 <- readRDS(here("output", "xgb6.rds"))

xgb6
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8985711  0.6929319  0.9128634
## 
## Tuning parameter 'nrounds' was held constant at a value of 250
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 50
## 
## Tuning parameter 'subsample' was held constant at a value of 0.9
if(0) {parameters_xgb7 <- expand.grid(nrounds = 500,
                               max_depth = 12,
                               eta = 0.1, 
                               gamma = 1,
                               colsample_bytree = 0.7,
                               min_child_weight = 50,
                               subsample = 0.9)



set.seed(123456789)
xgb7 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb7)


saveRDS(object = xgb7,
            file   = here("output", "xgb7.rds"))
}

xgb7 <- readRDS(here("output", "xgb7.rds"))

xgb7
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.8995244  0.692997  0.9129423
## 
## Tuning parameter 'nrounds' was held constant at a value of 500
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 50
## 
## Tuning parameter 'subsample' was held constant at a value of 0.9
if(0) {parameters_xgb8 <- expand.grid(nrounds = 1000,
                               max_depth = 16,
                               eta = 0.05, 
                               gamma = 1,
                               colsample_bytree = 0.8,
                               min_child_weight = 50,
                               subsample = 0.9)



set.seed(123456789)
xgb8 <- train(model_formula,
              data = hotel_train,
              method = "xgbTree",
              trControl = cv3,
              tuneGrid  = parameters_xgb8)


saveRDS(object = xgb8,
            file   = here("output", "xgb8.rds"))
}

xgb8 <- readRDS(here("output", "xgb8.rds"))

xgb8
## eXtreme Gradient Boosting 
## 
## 81389 samples
##    15 predictor
##     2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 54259, 54259, 54260 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.9016308  0.7002965  0.9125478
## 
## Tuning parameter 'nrounds' was held constant at a value of 1000
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 50
## 
## Tuning parameter 'subsample' was held constant at a value of 0.9

The results from this last step are very similar so we will keep only on the models 1-5 and the last one. Now we will check together all results on our training sample.

xgb_roc <- roc(hotel_train$is_canceled, 
                       predict(xgb,
                               hotel_train, type = "prob")[, "Yes"])

xgb_roc2 <- roc(hotel_train$is_canceled, 
               predict(xgb2,
                       hotel_train, type = "prob")[, "Yes"])

xgb_roc3 <- roc(hotel_train$is_canceled, 
               predict(xgb3,
                       hotel_train, type = "prob")[, "Yes"])

xgb_roc4 <- roc(hotel_train$is_canceled, 
               predict(xgb4,
                       hotel_train, type = "prob")[, "Yes"])

xgb_roc5 <- roc(hotel_train$is_canceled, 
               predict(xgb5,
                       hotel_train, type = "prob")[, "Yes"])

xgb_roc8 <- roc(hotel_train$is_canceled, 
               predict(xgb8,
                       hotel_train, type = "prob")[, "Yes"])

# plot
list(
  xgb_roc   = xgb_roc,
  xgb_roc2    = xgb_roc2,
  xgb_roc3  = xgb_roc3,
  xgb_roc4   = xgb_roc4,
  xgb_roc5  = xgb_roc5,
  xgb_roc8 = xgb_roc8
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = paste("Gini TRAIN: ",
                         "xgb = ", 
                         round(100 * (2 * auc(xgb_roc) - 1), 1), "%, ",
                         "xgb2 = ", 
                         round(100 * (2 * auc(xgb_roc2) - 1), 1), "%, ",
                         "xgb3 = ", 
                         round(100 * (2 * auc(xgb_roc3) - 1), 1), "%, \n",
                         "xgb4 = ", 
                         round(100 * (2 * auc(xgb_roc4) - 1), 1), "%, ",
                         "xgb5 = ", 
                         round(100 * (2 * auc(xgb_roc5) - 1), 1), "%, ",
                         "xgb8 = ", 
                         round(100 * (2 * auc(xgb_roc8) - 1), 1), "% ")) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Dark2")

We were able to constantly improve the quality of predictions on the training sample by tuning the parameters. Starting with quite low, unsatisfactory first model we obtained a much better Gini coefficient at the end of the process. Also, the changes in Gini coefficient were smaller and smaller as we were getting closer to the final parameters.

Now the same on testing sample.

xgb_test_roc <- roc(hotel_test$is_canceled, 
               predict(xgb,
                       hotel_test, type = "prob")[, "Yes"])

xgb_test_roc2 <- roc(hotel_test$is_canceled, 
                predict(xgb2,
                        hotel_test, type = "prob")[, "Yes"])

xgb_test_roc3 <- roc(hotel_test$is_canceled, 
                predict(xgb3,
                        hotel_test, type = "prob")[, "Yes"])

xgb_test_roc4 <- roc(hotel_test$is_canceled, 
                predict(xgb4,
                        hotel_test, type = "prob")[, "Yes"])

xgb_test_roc5 <- roc(hotel_test$is_canceled, 
                predict(xgb5,
                        hotel_test, type = "prob")[, "Yes"])

xgb_test_roc8 <- roc(hotel_test$is_canceled, 
                predict(xgb8,
                        hotel_test, type = "prob")[, "Yes"])


# plot
list(
  xgb_test_roc   = xgb_test_roc,
  xgb_test_roc2    = xgb_test_roc2,
  xgb_test_roc3  = xgb_test_roc3,
  xgb_test_roc4   = xgb_test_roc4,
  xgb_test_roc5  = xgb_test_roc5,
  xgb_test_roc8 = xgb_test_roc8
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = c(paste("Gini TEST: ",
                         "xgb = ", 
                         round(100 * (2 * auc(xgb_test_roc) - 1), 1), "%, ",
                         "xgb2 = ", 
                         round(100 * (2 * auc(xgb_test_roc2) - 1), 1), "%, ",
                         "xgb3 = ", 
                         round(100 * (2 * auc(xgb_test_roc3) - 1), 1), "%, \n",
                         "xgb4 = ", 
                         round(100 * (2 * auc(xgb_test_roc4) - 1), 1), "%, ",
                         "xgb5 = ", 
                         round(100 * (2 * auc(xgb_test_roc5) - 1), 1), "%, ",
                         "xgb8 = ", 
                         round(100 * (2 * auc(xgb_test_roc8) - 1), 1), "% "))) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Dark2")

On the testing data our results keep the same structure, although again some overfitting appears for our best variants of models, so the improvement is not as high as on the training sample.

Neurat Networks

Finally, we created also three rather simple neural networks. They are perfect in case of non-linear relations with the target variable, but they can also be very time-consuming.

before modelling we need to apply some changes to our data. We will train our models with neuralnet function which requires that continuous variables are standarized and what’s more, all categorical variables must be recoded into dummies.

# standarize continuous vars

train.max <- apply(hotel_train[,numeric_vars], 2, max)
train.min <- apply(hotel_train[,numeric_vars], 2, min)

hotel_train_s <- hotel_train

hotel_train_s[,numeric_vars] <- scale(hotel_train[,numeric_vars], 
                                      center = train.min, 
                                      scale  = train.max - train.min)

We need to standarize also the testing data for the future predictions, based on the min and max values from the training data.

hotel_test_s <- hotel_test

hotel_test_s[,numeric_vars] <- scale(hotel_test[,numeric_vars], 
                                      center = train.min, 
                                      scale  = train.max - train.min)

We’ll keep our recoded data as matrix and correct automatically obtained names of new binary variables. We need also a new formula based on those new labels.

# recoding into dummies

hotel_train_mtx <- 
  model.matrix(object = model_formula, 
               data   = hotel_train_s)

dim(hotel_train_mtx)
## [1] 81389    35
colnames(hotel_train_mtx)
##  [1] "(Intercept)"                  "hotelResort Hotel"           
##  [3] "arrival_date_monthAugust"     "arrival_date_monthDecember"  
##  [5] "arrival_date_monthFebruary"   "arrival_date_monthJanuary"   
##  [7] "arrival_date_monthJuly"       "arrival_date_monthJune"      
##  [9] "arrival_date_monthMarch"      "arrival_date_monthMay"       
## [11] "arrival_date_monthNovember"   "arrival_date_monthOctober"   
## [13] "arrival_date_monthSeptember"  "adults.L"                    
## [15] "adults.Q"                     "adults.C"                    
## [17] "meal.L"                       "meal.Q"                      
## [19] "distribution_channelDirect"   "distribution_channelTA/TO"   
## [21] "booking_changes"              "deposit_typeNon Refund"      
## [23] "deposit_typeRefundable"       "customer_typeGroup"          
## [25] "customer_typeTransient"       "customer_typeTransient-Party"
## [27] "lead_time_box"                "kidsyes"                     
## [29] "cancyes"                      "adr_box"                     
## [31] "parkingyes"                   "requestsA1"                  
## [33] "requestsA2"                   "requestsA3+"                 
## [35] "days"
colnames(hotel_train_mtx) <- gsub("/", "",   colnames(hotel_train_mtx))
colnames(hotel_train_mtx) <- gsub("\\+", "_",   colnames(hotel_train_mtx))
colnames(hotel_train_mtx) <- gsub(" ", "",   colnames(hotel_train_mtx))

# the same for test data

hotel_test_mtx <- 
  model.matrix(object = model_formula, 
               data   = hotel_test_s)

dim(hotel_test_mtx)
## [1] 34905    35
colnames(hotel_test_mtx)
##  [1] "(Intercept)"                  "hotelResort Hotel"           
##  [3] "arrival_date_monthAugust"     "arrival_date_monthDecember"  
##  [5] "arrival_date_monthFebruary"   "arrival_date_monthJanuary"   
##  [7] "arrival_date_monthJuly"       "arrival_date_monthJune"      
##  [9] "arrival_date_monthMarch"      "arrival_date_monthMay"       
## [11] "arrival_date_monthNovember"   "arrival_date_monthOctober"   
## [13] "arrival_date_monthSeptember"  "adults.L"                    
## [15] "adults.Q"                     "adults.C"                    
## [17] "meal.L"                       "meal.Q"                      
## [19] "distribution_channelDirect"   "distribution_channelTA/TO"   
## [21] "booking_changes"              "deposit_typeNon Refund"      
## [23] "deposit_typeRefundable"       "customer_typeGroup"          
## [25] "customer_typeTransient"       "customer_typeTransient-Party"
## [27] "lead_time_box"                "kidsyes"                     
## [29] "cancyes"                      "adr_box"                     
## [31] "parkingyes"                   "requestsA1"                  
## [33] "requestsA2"                   "requestsA3+"                 
## [35] "days"
colnames(hotel_test_mtx) <- gsub("/", "",   colnames(hotel_test_mtx))
colnames(hotel_test_mtx) <- gsub("\\+", "_",   colnames(hotel_test_mtx))
colnames(hotel_test_mtx) <- gsub(" ", "",   colnames(hotel_test_mtx))

# and new formula 

col_list <- paste(c(colnames(hotel_train_mtx[, -1])), collapse = "+")
col_list <- paste(c("is_canceled ~ ", col_list), collapse = "")
model_formula2 <- formula(col_list)

The first net will be a very simple model without any hidden layers.

if(0){
set.seed(123456789)
nn0 <- 
  data.frame(hotel_train_mtx,
             is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
  neuralnet(model_formula2,
            data = .,
            hidden = 0, 
            linear.output = FALSE,
            algorithm = "rprop+",
            threshold = 0.01)


saveRDS(object = nn0,
            file   = here("output", "nn0.rds"))
}

nn0 <- readRDS(here("output", "nn0.rds"))

nn0$result.matrix
##                                                     [,1]
## error                                       5.751797e+03
## reached.threshold                           9.748996e-03
## steps                                       6.894000e+03
## Intercept.to.is_canceled                   -7.490977e+00
## hotelResortHotel.to.is_canceled             4.093791e-01
## arrival_date_monthAugust.to.is_canceled    -5.040890e-01
## arrival_date_monthDecember.to.is_canceled   3.031218e-01
## arrival_date_monthFebruary.to.is_canceled   3.910708e-01
## arrival_date_monthJanuary.to.is_canceled    4.784395e-01
## arrival_date_monthJuly.to.is_canceled      -5.588438e-01
## arrival_date_monthJune.to.is_canceled      -3.972144e-01
## arrival_date_monthMarch.to.is_canceled      4.783688e-02
## arrival_date_monthMay.to.is_canceled       -2.700806e-01
## arrival_date_monthNovember.to.is_canceled   1.554840e-01
## arrival_date_monthOctober.to.is_canceled   -1.267552e-01
## arrival_date_monthSeptember.to.is_canceled -4.810588e-01
## adults.L.to.is_canceled                    -3.573290e-01
## adults.Q.to.is_canceled                     1.612853e-01
## adults.C.to.is_canceled                    -1.507029e-01
## meal.L.to.is_canceled                      -7.581968e-01
## meal.Q.to.is_canceled                       3.121499e-02
## distribution_channelDirect.to.is_canceled   2.056507e-01
## distribution_channelTATO.to.is_canceled     1.198020e+00
## booking_changes.to.is_canceled             -7.766732e+00
## deposit_typeNonRefund.to.is_canceled        4.166921e+00
## deposit_typeRefundable.to.is_canceled       5.431070e-01
## customer_typeGroup.to.is_canceled          -4.959328e-01
## customer_typeTransient.to.is_canceled       1.052287e+00
## lead_time_box.to.is_canceled                3.097382e+00
## kidsyes.to.is_canceled                      2.038891e-02
## cancyes.to.is_canceled                      3.374638e+00
## adr_box.to.is_canceled                      1.436497e+01
## parkingyes.to.is_canceled                  -3.748869e+02
## requestsA1.to.is_canceled                  -1.017234e+00
## requestsA2.to.is_canceled                  -1.236498e+00
## requestsA3_.to.is_canceled                 -1.739464e+00
## days.to.is_canceled                         5.285801e-01
plot(nn0, rep = 'best')

nn0_pred <- compute(nn0, hotel_train_mtx[, -1])


confusionMatrix(as.numeric((nn0_pred$net.result > 0.5)) %>% as.factor(),
                    as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 46942 12541
##          1  3760 18146
##                                           
##                Accuracy : 0.7997          
##                  95% CI : (0.7969, 0.8025)
##     No Information Rate : 0.623           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5481          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9258          
##             Specificity : 0.5913          
##          Pos Pred Value : 0.7892          
##          Neg Pred Value : 0.8284          
##              Prevalence : 0.6230          
##          Detection Rate : 0.5768          
##    Detection Prevalence : 0.7308          
##       Balanced Accuracy : 0.7586          
##                                           
##        'Positive' Class : 0               
## 
nn0_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
            compute(nn0, hotel_train_mtx[, -1])$net.result[, 1])

In the confusion matrix we can see that especially the specificity is low for the default threshold. We can check at once the predictions and confusion matrix for the testing sample.

nn0_test_pred <- compute(nn0, hotel_test_mtx[, -1])


confusionMatrix(as.numeric((nn0_test_pred$net.result > 0.5)) %>% as.factor(),
                as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 20270  5320
##          1  1599  7716
##                                           
##                Accuracy : 0.8018          
##                  95% CI : (0.7976, 0.8059)
##     No Information Rate : 0.6265          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5505          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9269          
##             Specificity : 0.5919          
##          Pos Pred Value : 0.7921          
##          Neg Pred Value : 0.8283          
##              Prevalence : 0.6265          
##          Detection Rate : 0.5807          
##    Detection Prevalence : 0.7331          
##       Balanced Accuracy : 0.7594          
##                                           
##        'Positive' Class : 0               
## 
nn0_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
               compute(nn0, hotel_test_mtx[, -1])$net.result[, 1])

The accuracy is even slightly higher then on training sample but still there is a problem with specificity. Now we’ll make a second model with one hidden layer of 6 neurons. To speed up the computationally demanding calculations we will use here a subsample of only 2 thousand observations selected from training sample.

if(0){
set.seed(123456789)
nn1 <- 
  data.frame(hotel_train_mtx,
             is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
  sample_n(2000) %>%
  neuralnet(model_formula2,
            data = .,
            hidden = c(6), 
            linear.output = FALSE,
            stepmax = 1e+06,
            algorithm = "rprop+",
            threshold = 0.01)


saveRDS(object = nn1,
            file   = here("output", "nn1.rds"))
}

nn1 <- readRDS(here("output", "nn1.rds"))

nn1$result.matrix
##                                                  [,1]
## error                                    7.397264e+01
## reached.threshold                        9.482453e-03
## steps                                    7.219800e+05
## Intercept.to.1layhid1                    5.374891e+00
## hotelResortHotel.to.1layhid1            -2.548735e+01
## arrival_date_monthAugust.to.1layhid1    -5.208985e+01
## arrival_date_monthDecember.to.1layhid1  -1.360527e+01
## arrival_date_monthFebruary.to.1layhid1   1.228098e+01
## arrival_date_monthJanuary.to.1layhid1   -1.777210e+00
## arrival_date_monthJuly.to.1layhid1      -5.219038e+01
## arrival_date_monthJune.to.1layhid1      -5.455389e+01
## arrival_date_monthMarch.to.1layhid1      3.389499e+01
## arrival_date_monthMay.to.1layhid1       -4.358121e+01
## arrival_date_monthNovember.to.1layhid1   4.705142e+01
## arrival_date_monthOctober.to.1layhid1   -4.369458e+01
## arrival_date_monthSeptember.to.1layhid1 -5.972898e+01
## adults.L.to.1layhid1                    -9.169481e+00
## adults.Q.to.1layhid1                    -1.442369e+01
## adults.C.to.1layhid1                    -7.352820e+00
## meal.L.to.1layhid1                      -8.735676e+01
## meal.Q.to.1layhid1                       1.171479e+01
## distribution_channelDirect.to.1layhid1   1.163392e+01
## distribution_channelTATO.to.1layhid1    -3.122873e-01
## booking_changes.to.1layhid1              3.910796e+01
## deposit_typeNonRefund.to.1layhid1       -7.194233e+02
## deposit_typeRefundable.to.1layhid1       2.200320e+01
## customer_typeGroup.to.1layhid1           6.854438e+01
## customer_typeTransient.to.1layhid1      -2.237636e+01
## lead_time_box.to.1layhid1               -4.220127e+01
## kidsyes.to.1layhid1                      3.653100e+01
## cancyes.to.1layhid1                      4.700659e+01
## adr_box.to.1layhid1                      3.906279e+01
## parkingyes.to.1layhid1                   1.590627e+02
## requestsA1.to.1layhid1                   2.463954e+01
## requestsA2.to.1layhid1                   3.388900e+01
## requestsA3_.to.1layhid1                  1.933467e+01
## days.to.1layhid1                         1.520092e+02
## Intercept.to.1layhid2                    3.178902e+00
## hotelResortHotel.to.1layhid2            -1.662296e+01
## arrival_date_monthAugust.to.1layhid2     1.786461e+01
## arrival_date_monthDecember.to.1layhid2   1.016848e+01
## arrival_date_monthFebruary.to.1layhid2   1.219847e+01
## arrival_date_monthJanuary.to.1layhid2    3.253156e+00
## arrival_date_monthJuly.to.1layhid2       1.638636e+01
## arrival_date_monthJune.to.1layhid2      -1.120094e+01
## arrival_date_monthMarch.to.1layhid2      1.246974e+01
## arrival_date_monthMay.to.1layhid2       -1.617144e+01
## arrival_date_monthNovember.to.1layhid2  -1.670364e+01
## arrival_date_monthOctober.to.1layhid2    1.337819e+01
## arrival_date_monthSeptember.to.1layhid2  2.841505e+01
## adults.L.to.1layhid2                     2.913257e+01
## adults.Q.to.1layhid2                     7.487303e+00
## adults.C.to.1layhid2                     9.853815e+00
## meal.L.to.1layhid2                      -2.385391e+01
## meal.Q.to.1layhid2                       6.079104e+00
## distribution_channelDirect.to.1layhid2  -6.598365e+00
## distribution_channelTATO.to.1layhid2    -1.026242e+01
## booking_changes.to.1layhid2             -6.470914e+01
## deposit_typeNonRefund.to.1layhid2        4.358609e+01
## deposit_typeRefundable.to.1layhid2      -6.912606e+02
## customer_typeGroup.to.1layhid2          -7.074416e+02
## customer_typeTransient.to.1layhid2       1.611277e+01
## lead_time_box.to.1layhid2                9.217366e+00
## kidsyes.to.1layhid2                      3.317836e+01
## cancyes.to.1layhid2                      7.793332e+01
## adr_box.to.1layhid2                     -2.975516e+01
## parkingyes.to.1layhid2                  -7.410783e+02
## requestsA1.to.1layhid2                   1.371463e+01
## requestsA2.to.1layhid2                   1.143314e+01
## requestsA3_.to.1layhid2                  2.196846e+01
## days.to.1layhid2                         2.346528e+01
## Intercept.to.1layhid3                   -4.511418e+00
## hotelResortHotel.to.1layhid3            -1.619105e+00
## arrival_date_monthAugust.to.1layhid3    -2.246299e+00
## arrival_date_monthDecember.to.1layhid3   5.327190e+00
## arrival_date_monthFebruary.to.1layhid3   9.387458e-01
## arrival_date_monthJanuary.to.1layhid3    1.484502e+00
## arrival_date_monthJuly.to.1layhid3      -2.335146e+00
## arrival_date_monthJune.to.1layhid3       4.417763e+00
## arrival_date_monthMarch.to.1layhid3      1.716618e+00
## arrival_date_monthMay.to.1layhid3        6.365860e-01
## arrival_date_monthNovember.to.1layhid3   3.883868e+00
## arrival_date_monthOctober.to.1layhid3   -1.954033e-01
## arrival_date_monthSeptember.to.1layhid3  1.204140e+00
## adults.L.to.1layhid3                    -5.984674e-01
## adults.Q.to.1layhid3                    -8.492180e-01
## adults.C.to.1layhid3                    -2.753736e-01
## meal.L.to.1layhid3                      -9.900164e-01
## meal.Q.to.1layhid3                       1.552201e-01
## distribution_channelDirect.to.1layhid3  -2.273597e+00
## distribution_channelTATO.to.1layhid3    -1.281813e+00
## booking_changes.to.1layhid3             -2.358846e+00
## deposit_typeNonRefund.to.1layhid3       -9.949138e-01
## deposit_typeRefundable.to.1layhid3       3.816987e+01
## customer_typeGroup.to.1layhid3          -6.546891e+02
## customer_typeTransient.to.1layhid3       2.391955e-01
## lead_time_box.to.1layhid3                1.515612e+00
## kidsyes.to.1layhid3                     -8.745374e-01
## cancyes.to.1layhid3                      6.344581e+00
## adr_box.to.1layhid3                      2.231523e+01
## parkingyes.to.1layhid3                  -6.846546e+02
## requestsA1.to.1layhid3                  -3.760618e+00
## requestsA2.to.1layhid3                  -6.347371e+00
## requestsA3_.to.1layhid3                 -1.805744e+00
## days.to.1layhid3                         8.250674e+00
## Intercept.to.1layhid4                    2.465855e+01
## hotelResortHotel.to.1layhid4             5.128860e+00
## arrival_date_monthAugust.to.1layhid4     1.746610e+01
## arrival_date_monthDecember.to.1layhid4   2.576604e+01
## arrival_date_monthFebruary.to.1layhid4  -3.385431e+00
## arrival_date_monthJanuary.to.1layhid4    5.816811e+00
## arrival_date_monthJuly.to.1layhid4       9.323096e+01
## arrival_date_monthJune.to.1layhid4      -2.827935e+01
## arrival_date_monthMarch.to.1layhid4      8.250569e+00
## arrival_date_monthMay.to.1layhid4       -1.611719e+01
## arrival_date_monthNovember.to.1layhid4  -4.486123e+01
## arrival_date_monthOctober.to.1layhid4    5.018538e+00
## arrival_date_monthSeptember.to.1layhid4 -2.753555e+01
## adults.L.to.1layhid4                     2.472143e+00
## adults.Q.to.1layhid4                    -3.791186e+01
## adults.C.to.1layhid4                    -1.326957e+01
## meal.L.to.1layhid4                       3.180176e+01
## meal.Q.to.1layhid4                       1.121728e+01
## distribution_channelDirect.to.1layhid4   7.456924e+02
## distribution_channelTATO.to.1layhid4     9.031202e+00
## booking_changes.to.1layhid4              1.648653e+02
## deposit_typeNonRefund.to.1layhid4       -5.460739e+01
## deposit_typeRefundable.to.1layhid4      -7.445491e+02
## customer_typeGroup.to.1layhid4           7.875166e+01
## customer_typeTransient.to.1layhid4      -5.510910e+00
## lead_time_box.to.1layhid4                5.549299e+00
## kidsyes.to.1layhid4                      9.232313e+00
## cancyes.to.1layhid4                     -5.000703e+00
## adr_box.to.1layhid4                     -9.050179e+01
## parkingyes.to.1layhid4                   6.147563e+02
## requestsA1.to.1layhid4                  -2.582677e+01
## requestsA2.to.1layhid4                  -2.499270e+01
## requestsA3_.to.1layhid4                 -3.849444e+01
## days.to.1layhid4                        -7.074467e+01
## Intercept.to.1layhid5                   -2.236770e+00
## hotelResortHotel.to.1layhid5            -8.504060e+00
## arrival_date_monthAugust.to.1layhid5     3.028680e+00
## arrival_date_monthDecember.to.1layhid5   5.390814e+00
## arrival_date_monthFebruary.to.1layhid5   4.560318e+00
## arrival_date_monthJanuary.to.1layhid5    9.778081e+00
## arrival_date_monthJuly.to.1layhid5       1.600907e+00
## arrival_date_monthJune.to.1layhid5      -5.326529e+00
## arrival_date_monthMarch.to.1layhid5      7.371411e+00
## arrival_date_monthMay.to.1layhid5        7.487681e+00
## arrival_date_monthNovember.to.1layhid5   5.521913e+01
## arrival_date_monthOctober.to.1layhid5    2.010108e+00
## arrival_date_monthSeptember.to.1layhid5  5.117728e+00
## adults.L.to.1layhid5                     3.186449e-01
## adults.Q.to.1layhid5                     4.055619e+00
## adults.C.to.1layhid5                    -2.996738e+00
## meal.L.to.1layhid5                       3.339225e+00
## meal.Q.to.1layhid5                      -3.391951e+00
## distribution_channelDirect.to.1layhid5  -2.551684e+00
## distribution_channelTATO.to.1layhid5    -5.894729e+00
## booking_changes.to.1layhid5              7.153328e+01
## deposit_typeNonRefund.to.1layhid5       -7.325118e+02
## deposit_typeRefundable.to.1layhid5       5.899934e+00
## customer_typeGroup.to.1layhid5           3.914812e+01
## customer_typeTransient.to.1layhid5       2.687965e-01
## lead_time_box.to.1layhid5               -7.994642e+00
## kidsyes.to.1layhid5                      1.024203e+00
## cancyes.to.1layhid5                     -5.517057e+00
## adr_box.to.1layhid5                      2.242047e+01
## parkingyes.to.1layhid5                   4.732492e+01
## requestsA1.to.1layhid5                   4.786153e+00
## requestsA2.to.1layhid5                  -1.813163e+00
## requestsA3_.to.1layhid5                  1.146103e+00
## days.to.1layhid5                         4.267567e+01
## Intercept.to.1layhid6                    9.303958e+00
## hotelResortHotel.to.1layhid6            -2.290631e+00
## arrival_date_monthAugust.to.1layhid6    -1.831472e+00
## arrival_date_monthDecember.to.1layhid6   2.069801e+00
## arrival_date_monthFebruary.to.1layhid6  -1.891841e+00
## arrival_date_monthJanuary.to.1layhid6   -4.920723e+00
## arrival_date_monthJuly.to.1layhid6      -8.215916e-01
## arrival_date_monthJune.to.1layhid6       9.661582e+00
## arrival_date_monthMarch.to.1layhid6     -2.871030e+00
## arrival_date_monthMay.to.1layhid6       -9.299367e-01
## arrival_date_monthNovember.to.1layhid6  -9.004345e+00
## arrival_date_monthOctober.to.1layhid6    1.193451e+00
## arrival_date_monthSeptember.to.1layhid6  8.141927e+00
## adults.L.to.1layhid6                     3.152838e+00
## adults.Q.to.1layhid6                    -3.189553e-01
## adults.C.to.1layhid6                     2.964809e+00
## meal.L.to.1layhid6                       2.134982e-01
## meal.Q.to.1layhid6                       1.236679e+00
## distribution_channelDirect.to.1layhid6  -5.743821e+00
## distribution_channelTATO.to.1layhid6    -1.581319e+00
## booking_changes.to.1layhid6             -3.581584e+01
## deposit_typeNonRefund.to.1layhid6       -2.332456e+00
## deposit_typeRefundable.to.1layhid6       5.984417e+00
## customer_typeGroup.to.1layhid6           4.359032e+01
## customer_typeTransient.to.1layhid6       1.925424e+00
## lead_time_box.to.1layhid6               -1.165244e-01
## kidsyes.to.1layhid6                     -3.113989e+00
## cancyes.to.1layhid6                     -5.322883e-01
## adr_box.to.1layhid6                     -2.723852e+01
## parkingyes.to.1layhid6                   1.301335e+02
## requestsA1.to.1layhid6                  -2.060458e+00
## requestsA2.to.1layhid6                  -1.967191e+00
## requestsA3_.to.1layhid6                  8.983357e+00
## days.to.1layhid6                        -7.680627e+00
## Intercept.to.is_canceled                 3.573575e+01
## 1layhid1.to.is_canceled                 -2.077520e+02
## 1layhid2.to.is_canceled                  2.133316e+02
## 1layhid3.to.is_canceled                  3.782999e+02
## 1layhid4.to.is_canceled                 -1.929522e+02
## 1layhid5.to.is_canceled                 -2.039991e+02
## 1layhid6.to.is_canceled                 -4.126416e+02
plot(nn1, rep = 'best')

nn1_pred <- compute(nn1, hotel_train_mtx[, -1])


confusionMatrix(as.numeric((nn1_pred$net.result > 0.5)) %>% as.factor(),
                    as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 42629 11771
##          1  8073 18916
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7532, 0.7591)
##     No Information Rate : 0.623           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4683          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8408          
##             Specificity : 0.6164          
##          Pos Pred Value : 0.7836          
##          Neg Pred Value : 0.7009          
##              Prevalence : 0.6230          
##          Detection Rate : 0.5238          
##    Detection Prevalence : 0.6684          
##       Balanced Accuracy : 0.7286          
##                                           
##        'Positive' Class : 0               
## 
nn1_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
            compute(nn1, hotel_train_mtx[, -1])$net.result[, 1])

Surprisingly, the result this time is worse and unsatisfactory. Let’s check the diffence for the testing data.

nn1_test_pred <- compute(nn1, hotel_test_mtx[, -1])


confusionMatrix(as.numeric((nn1_test_pred$net.result > 0.5)) %>% as.factor(),
                as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 18259  5021
##          1  3610  8015
##                                           
##                Accuracy : 0.7527          
##                  95% CI : (0.7482, 0.7572)
##     No Information Rate : 0.6265          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4598          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8349          
##             Specificity : 0.6148          
##          Pos Pred Value : 0.7843          
##          Neg Pred Value : 0.6895          
##              Prevalence : 0.6265          
##          Detection Rate : 0.5231          
##    Detection Prevalence : 0.6670          
##       Balanced Accuracy : 0.7249          
##                                           
##        'Positive' Class : 0               
## 
nn1_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
                    compute(nn1, hotel_test_mtx[, -1])$net.result[, 1])

Similar conclusions. It seems that adding a hidden layer doesn’t help at all. Finally we will add another layer of 4 neurons and see what happens. We will also reduce a subsample even more to only a one thousand observations.

if(0){
set.seed(123456789)
nn2 <- 
  data.frame(hotel_train_mtx,
             is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
  sample_n(1000) %>%
  neuralnet(model_formula2,
            data = .,
            hidden = c(6, 4), 
            linear.output = FALSE,
            stepmax = 1e+06,
            algorithm = "rprop+",
            threshold = 0.001)


saveRDS(object = nn2,
            file   = here("output", "nn2.rds"))
}

nn2 <- readRDS(here("output", "nn2.rds"))

nn2$result.matrix
##                                                  [,1]
## error                                    2.250067e+01
## reached.threshold                        7.669559e-04
## steps                                    4.606400e+04
## Intercept.to.1layhid1                    2.445336e+00
## hotelResortHotel.to.1layhid1             3.990793e+01
## arrival_date_monthAugust.to.1layhid1     3.931190e+01
## arrival_date_monthDecember.to.1layhid1   3.978750e+01
## arrival_date_monthFebruary.to.1layhid1   3.743967e+01
## arrival_date_monthJanuary.to.1layhid1   -7.522021e+00
## arrival_date_monthJuly.to.1layhid1      -7.962709e+00
## arrival_date_monthJune.to.1layhid1       2.578327e+00
## arrival_date_monthMarch.to.1layhid1     -7.262644e+00
## arrival_date_monthMay.to.1layhid1        2.452163e+00
## arrival_date_monthNovember.to.1layhid1  -4.010579e+00
## arrival_date_monthOctober.to.1layhid1   -2.757441e+00
## arrival_date_monthSeptember.to.1layhid1  4.215977e+01
## adults.L.to.1layhid1                    -1.190886e+00
## adults.Q.to.1layhid1                     1.668544e+00
## adults.C.to.1layhid1                     5.909044e+00
## meal.L.to.1layhid1                       9.129906e+00
## meal.Q.to.1layhid1                       6.103435e+00
## distribution_channelDirect.to.1layhid1   4.991895e+01
## distribution_channelTATO.to.1layhid1     2.852426e+00
## booking_changes.to.1layhid1              7.995993e+01
## deposit_typeNonRefund.to.1layhid1       -1.322254e+01
## deposit_typeRefundable.to.1layhid1       1.188839e+01
## customer_typeGroup.to.1layhid1           2.975056e+02
## customer_typeTransient.to.1layhid1       8.746696e+00
## lead_time_box.to.1layhid1               -4.222491e+00
## kidsyes.to.1layhid1                     -8.875577e+00
## cancyes.to.1layhid1                     -3.041214e+02
## adr_box.to.1layhid1                      1.233006e+01
## parkingyes.to.1layhid1                   2.451804e+02
## requestsA1.to.1layhid1                   9.907438e+00
## requestsA2.to.1layhid1                  -1.348668e+00
## requestsA3_.to.1layhid1                  3.127534e+02
## days.to.1layhid1                         6.684135e+01
## Intercept.to.1layhid2                   -8.689737e-01
## hotelResortHotel.to.1layhid2             5.722455e+00
## arrival_date_monthAugust.to.1layhid2    -5.972431e+00
## arrival_date_monthDecember.to.1layhid2  -7.654679e+00
## arrival_date_monthFebruary.to.1layhid2   9.218852e-02
## arrival_date_monthJanuary.to.1layhid2   -1.621443e+00
## arrival_date_monthJuly.to.1layhid2       7.353724e-01
## arrival_date_monthJune.to.1layhid2       1.957547e+00
## arrival_date_monthMarch.to.1layhid2     -4.842043e+00
## arrival_date_monthMay.to.1layhid2        1.050897e+00
## arrival_date_monthNovember.to.1layhid2  -2.465472e+00
## arrival_date_monthOctober.to.1layhid2   -7.221829e-01
## arrival_date_monthSeptember.to.1layhid2  1.853052e+00
## adults.L.to.1layhid2                    -2.507790e+00
## adults.Q.to.1layhid2                     1.627501e+00
## adults.C.to.1layhid2                    -3.268886e-01
## meal.L.to.1layhid2                       1.618695e+00
## meal.Q.to.1layhid2                      -8.755422e-01
## distribution_channelDirect.to.1layhid2  -7.136336e+02
## distribution_channelTATO.to.1layhid2     1.946677e-01
## booking_changes.to.1layhid2             -1.121755e+01
## deposit_typeNonRefund.to.1layhid2        7.190768e+02
## deposit_typeRefundable.to.1layhid2      -2.894353e+02
## customer_typeGroup.to.1layhid2          -5.113508e+02
## customer_typeTransient.to.1layhid2       2.278237e+00
## lead_time_box.to.1layhid2               -7.082112e-01
## kidsyes.to.1layhid2                     -2.904900e-01
## cancyes.to.1layhid2                     -7.155928e+00
## adr_box.to.1layhid2                      2.203187e+00
## parkingyes.to.1layhid2                  -5.152680e+02
## requestsA1.to.1layhid2                  -1.090650e+00
## requestsA2.to.1layhid2                  -1.563522e+00
## requestsA3_.to.1layhid2                  3.146404e+00
## days.to.1layhid2                        -1.664009e+01
## Intercept.to.1layhid3                    4.994431e-01
## hotelResortHotel.to.1layhid3             3.355449e-01
## arrival_date_monthAugust.to.1layhid3     3.913846e+00
## arrival_date_monthDecember.to.1layhid3  -7.105421e+02
## arrival_date_monthFebruary.to.1layhid3  -2.896649e-01
## arrival_date_monthJanuary.to.1layhid3   -7.063192e-01
## arrival_date_monthJuly.to.1layhid3       3.436165e+00
## arrival_date_monthJune.to.1layhid3      -2.736949e+00
## arrival_date_monthMarch.to.1layhid3      2.246594e+00
## arrival_date_monthMay.to.1layhid3        3.003981e-01
## arrival_date_monthNovember.to.1layhid3   3.074521e-02
## arrival_date_monthOctober.to.1layhid3    1.782097e+00
## arrival_date_monthSeptember.to.1layhid3 -5.789644e-01
## adults.L.to.1layhid3                    -6.941859e-01
## adults.Q.to.1layhid3                    -1.162360e+00
## adults.C.to.1layhid3                    -6.361649e-01
## meal.L.to.1layhid3                      -2.690321e+00
## meal.Q.to.1layhid3                       2.306288e+00
## distribution_channelDirect.to.1layhid3  -1.041899e+00
## distribution_channelTATO.to.1layhid3     7.106703e-01
## booking_changes.to.1layhid3              2.353725e+01
## deposit_typeNonRefund.to.1layhid3        7.118383e+02
## deposit_typeRefundable.to.1layhid3      -2.882787e+02
## customer_typeGroup.to.1layhid3          -5.310858e+02
## customer_typeTransient.to.1layhid3      -2.476492e-01
## lead_time_box.to.1layhid3                1.476960e+00
## kidsyes.to.1layhid3                      4.232672e-01
## cancyes.to.1layhid3                      3.420836e+00
## adr_box.to.1layhid3                     -2.787733e+00
## parkingyes.to.1layhid3                  -5.281673e+02
## requestsA1.to.1layhid3                  -2.972454e+00
## requestsA2.to.1layhid3                  -6.706305e-01
## requestsA3_.to.1layhid3                  1.532688e+00
## days.to.1layhid3                        -2.602837e+01
## Intercept.to.1layhid4                   -1.812370e+00
## hotelResortHotel.to.1layhid4             2.832903e-01
## arrival_date_monthAugust.to.1layhid4     9.401485e-01
## arrival_date_monthDecember.to.1layhid4  -2.700352e+00
## arrival_date_monthFebruary.to.1layhid4   2.167408e+00
## arrival_date_monthJanuary.to.1layhid4    2.976250e+00
## arrival_date_monthJuly.to.1layhid4       1.447992e+00
## arrival_date_monthJune.to.1layhid4       1.809558e-01
## arrival_date_monthMarch.to.1layhid4      1.819831e+00
## arrival_date_monthMay.to.1layhid4       -3.399769e+00
## arrival_date_monthNovember.to.1layhid4   3.032964e+00
## arrival_date_monthOctober.to.1layhid4    1.070366e+00
## arrival_date_monthSeptember.to.1layhid4 -5.087037e+00
## adults.L.to.1layhid4                     1.516245e+00
## adults.Q.to.1layhid4                     1.303123e+00
## adults.C.to.1layhid4                    -3.516939e-01
## meal.L.to.1layhid4                       2.597440e-01
## meal.Q.to.1layhid4                      -3.265543e-02
## distribution_channelDirect.to.1layhid4  -2.516561e+00
## distribution_channelTATO.to.1layhid4    -1.087358e+00
## booking_changes.to.1layhid4             -2.325487e+01
## deposit_typeNonRefund.to.1layhid4        4.512724e+01
## deposit_typeRefundable.to.1layhid4      -2.866621e+02
## customer_typeGroup.to.1layhid4          -5.122173e+02
## customer_typeTransient.to.1layhid4      -8.633997e-01
## lead_time_box.to.1layhid4                2.339991e+00
## kidsyes.to.1layhid4                     -1.797922e+00
## cancyes.to.1layhid4                      3.812564e+01
## adr_box.to.1layhid4                      5.253205e+00
## parkingyes.to.1layhid4                  -5.142397e+02
## requestsA1.to.1layhid4                   6.666795e+00
## requestsA2.to.1layhid4                   3.666278e+00
## requestsA3_.to.1layhid4                 -7.113838e+02
## days.to.1layhid4                        -6.338953e+00
## Intercept.to.1layhid5                    6.145372e-01
## hotelResortHotel.to.1layhid5             5.688636e-01
## arrival_date_monthAugust.to.1layhid5     1.295317e+00
## arrival_date_monthDecember.to.1layhid5   1.664685e-02
## arrival_date_monthFebruary.to.1layhid5   1.725441e+00
## arrival_date_monthJanuary.to.1layhid5    5.068823e+00
## arrival_date_monthJuly.to.1layhid5       3.101162e+00
## arrival_date_monthJune.to.1layhid5      -2.154570e+00
## arrival_date_monthMarch.to.1layhid5      2.655991e+00
## arrival_date_monthMay.to.1layhid5       -2.664765e+00
## arrival_date_monthNovember.to.1layhid5   2.470168e+00
## arrival_date_monthOctober.to.1layhid5    3.309031e+00
## arrival_date_monthSeptember.to.1layhid5 -1.855909e+00
## adults.L.to.1layhid5                     8.273229e-01
## adults.Q.to.1layhid5                    -2.906310e+00
## adults.C.to.1layhid5                     1.805033e+00
## meal.L.to.1layhid5                       2.626600e+00
## meal.Q.to.1layhid5                       6.048585e-01
## distribution_channelDirect.to.1layhid5  -4.929594e+00
## distribution_channelTATO.to.1layhid5     4.274756e-01
## booking_changes.to.1layhid5             -2.044103e+01
## deposit_typeNonRefund.to.1layhid5        3.434228e+01
## deposit_typeRefundable.to.1layhid5       1.869963e+01
## customer_typeGroup.to.1layhid5           1.786562e+02
## customer_typeTransient.to.1layhid5      -4.634708e-01
## lead_time_box.to.1layhid5               -1.035333e+00
## kidsyes.to.1layhid5                     -2.201244e+00
## cancyes.to.1layhid5                     -1.839631e+02
## adr_box.to.1layhid5                     -4.147092e+00
## parkingyes.to.1layhid5                   1.854793e+02
## requestsA1.to.1layhid5                   5.033844e+00
## requestsA2.to.1layhid5                   3.886092e-01
## requestsA3_.to.1layhid5                  2.060406e+02
## days.to.1layhid5                        -8.606483e+00
## Intercept.to.1layhid6                   -3.628445e+00
## hotelResortHotel.to.1layhid6            -1.139338e+00
## arrival_date_monthAugust.to.1layhid6    -1.289767e-01
## arrival_date_monthDecember.to.1layhid6   4.377645e+00
## arrival_date_monthFebruary.to.1layhid6   5.862540e-01
## arrival_date_monthJanuary.to.1layhid6    6.668938e-01
## arrival_date_monthJuly.to.1layhid6      -1.150463e+00
## arrival_date_monthJune.to.1layhid6      -1.121599e+00
## arrival_date_monthMarch.to.1layhid6      7.064627e-01
## arrival_date_monthMay.to.1layhid6       -6.540200e-01
## arrival_date_monthNovember.to.1layhid6   1.037043e+00
## arrival_date_monthOctober.to.1layhid6   -6.691655e-02
## arrival_date_monthSeptember.to.1layhid6 -9.428782e-04
## adults.L.to.1layhid6                    -2.565231e+00
## adults.Q.to.1layhid6                     3.984700e-01
## adults.C.to.1layhid6                    -9.617120e-02
## meal.L.to.1layhid6                       8.155688e-01
## meal.Q.to.1layhid6                      -2.023907e-01
## distribution_channelDirect.to.1layhid6   2.920532e-01
## distribution_channelTATO.to.1layhid6     5.268996e-01
## booking_changes.to.1layhid6             -8.155330e+00
## deposit_typeNonRefund.to.1layhid6       -7.080431e+02
## deposit_typeRefundable.to.1layhid6      -2.869864e+02
## customer_typeGroup.to.1layhid6          -5.091078e+02
## customer_typeTransient.to.1layhid6       4.740407e-01
## lead_time_box.to.1layhid6                1.148329e+00
## kidsyes.to.1layhid6                     -3.405187e-01
## cancyes.to.1layhid6                     -1.209551e+01
## adr_box.to.1layhid6                      6.371121e+00
## parkingyes.to.1layhid6                  -5.121759e+02
## requestsA1.to.1layhid6                   3.553015e-01
## requestsA2.to.1layhid6                  -1.295809e+00
## requestsA3_.to.1layhid6                  5.912159e-01
## days.to.1layhid6                         1.096411e+01
## Intercept.to.2layhid1                   -6.505797e-01
## 1layhid1.to.2layhid1                     1.549556e+00
## 1layhid2.to.2layhid1                    -1.213997e+00
## 1layhid3.to.2layhid1                    -1.343260e+00
## 1layhid4.to.2layhid1                    -4.403113e-01
## 1layhid5.to.2layhid1                     5.095590e+00
## 1layhid6.to.2layhid1                    -1.168404e+01
## Intercept.to.2layhid2                   -1.401423e+00
## 1layhid1.to.2layhid2                    -5.702424e-01
## 1layhid2.to.2layhid2                     1.768272e+00
## 1layhid3.to.2layhid2                     1.171244e+00
## 1layhid4.to.2layhid2                     1.570198e+00
## 1layhid5.to.2layhid2                    -1.676968e+00
## 1layhid6.to.2layhid2                     2.178951e+00
## Intercept.to.2layhid3                    1.890084e+00
## 1layhid1.to.2layhid3                     1.687252e+00
## 1layhid2.to.2layhid3                    -2.458881e+00
## 1layhid3.to.2layhid3                    -2.244527e+00
## 1layhid4.to.2layhid3                    -1.699384e+00
## 1layhid5.to.2layhid3                     1.438998e+00
## 1layhid6.to.2layhid3                    -3.859045e+00
## Intercept.to.2layhid4                    7.931987e-01
## 1layhid1.to.2layhid4                     2.250721e-01
## 1layhid2.to.2layhid4                    -1.724975e-01
## 1layhid3.to.2layhid4                    -1.177952e+00
## 1layhid4.to.2layhid4                    -2.314229e+00
## 1layhid5.to.2layhid4                     2.453100e+00
## 1layhid6.to.2layhid4                    -2.551598e+00
## Intercept.to.is_canceled                 9.473638e-01
## 2layhid1.to.is_canceled                 -2.097638e+01
## 2layhid2.to.is_canceled                  3.418909e+02
## 2layhid3.to.is_canceled                 -3.463026e+02
## 2layhid4.to.is_canceled                 -9.302619e+01
plot(nn2, rep = 'best')

nn2_pred <- compute(nn1, hotel_train_mtx[, -1])


confusionMatrix(as.numeric((nn2_pred$net.result > 0.5)) %>% as.factor(),
                    as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 42629 11771
##          1  8073 18916
##                                           
##                Accuracy : 0.7562          
##                  95% CI : (0.7532, 0.7591)
##     No Information Rate : 0.623           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4683          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8408          
##             Specificity : 0.6164          
##          Pos Pred Value : 0.7836          
##          Neg Pred Value : 0.7009          
##              Prevalence : 0.6230          
##          Detection Rate : 0.5238          
##    Detection Prevalence : 0.6684          
##       Balanced Accuracy : 0.7286          
##                                           
##        'Positive' Class : 0               
## 
nn2_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"), 
            compute(nn2, hotel_train_mtx[, -1])$net.result[, 1])

The second layer didn’t change much comparing to the previous model. As always we will check the predictions on the testing sample.

nn2_test_pred <- compute(nn2, hotel_test_mtx[, -1])


confusionMatrix(as.numeric((nn2_test_pred$net.result > 0.5)) %>% as.factor(),
                as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17950  5182
##          1  3919  7854
##                                           
##                Accuracy : 0.7393          
##                  95% CI : (0.7346, 0.7439)
##     No Information Rate : 0.6265          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4317          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8208          
##             Specificity : 0.6025          
##          Pos Pred Value : 0.7760          
##          Neg Pred Value : 0.6671          
##              Prevalence : 0.6265          
##          Detection Rate : 0.5143          
##    Detection Prevalence : 0.6627          
##       Balanced Accuracy : 0.7116          
##                                           
##        'Positive' Class : 0               
## 
nn2_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"), 
                    compute(nn2, hotel_test_mtx[, -1])$net.result[, 1])

Almost the same results and same conclusions. Out of those models the first one was definitely the best, although its performance is still much worse than for example the XGBoost models from the previous part of analysis, based on the ROC curve and Gini coefficients for neural nets which are presented on the graph below.

list(
  nn0_roc   = nn0_roc,
  nn0_test_roc   = nn0_test_roc,
  nn1_roc   = nn1_roc,
  nn1_test_roc   = nn1_test_roc,
  nn2_roc   = nn2_roc,
  nn2_test_roc   = nn2_test_roc
) %>%
  pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), 
               color = "grey", 
               linetype = "dashed") +
  labs(subtitle = c(paste("Gini: ",
                         "nn0 train = ", 
                         round(100 * (2 * auc(nn0_roc) - 1), 1), "%, ",
                         "nn0 test = ", 
                         round(100 * (2 * auc(nn0_test_roc) - 1), 1), "%, ",
                         "nn1 train = ", 
                         round(100 * (2 * auc(nn1_roc) - 1), 1), "%, ",
                         "nn1 test = ", 
                         round(100 * (2 * auc(nn1_test_roc) - 1), 1), "%, \n",
                         "nn2 train = ", 
                         round(100 * (2 * auc(nn2_roc) - 1), 1), "%, ",
                         "nn2 test = ", 
                         round(100 * (2 * auc(nn2_test_roc) - 1), 1), "% "))) +
  theme_bw() + coord_fixed() +
  scale_color_brewer(palette = "Paired")

Final conclusion

So which model from our analysis was the best one? Let’s combine the results and have a look at them once again.

Results_train <- rbind(2 * auc(tree1_roc) - 1, 2 * auc(tree2_roc) - 1, 2 * auc(tree3_roc) - 1, 2 * auc(sub_pred_roc) - 1, 2 * auc(xgb_roc) - 1, 2 * auc(xgb_roc2) - 1, 2 * auc(xgb_roc3) - 1, 2 * auc(xgb_roc4) - 1, 2 * auc(xgb_roc5) - 1, 2 * auc(xgb_roc8) - 1, 2 * auc(nn0_roc) - 1, 2 * auc(nn1_roc) - 1, 2 * auc(nn2_roc) - 1)

Results_test <- rbind(2 * auc(tree1_test_roc) - 1, 2 * auc(tree2_test_roc) - 1, 2 * auc(tree3_test_roc) - 1, 2 * auc(sub_test_pred_roc) - 1, 2 * auc(xgb_test_roc) - 1, 2 * auc(xgb_test_roc2) - 1, 2 * auc(xgb_test_roc3) - 1, 2 * auc(xgb_test_roc4) - 1, 2 * auc(xgb_test_roc5) - 1, 2 * auc(xgb_test_roc8) - 1, 2 * auc(nn0_test_roc) - 1, 2 * auc(nn1_test_roc) - 1, 2 * auc(nn2_test_roc) - 1)

Results <- cbind(Results_train, Results_test)

rownames(Results) <- c("Tree 1 (default)", "Tree 2 (big nodes)", "Tree 3 (cv)", "Subagging of trees",
                       "XGBoost 1", "XGBoost 2", "XGBoost 3", "XGBoost 4", "XGBoost 5", "XGBoost 8", "Neural Nets (0 layers)", "Neural Nets (1 layer)", "Neural Nets (2 layers)")

colnames(Results) <- c("Gini - Training sample", "Gini - Testing sample")

Results
##                        Gini - Training sample Gini - Testing sample
## Tree 1 (default)                    0.6294704             0.6269355
## Tree 2 (big nodes)                  0.5869887             0.5856268
## Tree 3 (cv)                         0.8657437             0.7734859
## Subagging of trees                  0.8740883             0.7748227
## XGBoost 1                           0.6843431             0.6807081
## XGBoost 2                           0.8095594             0.7847885
## XGBoost 3                           0.8352334             0.7930082
## XGBoost 4                           0.8675867             0.8053538
## XGBoost 5                           0.8711783             0.8073450
## XGBoost 8                           0.8872846             0.8183211
## Neural Nets (0 layers)              0.6987147             0.6958622
## Neural Nets (1 layer)               0.5660880             0.5569689
## Neural Nets (2 layers)              0.5592981             0.5584587

Not every modification resulted in a better model. Neural Networks are the best in their simplest version but still not good enough. Second tree is overall the worst model out of all created. We can see that Tree with cross-validation and Subagging of trees have high Gini coefficients for training data but for testing data it is lower by about 10 percentage points, although still quite high. Probably the best predictions are from XGBoost models. The better were predictions for XGBoost on the training sample, the better they were on the testing sample also, but the difference between increased steadily, so we may have an overfitting for the best performing XGBoost models. If we want to be sure that our model will keep the predictions close to the level of predictions from training sample, then we might prefer second or third XGBoost model, because the differences between Gini coefficient for all boosting models aside from the first one are very small.

All in all, it is not always true that a more complicated model will perform better than a simple one, but often that is the true. Then we may need to decide between explainability and better efficiency in our model. It is up to analyst to understand the specific case he’s dealing with and to know if he can resign of the explainability in favor of having the model which is better fitted to the data.