The second part of the project will be based on the dataset concerning hotel bookings. The data was retrieved from Kaggle and is available here. It consists of almost 120 thousand observations and 32 variables. Wider description of variables as well as details about the whole data collection can be found in the article here. The aim of this analysis is to obtain good predictions of whether the reservations were cancelled or not. After some initial preparation of variables, including their selection and transformations, multiple models are performed to achieve the goal.
library(Rmisc)
library(dplyr)
library(ggplot2)
library(bestNormalize)
library(caret)
library(corrplot)
library(stargazer)
library(rpart)
library(rattle)
library(pROC)
library(neuralnet)
library(here)
hotel <- read.csv2('hotel_bookings.csv', sep =',', dec = '.',
encoding = 'UTF-8', header = T, stringsAsFactors = T)
glimpse(hotel)
## Rows: 119,390
## Columns: 32
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,~
## $ country <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,~
## $ market_segment <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, ~
## $ assigned_room_type <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, ~
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent <fct> NULL, NULL, NULL, 304, 240, 240, NULL, ~
## $ company <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <fct> Transient, Transient, Transient, Transi~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201~
Our target variable is is_canceled. Besides that there are many variables with details about the reservation, customers and distribution of the specific offer. There is also a variable reservation_status which is generally duplicating our target so we won’t use it or reservation_status_date at all.
To be able to check the correctness of our predictions we need to start with splitting the dataset into two parts. Train sample consisting of 70% of all observations will be used to check if any data preparation is needed and in the process of training our models, while the remaining 30% of obervations will be used as a test sample only to assess the efficiency of our predictions on the new, unused part of data. To be able to replicate the results of the whole analysis, a specified seed will be used where necessary.
set.seed(123456789)
training_obs <- createDataPartition(hotel$is_canceled,
p = 0.7,
list = FALSE)
hotel_train <- hotel[training_obs,]
hotel_test <- hotel[-training_obs,]
nrow(hotel_train)
## [1] 83573
nrow(hotel_test)
## [1] 35817
Train data consists of over 30 thousand observations of cancelled reservations and over 52 thousand observations of bookings that weren’t cancelled.
table(hotel_train$is_canceled)
##
## 0 1
## 52527 31046
hotel_train$is_canceled <- as.factor(hotel_train$is_canceled)
hotel_train$is_canceled <- recode(hotel_train$is_canceled, "0" = "No", "1" = "Yes")
ggplot(hotel_train,
aes(x = is_canceled)) +
geom_bar(fill = "orange",
color = 'black') +
theme_minimal()
summary(hotel_train$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.30 94.95 101.90 126.00 5400.00
p1 <- ggplot(hotel_train,
aes(x = lead_time)) +
geom_histogram(fill = "orange",
bins = 100) +
theme_bw()
p2 <- ggplot(hotel_train,
aes(x = adr)) +
geom_histogram(fill = "orange",
bins = 300) +
theme_bw()
multiplot(p1, p2)
We can see that two continuous variables: lead_time and adr are skewed. We will use a Box-Cox transformation to change the distribution of those variables. adr has some negative values which means this transformation can’t be applied, however, adr is describing the average daily rate calculated by dividing the sum of all lodging transactions by the total number of staying nights and in this case should be a positive value by definition. We will treat those negative or equal to zero values as errors and change them to NAs first. In case of lead_time we will perform Box-Cox transformation previously adding 1 to all values to omit the problem with zeros.
# lead_time:
box <- boxcox(hotel_train$lead_time + 1)
box
## Standardized Box Cox Transformation with 83573 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.2776842
## - mean (before standardization) = 7.822438
## - sd (before standardization) = 4.257422
hotel_train$lead_time_box <- box$x.t
p3 <- ggplot(hotel_train,
aes(x = lead_time_box)) +
geom_histogram(fill = "orange",
bins = 100) +
theme_bw()
# adr:
hotel_train[hotel_train$adr <=0,]$adr <- NA
box2 <- boxcox(hotel_train$adr)
box2
## Standardized Box Cox Transformation with 82189 nonmissing obs.:
## Estimated statistics:
## - lambda = 0.2807637
## - mean (before standardization) = 9.285888
## - sd (before standardization) = 1.627004
hotel_train$adr_box <- box2$x.t
p4 <- ggplot(hotel_train,
aes(x = adr_box)) +
geom_histogram(fill = "orange",
bins = 300) +
theme_bw()
In both cases the lambda parameter is similar, close to 0.28. The distribution of variables in the training sample after transformation looks as presented below. Maybe for lead_time it is not perfect, however we will see later how it affects the relation with our target variable.
multiplot(p3, p4)
Now that we have the objects with Box-Cox transformations prepared on the training data, let’s apply those transformations on the whole dataset, remembering once again about the non-positive values of adr.
# lead_time with lambda value obtained on the training sample
hotel$lead_time_box <- predict(box, newdata = hotel$lead_time + 1)
# adr with lambda value obtained on the training sample
hotel[hotel$adr <=0,]$adr <- NA
hotel$adr_box <- predict(box2, newdata = hotel$adr)
Many variables in this dataset are of a type of count data, with the majority of observations equal to first few possible values and not many observations for different, larger numbers. Because of that we will group those higher values to one, common level.
Variables stays_in_week_nights and stays_in_weekend_nights as well as children and babies will be merged respectively to days and kids so that we will have less predictors but we shouldn’t lose much of their predictive power. kids will also have grouped levels in the same way as explained before.
To avoid potential problems with models, we will also rename those ordered factor variables so that their level names won’t start with a number. Finally, we will make sure that all qualitative variables are coded as factors and we will order the ordinal ones. To avoid potential problems with models, we will also rename those ordered factor variables so that their level labels won’t start with a number.
# is_canceled
hotel$is_canceled <- as.factor(hotel$is_canceled)
hotel$is_canceled <- recode(hotel$is_canceled, "0" = "No", "1" = "Yes")
hotel$is_canceled <- factor(hotel$is_canceled, levels = c("Yes", "No"))
# arrival date
hotel$arrival_date_year <- factor(hotel$arrival_date_year, order = T, levels = c("2015", "2016", "2017"))
# days
hotel$days <- hotel$stays_in_week_nights + hotel$stays_in_weekend_nights
hotel[hotel$days == 0,]$days <- NA
# adults
hotel$adults <- ifelse(hotel$adults > 2, "A3+", as.factor(hotel$adults))
hotel$adults <- recode_factor(hotel$adults, "1" = "A0", "2" = "A1", "3" = "A2")
hotel$adults <- factor(hotel$adults, order = T, levels = c("A0", "A1", "A2", "A3+"))
# kids
sum(hotel_train$children>0 | hotel_train$babies>0, na.rm = T)
## [1] 6675
hotel$kids <- ifelse(hotel$children>0 | hotel$babies>0, "yes", 'no')
hotel$kids <- as.factor(hotel$kids)
# meal
hotel[hotel$meal == "Undefined",]$meal <- NA
hotel$meal <- recode_factor(hotel$meal, "FB" = "FB/HB", "HB" = "FB/HB")
hotel$meal <- factor(hotel$meal, order = T, levels = c("SC", "BB", "FB/HB"))
# distribution_chanel
hotel[hotel$distribution_channel == "GDS",]$distribution_channel <- "TA/TO"
hotel[hotel$distribution_channel == "Undefined",]$distribution_channel <- NA
hotel$distribution_channel <- droplevels(hotel$distribution_channel)
# is_repeate_guest
hotel$is_repeated_guest <- as.factor(hotel$is_repeated_guest)
# previous cancellations
hotel$canc <- ifelse(hotel$previous_cancellations > 0, 'yes', 'no')
hotel$canc <- as.factor(hotel$canc)
# parking space
hotel$parking <- ifelse(hotel$required_car_parking_spaces > 0, 'yes', 'no')
hotel$parking <- as.factor(hotel$parking)
# special requests
hotel$requests <- ifelse(hotel$total_of_special_requests > 2, "A3+", as.factor(hotel$total_of_special_requests))
hotel$requests <- recode_factor(hotel$requests, "1" = "A0", "2" = "A1", "3" = "A2")
Now when we have everything transformed on the whole data we can easily apply those changes to train and test sample by assigning once again the same observations to our samples.
hotel_train <- hotel[training_obs,]
hotel_test <- hotel[-training_obs,]
hotel_train %>%
is.na() %>%
colSums() %>%
sort()
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## babies country
## 0 0
## market_segment is_repeated_guest
## 0 0
## previous_cancellations previous_bookings_not_canceled
## 0 0
## reserved_room_type assigned_room_type
## 0 0
## booking_changes deposit_type
## 0 0
## agent company
## 0 0
## days_in_waiting_list customer_type
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## lead_time_box canc
## 0 0
## parking requests
## 0 0
## children kids
## 3 3
## distribution_channel days
## 4 500
## meal adr
## 824 1384
## adr_box
## 1384
hotel_test %>%
is.na() %>%
colSums() %>%
sort()
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## babies country
## 0 0
## market_segment is_repeated_guest
## 0 0
## previous_cancellations previous_bookings_not_canceled
## 0 0
## reserved_room_type assigned_room_type
## 0 0
## booking_changes deposit_type
## 0 0
## agent company
## 0 0
## days_in_waiting_list customer_type
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
## lead_time_box canc
## 0 0
## parking requests
## 0 0
## children distribution_channel
## 1 1
## kids days
## 1 215
## meal adr
## 345 576
## adr_box
## 576
Now we can also deal with the missing data. There aren’t many missing values comparing to the size of the dataset and we will simply remove those observations from our analysis, especially that the come mostly from the erroneous values removed earlier.
hotel_train <- na.omit(hotel_train)
hotel_test <- na.omit(hotel_test)
After transformations, descriptive analysis of numerical variables presents as follows.
numeric_vars <- sapply(hotel_train, is.numeric) %>% which() %>% names()
stargazer(
hotel_train[, numeric_vars], type = "text",
summary.stat = c("min", "p25", "median", "p75", "max", "mean", "sd")
)
##
## ======================================================================================
## Statistic Min Pctl(25) Median Pctl(75) Max Mean St. Dev.
## --------------------------------------------------------------------------------------
## lead_time 0 19 71 162 709 105.477 107.127
## arrival_date_week_number 1 16 28 38 53 27.219 13.496
## arrival_date_day_of_month 1 8 16 23 31 15.757 8.788
## stays_in_weekend_nights 0 0 1 2 19 0.938 0.998
## stays_in_week_nights 0 1 2 3 50 2.518 1.899
## children 0 0 0 0 10 0.107 0.405
## babies 0 0 0 0 10 0.008 0.102
## previous_cancellations 0 0 0 0 26 0.086 0.842
## previous_bookings_not_canceled 0 0 0 0 72 0.126 1.449
## booking_changes 0 0 0 0 18 0.215 0.626
## days_in_waiting_list 0 0 0 0 391 2.251 17.504
## adr 0 71 95 126 5,400 103.694 50.199
## required_car_parking_spaces 0 0 0 0 8 0.063 0.245
## total_of_special_requests 0 0 0 1 5 0.577 0.795
## lead_time_box -1.837 -0.740 0.090 0.797 2.553 0.016 0.995
## adr_box -6.397 -0.652 -0.034 0.614 16.549 0.002 0.999
## days 1 2 3 4 69 3.456 2.543
## --------------------------------------------------------------------------------------
In the process of variable selection we will look at first at correlations between potential predictors among continuous variables.
numeric_vars <- sapply(hotel_train, is.numeric) %>% which() %>% names()
correlations <- cor(hotel_train[,numeric_vars],
use = "pairwise.complete.obs")
corrplot(correlations)
There is almost no correlation between the majority of variables, which is generally a good thing. The only exceptions are the variables that we created from existing ones like adr and adr_box or length of stay in week or weekend and days. Those two original variables concerning the legth of stay are also correlated with each other, but we have already solved this problem in a way by using them to create a new predictor.
anova <- function(numeric_var) {
anova_ <- aov(hotel_train[[numeric_var]] ~
hotel_train$is_canceled)
return(summary(anova_)[[1]][1, 4])
}
sapply(numeric_vars,
anova) %>%
sort(decreasing = TRUE) -> anova_all
anova_all
## lead_time_box lead_time
## 9506.962099 7636.312969
## total_of_special_requests required_car_parking_spaces
## 5029.214138 3363.168359
## booking_changes previous_cancellations
## 1757.466498 995.675062
## previous_bookings_not_canceled days_in_waiting_list
## 261.302358 250.447236
## adr_box babies
## 167.290462 93.903786
## adr stays_in_week_nights
## 88.013989 37.886082
## days arrival_date_week_number
## 17.429446 10.182337
## arrival_date_day_of_month children
## 3.731515 1.277450
## stays_in_weekend_nights
## 1.146784
The majority of variables have pretty high F statistics which means our variables should have the significant effect on the dependent variable. It is a good sign that Box-Cox transformations increased the F statistics in both case so we will use those transformed versions of predictors. Some predictors, like day of month or week, may not have the strongest effect on booking cancellations so we may reduce the number of variables, especially the ones which don’t make much sense.
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$hotel)
## [1] 0.1374975
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$arrival_date_month)
## [1] 0.07273162
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$arrival_date_year)
## [1] 0.02471176
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$adults)
## [1] 0.08177464
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$meal)
## [1] 0.009971023
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$distribution_channel)
## [1] 0.1725203
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$is_repeated_guest)
## [1] 0.0763397
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$deposit_type)
## [1] 0.4792627
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$customer_type)
## [1] 0.1344559
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$kids)
## [1] 0.01462105
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$canc)
## [1] 0.2708739
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$parking)
## [1] 0.2007072
DescTools::CramerV(hotel_train$is_canceled,
hotel_train$requests)
## [1] 0.273077
Let’s also check the Cramer’s coefficients between the target variable and qualitative predictors. Some of them are not impressive, including arrival date details or type of meals. On the other hand, there are some results consistent with expectations e.g. high coefficient for the deposit type.
Finally we will analyse the variance of the variables.
nearZeroVar(hotel_train, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## hotel 2.034299 0.002457335 FALSE FALSE
## is_canceled 1.652231 0.002457335 FALSE FALSE
## lead_time 1.797619 0.581159616 FALSE FALSE
## arrival_date_year 1.392310 0.003686002 FALSE FALSE
## arrival_date_month 1.099726 0.014744007 FALSE FALSE
## arrival_date_week_number 1.150000 0.065119365 FALSE FALSE
## arrival_date_day_of_month 1.033642 0.038088685 FALSE FALSE
## stays_in_weekend_nights 1.523727 0.018430009 FALSE FALSE
## stays_in_week_nights 1.109905 0.038088685 FALSE FALSE
## adults 4.003644 0.004914669 FALSE FALSE
## children 22.109155 0.006143336 FALSE TRUE
## babies 131.747145 0.006143336 FALSE TRUE
## meal 6.024735 0.003686002 FALSE FALSE
## country 3.887398 0.213788104 FALSE FALSE
## market_segment 2.388117 0.008600671 FALSE FALSE
## distribution_channel 7.083482 0.003686002 FALSE FALSE
## is_repeated_guest 34.006022 0.002457335 FALSE TRUE
## previous_cancellations 18.325631 0.017201342 FALSE FALSE
## previous_bookings_not_canceled 76.793003 0.074948703 FALSE TRUE
## reserved_room_type 4.426225 0.011058005 FALSE FALSE
## assigned_room_type 2.901870 0.013515340 FALSE FALSE
## booking_changes 8.196429 0.019658676 FALSE FALSE
## deposit_type 7.043547 0.003686002 FALSE FALSE
## agent 2.112131 0.383344187 FALSE FALSE
## company 122.651274 0.379658185 FALSE TRUE
## days_in_waiting_list 484.147239 0.144982737 FALSE TRUE
## customer_type 3.675681 0.004914669 FALSE FALSE
## adr 1.376190 9.136369780 FALSE FALSE
## required_car_parking_spaces 15.089974 0.006143336 FALSE FALSE
## total_of_special_requests 2.077581 0.007372004 FALSE FALSE
## reservation_status 1.696911 0.003686002 FALSE FALSE
## reservation_status_date 1.753982 1.121773213 FALSE FALSE
## lead_time_box 1.797619 0.581159616 FALSE FALSE
## adr_box 1.376190 9.136369780 FALSE FALSE
## days 1.022599 0.046689356 FALSE FALSE
## kids 11.429597 0.002457335 FALSE FALSE
## canc 17.256842 0.002457335 FALSE FALSE
## parking 15.024611 0.002457335 FALSE FALSE
## requests 2.077581 0.004914669 FALSE FALSE
There are some variables with the variance close to zero. We didn’t even try to merge levels of some of them because of one level dominance and we won’t use them (is_repeated_guest, previous_bookings_not_canceled, days_in_waiting_list…) but we successfully eliminated this problem creating for example kids predictor from children and babies.
Taking into consideration full information obtained about the variables we decide to leave 15 of them. We will use them as predictors to train models. We will define a common formula to be used later.
model_formula <- is_canceled ~ hotel + arrival_date_month + adults + meal +
distribution_channel + booking_changes + deposit_type + customer_type +
lead_time_box + kids + canc + adr_box + parking + requests + days
We will start with the classification trees. They are very similar to the regression trees. For classification, the most frequent value in a group is used to obtain prediction of the target variable. We will begin the modelling with the default tree from rpart function.
tree1 <-
rpart(model_formula,
data = hotel_train,
method = "class")
fancyRpartPlot(tree1, cex = 0.6, palettes = 'OrRd')
tree1_pred <- predict(tree1, hotel_train)
tree1_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
tree1_pred[, 1])
cat("Gini = ", 2 * tree1_roc$auc - 1, "\n", sep = "")
## Gini = 0.6294704
We end with 9 terminal nodes. Some of them contains only a little part of all observations from training sample.
barplot(rev(tree1$variable.importance),
col = "#E15D44",
main = "Imporatnce of variables for tree1",
horiz = T,
las = 1,
cex.names = 0.6)
Main advantage of classification trees is that not only we can see which variables contributed to the splits but also we can easily compute the variable importance. One of the measures, based on the quality of the splits, is kept in the model so it’s quickly accessible. Differences in the variable importance are huge for this tree. Deposit_type is far ahead of any other predictor. On the other hand adults and meal seem to be unimportant at all.
We can try to limit the splits with additional parameter so that the nodes will be bigger. Therefore our tree may become more universal and robust for data changes. Gini coefficient is equal to 62.9%.
tree2 <-
rpart(model_formula,
data = hotel_train,
method = "class",
minsplit = 10000,
maxdepth = 20)
fancyRpartPlot(tree2, cex = 0.6, palettes = 'OrRd')
tree2_pred <- predict(tree2, hotel_train)
tree2_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
tree2_pred[, 1])
cat("Gini = ", 2 * tree2_roc$auc - 1, "\n", sep = "")
## Gini = 0.5869887
We reduced the number of terminal nodes to seven, in some cases more numerous groups. Unfortunately, the Gini coefficient is significantly lower than for the previous tree.
barplot(rev(tree2$variable.importance),
col = "#E15D44",
main = "Imporatnce of variables for tree2",
horiz = T,
las = 1,
cex.names = 0.6)
As a result of lower number of splits, some variables lost their importance comparing to the previous tree and some changes in the order of importance is visible.
For the next tree we will try to find the optimal complexity parameter. In order to do this we will use a cross-validation with 10 folds.
# setting control
cv_10 <- trainControl(method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
cp.grid <- expand.grid(cp = seq(0, 0.01, 0.0005))
# training
set.seed(123456789)
tree3 <-
train(model_formula,
data = hotel_train,
method = "rpart",
metric = "ROC",
trControl = cv_10,
tuneGrid = cp.grid)
plot(tree3)
Without doubt, 0 is the best value of complexity parameter. For any positive values, AUC obtained during cross-validation is much lower.
tree3_pred <- predict(tree3, hotel_train, type = 'prob')
tree3_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
tree3_pred[, 1])
cat("Gini = ", 2 * tree3_roc$auc - 1, "\n", sep = "")
## Gini = 0.8657437
This time the Gini coefficient is much, much higher than for the previous trees. It seems that we managed to improve our model with the complexity parameter change.
#train
list(
tree1_roc = tree1_roc,
tree2_roc = tree2_roc,
tree3_roc = tree3_roc
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TRAIN: ",
"tree1 = ",
round(100*(2 * auc(tree1_roc) - 1), 1), "%, ",
"tree2 = ",
round(100*(2 * auc(tree2_roc) - 1), 1), "% ",
"tree3 = ",
round(100*(2 * auc(tree3_roc) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Dark2")[c(1:3)])
Graphical representation of ROC curves confirms our conclusions. The difference between two first and the third model are surprisingly big. It is a perfect example how even a single change may result in completely different predictions.
#test
tree1_test_pred <- predict(tree1, hotel_test)
tree1_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
tree1_test_pred[, 1])
tree2_test_pred <- predict(tree2, hotel_test)
tree2_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
tree2_test_pred[, 1])
tree3_test_pred <- predict(tree3, hotel_test, type = 'prob')
tree3_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
tree3_test_pred[, 1])
list(
tree1_test_roc = tree1_test_roc,
tree2_test_roc = tree2_test_roc,
tree3_test_roc = tree3_test_roc
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste0("Gini TEST: ",
"tree1 = ",
round(100*(2 * auc(tree1_test_roc) - 1), 1), "%, ",
"tree2 = ",
round(100*(2 * auc(tree2_test_roc) - 1), 1), "% ",
"tree3 = ",
round(100*(2 * auc(tree3_test_roc) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
scale_color_manual(values = RColorBrewer::brewer.pal(n = 4,
name = "Dark2")[c(1:3)])
Results on the test sample are very similar, although the Gini coefficient is much lower this time. It still makes the best predictions but it seems that an overfitting appeared. On the other hand, first and second tree have almost the same quality of predictions as on the training sample.
Next model will be created with the subagging of classification trees. It is a method of bagging in which a number of independent trees is trained on the simple samples from the training data. Then the predictions are made based on the most frequent value obtained from those trees. We will make 17 trees and each of them will be with complexity parameter equal to zero, so each of our trees will have parameters which we consider as optimal after our previous steps of modelling.
if(0) {results_sub <- list()
n <- nrow(hotel_train)
for (sample in 1:17) {
message(sample)
set.seed(1234 + sample)
data_sample <-
hotel_train[sample(x = 1:n,
size = n/2,
replace = FALSE),]
results_sub[[sample]] <- rpart(model_formula,
data = hotel_train,
method = "class",
cp = 0)
rm(data_sample)
}
saveRDS(object = results_sub,
file = here("output", "results_sub.rds"))
}
results_sub <- readRDS(here("output", "results_sub.rds"))
sub_pred <-
sapply(results_sub,
function(x)
predict(object = x,
newdata = hotel_train,
type = 'prob')[,1])
hist(rowSums(sub_pred < 0.5),
breaks = 0:17,
main = "Frequency of votes for single observations")
Based on frequency of votes we can see that the results are rather determined.
sub_pred_res <-
ifelse(rowSums(sub_pred < 0.5) > 17/2,
"No", "Yes") %>%
factor(., levels = c("Yes", "No"))
confusionMatrix(data = sub_pred_res,
reference = hotel_train$is_canceled,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 24624 3686
## No 6063 47016
##
## Accuracy : 0.8802
## 95% CI : (0.878, 0.8824)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7411
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8024
## Specificity : 0.9273
## Pos Pred Value : 0.8698
## Neg Pred Value : 0.8858
## Prevalence : 0.3770
## Detection Rate : 0.3025
## Detection Prevalence : 0.3478
## Balanced Accuracy : 0.8649
##
## 'Positive' Class : Yes
##
sub_pred2 <- rowMeans(sub_pred)
sub_pred_roc <-
roc(as.numeric(hotel_train$is_canceled == "Yes"),
sub_pred2)
cat("Gini = ", 2 * sub_pred_roc$auc - 1, "\n", sep = "")
## Gini = 0.8740883
Gini coefficient is a little bit higher than for previous tree. We can check some additional measures which are also high and confirm that at least on the training sample this model produces very good results.
#test
sub_test_pred <-
sapply(results_sub,
function(x)
predict(object = x,
newdata = hotel_test,
type = 'prob')[,1])
sub_test_pred_res <-
ifelse(rowSums(sub_test_pred < 0.5) > 17/2,
"No", "Yes") %>%
factor(., levels = c("Yes", "No"))
confusionMatrix(data = sub_test_pred_res,
reference = hotel_test$is_canceled,
positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 9412 2773
## No 3624 19096
##
## Accuracy : 0.8167
## 95% CI : (0.8126, 0.8208)
## No Information Rate : 0.6265
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6032
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7220
## Specificity : 0.8732
## Pos Pred Value : 0.7724
## Neg Pred Value : 0.8405
## Prevalence : 0.3735
## Detection Rate : 0.2696
## Detection Prevalence : 0.3491
## Balanced Accuracy : 0.7976
##
## 'Positive' Class : Yes
##
sub_test_pred2 <- rowMeans(sub_test_pred)
sub_test_pred_roc <-
roc(as.numeric(hotel_test$is_canceled == "Yes"),
sub_test_pred2)
cat("Gini = ", 2 * sub_test_pred_roc$auc - 1, "\n", sep = "")
## Gini = 0.7748227
On the test sample the accuracy and sensitivity are lower than expected. Similarly as in a tree from cross-validation, Gini coefficient indicates an overfitting which wasn’t reduced with the application of subagging method.
In the next step we will perform an XGBoost algorithm. It is a popular boosting technique with multiple options of parameter tuning. We will use here a cross-validation with 3 folds. Starting with some default values, we’ll create a process of parameter optimization, but we’ll also keep the majority of those models to be able to compare changes also on the testing sample. We will start with max depth equal to 10 because our data is pretty large, colsample_bytree = 0.26, min_child_weight = 800, subsample = 0.8 and gamma = 1. We’ll begin with searching for an optimal number of trees and learning rate.
cv3 <- trainControl(method = "cv",
number = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
if(0) {
parameters_xgb <- expand.grid(nrounds = seq(20, 120, 10),
max_depth = c(10),
eta = c(0.2, 0.3, 0.4, 0.5),
gamma = 1,
colsample_bytree = c(0.26),
min_child_weight = c(800),
subsample = 0.8)
set.seed(123456789)
xgb <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb)
saveRDS(object = xgb,
file = here("output", "xgb.rds"))
}
xgb <- readRDS(here("output", "xgb.rds"))
xgb
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results across tuning parameters:
##
## eta nrounds ROC Sens Spec
## 0.2 20 0.8090920 0.4765536 0.9455840
## 0.2 30 0.8163986 0.5088800 0.9366296
## 0.2 40 0.8199860 0.5355362 0.9228827
## 0.2 50 0.8210700 0.5501678 0.9117982
## 0.2 60 0.8226554 0.5582168 0.9083861
## 0.2 70 0.8235115 0.5637892 0.9058813
## 0.2 80 0.8241339 0.5679604 0.9032582
## 0.2 90 0.8247259 0.5692313 0.9022918
## 0.2 100 0.8256364 0.5707629 0.9018974
## 0.2 110 0.8261797 0.5709258 0.9017594
## 0.2 120 0.8266851 0.5724574 0.9015819
## 0.3 20 0.8193347 0.5110307 0.9368663
## 0.3 30 0.8209637 0.5428357 0.9203580
## 0.3 40 0.8245537 0.5577280 0.9120350
## 0.3 50 0.8252689 0.5617688 0.9093724
## 0.3 60 0.8265035 0.5675042 0.9054870
## 0.3 70 0.8272350 0.5686773 0.9051517
## 0.3 80 0.8280981 0.5710887 0.9049545
## 0.3 90 0.8287677 0.5717079 0.9061379
## 0.3 100 0.8293300 0.5744452 0.9045403
## 0.3 110 0.8296615 0.5731743 0.9042839
## 0.3 120 0.8300857 0.5743800 0.9040078
## 0.4 20 0.8186528 0.5330270 0.9217194
## 0.4 30 0.8229878 0.5592922 0.9082483
## 0.4 40 0.8248488 0.5646039 0.9065127
## 0.4 50 0.8263786 0.5690357 0.9048954
## 0.4 60 0.8276312 0.5708606 0.9035739
## 0.4 70 0.8285538 0.5703718 0.9048559
## 0.4 80 0.8295042 0.5743800 0.9031992
## 0.4 90 0.8299923 0.5742497 0.9036528
## 0.4 100 0.8305197 0.5744126 0.9037711
## 0.4 110 0.8308545 0.5745104 0.9035739
## 0.4 120 0.8314847 0.5764656 0.9026272
## 0.5 20 0.8196316 0.5345586 0.9197863
## 0.5 30 0.8227172 0.5580865 0.9075974
## 0.5 40 0.8245893 0.5636589 0.9052108
## 0.5 50 0.8258704 0.5652882 0.9045995
## 0.5 60 0.8273242 0.5681885 0.9039091
## 0.5 70 0.8288709 0.5693942 0.9041261
## 0.5 80 0.8298985 0.5694920 0.9050531
## 0.5 90 0.8304029 0.5712517 0.9053095
## 0.5 100 0.8313275 0.5703718 0.9056053
## 0.5 110 0.8319734 0.5739238 0.9052108
## 0.5 120 0.8323342 0.5719686 0.9057631
##
## Tuning parameter 'max_depth' was held constant at a value of 10
##
## Tuning parameter 'min_child_weight' was held constant at a value of 800
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 10, eta
## = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 800
## and subsample = 0.8.
plot(xgb)
The function of number of trees to the obtained ROC is constantly growing for every learnign rate, so the more trees we have, the better may be the results. However, 120 trees, which is the highest number tested, are already much and increasing it even more will have a negative effect on the time needed to finish later computations, so let’s stay with 120 trees and learning rate equal to 0.5. Next, we’ll tune the max depth of a tree and minimum child weight
if(0) {parameters_xgb2 <- expand.grid(nrounds = 120,
max_depth = seq(6, 18, 2),
eta = c(0.5),
gamma = 1,
colsample_bytree = c(0.26),
min_child_weight = seq(100, 1000, 100),
subsample = 0.8)
set.seed(123456789)
xgb2 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb2)
saveRDS(object = xgb2,
file = here("output", "xgb2.rds"))
}
xgb2 <- readRDS(here("output", "xgb2.rds"))
xgb2
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results across tuning parameters:
##
## max_depth min_child_weight ROC Sens Spec
## 6 100 0.8805259 0.6461368 0.9177744
## 6 200 0.8738655 0.6246619 0.9205160
## 6 300 0.8632513 0.6140711 0.9140074
## 6 400 0.8535837 0.5971584 0.9139285
## 6 500 0.8506938 0.5968325 0.9105163
## 6 600 0.8422932 0.5825920 0.9119759
## 6 700 0.8342798 0.5832437 0.9018382
## 6 800 0.8314463 0.5774758 0.9027258
## 6 900 0.8274910 0.5795614 0.8952900
## 6 1000 0.8235973 0.5657770 0.9004181
## 8 100 0.8823585 0.6537622 0.9160782
## 8 200 0.8739957 0.6273666 0.9174392
## 8 300 0.8652073 0.6148532 0.9159797
## 8 400 0.8567313 0.6063806 0.9152499
## 8 500 0.8517939 0.5925310 0.9148949
## 8 600 0.8432196 0.5851338 0.9120351
## 8 700 0.8345534 0.5768241 0.9074790
## 8 800 0.8323048 0.5698178 0.9078931
## 8 900 0.8277140 0.5700459 0.9001025
## 8 1000 0.8242032 0.5703718 0.8968286
## 10 100 0.8827999 0.6518395 0.9170052
## 10 200 0.8754674 0.6344054 0.9167095
## 10 300 0.8607888 0.6117900 0.9133762
## 10 400 0.8555159 0.6023723 0.9153682
## 10 500 0.8455640 0.5865024 0.9137312
## 10 600 0.8367524 0.5841887 0.9032189
## 10 700 0.8346780 0.5762049 0.9096487
## 10 800 0.8327760 0.5764330 0.9049544
## 10 900 0.8308141 0.5703718 0.9083865
## 10 1000 0.8236702 0.5715124 0.8941856
## 12 100 0.8848718 0.6631798 0.9129225
## 12 200 0.8739660 0.6300388 0.9165122
## 12 300 0.8652702 0.6114316 0.9152499
## 12 400 0.8584400 0.6033500 0.9149540
## 12 500 0.8487258 0.5950077 0.9121928
## 12 600 0.8418636 0.5894027 0.9068479
## 12 700 0.8342570 0.5740867 0.9079522
## 12 800 0.8312461 0.5776387 0.9013450
## 12 900 0.8298294 0.5678300 0.9081691
## 12 1000 0.8229520 0.5723270 0.8962171
## 14 100 0.8841096 0.6623326 0.9148948
## 14 200 0.8743949 0.6338515 0.9179913
## 14 300 0.8588259 0.6054029 0.9142835
## 14 400 0.8584950 0.6042624 0.9166306
## 14 500 0.8434333 0.5854271 0.9099641
## 14 600 0.8419050 0.5906084 0.9041260
## 14 700 0.8347057 0.5806042 0.9048558
## 14 800 0.8305610 0.5757813 0.9020947
## 14 900 0.8291469 0.5691009 0.9055657
## 14 1000 0.8253908 0.5494835 0.9129818
## 16 100 0.8845089 0.6605403 0.9135142
## 16 200 0.8767912 0.6357415 0.9208710
## 16 300 0.8644948 0.6142340 0.9169066
## 16 400 0.8574375 0.6063154 0.9167095
## 16 500 0.8489491 0.5893049 0.9143033
## 16 600 0.8402894 0.5850360 0.9075580
## 16 700 0.8353135 0.5778343 0.9081495
## 16 800 0.8316911 0.5753251 0.9024298
## 16 900 0.8267627 0.5737935 0.8992940
## 16 1000 0.8253833 0.5668524 0.9021931
## 18 100 0.8836314 0.6579333 0.9142637
## 18 200 0.8746455 0.6325480 0.9182873
## 18 300 0.8663147 0.6163196 0.9164136
## 18 400 0.8562308 0.5993092 0.9153091
## 18 500 0.8511864 0.5966044 0.9115419
## 18 600 0.8425807 0.5891094 0.9064140
## 18 700 0.8347018 0.5783557 0.9054476
## 18 800 0.8325570 0.5746407 0.9063943
## 18 900 0.8280188 0.5733047 0.9033173
## 18 1000 0.8251337 0.5624532 0.9079526
##
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
## 'colsample_bytree' was held constant at a value of 0.26
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
## = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 100
## and subsample = 0.8.
plot(xgb2)
We obtained max_depth = 12 and min_child_weight = 100. Because the second parameter is on the lower bound of the values we checked, let’s make a more precise optimising of this parameter.
if(0) {parameters_xgb3 <- expand.grid(nrounds = 120,
max_depth = 12,
eta = c(0.5),
gamma = 1,
colsample_bytree = c(0.26),
min_child_weight = seq(50, 150, 10),
subsample = 0.8)
set.seed(123456789)
xgb3 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb3)
saveRDS(object = xgb3,
file = here("output", "xgb3.rds"))
}
xgb3 <- readRDS(here("output", "xgb3.rds"))
xgb3
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results across tuning parameters:
##
## min_child_weight ROC Sens Spec
## 50 0.8895442 0.6761821 0.9095893
## 60 0.8884130 0.6746831 0.9110686
## 70 0.8870180 0.6709030 0.9104177
## 80 0.8850066 0.6617460 0.9134749
## 90 0.8849588 0.6620067 0.9146582
## 100 0.8844831 0.6611269 0.9119364
## 110 0.8836660 0.6590087 0.9142440
## 120 0.8822343 0.6543813 0.9160586
## 130 0.8814016 0.6492000 0.9174589
## 140 0.8792692 0.6452569 0.9180900
## 150 0.8806423 0.6508619 0.9175772
##
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
## 'colsample_bytree' was held constant at a value of 0.26
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
## = 0.5, gamma = 1, colsample_bytree = 0.26, min_child_weight = 50 and
## subsample = 0.8.
plot(xgb3)
Again we get a minimal value equal to 50 from our set. We will leave it at this level because it is already very low. Now it’s time to check the optimal value of colsample_bytree.
if(0) {parameters_xgb4 <- expand.grid(nrounds = 120,
max_depth = 12,
eta = c(0.5),
gamma = 1,
colsample_bytree = seq(0.2, 0.8, 0.1),
min_child_weight = 50,
subsample = 0.8)
set.seed(123456789)
xgb4 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb4)
saveRDS(object = xgb4,
file = here("output", "xgb4.rds"))
}
xgb4 <- readRDS(here("output", "xgb4.rds"))
xgb4
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results across tuning parameters:
##
## colsample_bytree ROC Sens Spec
## 0.2 0.8857069 0.6654935 0.9104769
## 0.3 0.8911128 0.6813960 0.9083862
## 0.4 0.8919072 0.6894125 0.9067690
## 0.5 0.8936424 0.6934207 0.9059012
## 0.6 0.8936666 0.6972985 0.9025482
## 0.7 0.8952262 0.6986672 0.9042641
## 0.8 0.8951719 0.6983087 0.9037513
##
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
## 'min_child_weight' was held constant at a value of 50
## Tuning
## parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
## = 0.5, gamma = 1, colsample_bytree = 0.7, min_child_weight = 50 and
## subsample = 0.8.
plot(xgb4)
Having the best result with parameter equal to 0.7, the last one we’ll check is subsample.
if(0) {parameters_xgb5 <- expand.grid(nrounds = 120,
max_depth = 12,
eta = c(0.5),
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 50,
subsample = seq(0.5, 0.9, 0.1))
set.seed(123456789)
xgb5 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb5)
saveRDS(object = xgb5,
file = here("output", "xgb5.rds"))
}
xgb5 <- readRDS(here("output", "xgb5.rds"))
xgb5
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results across tuning parameters:
##
## subsample ROC Sens Spec
## 0.5 0.8885252 0.6907159 0.8996489
## 0.6 0.8914834 0.6984065 0.9002998
## 0.7 0.8928361 0.6946590 0.9027652
## 0.8 0.8951797 0.6976570 0.9037119
## 0.9 0.8962220 0.6994493 0.9056053
##
## Tuning parameter 'nrounds' was held constant at a value of 120
## Tuning
## 'colsample_bytree' was held constant at a value of 0.7
## Tuning
## parameter 'min_child_weight' was held constant at a value of 50
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 120, max_depth = 12, eta
## = 0.5, gamma = 1, colsample_bytree = 0.7, min_child_weight = 50 and
## subsample = 0.9.
plot(xgb5)
And finally the subsample used for splits will be equal to 0.9. Having all those parameters tuned, we can also simultaneously increase the number of trees and lower the learning rate. This way we should obtain a much more stable final model. Models 6 to 8 are the effect of gradual changes in these two parameters.
if(0) {parameters_xgb6 <- expand.grid(nrounds = 250,
max_depth = 12,
eta = 0.2,
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 50,
subsample = 0.9)
set.seed(123456789)
xgb6 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb6)
saveRDS(object = xgb6,
file = here("output", "xgb6.rds"))
}
xgb6 <- readRDS(here("output", "xgb6.rds"))
xgb6
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results:
##
## ROC Sens Spec
## 0.8985711 0.6929319 0.9128634
##
## Tuning parameter 'nrounds' was held constant at a value of 250
## Tuning
## parameter 'min_child_weight' was held constant at a value of 50
##
## Tuning parameter 'subsample' was held constant at a value of 0.9
if(0) {parameters_xgb7 <- expand.grid(nrounds = 500,
max_depth = 12,
eta = 0.1,
gamma = 1,
colsample_bytree = 0.7,
min_child_weight = 50,
subsample = 0.9)
set.seed(123456789)
xgb7 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb7)
saveRDS(object = xgb7,
file = here("output", "xgb7.rds"))
}
xgb7 <- readRDS(here("output", "xgb7.rds"))
xgb7
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results:
##
## ROC Sens Spec
## 0.8995244 0.692997 0.9129423
##
## Tuning parameter 'nrounds' was held constant at a value of 500
## Tuning
## parameter 'min_child_weight' was held constant at a value of 50
##
## Tuning parameter 'subsample' was held constant at a value of 0.9
if(0) {parameters_xgb8 <- expand.grid(nrounds = 1000,
max_depth = 16,
eta = 0.05,
gamma = 1,
colsample_bytree = 0.8,
min_child_weight = 50,
subsample = 0.9)
set.seed(123456789)
xgb8 <- train(model_formula,
data = hotel_train,
method = "xgbTree",
trControl = cv3,
tuneGrid = parameters_xgb8)
saveRDS(object = xgb8,
file = here("output", "xgb8.rds"))
}
xgb8 <- readRDS(here("output", "xgb8.rds"))
xgb8
## eXtreme Gradient Boosting
##
## 81389 samples
## 15 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 54259, 54259, 54260
## Resampling results:
##
## ROC Sens Spec
## 0.9016308 0.7002965 0.9125478
##
## Tuning parameter 'nrounds' was held constant at a value of 1000
##
## Tuning parameter 'min_child_weight' was held constant at a value of 50
##
## Tuning parameter 'subsample' was held constant at a value of 0.9
The results from this last step are very similar so we will keep only on the models 1-5 and the last one. Now we will check together all results on our training sample.
xgb_roc <- roc(hotel_train$is_canceled,
predict(xgb,
hotel_train, type = "prob")[, "Yes"])
xgb_roc2 <- roc(hotel_train$is_canceled,
predict(xgb2,
hotel_train, type = "prob")[, "Yes"])
xgb_roc3 <- roc(hotel_train$is_canceled,
predict(xgb3,
hotel_train, type = "prob")[, "Yes"])
xgb_roc4 <- roc(hotel_train$is_canceled,
predict(xgb4,
hotel_train, type = "prob")[, "Yes"])
xgb_roc5 <- roc(hotel_train$is_canceled,
predict(xgb5,
hotel_train, type = "prob")[, "Yes"])
xgb_roc8 <- roc(hotel_train$is_canceled,
predict(xgb8,
hotel_train, type = "prob")[, "Yes"])
# plot
list(
xgb_roc = xgb_roc,
xgb_roc2 = xgb_roc2,
xgb_roc3 = xgb_roc3,
xgb_roc4 = xgb_roc4,
xgb_roc5 = xgb_roc5,
xgb_roc8 = xgb_roc8
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = paste("Gini TRAIN: ",
"xgb = ",
round(100 * (2 * auc(xgb_roc) - 1), 1), "%, ",
"xgb2 = ",
round(100 * (2 * auc(xgb_roc2) - 1), 1), "%, ",
"xgb3 = ",
round(100 * (2 * auc(xgb_roc3) - 1), 1), "%, \n",
"xgb4 = ",
round(100 * (2 * auc(xgb_roc4) - 1), 1), "%, ",
"xgb5 = ",
round(100 * (2 * auc(xgb_roc5) - 1), 1), "%, ",
"xgb8 = ",
round(100 * (2 * auc(xgb_roc8) - 1), 1), "% ")) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Dark2")
We were able to constantly improve the quality of predictions on the training sample by tuning the parameters. Starting with quite low, unsatisfactory first model we obtained a much better Gini coefficient at the end of the process. Also, the changes in Gini coefficient were smaller and smaller as we were getting closer to the final parameters.
Now the same on testing sample.
xgb_test_roc <- roc(hotel_test$is_canceled,
predict(xgb,
hotel_test, type = "prob")[, "Yes"])
xgb_test_roc2 <- roc(hotel_test$is_canceled,
predict(xgb2,
hotel_test, type = "prob")[, "Yes"])
xgb_test_roc3 <- roc(hotel_test$is_canceled,
predict(xgb3,
hotel_test, type = "prob")[, "Yes"])
xgb_test_roc4 <- roc(hotel_test$is_canceled,
predict(xgb4,
hotel_test, type = "prob")[, "Yes"])
xgb_test_roc5 <- roc(hotel_test$is_canceled,
predict(xgb5,
hotel_test, type = "prob")[, "Yes"])
xgb_test_roc8 <- roc(hotel_test$is_canceled,
predict(xgb8,
hotel_test, type = "prob")[, "Yes"])
# plot
list(
xgb_test_roc = xgb_test_roc,
xgb_test_roc2 = xgb_test_roc2,
xgb_test_roc3 = xgb_test_roc3,
xgb_test_roc4 = xgb_test_roc4,
xgb_test_roc5 = xgb_test_roc5,
xgb_test_roc8 = xgb_test_roc8
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = c(paste("Gini TEST: ",
"xgb = ",
round(100 * (2 * auc(xgb_test_roc) - 1), 1), "%, ",
"xgb2 = ",
round(100 * (2 * auc(xgb_test_roc2) - 1), 1), "%, ",
"xgb3 = ",
round(100 * (2 * auc(xgb_test_roc3) - 1), 1), "%, \n",
"xgb4 = ",
round(100 * (2 * auc(xgb_test_roc4) - 1), 1), "%, ",
"xgb5 = ",
round(100 * (2 * auc(xgb_test_roc5) - 1), 1), "%, ",
"xgb8 = ",
round(100 * (2 * auc(xgb_test_roc8) - 1), 1), "% "))) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Dark2")
On the testing data our results keep the same structure, although again some overfitting appears for our best variants of models, so the improvement is not as high as on the training sample.
Finally, we created also three rather simple neural networks. They are perfect in case of non-linear relations with the target variable, but they can also be very time-consuming.
before modelling we need to apply some changes to our data. We will train our models with neuralnet function which requires that continuous variables are standarized and what’s more, all categorical variables must be recoded into dummies.
# standarize continuous vars
train.max <- apply(hotel_train[,numeric_vars], 2, max)
train.min <- apply(hotel_train[,numeric_vars], 2, min)
hotel_train_s <- hotel_train
hotel_train_s[,numeric_vars] <- scale(hotel_train[,numeric_vars],
center = train.min,
scale = train.max - train.min)
We need to standarize also the testing data for the future predictions, based on the min and max values from the training data.
hotel_test_s <- hotel_test
hotel_test_s[,numeric_vars] <- scale(hotel_test[,numeric_vars],
center = train.min,
scale = train.max - train.min)
We’ll keep our recoded data as matrix and correct automatically obtained names of new binary variables. We need also a new formula based on those new labels.
# recoding into dummies
hotel_train_mtx <-
model.matrix(object = model_formula,
data = hotel_train_s)
dim(hotel_train_mtx)
## [1] 81389 35
colnames(hotel_train_mtx)
## [1] "(Intercept)" "hotelResort Hotel"
## [3] "arrival_date_monthAugust" "arrival_date_monthDecember"
## [5] "arrival_date_monthFebruary" "arrival_date_monthJanuary"
## [7] "arrival_date_monthJuly" "arrival_date_monthJune"
## [9] "arrival_date_monthMarch" "arrival_date_monthMay"
## [11] "arrival_date_monthNovember" "arrival_date_monthOctober"
## [13] "arrival_date_monthSeptember" "adults.L"
## [15] "adults.Q" "adults.C"
## [17] "meal.L" "meal.Q"
## [19] "distribution_channelDirect" "distribution_channelTA/TO"
## [21] "booking_changes" "deposit_typeNon Refund"
## [23] "deposit_typeRefundable" "customer_typeGroup"
## [25] "customer_typeTransient" "customer_typeTransient-Party"
## [27] "lead_time_box" "kidsyes"
## [29] "cancyes" "adr_box"
## [31] "parkingyes" "requestsA1"
## [33] "requestsA2" "requestsA3+"
## [35] "days"
colnames(hotel_train_mtx) <- gsub("/", "", colnames(hotel_train_mtx))
colnames(hotel_train_mtx) <- gsub("\\+", "_", colnames(hotel_train_mtx))
colnames(hotel_train_mtx) <- gsub(" ", "", colnames(hotel_train_mtx))
# the same for test data
hotel_test_mtx <-
model.matrix(object = model_formula,
data = hotel_test_s)
dim(hotel_test_mtx)
## [1] 34905 35
colnames(hotel_test_mtx)
## [1] "(Intercept)" "hotelResort Hotel"
## [3] "arrival_date_monthAugust" "arrival_date_monthDecember"
## [5] "arrival_date_monthFebruary" "arrival_date_monthJanuary"
## [7] "arrival_date_monthJuly" "arrival_date_monthJune"
## [9] "arrival_date_monthMarch" "arrival_date_monthMay"
## [11] "arrival_date_monthNovember" "arrival_date_monthOctober"
## [13] "arrival_date_monthSeptember" "adults.L"
## [15] "adults.Q" "adults.C"
## [17] "meal.L" "meal.Q"
## [19] "distribution_channelDirect" "distribution_channelTA/TO"
## [21] "booking_changes" "deposit_typeNon Refund"
## [23] "deposit_typeRefundable" "customer_typeGroup"
## [25] "customer_typeTransient" "customer_typeTransient-Party"
## [27] "lead_time_box" "kidsyes"
## [29] "cancyes" "adr_box"
## [31] "parkingyes" "requestsA1"
## [33] "requestsA2" "requestsA3+"
## [35] "days"
colnames(hotel_test_mtx) <- gsub("/", "", colnames(hotel_test_mtx))
colnames(hotel_test_mtx) <- gsub("\\+", "_", colnames(hotel_test_mtx))
colnames(hotel_test_mtx) <- gsub(" ", "", colnames(hotel_test_mtx))
# and new formula
col_list <- paste(c(colnames(hotel_train_mtx[, -1])), collapse = "+")
col_list <- paste(c("is_canceled ~ ", col_list), collapse = "")
model_formula2 <- formula(col_list)
The first net will be a very simple model without any hidden layers.
if(0){
set.seed(123456789)
nn0 <-
data.frame(hotel_train_mtx,
is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
neuralnet(model_formula2,
data = .,
hidden = 0,
linear.output = FALSE,
algorithm = "rprop+",
threshold = 0.01)
saveRDS(object = nn0,
file = here("output", "nn0.rds"))
}
nn0 <- readRDS(here("output", "nn0.rds"))
nn0$result.matrix
## [,1]
## error 5.751797e+03
## reached.threshold 9.748996e-03
## steps 6.894000e+03
## Intercept.to.is_canceled -7.490977e+00
## hotelResortHotel.to.is_canceled 4.093791e-01
## arrival_date_monthAugust.to.is_canceled -5.040890e-01
## arrival_date_monthDecember.to.is_canceled 3.031218e-01
## arrival_date_monthFebruary.to.is_canceled 3.910708e-01
## arrival_date_monthJanuary.to.is_canceled 4.784395e-01
## arrival_date_monthJuly.to.is_canceled -5.588438e-01
## arrival_date_monthJune.to.is_canceled -3.972144e-01
## arrival_date_monthMarch.to.is_canceled 4.783688e-02
## arrival_date_monthMay.to.is_canceled -2.700806e-01
## arrival_date_monthNovember.to.is_canceled 1.554840e-01
## arrival_date_monthOctober.to.is_canceled -1.267552e-01
## arrival_date_monthSeptember.to.is_canceled -4.810588e-01
## adults.L.to.is_canceled -3.573290e-01
## adults.Q.to.is_canceled 1.612853e-01
## adults.C.to.is_canceled -1.507029e-01
## meal.L.to.is_canceled -7.581968e-01
## meal.Q.to.is_canceled 3.121499e-02
## distribution_channelDirect.to.is_canceled 2.056507e-01
## distribution_channelTATO.to.is_canceled 1.198020e+00
## booking_changes.to.is_canceled -7.766732e+00
## deposit_typeNonRefund.to.is_canceled 4.166921e+00
## deposit_typeRefundable.to.is_canceled 5.431070e-01
## customer_typeGroup.to.is_canceled -4.959328e-01
## customer_typeTransient.to.is_canceled 1.052287e+00
## lead_time_box.to.is_canceled 3.097382e+00
## kidsyes.to.is_canceled 2.038891e-02
## cancyes.to.is_canceled 3.374638e+00
## adr_box.to.is_canceled 1.436497e+01
## parkingyes.to.is_canceled -3.748869e+02
## requestsA1.to.is_canceled -1.017234e+00
## requestsA2.to.is_canceled -1.236498e+00
## requestsA3_.to.is_canceled -1.739464e+00
## days.to.is_canceled 5.285801e-01
plot(nn0, rep = 'best')
nn0_pred <- compute(nn0, hotel_train_mtx[, -1])
confusionMatrix(as.numeric((nn0_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46942 12541
## 1 3760 18146
##
## Accuracy : 0.7997
## 95% CI : (0.7969, 0.8025)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5481
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9258
## Specificity : 0.5913
## Pos Pred Value : 0.7892
## Neg Pred Value : 0.8284
## Prevalence : 0.6230
## Detection Rate : 0.5768
## Detection Prevalence : 0.7308
## Balanced Accuracy : 0.7586
##
## 'Positive' Class : 0
##
nn0_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
compute(nn0, hotel_train_mtx[, -1])$net.result[, 1])
In the confusion matrix we can see that especially the specificity is low for the default threshold. We can check at once the predictions and confusion matrix for the testing sample.
nn0_test_pred <- compute(nn0, hotel_test_mtx[, -1])
confusionMatrix(as.numeric((nn0_test_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 20270 5320
## 1 1599 7716
##
## Accuracy : 0.8018
## 95% CI : (0.7976, 0.8059)
## No Information Rate : 0.6265
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5505
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9269
## Specificity : 0.5919
## Pos Pred Value : 0.7921
## Neg Pred Value : 0.8283
## Prevalence : 0.6265
## Detection Rate : 0.5807
## Detection Prevalence : 0.7331
## Balanced Accuracy : 0.7594
##
## 'Positive' Class : 0
##
nn0_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
compute(nn0, hotel_test_mtx[, -1])$net.result[, 1])
The accuracy is even slightly higher then on training sample but still there is a problem with specificity. Now we’ll make a second model with one hidden layer of 6 neurons. To speed up the computationally demanding calculations we will use here a subsample of only 2 thousand observations selected from training sample.
if(0){
set.seed(123456789)
nn1 <-
data.frame(hotel_train_mtx,
is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
sample_n(2000) %>%
neuralnet(model_formula2,
data = .,
hidden = c(6),
linear.output = FALSE,
stepmax = 1e+06,
algorithm = "rprop+",
threshold = 0.01)
saveRDS(object = nn1,
file = here("output", "nn1.rds"))
}
nn1 <- readRDS(here("output", "nn1.rds"))
nn1$result.matrix
## [,1]
## error 7.397264e+01
## reached.threshold 9.482453e-03
## steps 7.219800e+05
## Intercept.to.1layhid1 5.374891e+00
## hotelResortHotel.to.1layhid1 -2.548735e+01
## arrival_date_monthAugust.to.1layhid1 -5.208985e+01
## arrival_date_monthDecember.to.1layhid1 -1.360527e+01
## arrival_date_monthFebruary.to.1layhid1 1.228098e+01
## arrival_date_monthJanuary.to.1layhid1 -1.777210e+00
## arrival_date_monthJuly.to.1layhid1 -5.219038e+01
## arrival_date_monthJune.to.1layhid1 -5.455389e+01
## arrival_date_monthMarch.to.1layhid1 3.389499e+01
## arrival_date_monthMay.to.1layhid1 -4.358121e+01
## arrival_date_monthNovember.to.1layhid1 4.705142e+01
## arrival_date_monthOctober.to.1layhid1 -4.369458e+01
## arrival_date_monthSeptember.to.1layhid1 -5.972898e+01
## adults.L.to.1layhid1 -9.169481e+00
## adults.Q.to.1layhid1 -1.442369e+01
## adults.C.to.1layhid1 -7.352820e+00
## meal.L.to.1layhid1 -8.735676e+01
## meal.Q.to.1layhid1 1.171479e+01
## distribution_channelDirect.to.1layhid1 1.163392e+01
## distribution_channelTATO.to.1layhid1 -3.122873e-01
## booking_changes.to.1layhid1 3.910796e+01
## deposit_typeNonRefund.to.1layhid1 -7.194233e+02
## deposit_typeRefundable.to.1layhid1 2.200320e+01
## customer_typeGroup.to.1layhid1 6.854438e+01
## customer_typeTransient.to.1layhid1 -2.237636e+01
## lead_time_box.to.1layhid1 -4.220127e+01
## kidsyes.to.1layhid1 3.653100e+01
## cancyes.to.1layhid1 4.700659e+01
## adr_box.to.1layhid1 3.906279e+01
## parkingyes.to.1layhid1 1.590627e+02
## requestsA1.to.1layhid1 2.463954e+01
## requestsA2.to.1layhid1 3.388900e+01
## requestsA3_.to.1layhid1 1.933467e+01
## days.to.1layhid1 1.520092e+02
## Intercept.to.1layhid2 3.178902e+00
## hotelResortHotel.to.1layhid2 -1.662296e+01
## arrival_date_monthAugust.to.1layhid2 1.786461e+01
## arrival_date_monthDecember.to.1layhid2 1.016848e+01
## arrival_date_monthFebruary.to.1layhid2 1.219847e+01
## arrival_date_monthJanuary.to.1layhid2 3.253156e+00
## arrival_date_monthJuly.to.1layhid2 1.638636e+01
## arrival_date_monthJune.to.1layhid2 -1.120094e+01
## arrival_date_monthMarch.to.1layhid2 1.246974e+01
## arrival_date_monthMay.to.1layhid2 -1.617144e+01
## arrival_date_monthNovember.to.1layhid2 -1.670364e+01
## arrival_date_monthOctober.to.1layhid2 1.337819e+01
## arrival_date_monthSeptember.to.1layhid2 2.841505e+01
## adults.L.to.1layhid2 2.913257e+01
## adults.Q.to.1layhid2 7.487303e+00
## adults.C.to.1layhid2 9.853815e+00
## meal.L.to.1layhid2 -2.385391e+01
## meal.Q.to.1layhid2 6.079104e+00
## distribution_channelDirect.to.1layhid2 -6.598365e+00
## distribution_channelTATO.to.1layhid2 -1.026242e+01
## booking_changes.to.1layhid2 -6.470914e+01
## deposit_typeNonRefund.to.1layhid2 4.358609e+01
## deposit_typeRefundable.to.1layhid2 -6.912606e+02
## customer_typeGroup.to.1layhid2 -7.074416e+02
## customer_typeTransient.to.1layhid2 1.611277e+01
## lead_time_box.to.1layhid2 9.217366e+00
## kidsyes.to.1layhid2 3.317836e+01
## cancyes.to.1layhid2 7.793332e+01
## adr_box.to.1layhid2 -2.975516e+01
## parkingyes.to.1layhid2 -7.410783e+02
## requestsA1.to.1layhid2 1.371463e+01
## requestsA2.to.1layhid2 1.143314e+01
## requestsA3_.to.1layhid2 2.196846e+01
## days.to.1layhid2 2.346528e+01
## Intercept.to.1layhid3 -4.511418e+00
## hotelResortHotel.to.1layhid3 -1.619105e+00
## arrival_date_monthAugust.to.1layhid3 -2.246299e+00
## arrival_date_monthDecember.to.1layhid3 5.327190e+00
## arrival_date_monthFebruary.to.1layhid3 9.387458e-01
## arrival_date_monthJanuary.to.1layhid3 1.484502e+00
## arrival_date_monthJuly.to.1layhid3 -2.335146e+00
## arrival_date_monthJune.to.1layhid3 4.417763e+00
## arrival_date_monthMarch.to.1layhid3 1.716618e+00
## arrival_date_monthMay.to.1layhid3 6.365860e-01
## arrival_date_monthNovember.to.1layhid3 3.883868e+00
## arrival_date_monthOctober.to.1layhid3 -1.954033e-01
## arrival_date_monthSeptember.to.1layhid3 1.204140e+00
## adults.L.to.1layhid3 -5.984674e-01
## adults.Q.to.1layhid3 -8.492180e-01
## adults.C.to.1layhid3 -2.753736e-01
## meal.L.to.1layhid3 -9.900164e-01
## meal.Q.to.1layhid3 1.552201e-01
## distribution_channelDirect.to.1layhid3 -2.273597e+00
## distribution_channelTATO.to.1layhid3 -1.281813e+00
## booking_changes.to.1layhid3 -2.358846e+00
## deposit_typeNonRefund.to.1layhid3 -9.949138e-01
## deposit_typeRefundable.to.1layhid3 3.816987e+01
## customer_typeGroup.to.1layhid3 -6.546891e+02
## customer_typeTransient.to.1layhid3 2.391955e-01
## lead_time_box.to.1layhid3 1.515612e+00
## kidsyes.to.1layhid3 -8.745374e-01
## cancyes.to.1layhid3 6.344581e+00
## adr_box.to.1layhid3 2.231523e+01
## parkingyes.to.1layhid3 -6.846546e+02
## requestsA1.to.1layhid3 -3.760618e+00
## requestsA2.to.1layhid3 -6.347371e+00
## requestsA3_.to.1layhid3 -1.805744e+00
## days.to.1layhid3 8.250674e+00
## Intercept.to.1layhid4 2.465855e+01
## hotelResortHotel.to.1layhid4 5.128860e+00
## arrival_date_monthAugust.to.1layhid4 1.746610e+01
## arrival_date_monthDecember.to.1layhid4 2.576604e+01
## arrival_date_monthFebruary.to.1layhid4 -3.385431e+00
## arrival_date_monthJanuary.to.1layhid4 5.816811e+00
## arrival_date_monthJuly.to.1layhid4 9.323096e+01
## arrival_date_monthJune.to.1layhid4 -2.827935e+01
## arrival_date_monthMarch.to.1layhid4 8.250569e+00
## arrival_date_monthMay.to.1layhid4 -1.611719e+01
## arrival_date_monthNovember.to.1layhid4 -4.486123e+01
## arrival_date_monthOctober.to.1layhid4 5.018538e+00
## arrival_date_monthSeptember.to.1layhid4 -2.753555e+01
## adults.L.to.1layhid4 2.472143e+00
## adults.Q.to.1layhid4 -3.791186e+01
## adults.C.to.1layhid4 -1.326957e+01
## meal.L.to.1layhid4 3.180176e+01
## meal.Q.to.1layhid4 1.121728e+01
## distribution_channelDirect.to.1layhid4 7.456924e+02
## distribution_channelTATO.to.1layhid4 9.031202e+00
## booking_changes.to.1layhid4 1.648653e+02
## deposit_typeNonRefund.to.1layhid4 -5.460739e+01
## deposit_typeRefundable.to.1layhid4 -7.445491e+02
## customer_typeGroup.to.1layhid4 7.875166e+01
## customer_typeTransient.to.1layhid4 -5.510910e+00
## lead_time_box.to.1layhid4 5.549299e+00
## kidsyes.to.1layhid4 9.232313e+00
## cancyes.to.1layhid4 -5.000703e+00
## adr_box.to.1layhid4 -9.050179e+01
## parkingyes.to.1layhid4 6.147563e+02
## requestsA1.to.1layhid4 -2.582677e+01
## requestsA2.to.1layhid4 -2.499270e+01
## requestsA3_.to.1layhid4 -3.849444e+01
## days.to.1layhid4 -7.074467e+01
## Intercept.to.1layhid5 -2.236770e+00
## hotelResortHotel.to.1layhid5 -8.504060e+00
## arrival_date_monthAugust.to.1layhid5 3.028680e+00
## arrival_date_monthDecember.to.1layhid5 5.390814e+00
## arrival_date_monthFebruary.to.1layhid5 4.560318e+00
## arrival_date_monthJanuary.to.1layhid5 9.778081e+00
## arrival_date_monthJuly.to.1layhid5 1.600907e+00
## arrival_date_monthJune.to.1layhid5 -5.326529e+00
## arrival_date_monthMarch.to.1layhid5 7.371411e+00
## arrival_date_monthMay.to.1layhid5 7.487681e+00
## arrival_date_monthNovember.to.1layhid5 5.521913e+01
## arrival_date_monthOctober.to.1layhid5 2.010108e+00
## arrival_date_monthSeptember.to.1layhid5 5.117728e+00
## adults.L.to.1layhid5 3.186449e-01
## adults.Q.to.1layhid5 4.055619e+00
## adults.C.to.1layhid5 -2.996738e+00
## meal.L.to.1layhid5 3.339225e+00
## meal.Q.to.1layhid5 -3.391951e+00
## distribution_channelDirect.to.1layhid5 -2.551684e+00
## distribution_channelTATO.to.1layhid5 -5.894729e+00
## booking_changes.to.1layhid5 7.153328e+01
## deposit_typeNonRefund.to.1layhid5 -7.325118e+02
## deposit_typeRefundable.to.1layhid5 5.899934e+00
## customer_typeGroup.to.1layhid5 3.914812e+01
## customer_typeTransient.to.1layhid5 2.687965e-01
## lead_time_box.to.1layhid5 -7.994642e+00
## kidsyes.to.1layhid5 1.024203e+00
## cancyes.to.1layhid5 -5.517057e+00
## adr_box.to.1layhid5 2.242047e+01
## parkingyes.to.1layhid5 4.732492e+01
## requestsA1.to.1layhid5 4.786153e+00
## requestsA2.to.1layhid5 -1.813163e+00
## requestsA3_.to.1layhid5 1.146103e+00
## days.to.1layhid5 4.267567e+01
## Intercept.to.1layhid6 9.303958e+00
## hotelResortHotel.to.1layhid6 -2.290631e+00
## arrival_date_monthAugust.to.1layhid6 -1.831472e+00
## arrival_date_monthDecember.to.1layhid6 2.069801e+00
## arrival_date_monthFebruary.to.1layhid6 -1.891841e+00
## arrival_date_monthJanuary.to.1layhid6 -4.920723e+00
## arrival_date_monthJuly.to.1layhid6 -8.215916e-01
## arrival_date_monthJune.to.1layhid6 9.661582e+00
## arrival_date_monthMarch.to.1layhid6 -2.871030e+00
## arrival_date_monthMay.to.1layhid6 -9.299367e-01
## arrival_date_monthNovember.to.1layhid6 -9.004345e+00
## arrival_date_monthOctober.to.1layhid6 1.193451e+00
## arrival_date_monthSeptember.to.1layhid6 8.141927e+00
## adults.L.to.1layhid6 3.152838e+00
## adults.Q.to.1layhid6 -3.189553e-01
## adults.C.to.1layhid6 2.964809e+00
## meal.L.to.1layhid6 2.134982e-01
## meal.Q.to.1layhid6 1.236679e+00
## distribution_channelDirect.to.1layhid6 -5.743821e+00
## distribution_channelTATO.to.1layhid6 -1.581319e+00
## booking_changes.to.1layhid6 -3.581584e+01
## deposit_typeNonRefund.to.1layhid6 -2.332456e+00
## deposit_typeRefundable.to.1layhid6 5.984417e+00
## customer_typeGroup.to.1layhid6 4.359032e+01
## customer_typeTransient.to.1layhid6 1.925424e+00
## lead_time_box.to.1layhid6 -1.165244e-01
## kidsyes.to.1layhid6 -3.113989e+00
## cancyes.to.1layhid6 -5.322883e-01
## adr_box.to.1layhid6 -2.723852e+01
## parkingyes.to.1layhid6 1.301335e+02
## requestsA1.to.1layhid6 -2.060458e+00
## requestsA2.to.1layhid6 -1.967191e+00
## requestsA3_.to.1layhid6 8.983357e+00
## days.to.1layhid6 -7.680627e+00
## Intercept.to.is_canceled 3.573575e+01
## 1layhid1.to.is_canceled -2.077520e+02
## 1layhid2.to.is_canceled 2.133316e+02
## 1layhid3.to.is_canceled 3.782999e+02
## 1layhid4.to.is_canceled -1.929522e+02
## 1layhid5.to.is_canceled -2.039991e+02
## 1layhid6.to.is_canceled -4.126416e+02
plot(nn1, rep = 'best')
nn1_pred <- compute(nn1, hotel_train_mtx[, -1])
confusionMatrix(as.numeric((nn1_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42629 11771
## 1 8073 18916
##
## Accuracy : 0.7562
## 95% CI : (0.7532, 0.7591)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4683
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8408
## Specificity : 0.6164
## Pos Pred Value : 0.7836
## Neg Pred Value : 0.7009
## Prevalence : 0.6230
## Detection Rate : 0.5238
## Detection Prevalence : 0.6684
## Balanced Accuracy : 0.7286
##
## 'Positive' Class : 0
##
nn1_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
compute(nn1, hotel_train_mtx[, -1])$net.result[, 1])
Surprisingly, the result this time is worse and unsatisfactory. Let’s check the diffence for the testing data.
nn1_test_pred <- compute(nn1, hotel_test_mtx[, -1])
confusionMatrix(as.numeric((nn1_test_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 18259 5021
## 1 3610 8015
##
## Accuracy : 0.7527
## 95% CI : (0.7482, 0.7572)
## No Information Rate : 0.6265
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4598
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8349
## Specificity : 0.6148
## Pos Pred Value : 0.7843
## Neg Pred Value : 0.6895
## Prevalence : 0.6265
## Detection Rate : 0.5231
## Detection Prevalence : 0.6670
## Balanced Accuracy : 0.7249
##
## 'Positive' Class : 0
##
nn1_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
compute(nn1, hotel_test_mtx[, -1])$net.result[, 1])
Similar conclusions. It seems that adding a hidden layer doesn’t help at all. Finally we will add another layer of 4 neurons and see what happens. We will also reduce a subsample even more to only a one thousand observations.
if(0){
set.seed(123456789)
nn2 <-
data.frame(hotel_train_mtx,
is_canceled = as.numeric(hotel_train$is_canceled == "Yes")) %>%
sample_n(1000) %>%
neuralnet(model_formula2,
data = .,
hidden = c(6, 4),
linear.output = FALSE,
stepmax = 1e+06,
algorithm = "rprop+",
threshold = 0.001)
saveRDS(object = nn2,
file = here("output", "nn2.rds"))
}
nn2 <- readRDS(here("output", "nn2.rds"))
nn2$result.matrix
## [,1]
## error 2.250067e+01
## reached.threshold 7.669559e-04
## steps 4.606400e+04
## Intercept.to.1layhid1 2.445336e+00
## hotelResortHotel.to.1layhid1 3.990793e+01
## arrival_date_monthAugust.to.1layhid1 3.931190e+01
## arrival_date_monthDecember.to.1layhid1 3.978750e+01
## arrival_date_monthFebruary.to.1layhid1 3.743967e+01
## arrival_date_monthJanuary.to.1layhid1 -7.522021e+00
## arrival_date_monthJuly.to.1layhid1 -7.962709e+00
## arrival_date_monthJune.to.1layhid1 2.578327e+00
## arrival_date_monthMarch.to.1layhid1 -7.262644e+00
## arrival_date_monthMay.to.1layhid1 2.452163e+00
## arrival_date_monthNovember.to.1layhid1 -4.010579e+00
## arrival_date_monthOctober.to.1layhid1 -2.757441e+00
## arrival_date_monthSeptember.to.1layhid1 4.215977e+01
## adults.L.to.1layhid1 -1.190886e+00
## adults.Q.to.1layhid1 1.668544e+00
## adults.C.to.1layhid1 5.909044e+00
## meal.L.to.1layhid1 9.129906e+00
## meal.Q.to.1layhid1 6.103435e+00
## distribution_channelDirect.to.1layhid1 4.991895e+01
## distribution_channelTATO.to.1layhid1 2.852426e+00
## booking_changes.to.1layhid1 7.995993e+01
## deposit_typeNonRefund.to.1layhid1 -1.322254e+01
## deposit_typeRefundable.to.1layhid1 1.188839e+01
## customer_typeGroup.to.1layhid1 2.975056e+02
## customer_typeTransient.to.1layhid1 8.746696e+00
## lead_time_box.to.1layhid1 -4.222491e+00
## kidsyes.to.1layhid1 -8.875577e+00
## cancyes.to.1layhid1 -3.041214e+02
## adr_box.to.1layhid1 1.233006e+01
## parkingyes.to.1layhid1 2.451804e+02
## requestsA1.to.1layhid1 9.907438e+00
## requestsA2.to.1layhid1 -1.348668e+00
## requestsA3_.to.1layhid1 3.127534e+02
## days.to.1layhid1 6.684135e+01
## Intercept.to.1layhid2 -8.689737e-01
## hotelResortHotel.to.1layhid2 5.722455e+00
## arrival_date_monthAugust.to.1layhid2 -5.972431e+00
## arrival_date_monthDecember.to.1layhid2 -7.654679e+00
## arrival_date_monthFebruary.to.1layhid2 9.218852e-02
## arrival_date_monthJanuary.to.1layhid2 -1.621443e+00
## arrival_date_monthJuly.to.1layhid2 7.353724e-01
## arrival_date_monthJune.to.1layhid2 1.957547e+00
## arrival_date_monthMarch.to.1layhid2 -4.842043e+00
## arrival_date_monthMay.to.1layhid2 1.050897e+00
## arrival_date_monthNovember.to.1layhid2 -2.465472e+00
## arrival_date_monthOctober.to.1layhid2 -7.221829e-01
## arrival_date_monthSeptember.to.1layhid2 1.853052e+00
## adults.L.to.1layhid2 -2.507790e+00
## adults.Q.to.1layhid2 1.627501e+00
## adults.C.to.1layhid2 -3.268886e-01
## meal.L.to.1layhid2 1.618695e+00
## meal.Q.to.1layhid2 -8.755422e-01
## distribution_channelDirect.to.1layhid2 -7.136336e+02
## distribution_channelTATO.to.1layhid2 1.946677e-01
## booking_changes.to.1layhid2 -1.121755e+01
## deposit_typeNonRefund.to.1layhid2 7.190768e+02
## deposit_typeRefundable.to.1layhid2 -2.894353e+02
## customer_typeGroup.to.1layhid2 -5.113508e+02
## customer_typeTransient.to.1layhid2 2.278237e+00
## lead_time_box.to.1layhid2 -7.082112e-01
## kidsyes.to.1layhid2 -2.904900e-01
## cancyes.to.1layhid2 -7.155928e+00
## adr_box.to.1layhid2 2.203187e+00
## parkingyes.to.1layhid2 -5.152680e+02
## requestsA1.to.1layhid2 -1.090650e+00
## requestsA2.to.1layhid2 -1.563522e+00
## requestsA3_.to.1layhid2 3.146404e+00
## days.to.1layhid2 -1.664009e+01
## Intercept.to.1layhid3 4.994431e-01
## hotelResortHotel.to.1layhid3 3.355449e-01
## arrival_date_monthAugust.to.1layhid3 3.913846e+00
## arrival_date_monthDecember.to.1layhid3 -7.105421e+02
## arrival_date_monthFebruary.to.1layhid3 -2.896649e-01
## arrival_date_monthJanuary.to.1layhid3 -7.063192e-01
## arrival_date_monthJuly.to.1layhid3 3.436165e+00
## arrival_date_monthJune.to.1layhid3 -2.736949e+00
## arrival_date_monthMarch.to.1layhid3 2.246594e+00
## arrival_date_monthMay.to.1layhid3 3.003981e-01
## arrival_date_monthNovember.to.1layhid3 3.074521e-02
## arrival_date_monthOctober.to.1layhid3 1.782097e+00
## arrival_date_monthSeptember.to.1layhid3 -5.789644e-01
## adults.L.to.1layhid3 -6.941859e-01
## adults.Q.to.1layhid3 -1.162360e+00
## adults.C.to.1layhid3 -6.361649e-01
## meal.L.to.1layhid3 -2.690321e+00
## meal.Q.to.1layhid3 2.306288e+00
## distribution_channelDirect.to.1layhid3 -1.041899e+00
## distribution_channelTATO.to.1layhid3 7.106703e-01
## booking_changes.to.1layhid3 2.353725e+01
## deposit_typeNonRefund.to.1layhid3 7.118383e+02
## deposit_typeRefundable.to.1layhid3 -2.882787e+02
## customer_typeGroup.to.1layhid3 -5.310858e+02
## customer_typeTransient.to.1layhid3 -2.476492e-01
## lead_time_box.to.1layhid3 1.476960e+00
## kidsyes.to.1layhid3 4.232672e-01
## cancyes.to.1layhid3 3.420836e+00
## adr_box.to.1layhid3 -2.787733e+00
## parkingyes.to.1layhid3 -5.281673e+02
## requestsA1.to.1layhid3 -2.972454e+00
## requestsA2.to.1layhid3 -6.706305e-01
## requestsA3_.to.1layhid3 1.532688e+00
## days.to.1layhid3 -2.602837e+01
## Intercept.to.1layhid4 -1.812370e+00
## hotelResortHotel.to.1layhid4 2.832903e-01
## arrival_date_monthAugust.to.1layhid4 9.401485e-01
## arrival_date_monthDecember.to.1layhid4 -2.700352e+00
## arrival_date_monthFebruary.to.1layhid4 2.167408e+00
## arrival_date_monthJanuary.to.1layhid4 2.976250e+00
## arrival_date_monthJuly.to.1layhid4 1.447992e+00
## arrival_date_monthJune.to.1layhid4 1.809558e-01
## arrival_date_monthMarch.to.1layhid4 1.819831e+00
## arrival_date_monthMay.to.1layhid4 -3.399769e+00
## arrival_date_monthNovember.to.1layhid4 3.032964e+00
## arrival_date_monthOctober.to.1layhid4 1.070366e+00
## arrival_date_monthSeptember.to.1layhid4 -5.087037e+00
## adults.L.to.1layhid4 1.516245e+00
## adults.Q.to.1layhid4 1.303123e+00
## adults.C.to.1layhid4 -3.516939e-01
## meal.L.to.1layhid4 2.597440e-01
## meal.Q.to.1layhid4 -3.265543e-02
## distribution_channelDirect.to.1layhid4 -2.516561e+00
## distribution_channelTATO.to.1layhid4 -1.087358e+00
## booking_changes.to.1layhid4 -2.325487e+01
## deposit_typeNonRefund.to.1layhid4 4.512724e+01
## deposit_typeRefundable.to.1layhid4 -2.866621e+02
## customer_typeGroup.to.1layhid4 -5.122173e+02
## customer_typeTransient.to.1layhid4 -8.633997e-01
## lead_time_box.to.1layhid4 2.339991e+00
## kidsyes.to.1layhid4 -1.797922e+00
## cancyes.to.1layhid4 3.812564e+01
## adr_box.to.1layhid4 5.253205e+00
## parkingyes.to.1layhid4 -5.142397e+02
## requestsA1.to.1layhid4 6.666795e+00
## requestsA2.to.1layhid4 3.666278e+00
## requestsA3_.to.1layhid4 -7.113838e+02
## days.to.1layhid4 -6.338953e+00
## Intercept.to.1layhid5 6.145372e-01
## hotelResortHotel.to.1layhid5 5.688636e-01
## arrival_date_monthAugust.to.1layhid5 1.295317e+00
## arrival_date_monthDecember.to.1layhid5 1.664685e-02
## arrival_date_monthFebruary.to.1layhid5 1.725441e+00
## arrival_date_monthJanuary.to.1layhid5 5.068823e+00
## arrival_date_monthJuly.to.1layhid5 3.101162e+00
## arrival_date_monthJune.to.1layhid5 -2.154570e+00
## arrival_date_monthMarch.to.1layhid5 2.655991e+00
## arrival_date_monthMay.to.1layhid5 -2.664765e+00
## arrival_date_monthNovember.to.1layhid5 2.470168e+00
## arrival_date_monthOctober.to.1layhid5 3.309031e+00
## arrival_date_monthSeptember.to.1layhid5 -1.855909e+00
## adults.L.to.1layhid5 8.273229e-01
## adults.Q.to.1layhid5 -2.906310e+00
## adults.C.to.1layhid5 1.805033e+00
## meal.L.to.1layhid5 2.626600e+00
## meal.Q.to.1layhid5 6.048585e-01
## distribution_channelDirect.to.1layhid5 -4.929594e+00
## distribution_channelTATO.to.1layhid5 4.274756e-01
## booking_changes.to.1layhid5 -2.044103e+01
## deposit_typeNonRefund.to.1layhid5 3.434228e+01
## deposit_typeRefundable.to.1layhid5 1.869963e+01
## customer_typeGroup.to.1layhid5 1.786562e+02
## customer_typeTransient.to.1layhid5 -4.634708e-01
## lead_time_box.to.1layhid5 -1.035333e+00
## kidsyes.to.1layhid5 -2.201244e+00
## cancyes.to.1layhid5 -1.839631e+02
## adr_box.to.1layhid5 -4.147092e+00
## parkingyes.to.1layhid5 1.854793e+02
## requestsA1.to.1layhid5 5.033844e+00
## requestsA2.to.1layhid5 3.886092e-01
## requestsA3_.to.1layhid5 2.060406e+02
## days.to.1layhid5 -8.606483e+00
## Intercept.to.1layhid6 -3.628445e+00
## hotelResortHotel.to.1layhid6 -1.139338e+00
## arrival_date_monthAugust.to.1layhid6 -1.289767e-01
## arrival_date_monthDecember.to.1layhid6 4.377645e+00
## arrival_date_monthFebruary.to.1layhid6 5.862540e-01
## arrival_date_monthJanuary.to.1layhid6 6.668938e-01
## arrival_date_monthJuly.to.1layhid6 -1.150463e+00
## arrival_date_monthJune.to.1layhid6 -1.121599e+00
## arrival_date_monthMarch.to.1layhid6 7.064627e-01
## arrival_date_monthMay.to.1layhid6 -6.540200e-01
## arrival_date_monthNovember.to.1layhid6 1.037043e+00
## arrival_date_monthOctober.to.1layhid6 -6.691655e-02
## arrival_date_monthSeptember.to.1layhid6 -9.428782e-04
## adults.L.to.1layhid6 -2.565231e+00
## adults.Q.to.1layhid6 3.984700e-01
## adults.C.to.1layhid6 -9.617120e-02
## meal.L.to.1layhid6 8.155688e-01
## meal.Q.to.1layhid6 -2.023907e-01
## distribution_channelDirect.to.1layhid6 2.920532e-01
## distribution_channelTATO.to.1layhid6 5.268996e-01
## booking_changes.to.1layhid6 -8.155330e+00
## deposit_typeNonRefund.to.1layhid6 -7.080431e+02
## deposit_typeRefundable.to.1layhid6 -2.869864e+02
## customer_typeGroup.to.1layhid6 -5.091078e+02
## customer_typeTransient.to.1layhid6 4.740407e-01
## lead_time_box.to.1layhid6 1.148329e+00
## kidsyes.to.1layhid6 -3.405187e-01
## cancyes.to.1layhid6 -1.209551e+01
## adr_box.to.1layhid6 6.371121e+00
## parkingyes.to.1layhid6 -5.121759e+02
## requestsA1.to.1layhid6 3.553015e-01
## requestsA2.to.1layhid6 -1.295809e+00
## requestsA3_.to.1layhid6 5.912159e-01
## days.to.1layhid6 1.096411e+01
## Intercept.to.2layhid1 -6.505797e-01
## 1layhid1.to.2layhid1 1.549556e+00
## 1layhid2.to.2layhid1 -1.213997e+00
## 1layhid3.to.2layhid1 -1.343260e+00
## 1layhid4.to.2layhid1 -4.403113e-01
## 1layhid5.to.2layhid1 5.095590e+00
## 1layhid6.to.2layhid1 -1.168404e+01
## Intercept.to.2layhid2 -1.401423e+00
## 1layhid1.to.2layhid2 -5.702424e-01
## 1layhid2.to.2layhid2 1.768272e+00
## 1layhid3.to.2layhid2 1.171244e+00
## 1layhid4.to.2layhid2 1.570198e+00
## 1layhid5.to.2layhid2 -1.676968e+00
## 1layhid6.to.2layhid2 2.178951e+00
## Intercept.to.2layhid3 1.890084e+00
## 1layhid1.to.2layhid3 1.687252e+00
## 1layhid2.to.2layhid3 -2.458881e+00
## 1layhid3.to.2layhid3 -2.244527e+00
## 1layhid4.to.2layhid3 -1.699384e+00
## 1layhid5.to.2layhid3 1.438998e+00
## 1layhid6.to.2layhid3 -3.859045e+00
## Intercept.to.2layhid4 7.931987e-01
## 1layhid1.to.2layhid4 2.250721e-01
## 1layhid2.to.2layhid4 -1.724975e-01
## 1layhid3.to.2layhid4 -1.177952e+00
## 1layhid4.to.2layhid4 -2.314229e+00
## 1layhid5.to.2layhid4 2.453100e+00
## 1layhid6.to.2layhid4 -2.551598e+00
## Intercept.to.is_canceled 9.473638e-01
## 2layhid1.to.is_canceled -2.097638e+01
## 2layhid2.to.is_canceled 3.418909e+02
## 2layhid3.to.is_canceled -3.463026e+02
## 2layhid4.to.is_canceled -9.302619e+01
plot(nn2, rep = 'best')
nn2_pred <- compute(nn1, hotel_train_mtx[, -1])
confusionMatrix(as.numeric((nn2_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_train$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42629 11771
## 1 8073 18916
##
## Accuracy : 0.7562
## 95% CI : (0.7532, 0.7591)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4683
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8408
## Specificity : 0.6164
## Pos Pred Value : 0.7836
## Neg Pred Value : 0.7009
## Prevalence : 0.6230
## Detection Rate : 0.5238
## Detection Prevalence : 0.6684
## Balanced Accuracy : 0.7286
##
## 'Positive' Class : 0
##
nn2_roc <- roc(as.numeric(hotel_train$is_canceled == "Yes"),
compute(nn2, hotel_train_mtx[, -1])$net.result[, 1])
The second layer didn’t change much comparing to the previous model. As always we will check the predictions on the testing sample.
nn2_test_pred <- compute(nn2, hotel_test_mtx[, -1])
confusionMatrix(as.numeric((nn2_test_pred$net.result > 0.5)) %>% as.factor(),
as.factor(ifelse(hotel_test$is_canceled == "Yes", 1, 0)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 17950 5182
## 1 3919 7854
##
## Accuracy : 0.7393
## 95% CI : (0.7346, 0.7439)
## No Information Rate : 0.6265
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4317
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8208
## Specificity : 0.6025
## Pos Pred Value : 0.7760
## Neg Pred Value : 0.6671
## Prevalence : 0.6265
## Detection Rate : 0.5143
## Detection Prevalence : 0.6627
## Balanced Accuracy : 0.7116
##
## 'Positive' Class : 0
##
nn2_test_roc <- roc(as.numeric(hotel_test$is_canceled == "Yes"),
compute(nn2, hotel_test_mtx[, -1])$net.result[, 1])
Almost the same results and same conclusions. Out of those models the first one was definitely the best, although its performance is still much worse than for example the XGBoost models from the previous part of analysis, based on the ROC curve and Gini coefficients for neural nets which are presented on the graph below.
list(
nn0_roc = nn0_roc,
nn0_test_roc = nn0_test_roc,
nn1_roc = nn1_roc,
nn1_test_roc = nn1_test_roc,
nn2_roc = nn2_roc,
nn2_test_roc = nn2_test_roc
) %>%
pROC::ggroc(alpha = 0.5, linetype = 1, size = 1) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1),
color = "grey",
linetype = "dashed") +
labs(subtitle = c(paste("Gini: ",
"nn0 train = ",
round(100 * (2 * auc(nn0_roc) - 1), 1), "%, ",
"nn0 test = ",
round(100 * (2 * auc(nn0_test_roc) - 1), 1), "%, ",
"nn1 train = ",
round(100 * (2 * auc(nn1_roc) - 1), 1), "%, ",
"nn1 test = ",
round(100 * (2 * auc(nn1_test_roc) - 1), 1), "%, \n",
"nn2 train = ",
round(100 * (2 * auc(nn2_roc) - 1), 1), "%, ",
"nn2 test = ",
round(100 * (2 * auc(nn2_test_roc) - 1), 1), "% "))) +
theme_bw() + coord_fixed() +
scale_color_brewer(palette = "Paired")
So which model from our analysis was the best one? Let’s combine the results and have a look at them once again.
Results_train <- rbind(2 * auc(tree1_roc) - 1, 2 * auc(tree2_roc) - 1, 2 * auc(tree3_roc) - 1, 2 * auc(sub_pred_roc) - 1, 2 * auc(xgb_roc) - 1, 2 * auc(xgb_roc2) - 1, 2 * auc(xgb_roc3) - 1, 2 * auc(xgb_roc4) - 1, 2 * auc(xgb_roc5) - 1, 2 * auc(xgb_roc8) - 1, 2 * auc(nn0_roc) - 1, 2 * auc(nn1_roc) - 1, 2 * auc(nn2_roc) - 1)
Results_test <- rbind(2 * auc(tree1_test_roc) - 1, 2 * auc(tree2_test_roc) - 1, 2 * auc(tree3_test_roc) - 1, 2 * auc(sub_test_pred_roc) - 1, 2 * auc(xgb_test_roc) - 1, 2 * auc(xgb_test_roc2) - 1, 2 * auc(xgb_test_roc3) - 1, 2 * auc(xgb_test_roc4) - 1, 2 * auc(xgb_test_roc5) - 1, 2 * auc(xgb_test_roc8) - 1, 2 * auc(nn0_test_roc) - 1, 2 * auc(nn1_test_roc) - 1, 2 * auc(nn2_test_roc) - 1)
Results <- cbind(Results_train, Results_test)
rownames(Results) <- c("Tree 1 (default)", "Tree 2 (big nodes)", "Tree 3 (cv)", "Subagging of trees",
"XGBoost 1", "XGBoost 2", "XGBoost 3", "XGBoost 4", "XGBoost 5", "XGBoost 8", "Neural Nets (0 layers)", "Neural Nets (1 layer)", "Neural Nets (2 layers)")
colnames(Results) <- c("Gini - Training sample", "Gini - Testing sample")
Results
## Gini - Training sample Gini - Testing sample
## Tree 1 (default) 0.6294704 0.6269355
## Tree 2 (big nodes) 0.5869887 0.5856268
## Tree 3 (cv) 0.8657437 0.7734859
## Subagging of trees 0.8740883 0.7748227
## XGBoost 1 0.6843431 0.6807081
## XGBoost 2 0.8095594 0.7847885
## XGBoost 3 0.8352334 0.7930082
## XGBoost 4 0.8675867 0.8053538
## XGBoost 5 0.8711783 0.8073450
## XGBoost 8 0.8872846 0.8183211
## Neural Nets (0 layers) 0.6987147 0.6958622
## Neural Nets (1 layer) 0.5660880 0.5569689
## Neural Nets (2 layers) 0.5592981 0.5584587
Not every modification resulted in a better model. Neural Networks are the best in their simplest version but still not good enough. Second tree is overall the worst model out of all created. We can see that Tree with cross-validation and Subagging of trees have high Gini coefficients for training data but for testing data it is lower by about 10 percentage points, although still quite high. Probably the best predictions are from XGBoost models. The better were predictions for XGBoost on the training sample, the better they were on the testing sample also, but the difference between increased steadily, so we may have an overfitting for the best performing XGBoost models. If we want to be sure that our model will keep the predictions close to the level of predictions from training sample, then we might prefer second or third XGBoost model, because the differences between Gini coefficient for all boosting models aside from the first one are very small.
All in all, it is not always true that a more complicated model will perform better than a simple one, but often that is the true. Then we may need to decide between explainability and better efficiency in our model. It is up to analyst to understand the specific case he’s dealing with and to know if he can resign of the explainability in favor of having the model which is better fitted to the data.