photo source

Business Case

Hello all! It’s Gissella Nadya again and today we are going to explore on yet another Classification Models, those are Naive Bayes, Decision Tree and Random Forest. The dataset can be accessed on Kaggle and it was originally obtained from an article on ScienceDirect. It was then made avaiable here on February 11th 2020. Let’s get started!

Getting Started

Installing the Libraries

library(tidyverse)
library(ggplot2)
library(ggmosaic)
library(plotly)
library(e1071) # naiveBayes()
library(caret) # cm
library(partykit) # ctree()
library(tictoc)
library(rpart)
library(randomForest)
library(ROCR)

Reading the Data

First of all, let’s get to know the data first.

bookings <-  read_csv("hotel_bookings.csv") %>%  janitor::clean_names()
head(bookings)

Data Wrangling

glimpse(bookings)
#> Rows: 119,390
#> Columns: 32
#> $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resor…
#> $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
#> $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…
#> $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
#> $ arrival_date_month             <chr> "July", "July", "July", "July", "July"…
#> $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
#> $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…
#> $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…
#> $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…
#> $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corpora…
#> $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corpora…
#> $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C"…
#> $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C"…
#> $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposi…
#> $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", …
#> $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…
#> $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ customer_type                  <chr> "Transient", "Transient", "Transient",…
#> $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…
#> $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…
#> $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out",…
#> $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…

To understand more about our dataset, here are the explanations for each columns:

variable class description
hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)
is_canceled double Value indicating if the booking was canceled (1) or not (0)
lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year double Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number double Week number of year for arrival date
arrival_date_day_of_month double Day of arrival date
stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults double Number of adults
children double Number of children
babies double Number of babies
meal character Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package;BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
country character Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type character Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:No Deposit – no deposit was made;Non Refund – a deposit was made in the value of the total stay cost;Refundable – a deposit was made with a value under the total cost of stay.
agent character ID of the travel agency that made the booking
company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type character Type of booking, assuming one of four categories:Contract - when the booking has an allotment or other type of contract associated to it;Group – when the booking is associated to a group;Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;Transient-party – when the booking is transient, but is associated to at least other transient booking
adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces double Number of car parking spaces required by the customer
total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status character Reservation last status, assuming one of three categories:Canceled – booking was canceled by the customer;Check-Out – customer has checked in but already departed;No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date double Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

target = is_cancelled

unique(sort(bookings$hotel))
#> [1] "City Hotel"   "Resort Hotel"

Here we can see that the dataset consists of data from a city hotel and resort hotel. Today we are going to focus only to our Resort Hotel dataset.

resort <- bookings %>% 
  filter(hotel == "Resort Hotel")

resort <-  resort %>% 
  select(-hotel) %>% 
  mutate_if(~is.character(.), ~as.factor(.)) %>% 
  mutate(is_canceled = as.factor(is_canceled),
         is_repeated_guest = as.factor(is_repeated_guest),
         total_of_special_requests = as.factor(total_of_special_requests))
resort$arrival_date_month <- factor(resort$arrival_date_month, levels = month.name)
colSums(is.na(resort))
#>                    is_canceled                      lead_time 
#>                              0                              0 
#>              arrival_date_year             arrival_date_month 
#>                              0                              0 
#>       arrival_date_week_number      arrival_date_day_of_month 
#>                              0                              0 
#>        stays_in_weekend_nights           stays_in_week_nights 
#>                              0                              0 
#>                         adults                       children 
#>                              0                              0 
#>                         babies                           meal 
#>                              0                              0 
#>                        country                 market_segment 
#>                              0                              0 
#>           distribution_channel              is_repeated_guest 
#>                              0                              0 
#>         previous_cancellations previous_bookings_not_canceled 
#>                              0                              0 
#>             reserved_room_type             assigned_room_type 
#>                              0                              0 
#>                booking_changes                   deposit_type 
#>                              0                              0 
#>                          agent                        company 
#>                              0                              0 
#>           days_in_waiting_list                  customer_type 
#>                              0                              0 
#>                            adr    required_car_parking_spaces 
#>                              0                              0 
#>      total_of_special_requests             reservation_status 
#>                              0                              0 
#>        reservation_status_date 
#>                              0

No missing values. Great!

We noticed that there are variables for stays_in_weekend_nights and stays_in_week_nights, in this feature engineering, we are going to make a new column for the total of this two. Next, we are also going to count the total of amount that the guest have to pay and named it total_stay_cost which would be counted by the Average Daily Rate * Total Number of Night Stays.

resort <- resort %>% 
  mutate(stays_in_total_nights = stays_in_weekend_nights + stays_in_week_nights,
         total_stay_cost = adr * stays_in_total_nights)
resort <- resort %>% 
  mutate(is_canceled = case_when(is_canceled == 0 ~ "No",
                                 is_canceled == 1 ~ "Yes"), 
         is_canceled = as.factor(is_canceled)) %>% 
  rename(canceled = is_canceled)

resort$canceled <- factor(resort$canceled, levels = c("No", "Yes"))

EDA

summary(resort)
#>  canceled      lead_time      arrival_date_year arrival_date_month
#>  No :28938   Min.   :  0.00   Min.   :2015      August : 4894     
#>  Yes:11122   1st Qu.: 10.00   1st Qu.:2016      July   : 4573     
#>              Median : 57.00   Median :2016      April  : 3609     
#>              Mean   : 92.68   Mean   :2016      May    : 3559     
#>              3rd Qu.:155.00   3rd Qu.:2017      October: 3555     
#>              Max.   :737.00   Max.   :2017      March  : 3336     
#>                                                 (Other):16534     
#>  arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
#>  Min.   : 1.00            Min.   : 1.00             Min.   : 0.00          
#>  1st Qu.:16.00            1st Qu.: 8.00             1st Qu.: 0.00          
#>  Median :28.00            Median :16.00             Median : 1.00          
#>  Mean   :27.14            Mean   :15.82             Mean   : 1.19          
#>  3rd Qu.:38.00            3rd Qu.:24.00             3rd Qu.: 2.00          
#>  Max.   :53.00            Max.   :31.00             Max.   :19.00          
#>                                                                            
#>  stays_in_week_nights     adults          children           babies      
#>  Min.   : 0.000       Min.   : 0.000   Min.   : 0.0000   Min.   :0.0000  
#>  1st Qu.: 1.000       1st Qu.: 2.000   1st Qu.: 0.0000   1st Qu.:0.0000  
#>  Median : 3.000       Median : 2.000   Median : 0.0000   Median :0.0000  
#>  Mean   : 3.129       Mean   : 1.867   Mean   : 0.1287   Mean   :0.0139  
#>  3rd Qu.: 5.000       3rd Qu.: 2.000   3rd Qu.: 0.0000   3rd Qu.:0.0000  
#>  Max.   :50.000       Max.   :55.000   Max.   :10.0000   Max.   :2.0000  
#>                                                                          
#>         meal          country            market_segment  distribution_channel
#>  BB       :30005   PRT    :17630   Complementary:  201   Corporate: 3269     
#>  FB       :  754   GBR    : 6814   Corporate    : 2309   Direct   : 7865     
#>  HB       : 8046   ESP    : 3957   Direct       : 6513   TA/TO    :28925     
#>  SC       :   86   IRL    : 2166   Groups       : 5836   Undefined:    1     
#>  Undefined: 1169   FRA    : 1611   Offline TA/TO: 7472                       
#>                    DEU    : 1203   Online TA    :17729                       
#>                    (Other): 6679                                             
#>  is_repeated_guest previous_cancellations previous_bookings_not_canceled
#>  0:38282           Min.   : 0.0000        Min.   : 0.0000               
#>  1: 1778           1st Qu.: 0.0000        1st Qu.: 0.0000               
#>                    Median : 0.0000        Median : 0.0000               
#>                    Mean   : 0.1017        Mean   : 0.1465               
#>                    3rd Qu.: 0.0000        3rd Qu.: 0.0000               
#>                    Max.   :26.0000        Max.   :30.0000               
#>                                                                         
#>  reserved_room_type assigned_room_type booking_changes      deposit_type  
#>  A      :23399      A      :17046      Min.   : 0.000   No Deposit:38199  
#>  D      : 7433      D      :10339      1st Qu.: 0.000   Non Refund: 1719  
#>  E      : 4982      E      : 5638      Median : 0.000   Refundable:  142  
#>  G      : 1610      C      : 2214      Mean   : 0.288                     
#>  F      : 1106      G      : 1853      3rd Qu.: 0.000                     
#>  C      :  918      F      : 1733      Max.   :17.000                     
#>  (Other):  612      (Other): 1237                                         
#>      agent          company      days_in_waiting_list         customer_type  
#>  240    :13905   NULL   :36952   Min.   :  0.0000     Contract       : 1776  
#>  NULL   : 8209   223    :  784   1st Qu.:  0.0000     Group          :  284  
#>  250    : 2869   281    :  138   Median :  0.0000     Transient      :30209  
#>  241    : 1721   154    :  133   Mean   :  0.5278     Transient-Party: 7791  
#>  40     : 1002   405    :  100   3rd Qu.:  0.0000                            
#>  314    :  927   94     :   87   Max.   :185.0000                            
#>  (Other):11427   (Other): 1866                                               
#>       adr         required_car_parking_spaces total_of_special_requests
#>  Min.   : -6.38   Min.   :0.0000              0:22361                  
#>  1st Qu.: 50.00   1st Qu.:0.0000              1:11806                  
#>  Median : 75.00   Median :0.0000              2: 4827                  
#>  Mean   : 94.95   Mean   :0.1381              3:  910                  
#>  3rd Qu.:125.00   3rd Qu.:0.0000              4:  142                  
#>  Max.   :508.00   Max.   :8.0000              5:   14                  
#>                                                                        
#>  reservation_status reservation_status_date stays_in_total_nights
#>  Canceled :10831    Min.   :2014-11-18      Min.   : 0.000       
#>  Check-Out:28938    1st Qu.:2016-01-26      1st Qu.: 2.000       
#>  No-Show  :  291    Median :2016-07-31      Median : 3.000       
#>                     Mean   :2016-07-28      Mean   : 4.319       
#>                     3rd Qu.:2017-02-11      3rd Qu.: 7.000       
#>                     Max.   :2017-09-14      Max.   :69.000       
#>                                                                  
#>  total_stay_cost 
#>  Min.   : -63.8  
#>  1st Qu.: 117.0  
#>  Median : 273.0  
#>  Mean   : 435.4  
#>  3rd Qu.: 593.0  
#>  Max.   :7590.0  
#> 

If we take a look nicely to the summary, we can see that the variable company, and agent has many NULL in it. Therefore, we are going to ignore these columns.

cleaned_hotel <-  resort %>% 
  select(-c(company, agent))
resort %>% 
  ggplot() + 
  geom_mosaic(aes(x = product(canceled),
                  fill = canceled,
                  conds = product(is_repeated_guest))) + 
  labs(title = "Does our Repeated Guests Make Many Canceled Bookings?") +
  theme(legend.position = "none", 
        title = element_text(face = "bold"),
        plot.title = element_text(hjust = 0.5, size = 14))

Lead times is the gap between the arrival date and the booking date. Through the boxplot, we are able to tell that the bookings that people who books later tends to lead to a canceled bookings, while those who books very early, tend to be a successful booking and leads to arrival to the hotel.

According to our dataset, the most booking happened was in the year 2016.

Cross-Validation

RNGkind(sample.kind = "Rounding")
set.seed(598)

index <- sample(nrow(cleaned_hotel), nrow(cleaned_hotel)*0.8)
hotel_train <- cleaned_hotel[index, ]
hotel_test <- cleaned_hotel[-index, ]
prop.table(table(hotel_train$canceled))
#> 
#>        No       Yes 
#> 0.7234461 0.2765539

72:27. The data seems to be imbalance, therefore we are going to perform a downsampling method.

hotel_train <- downSample(x = hotel_train %>%  select(-canceled),
                          y = hotel_train$canceled,
                          yname = "canceled")
prop.table(table(hotel_train$canceled))
#> 
#>  No Yes 
#> 0.5 0.5

Building the Models

Naive Bayes

Modeling

tic()
model_naive <- naiveBayes(x = hotel_train %>% select(-c(canceled, reservation_status)),
                          y = hotel_train$canceled,
                          laplace = 1) # angka untuk laplace smoothing
toc()
#> 0.13 sec elapsed

Model Interpretation

Here are interpretation samples.

model_naive$tables$is_repeated_guest
#>                     is_repeated_guest
#> hotel_train$canceled          0          1
#>                  No  0.94280880 0.05719120
#>                  Yes 0.98996052 0.01003948
  • the proportion of people who canceled the bookings (Yes) and not a repeated guests (0) are 0.98996052.
model_naive$tables$market_segment
#>                     market_segment
#> hotel_train$canceled Complementary   Corporate      Direct      Groups
#>                  No    0.005637614 0.066411095 0.196076221 0.118389897
#>                  Yes   0.003382568 0.033036419 0.079264855 0.223700530
#>                     market_segment
#> hotel_train$canceled Offline TA/TO   Online TA
#>                  No    0.208478972 0.405006201
#>                  Yes   0.103281091 0.557334536
  • the highest proportion of people who canceled there booking is people who booked online via travel agent with 0.557334536.

Prediction

tic()
naive_class <- predict(model_naive, hotel_test %>% select(-reservation_status), type = "class")
toc()
#> 8.458 sec elapsed
table(naive_class)
#> naive_class
#>   No  Yes 
#> 1974 6038
head(naive_class)
#> [1] No  Yes Yes Yes Yes Yes
#> Levels: No Yes

Decision Tree

Data Wrangling for DT

before we model, we need to make sure that there is no date class in our data frame

dt_hotel_train <- hotel_train %>% select(-c(reservation_status_date, reservation_status))
dt_hotel_test <- hotel_test %>% select(-c(reservation_status_date, reservation_status))

Modeling

model_dtree <- ctree(formula = canceled ~ ., 
                     data = dt_hotel_train)

Surprisingly, by using the partykit package and ctree() function for the decision tree, are not able to distinguish anything. Let’s try to use another Decission Tree Method using the rpart package.

tic()
set.seed(598)
model_dt_rpart <- rpart(canceled ~ ., data = dt_hotel_train, method = "class")
toc()
#> 1.663 sec elapsed
rattle::fancyRpartPlot(model_dt_rpart, sub = NULL, palettes = "YlOrBr")

It looks like the plot is self explanatory. It is interesting that the model divided it using the country as the first decision.

Prediction

dtree_rpart_pred <- select(dt_hotel_test, canceled) %>% 
  bind_cols(pred_class = predict(model_dt_rpart, 
                                 newdata = dt_hotel_test, type = "class"))
head(dtree_rpart_pred)

Yikes, it appears we have some false predictions here, but to make sure we are going to see later in our model evaluation.

Random Forest

Data Wrangling for RF

Random Forest model are known with their heavy and long computational load. To overcome this characteristic, we are able to use nearZeroVar() function from the package caret to eliminate the columns with variance close to 0 (uninformative columns).

dim(cleaned_hotel)
#> [1] 40060    31
nzvar <- nearZeroVar(cleaned_hotel)
rf_hotel <- cleaned_hotel[, -nzvar]

dim(rf_hotel)
#> [1] 40060    24

Using the nearZeroVar() function, we are able to eliminate from 31 columns to 25. Now we can proceed with another cross validation step.

RNGkind(sample.kind = "Rounding")
set.seed(598)

index <- sample(nrow(rf_hotel), nrow(rf_hotel) * 0.8)

rf_hotel_train <- rf_hotel[index, ]
rf_hotel_test <- rf_hotel[-index, ]
prop.table(table(rf_hotel_train$canceled))
#> 
#>        No       Yes 
#> 0.7234461 0.2765539

As before, we are going to do downsampling.

rf_hotel_train <- downSample(x = rf_hotel_train %>%  select(-canceled),
                             y = rf_hotel_train$canceled,
                             yname = "canceled")
prop.table(table(rf_hotel_train$canceled))
#> 
#>  No Yes 
#> 0.5 0.5

Modeling

tic()
set.seed(598)

ctrl <- trainControl(method = "repeatedcv", #k-fold
                     number = 5, # k-fold cross validation (k=5) 
                     repeats = 3) # 3x repeat

model_forest <- train(canceled ~ ., 
                      data = rf_hotel_train, 
                      method = "rf", 
                      trControl = ctrl)

# saveRDS(model_forest, "model_forest_resort_1.RDS") # save the model
toc()
#> [1] "1.14 hours elapsed to create the model"
hotel_forest_rds <- readRDS("model_forest_resort_1.RDS")
hotel_forest_rds
#> Random Forest 
#> 
#> 17726 samples
#>    23 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 14181, 14180, 14181, 14180, 14182, 14181, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>     2   0.7811311  0.5622615
#>    96   1.0000000  1.0000000
#>   190   1.0000000  1.0000000
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 96.

mtry is the number of columns that was made as the predictors. For hotel_forest_rds model, there are 96 splitting nodes.

Out of Bag Error

hotel_forest_rds$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 96
#> 
#>         OOB estimate of  error rate: 0%
#> Confusion matrix:
#>       No  Yes class.error
#> No  8863    0           0
#> Yes    0 8863           0

For the hotel_forest_rds model, the Out of Bag Error is 0%. It means that the accuracy fo the model is 100%! Which is a bit suspicious.

\[Accuracy = 100 - 0 = 100 \%\]

Interpretation

varImp(hotel_forest_rds)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 190)
#> 
#>                               Overall
#> reservation_statusCheck-Out 100.00000
#> countryPRT                    3.39248
#> lead_time                     3.13235
#> required_car_parking_spaces   2.41052
#> market_segmentOnline TA       0.62521
#> reservation_status_date       0.41835
#> customer_typeTransient        0.28214
#> market_segmentOffline TA/TO   0.14293
#> countryGBR                    0.13807
#> stays_in_total_nights         0.13744
#> reservation_statusNo-Show     0.12462
#> total_stay_cost               0.06876
#> market_segmentDirect          0.05254
#> market_segmentGroups          0.04598
#> distribution_channelTA/TO     0.03999
#> booking_changes               0.03975
#> arrival_date_year             0.02933
#> adr                           0.01956
#> distribution_channelDirect    0.01687
#> arrival_date_week_number      0.01027

Here we see that the reservation_statusCheck-Out is the most important variable for seeing if a booking is cancelled or not. Surprisingly that is common logic. Therefore we are going to try a hypothesis in which we remove this variable.

Prediction

forest_class <- predict(hotel_forest_rds, rf_hotel_test, type = "raw")
head(forest_class)
#> [1] No  No  Yes No  No  No 
#> Levels: No Yes

RF 2

Data Wrangling for RF 2

cleaned_hotel2 <- cleaned_hotel %>%  select(-reservation_status)
dim(cleaned_hotel2)
#> [1] 40060    30
nzvar2 <- nearZeroVar(cleaned_hotel2)
rf_hotel2 <- cleaned_hotel2[, -nzvar2]

dim(rf_hotel2)
#> [1] 40060    23

Using the nearZeroVar() function, we are able to eliminate from 30 columns to 23. Now we can proceed with another cross validation step.

RNGkind(sample.kind = "Rounding")
set.seed(598)

index <- sample(nrow(rf_hotel2), nrow(rf_hotel2) * 0.8)

rf_hotel_train2 <- rf_hotel2[index, ]
rf_hotel_test2 <- rf_hotel2[-index, ]
prop.table(table(rf_hotel_train2$canceled))
#> 
#>        No       Yes 
#> 0.7234461 0.2765539
rf_hotel_train2 <- downSample(x = rf_hotel_train2 %>%  select(-canceled),
                             y = rf_hotel_train2$canceled,
                             yname = "canceled")
prop.table(table(rf_hotel_train2$canceled))
#> 
#>  No Yes 
#> 0.5 0.5

Modeling

tic()
set.seed(598)

ctrl2 <- trainControl(method = "repeatedcv", #k-fold
                      number = 5, # k-fold cross validation (k=5) 
                      repeats = 3) # 3x repeat

model_forest2 <- train(canceled ~ ., 
                       data = rf_hotel_train2, 
                       method = "rf", 
                       trControl = ctrl2)

# saveRDS(model_forest2, "model_forest_resort_2.RDS") # save the model
toc()
#> [1] "4.41 hours elapsed to create the model"
hotel_forest_rds2 <- readRDS("model_forest_resort_2.RDS")
hotel_forest_rds2
#> Random Forest 
#> 
#> 17726 samples
#>    22 predictor
#>     2 classes: 'No', 'Yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 14181, 14180, 14181, 14180, 14182, 14181, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>     2   0.6905469  0.3810921
#>    95   0.9175416  0.8350833
#>   188   0.9197041  0.8394082
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 188.

mtry is the number of columns that was made as the predictors. For hotel_forest_rds model, there are 188 splitting nodes.

Out of Bag Error

hotel_forest_rds2$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 188
#> 
#>         OOB estimate of  error rate: 7.62%
#> Confusion matrix:
#>       No  Yes class.error
#> No  8205  658  0.07424123
#> Yes  693 8170  0.07819023

For the hotel_forest_rds2 model, the Out of Bag Error is 7.62%. It means that the accuracy fo the model is 92.38%! Which means very good!

\[Accuracy = 100 - 7.62 = 92.38 \%\]

Interpretation

varImp(hotel_forest_rds2)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 188)
#> 
#>                             Overall
#> reservation_status_date     100.000
#> lead_time                    87.854
#> countryPRT                   60.552
#> required_car_parking_spaces  49.957
#> arrival_date_week_number     40.886
#> market_segmentOnline TA      37.148
#> adr                          22.383
#> total_stay_cost              16.312
#> arrival_date_day_of_month    14.061
#> arrival_date_year             8.325
#> stays_in_total_nights         7.102
#> booking_changes               6.684
#> total_of_special_requests1    5.509
#> stays_in_week_nights          4.981
#> customer_typeTransient        4.583
#> distribution_channelDirect    4.299
#> total_of_special_requests2    3.729
#> mealUndefined                 3.552
#> stays_in_weekend_nights       3.198
#> adults                        2.362

For hotel_forest_rds2 model, the most important variable is reservation_status_date

Prediction

forest_class_2 <- predict(hotel_forest_rds2, rf_hotel_test2, type = "raw")
head(forest_class_2)
#> [1] Yes Yes Yes No  Yes No 
#> Levels: No Yes

Model Evaluation

Naive Bayes

cm_nb <- confusionMatrix(naive_class, 
                         hotel_test$canceled, 
                         positive = "Yes")
cm_nb
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   No  Yes
#>        No  1901   73
#>        Yes 3852 2186
#>                                              
#>                Accuracy : 0.5101             
#>                  95% CI : (0.4991, 0.5211)   
#>     No Information Rate : 0.718              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.1977             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.9677             
#>             Specificity : 0.3304             
#>          Pos Pred Value : 0.3620             
#>          Neg Pred Value : 0.9630             
#>              Prevalence : 0.2820             
#>          Detection Rate : 0.2728             
#>    Detection Prevalence : 0.7536             
#>       Balanced Accuracy : 0.6491             
#>                                              
#>        'Positive' Class : Yes                
#> 

Decision Tree

cm_dt_rpart <- confusionMatrix(dtree_rpart_pred$pred_class, 
                               dt_hotel_test$canceled, positive = "Yes")
cm_dt_rpart
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   No  Yes
#>        No  4685  307
#>        Yes 1068 1952
#>                                                
#>                Accuracy : 0.8284               
#>                  95% CI : (0.8199, 0.8366)     
#>     No Information Rate : 0.718                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6155               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8641               
#>             Specificity : 0.8144               
#>          Pos Pred Value : 0.6464               
#>          Neg Pred Value : 0.9385               
#>              Prevalence : 0.2820               
#>          Detection Rate : 0.2436               
#>    Detection Prevalence : 0.3769               
#>       Balanced Accuracy : 0.8392               
#>                                                
#>        'Positive' Class : Yes                  
#> 

Random Forest

cm_rf <-  confusionMatrix(data = forest_class, 
                          reference = rf_hotel_test$canceled, 
                          positive = "Yes")
cm_rf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   No  Yes
#>        No  5753    0
#>        Yes    0 2259
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9995, 1)          
#>     No Information Rate : 0.718                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.000                
#>             Specificity : 1.000                
#>          Pos Pred Value : 1.000                
#>          Neg Pred Value : 1.000                
#>              Prevalence : 0.282                
#>          Detection Rate : 0.282                
#>    Detection Prevalence : 0.282                
#>       Balanced Accuracy : 1.000                
#>                                                
#>        'Positive' Class : Yes                  
#> 
cm_rf_2 <-  confusionMatrix(data = forest_class_2, 
                          reference = rf_hotel_test2$canceled, 
                          positive = "Yes")
cm_rf_2
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   No  Yes
#>        No  5305  143
#>        Yes  448 2116
#>                                                
#>                Accuracy : 0.9262               
#>                  95% CI : (0.9203, 0.9319)     
#>     No Information Rate : 0.718                
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.825                
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9367               
#>             Specificity : 0.9221               
#>          Pos Pred Value : 0.8253               
#>          Neg Pred Value : 0.9738               
#>              Prevalence : 0.2820               
#>          Detection Rate : 0.2641               
#>    Detection Prevalence : 0.3200               
#>       Balanced Accuracy : 0.9294               
#>                                                
#>        'Positive' Class : Yes                  
#> 

As we can see, the overall models that we created have its ups and downs. The Naive Bayes model really high on Recall / Sensitivity, but worst in all other metrics. The Decision Tree have overall above 0.80 values except the precision. As what we were expected, the First Random Forest model have 1 value in everything, which is very suspicious. And finally the second Random Forest have almost 0.90 in all the metrics, except precision. Because we would like to see the accuracy of the model, therefore we are going to use the Random Forest 2 model.

Best Model Evaluation

We are using our best model with the highest accuracy: rf_hotel_test2.

# probability results
rf_prob <- predict(hotel_forest_rds2, # best model
                   rf_hotel_test2, # data test 
                   type = "prob")
head(rf_prob)
# object predictions
forest_auc <- prediction(predictions = rf_prob[, 2], # prob kelas positif
                         labels = as.numeric(rf_hotel_test$canceled == "Yes")) # label kelas positif

# buat performance dari objek prediction
perf <- performance(prediction.obj = forest_auc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
# buat plot
plot(perf)
abline(0,1, lty = 2, col = "red")

auc <- performance(prediction.obj = forest_auc, 
                   measure = "auc")

auc@y.values
#> [[1]]
#> [1] 0.9827171

The AUC value of rf_hotel_test2 model is almost 1, this means that our model could classify both the positive class and negative class very well.

Thank you very much for taking the time to read my report project. Hope you have a great day!