Introduction

This project utilizes the Hotel Bookings dataset provided by Jesse Mostipak on the Kaggle website. Link: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand

This dataset records detailed information about more than 100,000 hotel bookings. I intend to employ two machine learning methods to determine which is more effective at predicting whether or not a guest will cancel their booking.

I access the hotel data using a relative file path.

data = read.csv(file = 'hotel_bookings.csv')

I will activate several libraries to enable functions for later use.

library(lattice)
library(ggplot2)
library(caret)
library(e1071)

I want my random results to be reproducible. The analysis in this project assumes that you have run the below cell to set the random seed to 428.

set.seed(428)

To ensure that later calculations are not confounded by missing data, I will immediately use na.omit to remove all rows containing missing values. To ensure that it worked, I will display the number of rows before and after purging the dataset of NA.

nrow(data)
## [1] 119390
data = na.omit(data)
nrow(data)
## [1] 119386

Apparently, there were only four rows with missing values in the entire dataset.

To cut down on processing times, I take a random sample of 5000 rows of the dataset for study, omitting the rest.

data = data[sample(nrow(data), size = 5000),]
nrow(data)
## [1] 5000

Before anything else is done with the data, let’s see the table.

head(data)
##               hotel is_canceled lead_time arrival_date_year arrival_date_month
## 80785    City Hotel           1        61              2015           November
## 99780    City Hotel           0         3              2016            October
## 109784   City Hotel           0        59              2017              April
## 105858   City Hotel           0       147              2017           February
## 95556    City Hotel           0        80              2016             August
## 39280  Resort Hotel           0       191              2017             August
##        arrival_date_week_number arrival_date_day_of_month
## 80785                        48                        23
## 99780                        43                        16
## 109784                       16                        16
## 105858                        7                        13
## 95556                        34                        18
## 39280                        32                         9
##        stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 80785                        1                    3      2        0      0   BB
## 99780                        2                    0      2        0      0   BB
## 109784                       1                    0      2        0      0   SC
## 105858                       1                    3      2        0      0   BB
## 95556                        0                    3      2        0      0   BB
## 39280                        0                    4      2        0      0   BB
##        country market_segment distribution_channel is_repeated_guest
## 80785      PRT         Groups                TA/TO                 0
## 99780      ESP      Online TA                TA/TO                 0
## 109784     AUT      Online TA                TA/TO                 0
## 105858     FRA  Offline TA/TO                TA/TO                 0
## 95556      GBR         Direct               Direct                 0
## 39280       CN         Direct               Direct                 0
##        previous_cancellations previous_bookings_not_canceled reserved_room_type
## 80785                       0                              0                  G
## 99780                       0                              0                  A
## 109784                      0                              0                  A
## 105858                      0                              0                  A
## 95556                       0                              0                  E
## 39280                       0                              0                  A
##        assigned_room_type booking_changes deposit_type agent company
## 80785                   G               1   No Deposit    37    NULL
## 99780                   A               0   No Deposit    83    NULL
## 109784                  A               0   No Deposit     9    NULL
## 105858                  D               0   No Deposit    28    NULL
## 95556                   E               1   No Deposit    14    NULL
## 39280                   A               0   No Deposit   250    NULL
##        days_in_waiting_list   customer_type   adr required_car_parking_spaces
## 80785                     0 Transient-Party  10.0                           0
## 99780                     0       Transient 120.0                           0
## 109784                    0       Transient 108.0                           0
## 105858                    0       Transient  75.0                           0
## 95556                     0       Transient 168.3                           0
## 39280                     0       Transient 162.0                           0
##        total_of_special_requests reservation_status reservation_status_date
## 80785                          0            No-Show              2015-11-23
## 99780                          0          Check-Out              2016-10-18
## 109784                         0          Check-Out              2017-04-17
## 105858                         0          Check-Out              2017-02-17
## 95556                          1          Check-Out              2016-08-21
## 39280                          0          Check-Out              2017-08-13

There are 32 columns, sporting a mixture of numeric and categorical variables (along with one column containing dates). Notice the randomized row indices, indicative of my earlier shuffling.

Of particular interest is the is_canceled column, second from the left. It is to be treated as a Boolean variable, with 1 indicating a canceled reservation and 0 indicating a reservation that was fulfilled normally.

I wish to keep the bulk of the variables available for predictive modelling, but not all of these are necessary. To streamline things, I get rid of some of the less useful and more cumbersome ones. To do so, I replace my table with a subset of itself containing all but the variables I want to expunge.

data = subset(data, select = -c(is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, agent, company, days_in_waiting_list, reservation_status_date, country, arrival_date_week_number, arrival_date_day_of_month))

Since they are still fresh from the csv file, my categorical variables are not internally labeled as factors, instead defaulting to the less useful character type. The next code chunk accomplishes this task.

Once these operations are performed, the table will be printed again.

data = transform(data, hotel = factor(hotel))
data = transform(data, is_canceled = factor(is_canceled))
data = transform(data, arrival_date_year = factor(arrival_date_year))
data = transform(data, arrival_date_month = factor(arrival_date_month))
data = transform(data, meal = factor(meal))
data = transform(data, market_segment = factor(market_segment))
data = transform(data, distribution_channel = factor(distribution_channel))
data = transform(data, reserved_room_type = factor(reserved_room_type))
data = transform(data, assigned_room_type = factor(assigned_room_type))
data = transform(data, deposit_type = factor(deposit_type))
data = transform(data, customer_type = factor(customer_type))
data = transform(data, reservation_status = factor(reservation_status))

head(data)
##               hotel is_canceled lead_time arrival_date_year arrival_date_month
## 80785    City Hotel           1        61              2015           November
## 99780    City Hotel           0         3              2016            October
## 109784   City Hotel           0        59              2017              April
## 105858   City Hotel           0       147              2017           February
## 95556    City Hotel           0        80              2016             August
## 39280  Resort Hotel           0       191              2017             August
##        stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 80785                        1                    3      2        0      0   BB
## 99780                        2                    0      2        0      0   BB
## 109784                       1                    0      2        0      0   SC
## 105858                       1                    3      2        0      0   BB
## 95556                        0                    3      2        0      0   BB
## 39280                        0                    4      2        0      0   BB
##        market_segment distribution_channel reserved_room_type
## 80785          Groups                TA/TO                  G
## 99780       Online TA                TA/TO                  A
## 109784      Online TA                TA/TO                  A
## 105858  Offline TA/TO                TA/TO                  A
## 95556          Direct               Direct                  E
## 39280          Direct               Direct                  A
##        assigned_room_type booking_changes deposit_type   customer_type   adr
## 80785                   G               1   No Deposit Transient-Party  10.0
## 99780                   A               0   No Deposit       Transient 120.0
## 109784                  A               0   No Deposit       Transient 108.0
## 105858                  D               0   No Deposit       Transient  75.0
## 95556                   E               1   No Deposit       Transient 168.3
## 39280                   A               0   No Deposit       Transient 162.0
##        required_car_parking_spaces total_of_special_requests reservation_status
## 80785                            0                         0            No-Show
## 99780                            0                         0          Check-Out
## 109784                           0                         0          Check-Out
## 105858                           0                         0          Check-Out
## 95556                            0                         1          Check-Out
## 39280                            0                         0          Check-Out

Data Summary and Analysis

Simply seeing the data table doesn’t tell us enough about the data. To better understand the underlying trends, I will use the summary function to provide some quick facts on all 22 variables.

Numeric variables display simple summary statistics, whereas categorical variables return the count of each category.

summary(data)
##           hotel      is_canceled   lead_time     arrival_date_year
##  City Hotel  :3363   0:3155      Min.   :  0.0   2015: 913        
##  Resort Hotel:1637   1:1845      1st Qu.: 19.0   2016:2365        
##                                  Median : 69.0   2017:1722        
##                                  Mean   :105.4                    
##                                  3rd Qu.:162.0                    
##                                  Max.   :626.0                    
##                                                                   
##  arrival_date_month stays_in_weekend_nights stays_in_week_nights
##  August   : 606     Min.   : 0.0000         Min.   : 0.000      
##  July     : 538     1st Qu.: 0.0000         1st Qu.: 1.000      
##  May      : 483     Median : 1.0000         Median : 2.000      
##  June     : 475     Mean   : 0.9312         Mean   : 2.502      
##  September: 457     3rd Qu.: 2.0000         3rd Qu.: 3.000      
##  October  : 445     Max.   :16.0000         Max.   :40.000      
##  (Other)  :1996                                                 
##      adults         children         babies              meal     
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   BB       :3881  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:0.0000   FB       :  28  
##  Median :2.000   Median :0.000   Median :0.0000   HB       : 603  
##  Mean   :1.865   Mean   :0.111   Mean   :0.0056   SC       : 433  
##  3rd Qu.:2.000   3rd Qu.:0.000   3rd Qu.:0.0000   Undefined:  55  
##  Max.   :5.000   Max.   :3.000   Max.   :1.0000                   
##                                                                   
##        market_segment distribution_channel reserved_room_type
##  Aviation     :   9   Corporate: 279       A      :3590      
##  Complementary:  31   Direct   : 594       D      : 802      
##  Corporate    : 205   GDS      :   9       E      : 295      
##  Direct       : 515   TA/TO    :4118       F      : 109      
##  Groups       : 862                        G      :  86      
##  Offline TA/TO: 964                        B      :  52      
##  Online TA    :2414                        (Other):  66      
##  assigned_room_type booking_changes     deposit_type          customer_type 
##  A      :3131       Min.   :0.000   No Deposit:4402   Contract       : 176  
##  D      :1041       1st Qu.:0.000   Non Refund: 590   Group          :  19  
##  E      : 335       Median :0.000   Refundable:   8   Transient      :3723  
##  F      : 155       Mean   :0.209                     Transient-Party:1082  
##  C      : 101       3rd Qu.:0.000                                           
##  G      : 100       Max.   :8.000                                           
##  (Other): 137                                                               
##       adr         required_car_parking_spaces total_of_special_requests
##  Min.   :  0.00   Min.   :0.0000              Min.   :0.000            
##  1st Qu.: 70.00   1st Qu.:0.0000              1st Qu.:0.000            
##  Median : 95.27   Median :0.0000              Median :0.000            
##  Mean   :102.36   Mean   :0.0616              Mean   :0.585            
##  3rd Qu.:126.00   3rd Qu.:0.0000              3rd Qu.:1.000            
##  Max.   :451.50   Max.   :8.0000              Max.   :5.000            
##                                                                        
##  reservation_status
##  Canceled :1804    
##  Check-Out:3155    
##  No-Show  :  41    
##                    
##                    
##                    
## 

At a glance, we can see that about a third of the reservations are canceled. This is visually confirmed by the pie chart below, which shows that 36.9% of reservations are canceled - slightly more than a third.

cancelcount = table(data$is_canceled)
piecent = round(100 * cancelcount / sum(cancelcount), 1)
pie(cancelcount, labels = piecent, col = c('darkblue', 'darkred'))
legend('topleft', c('Not Canceled', 'Canceled'), fill = c('darkblue', 'darkred'))

There are too many predictor variables for it to be practical to individually analyze each one, but for the sake of example, I would like to create a bar graph to assess the connection between hotel type and cancellation rates.

hoteltype = table(data$is_canceled, data$hotel)
barplot(hoteltype, xlab = 'Hotel Type', ylab = 'Number of Reservations', 
main = 'Cancelation by Hotel Type', 
beside = T, col = c('darkblue', 'darkred', 'darkblue', 'darkred'))
legend('topright', c('Not Canceled', 'Canceled'), fill = c('darkblue', 'darkred'))

As it turns out, the rate of cancelation for city hotels is very noticably higher than rate for resort hotels. Though the data and the graph do not tell us why this discrepancy exists, it is not difficult to speculate why. Booking at a city hotel is an inconvenient necessity that most people would avoid if possible, whereas booking at a resort hotel allows people to stay on vacation for longer.

The Train/Test Split

The most essential preparation for creating predictive models is to separate the dataset into training and testing subsets. The training set will be used for model calibration, whereas the testing set is set aside to evaluate model performance.

To perform this split so, I use the createDataPartition function. This results in a training set containing 80% of the rows and a testing set containing the remaining 20%.

cdp = createDataPartition(y = data[, 2], p = 0.8, list = FALSE)
train = data[cdp,]
test = data[-cdp,]

To verify that this split was performed correctly, I will use the dim function of the train and test sets. The train set should contain 4000 rows, and the test set should contain 1000 rows. Both sets should have 22 columns.

dim(train)
## [1] 4000   22
dim(test)
## [1] 1000   22

Naive Bayes

Some machine learning techniques require more preparation than others. I start with Naive Bayes because all of the necessary prepping has already been done.

The model is created with is_canceled as the dependent variable and all other variables (denoted by .) as independent variables. The model is fitted using the train set.

nbmodel = naiveBayes(is_canceled~., data = train)

Next, the model is used to emulate what might be the values in the test set.

nbpredict = predict(nbmodel, test)

How accurate is the prediction? The confusionMatrix function will tell us.

nbtable = table(nbpredict, test$is_canceled)
nbcm = confusionMatrix(nbtable)
nbcm
## Confusion Matrix and Statistics
## 
##          
## nbpredict   0   1
##         0 554   1
##         1  77 368
##                                           
##                Accuracy : 0.922           
##                  95% CI : (0.9036, 0.9379)
##     No Information Rate : 0.631           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8394          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8780          
##             Specificity : 0.9973          
##          Pos Pred Value : 0.9982          
##          Neg Pred Value : 0.8270          
##              Prevalence : 0.6310          
##          Detection Rate : 0.5540          
##    Detection Prevalence : 0.5550          
##       Balanced Accuracy : 0.9376          
##                                           
##        'Positive' Class : 0               
## 

The most important statistic in this output is Accuracy, which is rated at 0.922 (92.2%). This is to say that 92% of the predictions made by the Naive Bayes model were correct.

92% is a respectable accuracy rate, but considering the number of features and observations provided, it could be better. Perhaps another method would produce even better results?

I visualize the results using the bar graph below.

barplot(nbcm$table, xlab = 'Prediction', ylab = 'Frequency', 
main = 'Naive Bayes Results', 
beside = T, col = c('darkblue', 'darkred', 'darkred', 'darkblue'))
legend('topright', c('Correct Prediction', 'Incorrect Prediction'), fill = c('darkblue', 'darkred'))

There are far more correct predictions (blue) than incorrect ones (red).

Interestingly, the false negative rate is far higher than the false positive rate. Future iteration upon this model might seek to curb the false negative rate, since the false positive rate is negligible by comparison.

One Hot Encoding

Some machine learning methods, such as the binomial logistic regression I conduct below, require that all of their input data be numeric. This does not mean that I need to get rid of my categorical variables, but they cannot remain in their current form. My solution is to employ One Hot Encoding to convert categorical variables into numeric “dummy” variables, which are essentially Boolean.

The number of dummy variables produced for each categorical variable by this process is equal to the number of categories in that variable minus one. This is to avoid the “dummy trap” of creating more features than is really necessary and overfitting as a result.

To perform One Hot Encoding, I use the model.matrix function on the whole dataset.

data = model.matrix( ~ . , data=data)[,-1]
dim(data)
## [1] 5000   61

The number of variables has drastically increased, implying that the process was successful. I will print a few rows of the table to ensure that the encoding was done in full and not overdone.

dataframe = as.data.frame(data)
head(dataframe)
##        hotelResort Hotel is_canceled1 lead_time arrival_date_year2016
## 80785                  0            1        61                     0
## 99780                  0            0         3                     1
## 109784                 0            0        59                     0
## 105858                 0            0       147                     0
## 95556                  0            0        80                     1
## 39280                  1            0       191                     0
##        arrival_date_year2017 arrival_date_monthAugust
## 80785                      0                        0
## 99780                      0                        0
## 109784                     1                        0
## 105858                     1                        0
## 95556                      0                        1
## 39280                      1                        1
##        arrival_date_monthDecember arrival_date_monthFebruary
## 80785                           0                          0
## 99780                           0                          0
## 109784                          0                          0
## 105858                          0                          1
## 95556                           0                          0
## 39280                           0                          0
##        arrival_date_monthJanuary arrival_date_monthJuly arrival_date_monthJune
## 80785                          0                      0                      0
## 99780                          0                      0                      0
## 109784                         0                      0                      0
## 105858                         0                      0                      0
## 95556                          0                      0                      0
## 39280                          0                      0                      0
##        arrival_date_monthMarch arrival_date_monthMay arrival_date_monthNovember
## 80785                        0                     0                          1
## 99780                        0                     0                          0
## 109784                       0                     0                          0
## 105858                       0                     0                          0
## 95556                        0                     0                          0
## 39280                        0                     0                          0
##        arrival_date_monthOctober arrival_date_monthSeptember
## 80785                          0                           0
## 99780                          1                           0
## 109784                         0                           0
## 105858                         0                           0
## 95556                          0                           0
## 39280                          0                           0
##        stays_in_weekend_nights stays_in_week_nights adults children babies
## 80785                        1                    3      2        0      0
## 99780                        2                    0      2        0      0
## 109784                       1                    0      2        0      0
## 105858                       1                    3      2        0      0
## 95556                        0                    3      2        0      0
## 39280                        0                    4      2        0      0
##        mealFB mealHB mealSC mealUndefined market_segmentComplementary
## 80785       0      0      0             0                           0
## 99780       0      0      0             0                           0
## 109784      0      0      1             0                           0
## 105858      0      0      0             0                           0
## 95556       0      0      0             0                           0
## 39280       0      0      0             0                           0
##        market_segmentCorporate market_segmentDirect market_segmentGroups
## 80785                        0                    0                    1
## 99780                        0                    0                    0
## 109784                       0                    0                    0
## 105858                       0                    0                    0
## 95556                        0                    1                    0
## 39280                        0                    1                    0
##        market_segmentOffline TA/TO market_segmentOnline TA
## 80785                            0                       0
## 99780                            0                       1
## 109784                           0                       1
## 105858                           1                       0
## 95556                            0                       0
## 39280                            0                       0
##        distribution_channelDirect distribution_channelGDS
## 80785                           0                       0
## 99780                           0                       0
## 109784                          0                       0
## 105858                          0                       0
## 95556                           1                       0
## 39280                           1                       0
##        distribution_channelTA/TO reserved_room_typeB reserved_room_typeC
## 80785                          1                   0                   0
## 99780                          1                   0                   0
## 109784                         1                   0                   0
## 105858                         1                   0                   0
## 95556                          0                   0                   0
## 39280                          0                   0                   0
##        reserved_room_typeD reserved_room_typeE reserved_room_typeF
## 80785                    0                   0                   0
## 99780                    0                   0                   0
## 109784                   0                   0                   0
## 105858                   0                   0                   0
## 95556                    0                   1                   0
## 39280                    0                   0                   0
##        reserved_room_typeG reserved_room_typeH assigned_room_typeB
## 80785                    1                   0                   0
## 99780                    0                   0                   0
## 109784                   0                   0                   0
## 105858                   0                   0                   0
## 95556                    0                   0                   0
## 39280                    0                   0                   0
##        assigned_room_typeC assigned_room_typeD assigned_room_typeE
## 80785                    0                   0                   0
## 99780                    0                   0                   0
## 109784                   0                   0                   0
## 105858                   0                   1                   0
## 95556                    0                   0                   1
## 39280                    0                   0                   0
##        assigned_room_typeF assigned_room_typeG assigned_room_typeH
## 80785                    0                   1                   0
## 99780                    0                   0                   0
## 109784                   0                   0                   0
## 105858                   0                   0                   0
## 95556                    0                   0                   0
## 39280                    0                   0                   0
##        assigned_room_typeI assigned_room_typeK booking_changes
## 80785                    0                   0               1
## 99780                    0                   0               0
## 109784                   0                   0               0
## 105858                   0                   0               0
## 95556                    0                   0               1
## 39280                    0                   0               0
##        deposit_typeNon Refund deposit_typeRefundable customer_typeGroup
## 80785                       0                      0                  0
## 99780                       0                      0                  0
## 109784                      0                      0                  0
## 105858                      0                      0                  0
## 95556                       0                      0                  0
## 39280                       0                      0                  0
##        customer_typeTransient customer_typeTransient-Party   adr
## 80785                       0                            1  10.0
## 99780                       1                            0 120.0
## 109784                      1                            0 108.0
## 105858                      1                            0  75.0
## 95556                       1                            0 168.3
## 39280                       1                            0 162.0
##        required_car_parking_spaces total_of_special_requests
## 80785                            0                         0
## 99780                            0                         0
## 109784                           0                         0
## 105858                           0                         0
## 95556                            0                         1
## 39280                            0                         0
##        reservation_statusCheck-Out reservation_statusNo-Show
## 80785                            0                         1
## 99780                            1                         0
## 109784                           1                         0
## 105858                           1                         0
## 95556                            1                         0
## 39280                            1                         0

It can easily be observed that all categorical variables have been replaced by the minimum necessary number of numeric-form Booleans, just as intended.

Now, I need to perform a train-test split on the encoded dataset.

cdp = createDataPartition(y = data[, 2], p = 0.8, list = FALSE)
train = data[cdp,]
test = data[-cdp,]

The dim function will verify that the train/test split was carried out correctly.

dim(train)
## [1] 4000   61
dim(test)
## [1] 1000   61

Binomial Logistic Regression

Binomial logistic regression is a machine learning method that uses numeric data to predict a categorical variable with two categories - that is, a Boolean variable like is_canceled.

I create and fit a logistic regression model using the glm function.

logmodel = glm(is_canceled1 ~., data = as.data.frame(train), family = binomial(link = 'logit'))

I create a predictive model in the next code chunk. This method produces predictions that do not perfectly align as 1s and 0s, so I use an inequality to regularize it into Boolean form.

logpredict = predict(logmodel, as.data.frame(test), type = 'response')
logpredict = ifelse(logpredict > 0.5, 1, 0)

I use the confusionMatrix function to display results.

logtable = table(logpredict, test[, 2])
logcm = confusionMatrix(logtable)
logcm
## Confusion Matrix and Statistics
## 
##           
## logpredict   0   1
##          0 632   0
##          1   0 368
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9963, 1)
##     No Information Rate : 0.632      
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.632      
##          Detection Rate : 0.632      
##    Detection Prevalence : 0.632      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : 0          
## 

Somehow, this logistic regression model is powerful enough to produce an accuracy rate of 1. Every single prediction made is correct.

I will end this presentation with a bar graph similar to the one used to visualize the Naive Bayes results.

barplot(logcm$table, xlab = 'Prediction', ylab = 'Frequency', 
main = 'Logistic Regression Results', 
beside = T, col = c('darkblue', 'darkred', 'darkred', 'darkblue'))
legend('topright', c('Correct Prediction', 'Incorrect Prediction'), fill = c('darkblue', 'darkred'))