This project utilizes the Hotel Bookings dataset provided by Jesse Mostipak on the Kaggle website. Link: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand
This dataset records detailed information about more than 100,000 hotel bookings. I intend to employ two machine learning methods to determine which is more effective at predicting whether or not a guest will cancel their booking.
I access the hotel data using a relative file path.
data = read.csv(file = 'hotel_bookings.csv')
I will activate several libraries to enable functions for later use.
library(lattice)
library(ggplot2)
library(caret)
library(e1071)
I want my random results to be reproducible. The analysis in this project assumes that you have run the below cell to set the random seed to 428.
set.seed(428)
To ensure that later calculations are not confounded by missing data, I will immediately use na.omit to remove all rows containing missing values. To ensure that it worked, I will display the number of rows before and after purging the dataset of NA.
nrow(data)
## [1] 119390
data = na.omit(data)
nrow(data)
## [1] 119386
Apparently, there were only four rows with missing values in the entire dataset.
To cut down on processing times, I take a random sample of 5000 rows of the dataset for study, omitting the rest.
data = data[sample(nrow(data), size = 5000),]
nrow(data)
## [1] 5000
Before anything else is done with the data, let’s see the table.
head(data)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 80785 City Hotel 1 61 2015 November
## 99780 City Hotel 0 3 2016 October
## 109784 City Hotel 0 59 2017 April
## 105858 City Hotel 0 147 2017 February
## 95556 City Hotel 0 80 2016 August
## 39280 Resort Hotel 0 191 2017 August
## arrival_date_week_number arrival_date_day_of_month
## 80785 48 23
## 99780 43 16
## 109784 16 16
## 105858 7 13
## 95556 34 18
## 39280 32 9
## stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 80785 1 3 2 0 0 BB
## 99780 2 0 2 0 0 BB
## 109784 1 0 2 0 0 SC
## 105858 1 3 2 0 0 BB
## 95556 0 3 2 0 0 BB
## 39280 0 4 2 0 0 BB
## country market_segment distribution_channel is_repeated_guest
## 80785 PRT Groups TA/TO 0
## 99780 ESP Online TA TA/TO 0
## 109784 AUT Online TA TA/TO 0
## 105858 FRA Offline TA/TO TA/TO 0
## 95556 GBR Direct Direct 0
## 39280 CN Direct Direct 0
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## 80785 0 0 G
## 99780 0 0 A
## 109784 0 0 A
## 105858 0 0 A
## 95556 0 0 E
## 39280 0 0 A
## assigned_room_type booking_changes deposit_type agent company
## 80785 G 1 No Deposit 37 NULL
## 99780 A 0 No Deposit 83 NULL
## 109784 A 0 No Deposit 9 NULL
## 105858 D 0 No Deposit 28 NULL
## 95556 E 1 No Deposit 14 NULL
## 39280 A 0 No Deposit 250 NULL
## days_in_waiting_list customer_type adr required_car_parking_spaces
## 80785 0 Transient-Party 10.0 0
## 99780 0 Transient 120.0 0
## 109784 0 Transient 108.0 0
## 105858 0 Transient 75.0 0
## 95556 0 Transient 168.3 0
## 39280 0 Transient 162.0 0
## total_of_special_requests reservation_status reservation_status_date
## 80785 0 No-Show 2015-11-23
## 99780 0 Check-Out 2016-10-18
## 109784 0 Check-Out 2017-04-17
## 105858 0 Check-Out 2017-02-17
## 95556 1 Check-Out 2016-08-21
## 39280 0 Check-Out 2017-08-13
There are 32 columns, sporting a mixture of numeric and categorical variables (along with one column containing dates). Notice the randomized row indices, indicative of my earlier shuffling.
Of particular interest is the is_canceled column, second from the left. It is to be treated as a Boolean variable, with 1 indicating a canceled reservation and 0 indicating a reservation that was fulfilled normally.
I wish to keep the bulk of the variables available for predictive modelling, but not all of these are necessary. To streamline things, I get rid of some of the less useful and more cumbersome ones. To do so, I replace my table with a subset of itself containing all but the variables I want to expunge.
data = subset(data, select = -c(is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, agent, company, days_in_waiting_list, reservation_status_date, country, arrival_date_week_number, arrival_date_day_of_month))
Since they are still fresh from the csv file, my categorical variables are not internally labeled as factors, instead defaulting to the less useful character type. The next code chunk accomplishes this task.
Once these operations are performed, the table will be printed again.
data = transform(data, hotel = factor(hotel))
data = transform(data, is_canceled = factor(is_canceled))
data = transform(data, arrival_date_year = factor(arrival_date_year))
data = transform(data, arrival_date_month = factor(arrival_date_month))
data = transform(data, meal = factor(meal))
data = transform(data, market_segment = factor(market_segment))
data = transform(data, distribution_channel = factor(distribution_channel))
data = transform(data, reserved_room_type = factor(reserved_room_type))
data = transform(data, assigned_room_type = factor(assigned_room_type))
data = transform(data, deposit_type = factor(deposit_type))
data = transform(data, customer_type = factor(customer_type))
data = transform(data, reservation_status = factor(reservation_status))
head(data)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 80785 City Hotel 1 61 2015 November
## 99780 City Hotel 0 3 2016 October
## 109784 City Hotel 0 59 2017 April
## 105858 City Hotel 0 147 2017 February
## 95556 City Hotel 0 80 2016 August
## 39280 Resort Hotel 0 191 2017 August
## stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 80785 1 3 2 0 0 BB
## 99780 2 0 2 0 0 BB
## 109784 1 0 2 0 0 SC
## 105858 1 3 2 0 0 BB
## 95556 0 3 2 0 0 BB
## 39280 0 4 2 0 0 BB
## market_segment distribution_channel reserved_room_type
## 80785 Groups TA/TO G
## 99780 Online TA TA/TO A
## 109784 Online TA TA/TO A
## 105858 Offline TA/TO TA/TO A
## 95556 Direct Direct E
## 39280 Direct Direct A
## assigned_room_type booking_changes deposit_type customer_type adr
## 80785 G 1 No Deposit Transient-Party 10.0
## 99780 A 0 No Deposit Transient 120.0
## 109784 A 0 No Deposit Transient 108.0
## 105858 D 0 No Deposit Transient 75.0
## 95556 E 1 No Deposit Transient 168.3
## 39280 A 0 No Deposit Transient 162.0
## required_car_parking_spaces total_of_special_requests reservation_status
## 80785 0 0 No-Show
## 99780 0 0 Check-Out
## 109784 0 0 Check-Out
## 105858 0 0 Check-Out
## 95556 0 1 Check-Out
## 39280 0 0 Check-Out
Simply seeing the data table doesn’t tell us enough about the data. To better understand the underlying trends, I will use the summary function to provide some quick facts on all 22 variables.
Numeric variables display simple summary statistics, whereas categorical variables return the count of each category.
summary(data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :3363 0:3155 Min. : 0.0 2015: 913
## Resort Hotel:1637 1:1845 1st Qu.: 19.0 2016:2365
## Median : 69.0 2017:1722
## Mean :105.4
## 3rd Qu.:162.0
## Max. :626.0
##
## arrival_date_month stays_in_weekend_nights stays_in_week_nights
## August : 606 Min. : 0.0000 Min. : 0.000
## July : 538 1st Qu.: 0.0000 1st Qu.: 1.000
## May : 483 Median : 1.0000 Median : 2.000
## June : 475 Mean : 0.9312 Mean : 2.502
## September: 457 3rd Qu.: 2.0000 3rd Qu.: 3.000
## October : 445 Max. :16.0000 Max. :40.000
## (Other) :1996
## adults children babies meal
## Min. :0.000 Min. :0.000 Min. :0.0000 BB :3881
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:0.0000 FB : 28
## Median :2.000 Median :0.000 Median :0.0000 HB : 603
## Mean :1.865 Mean :0.111 Mean :0.0056 SC : 433
## 3rd Qu.:2.000 3rd Qu.:0.000 3rd Qu.:0.0000 Undefined: 55
## Max. :5.000 Max. :3.000 Max. :1.0000
##
## market_segment distribution_channel reserved_room_type
## Aviation : 9 Corporate: 279 A :3590
## Complementary: 31 Direct : 594 D : 802
## Corporate : 205 GDS : 9 E : 295
## Direct : 515 TA/TO :4118 F : 109
## Groups : 862 G : 86
## Offline TA/TO: 964 B : 52
## Online TA :2414 (Other): 66
## assigned_room_type booking_changes deposit_type customer_type
## A :3131 Min. :0.000 No Deposit:4402 Contract : 176
## D :1041 1st Qu.:0.000 Non Refund: 590 Group : 19
## E : 335 Median :0.000 Refundable: 8 Transient :3723
## F : 155 Mean :0.209 Transient-Party:1082
## C : 101 3rd Qu.:0.000
## G : 100 Max. :8.000
## (Other): 137
## adr required_car_parking_spaces total_of_special_requests
## Min. : 0.00 Min. :0.0000 Min. :0.000
## 1st Qu.: 70.00 1st Qu.:0.0000 1st Qu.:0.000
## Median : 95.27 Median :0.0000 Median :0.000
## Mean :102.36 Mean :0.0616 Mean :0.585
## 3rd Qu.:126.00 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :451.50 Max. :8.0000 Max. :5.000
##
## reservation_status
## Canceled :1804
## Check-Out:3155
## No-Show : 41
##
##
##
##
At a glance, we can see that about a third of the reservations are canceled. This is visually confirmed by the pie chart below, which shows that 36.9% of reservations are canceled - slightly more than a third.
cancelcount = table(data$is_canceled)
piecent = round(100 * cancelcount / sum(cancelcount), 1)
pie(cancelcount, labels = piecent, col = c('darkblue', 'darkred'))
legend('topleft', c('Not Canceled', 'Canceled'), fill = c('darkblue', 'darkred'))
There are too many predictor variables for it to be practical to individually analyze each one, but for the sake of example, I would like to create a bar graph to assess the connection between hotel type and cancellation rates.
hoteltype = table(data$is_canceled, data$hotel)
barplot(hoteltype, xlab = 'Hotel Type', ylab = 'Number of Reservations',
main = 'Cancelation by Hotel Type',
beside = T, col = c('darkblue', 'darkred', 'darkblue', 'darkred'))
legend('topright', c('Not Canceled', 'Canceled'), fill = c('darkblue', 'darkred'))
As it turns out, the rate of cancelation for city hotels is very noticably higher than rate for resort hotels. Though the data and the graph do not tell us why this discrepancy exists, it is not difficult to speculate why. Booking at a city hotel is an inconvenient necessity that most people would avoid if possible, whereas booking at a resort hotel allows people to stay on vacation for longer.
The most essential preparation for creating predictive models is to separate the dataset into training and testing subsets. The training set will be used for model calibration, whereas the testing set is set aside to evaluate model performance.
To perform this split so, I use the createDataPartition function. This results in a training set containing 80% of the rows and a testing set containing the remaining 20%.
cdp = createDataPartition(y = data[, 2], p = 0.8, list = FALSE)
train = data[cdp,]
test = data[-cdp,]
To verify that this split was performed correctly, I will use the dim function of the train and test sets. The train set should contain 4000 rows, and the test set should contain 1000 rows. Both sets should have 22 columns.
dim(train)
## [1] 4000 22
dim(test)
## [1] 1000 22
Some machine learning techniques require more preparation than others. I start with Naive Bayes because all of the necessary prepping has already been done.
The model is created with is_canceled as the dependent variable and all other variables (denoted by .) as independent variables. The model is fitted using the train set.
nbmodel = naiveBayes(is_canceled~., data = train)
Next, the model is used to emulate what might be the values in the test set.
nbpredict = predict(nbmodel, test)
How accurate is the prediction? The confusionMatrix function will tell us.
nbtable = table(nbpredict, test$is_canceled)
nbcm = confusionMatrix(nbtable)
nbcm
## Confusion Matrix and Statistics
##
##
## nbpredict 0 1
## 0 554 1
## 1 77 368
##
## Accuracy : 0.922
## 95% CI : (0.9036, 0.9379)
## No Information Rate : 0.631
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8394
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8780
## Specificity : 0.9973
## Pos Pred Value : 0.9982
## Neg Pred Value : 0.8270
## Prevalence : 0.6310
## Detection Rate : 0.5540
## Detection Prevalence : 0.5550
## Balanced Accuracy : 0.9376
##
## 'Positive' Class : 0
##
The most important statistic in this output is Accuracy, which is rated at 0.922 (92.2%). This is to say that 92% of the predictions made by the Naive Bayes model were correct.
92% is a respectable accuracy rate, but considering the number of features and observations provided, it could be better. Perhaps another method would produce even better results?
I visualize the results using the bar graph below.
barplot(nbcm$table, xlab = 'Prediction', ylab = 'Frequency',
main = 'Naive Bayes Results',
beside = T, col = c('darkblue', 'darkred', 'darkred', 'darkblue'))
legend('topright', c('Correct Prediction', 'Incorrect Prediction'), fill = c('darkblue', 'darkred'))
There are far more correct predictions (blue) than incorrect ones (red).
Interestingly, the false negative rate is far higher than the false positive rate. Future iteration upon this model might seek to curb the false negative rate, since the false positive rate is negligible by comparison.
Some machine learning methods, such as the binomial logistic regression I conduct below, require that all of their input data be numeric. This does not mean that I need to get rid of my categorical variables, but they cannot remain in their current form. My solution is to employ One Hot Encoding to convert categorical variables into numeric “dummy” variables, which are essentially Boolean.
The number of dummy variables produced for each categorical variable by this process is equal to the number of categories in that variable minus one. This is to avoid the “dummy trap” of creating more features than is really necessary and overfitting as a result.
To perform One Hot Encoding, I use the model.matrix function on the whole dataset.
data = model.matrix( ~ . , data=data)[,-1]
dim(data)
## [1] 5000 61
The number of variables has drastically increased, implying that the process was successful. I will print a few rows of the table to ensure that the encoding was done in full and not overdone.
dataframe = as.data.frame(data)
head(dataframe)
## hotelResort Hotel is_canceled1 lead_time arrival_date_year2016
## 80785 0 1 61 0
## 99780 0 0 3 1
## 109784 0 0 59 0
## 105858 0 0 147 0
## 95556 0 0 80 1
## 39280 1 0 191 0
## arrival_date_year2017 arrival_date_monthAugust
## 80785 0 0
## 99780 0 0
## 109784 1 0
## 105858 1 0
## 95556 0 1
## 39280 1 1
## arrival_date_monthDecember arrival_date_monthFebruary
## 80785 0 0
## 99780 0 0
## 109784 0 0
## 105858 0 1
## 95556 0 0
## 39280 0 0
## arrival_date_monthJanuary arrival_date_monthJuly arrival_date_monthJune
## 80785 0 0 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 0
## 39280 0 0 0
## arrival_date_monthMarch arrival_date_monthMay arrival_date_monthNovember
## 80785 0 0 1
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 0
## 39280 0 0 0
## arrival_date_monthOctober arrival_date_monthSeptember
## 80785 0 0
## 99780 1 0
## 109784 0 0
## 105858 0 0
## 95556 0 0
## 39280 0 0
## stays_in_weekend_nights stays_in_week_nights adults children babies
## 80785 1 3 2 0 0
## 99780 2 0 2 0 0
## 109784 1 0 2 0 0
## 105858 1 3 2 0 0
## 95556 0 3 2 0 0
## 39280 0 4 2 0 0
## mealFB mealHB mealSC mealUndefined market_segmentComplementary
## 80785 0 0 0 0 0
## 99780 0 0 0 0 0
## 109784 0 0 1 0 0
## 105858 0 0 0 0 0
## 95556 0 0 0 0 0
## 39280 0 0 0 0 0
## market_segmentCorporate market_segmentDirect market_segmentGroups
## 80785 0 0 1
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 1 0
## 39280 0 1 0
## market_segmentOffline TA/TO market_segmentOnline TA
## 80785 0 0
## 99780 0 1
## 109784 0 1
## 105858 1 0
## 95556 0 0
## 39280 0 0
## distribution_channelDirect distribution_channelGDS
## 80785 0 0
## 99780 0 0
## 109784 0 0
## 105858 0 0
## 95556 1 0
## 39280 1 0
## distribution_channelTA/TO reserved_room_typeB reserved_room_typeC
## 80785 1 0 0
## 99780 1 0 0
## 109784 1 0 0
## 105858 1 0 0
## 95556 0 0 0
## 39280 0 0 0
## reserved_room_typeD reserved_room_typeE reserved_room_typeF
## 80785 0 0 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 1 0
## 39280 0 0 0
## reserved_room_typeG reserved_room_typeH assigned_room_typeB
## 80785 1 0 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 0
## 39280 0 0 0
## assigned_room_typeC assigned_room_typeD assigned_room_typeE
## 80785 0 0 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 1 0
## 95556 0 0 1
## 39280 0 0 0
## assigned_room_typeF assigned_room_typeG assigned_room_typeH
## 80785 0 1 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 0
## 39280 0 0 0
## assigned_room_typeI assigned_room_typeK booking_changes
## 80785 0 0 1
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 1
## 39280 0 0 0
## deposit_typeNon Refund deposit_typeRefundable customer_typeGroup
## 80785 0 0 0
## 99780 0 0 0
## 109784 0 0 0
## 105858 0 0 0
## 95556 0 0 0
## 39280 0 0 0
## customer_typeTransient customer_typeTransient-Party adr
## 80785 0 1 10.0
## 99780 1 0 120.0
## 109784 1 0 108.0
## 105858 1 0 75.0
## 95556 1 0 168.3
## 39280 1 0 162.0
## required_car_parking_spaces total_of_special_requests
## 80785 0 0
## 99780 0 0
## 109784 0 0
## 105858 0 0
## 95556 0 1
## 39280 0 0
## reservation_statusCheck-Out reservation_statusNo-Show
## 80785 0 1
## 99780 1 0
## 109784 1 0
## 105858 1 0
## 95556 1 0
## 39280 1 0
It can easily be observed that all categorical variables have been replaced by the minimum necessary number of numeric-form Booleans, just as intended.
Now, I need to perform a train-test split on the encoded dataset.
cdp = createDataPartition(y = data[, 2], p = 0.8, list = FALSE)
train = data[cdp,]
test = data[-cdp,]
The dim function will verify that the train/test split was carried out correctly.
dim(train)
## [1] 4000 61
dim(test)
## [1] 1000 61
Binomial logistic regression is a machine learning method that uses numeric data to predict a categorical variable with two categories - that is, a Boolean variable like is_canceled.
I create and fit a logistic regression model using the glm function.
logmodel = glm(is_canceled1 ~., data = as.data.frame(train), family = binomial(link = 'logit'))
I create a predictive model in the next code chunk. This method produces predictions that do not perfectly align as 1s and 0s, so I use an inequality to regularize it into Boolean form.
logpredict = predict(logmodel, as.data.frame(test), type = 'response')
logpredict = ifelse(logpredict > 0.5, 1, 0)
I use the confusionMatrix function to display results.
logtable = table(logpredict, test[, 2])
logcm = confusionMatrix(logtable)
logcm
## Confusion Matrix and Statistics
##
##
## logpredict 0 1
## 0 632 0
## 1 0 368
##
## Accuracy : 1
## 95% CI : (0.9963, 1)
## No Information Rate : 0.632
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.632
## Detection Rate : 0.632
## Detection Prevalence : 0.632
## Balanced Accuracy : 1.000
##
## 'Positive' Class : 0
##
Somehow, this logistic regression model is powerful enough to produce an accuracy rate of 1. Every single prediction made is correct.
I will end this presentation with a bar graph similar to the one used to visualize the Naive Bayes results.
barplot(logcm$table, xlab = 'Prediction', ylab = 'Frequency',
main = 'Logistic Regression Results',
beside = T, col = c('darkblue', 'darkred', 'darkred', 'darkblue'))
legend('topright', c('Correct Prediction', 'Incorrect Prediction'), fill = c('darkblue', 'darkred'))