Hotel Cancellation Forecasting

Synopsis

As one of the most important components on our travelling agenda, securing a hotel room in a strange city or even country ahead of our travelling seem to be one of the first things to hit on our to-do list. Thanks to the technologies and various hotel booking platforms, travelers nowadays are able to book or cancel hotel reservations in just a blink of an eye. Along with the convinience of making hotel reservations, the cancellation of a room reservation has also become much easier. As a result, the potential loss due to hotel cancellation might post additional cost to hotel management in general. In other words, to have a good sense about how cancellation activities look like has become increasingly important for hotels for their daily management as well as long-term business decision making. For this final project, I used the data collected by hotel booking platforms regarding different attributes of the reservation to better understand hotel cancellation in general, then try to answer questions in related to hotel cancellation such as what factors are closely associated with hotel cancellation, and how can we better predict hotel cancellation. With these answers and information, the hotel management team would be better informed with the following aspects:

  • How does cancellation look like in general (e.g., cancellation rate, cancellation distribution, etc.)?
  • What factors are more closely related to cancellation (e.g., attributes of the customers, timing of the booking, room characteristics, booking activities, booking avenues, etc.)?
  • How to better predict cancellation based on other known factors? And
  • actions to minimize potential cancellations.

To answer these questions, I will

  1. clean the dataset (e.g., check for variable names, types, missing values, and duplicates; fix any probelms to make the dataset tidy)
  2. perform a diagnostic analysis on the dataset to visualize distributions, detect any invalid values and/or outliers
  3. perform an exploratory data analysis to have a better understanding about the relationships among variables (e.g., check data/variable distributions, check data by group, x-y plots, scatterplots, etc.)
  4. conduct regressional analyses to build a predicting model for cancellation

Package Required

The main R package used for this project is tidyverse. The core functions inbeded in tidyverse include ggplot2, tidyr, dplyr etc. Some non-core functions of the tidyverse package were also loaded.

library(tidyverse)
library(readxl)
library(lubridate)
library(magrittr)
library(glue)
library(tidyverse)

Data Set & Variables

The hotels dataset used for this project was downloaded from the BANA 7025 Course Website. A total number of 32 varibales are included in the hotel dataset, most of which are categorical variables. Detailed variable name, type, and descriptions are listed below.

  1. hotel character Hotel (H1 = Resort Hotel or H2 = City Hotel)

  2. is_canceled double Value indicating if the booking was canceled (1) or not (0)

  3. lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

  4. arrival_date_year double Year of arrival date

  5. arrival_date_month character Month of arrival date

  6. arrival_date_week_number double Week number of year for arrival date

  7. arrival_date_day_of_month double Day of arrival date

  8. stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

  9. stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

  10. adults double Number of adults

  11. children double Number of children

  12. babies double Number of babies

  13. meal character Type of meal booked. Categories are presented in standard hospitality meal packages:
  • Undefined/SC – no meal package;
  • BB – Bed & Breakfast;
  • HB – Half board (breakfast and one other meal – usually dinner);
  • FB – Full board (breakfast, lunch and dinner)
  1. country character Country of origin. Categories are represented in the ISO 3155–3:2013 format

  2. market_segment character Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  3. distribution_channel character Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  4. is_repeated_guest double Value indicating if the booking name was from a repeated guest (1) or not (0)

  5. previous_cancellations double Number of previous bookings that were cancelled by the customer prior to the current booking

  6. previous_bookings_not_canceled double Number of previous bookings not cancelled by the customer prior to the current booking

  7. reserved_room_type character Code of room type reserved. Code is presented instead of designation for anonymity reasons

  8. assigned_room_type character Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

  9. booking_changes double Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

  10. deposit_type character Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
  • No Deposit – no deposit was made;
  • Non Refund – a deposit was made in the value of the total stay cost;
  • Refundable – a deposit was made with a value under the total cost of stay.
  1. agent character ID of the travel agency that made the booking

  2. company character ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

  3. days_in_waiting_list double Number of days the booking was in the waiting list before it was confirmed to the customer

  4. customer_type character Type of booking, assuming one of four categories:
  • Contract - when the booking has an allotment or other type of contract associated to it;
  • Group – when the booking is associated to a group;
  • Transient – when the booking is not part of a group or contract, and is not associated to other transient booking;
  • Transient-party – when the booking is transient, but is associated to at least other transient booking
  1. adr double Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

  2. required_car_parking_spaces double Number of car parking spaces required by the customer

  3. total_of_special_requests double Number of special requests made by the customer (e.g. twin bed or high floor)

  4. reservation_status character Reservation last status, assuming one of three categories:
  • Canceled – booking was canceled by the customer;
  • Check-Out – customer has checked in but already departed;
  • No-Show – customer did not check-in and did inform the hotel of the reason why
  1. reservation_status_date double Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

As you can see, it is a rather long variable list, and it might be not easy to digest. To have a better idea about the variables, I grouped them into four general categories:

  1. Customer Attributes

adults, children, babies, country, is_repeated_guest, previous_cancellations, previous_bookings_not_cancelled, agent, company, customer_type, and total_of_special_requests

  1. Hotel Attributes

hotel, meal, reserved_room_type, assigned_room_type, adr, and required_car_parking_spaces

  1. Booking Attributes

lead_time, arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month, stays_in_week_nights, stays_in_weekend_nights, market_segment, distribution_channel, booking_changes, deposit_type, days_in_waiting_list, and reservation_status_date

  1. Outcome Variables

is_canceled, and reservation_status

Data Preparation

Glimpse of Data

Now let's take a closer look at the dataset.

#load dataset.
hotels <- read.csv("hotels.csv", header = TRUE)
str(hotels)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
head(hotels, 5)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03

As the output above shows, there are a total of 119390 observations and 32 variables in the dataset.

Missing Data

To get an idea about missing data in the dataset, the is.na fuction in R programming language was used as shown below.

# missing data in general
sum(is.na(hotels))
## [1] 4
# missing data broken down by variable
colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

The output above indicates that the hotels dataset is a fairly complete dataset, with only 4 cases of missing data for children. To further check for missing values that might be labeled by other expressions in the dataset (e.g., -99, 99, NULL, etc.), unique values for each variable were extracted using the R fuction table(data$variable). Due to the massive amount of values and observations, the output of this step will not be shown. Instead, the results will be synthesized in the form of narrative below.

table(hotels$is_canceled)
table(hotels$lead_time)
table(hotels$arrival_date_year)
table(hotels$arrival_date_month)
table(hotels$arrival_date_week_number)
table(hotels$arrival_date_day_of_month)
table(hotels$stays_in_weekend_nights)
table(hotels$stays_in_week_nights)
table(hotels$adults)
table(hotels$children)
table(hotels$babies)
table(hotels$meal)
table(hotels$country)
table(hotels$market_segment)
table(hotels$distribution_channel)
table(hotels$is_repeated_guest)
table(hotels$previous_cancellations)
table(hotels$previous_bookings_not_canceled)
table(hotels$booking_changes)
table(hotels$deposit_type)
table(hotels$agent)
table(hotels$company)
table(hotels$days_in_waiting_list)
table(hotels$customer_type)
table(hotels$adr)
table(hotels$required_car_parking_spaces)
table(hotels$total_of_special_requests)
table(hotels$reservation_status)
table(hotels$reservation_status_date)

Now let's take a closer look at two variavles agent and company which seem to have more NULL values than the others.

table(hotels$agent)
table(hotels$company)

The two lines of code above showed us that for variable agent and company, there were a large amount of cases being labeled as "NULL" (n=16340, 112593 respectively), which indicated missing data . (output now shown due to its large amount) To replace the "NULL" values with NA for variable agent and company, the codes below was used.

hotels$agent[hotels$agent %in% c("NULL")] <- "NA"
hotels$company[hotels$company %in% c("NULL")] <- "NA"
# QA
table(hotels$agent)
table(hotels$company)

Check Duplicates

Dupicate rows can be problematic and cause issues for data analysis and interpretation. Therefore in this step, I first identified duplicate rows (if any), and then removed them. The code and output below demonstrate this step.

# check the total number of duplicates in the dataset
sum(duplicated(hotels))
## [1] 31994
# remove duplicates
hotels_no_dup <- hotels[!duplicated(hotels), ]
# QA
sum(duplicated(hotels_no_dup))
## [1] 0

As the output above shows, there were 31994 duplicate rows in the hotel dataset. After removing the duplicate rows, the new dataset was saved as hotels_no_dup, and no more duplicate rows remianed in the new dataset.

Summary Statistics

Of the 32 variables, variable lead_timeand adr are more understandable as numeric, so summary statistics were computed as shown below. Other categorical variables were demonstrated in the form of barplots.

Numerical Variables

summary(hotels_no_dup$lead_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   49.00   79.89  125.00  737.00
summary(hotels_no_dup$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   72.00   98.10  106.34  134.00 5400.00

The minimum value for variable adr seems to be invalid as the averaged hotel rate should not be negative in most situations. Examined the dataset again (with the table(dataset$variable function), there was one value of "-6.38" in the dataset, along with a number of cases with very low adr values such as "0", "0.26" and "0.5" etc. The existance of various low adr values suggests that the case of "-6.38" might not be a misentering or invalid data. Instead, there must be a systematic reason for those low adr values such as gift cards or credits being redeamed or hotel compensation. In a word, without conforming with the client first, all those low values in adr are still considered to be valid and thus no action is needed now.

Categorical Variables

Barplot was used to visualize frequencies for the categorical variables. The code and output are shown below.

barplot(table(hotels_no_dup$hotel), xlab = "Hotel Type")

barplot(table(hotels_no_dup$is_canceled), xlab = "Is the Reservation Cancelled")

barplot(table(hotels_no_dup$arrival_date_year), xlab = "Arrival Year")

barplot(table(hotels_no_dup$arrival_date_month), xlab = "Arrival Month")

barplot(table(hotels_no_dup$arrival_date_week_number), xlab = "Arrival Week") # not so useful

barplot(table(hotels_no_dup$arrival_date_day_of_month), xlab = "Arrival Day")

barplot(table(hotels_no_dup$stays_in_weekend_nights), xlab = "Weekend Night Stayed")

barplot(table(hotels_no_dup$stays_in_week_nights), xlab = "Week Night Stayed")

barplot(table(hotels_no_dup$adults), xlab = "Number of Adults")

barplot(table(hotels_no_dup$children), xlab = "Number of Children")

barplot(table(hotels_no_dup$babies), xlab = "Number of Babies")

barplot(table(hotels_no_dup$meal), xlab = "Type of Meal Booked")

barplot(table(hotels_no_dup$country), xlab = "Customer Country")

barplot(table(hotels_no_dup$market_segment), xlab = "Market Segment")

barplot(table(hotels_no_dup$distribution_channel), xlab = "Booking Distribution Channel")

barplot(table(hotels_no_dup$is_repeated_guest), xlab = "Is Repeated Guest")

barplot(table(hotels_no_dup$previous_cancellations), xlab = "Previous Bookings Cancelled")

barplot(table(hotels_no_dup$previous_bookings_not_canceled), xlab = "No. of Previous Bookings Not Cancelled")

barplot(table(hotels_no_dup$reserved_room_type), xlab = "Reserved Room Type")

barplot(table(hotels_no_dup$assigned_room_type), xlab = "Assigned Room Type")

barplot(table(hotels_no_dup$booking_changes), xlab = "Number of Changes Made")

barplot(table(hotels_no_dup$deposit_type), xlab = "Deposit Type")

barplot(table(hotels_no_dup$agent), xlab = "Travel Agent ID")

barplot(table(hotels_no_dup$company), xlab = "Company ID")

barplot(table(hotels_no_dup$days_in_waiting_list), xlab = "Days in Waitlist")

barplot(table(hotels_no_dup$customer_type), xlab = "Customer Type")

barplot(table(hotels_no_dup$required_car_parking_spaces), xlab = "Number of Required Parking Spaces")

barplot(table(hotels_no_dup$total_of_special_requests), xlab = "Number of Special Requests")

barplot(table(hotels_no_dup$reservation_status), xlab = "Reservation Last Status")

barplot(table(hotels_no_dup$reservation_status_date), xlab = "Last Status Date") # not so useful

The barplots above have given us a much better sense in terms of what the data tells us about the aspects being measured. Specifically, the list below summarizes some notable patterns by just eyeballing the dataset.

hotel hotel type

  • About 2/3 were city hotels, the rest were resort hotels

is_canceled Is the reservation cancelled?

  • Most reservations (about 2/3) were not cancelled, while about 1/3 were cancelled.

lead_time number of days between booking and arrival

  • Mean lead time was 79.89 (not sure in what unit)

arrival_date_year arrival year

  • Data was for 2015, 2016, 2017, and 2016 and 2017 seem to have most of the arrivals.

arrival_date_month arrival month

  • Arrivals seem to be evenly distributed throughout the year, with some peaks in July and August, some lows in January and November.

arrival_date_week_number arrival week

  • N/A

arrival_date_day_of_month arrival day

  • N/A

stays_in_weekend_nights number of weekend nights stayed

  • Most guest who stayed on weekends had spent 1 to two nights.

stays_in_week_nights number of week nights stayed

  • Most guest who stayed on week days had spent 1 to five nights. In addition, more guest stayed on week days than weekends.

adults number of adults

  • 2 adults for most of the cases

children number of children

  • 0 child for most of the cases

babies number of babies

  • 0 babies for most of the cases

meal type of meal booked

  • Most guests booked bed & breakfast meal type

country guest country of origin

  • PRT, FRA, and GBR seem to be the top 3 countries of origin

market_segment market segment

  • Reservations mostly booked through online travel agencies

distribution_channel distribution channel

  • Reservations mostly booked through travel agents or tour operators

is_repeated_guest is repeated guest

  • Most were not repeated guests

previous_cancellations number of previous cancellations

  • Most guests did not cancel reservations previously

previous_bookings_not_canceled Number of previous bookings not canceled

  • Most had value of 0

reserved_room_type room type reserved

  • Mostly type A and D

assigned_room_type room type assigned

  • Mostly type A and D

booking_changes number of changes made to the booking

  • Mostly no changes made to the reservation

deposit_type deposit type

  • Mostly no deposit

agent ID of the travel agency that made the booking

  • 9 and 240 account for most

company ID of the company/entity that made the booking or responsible for paying the booking

  • NA

days_in_waiting_list number of days in waitlist

  • Mostly 0

customer_type type of booking

  • Mostly transient guests

adr average daily rate

  • Mean is $106.34

required_car_parking_spaces number of parking spaces required

  • Mostly 0, with some 1

total_of_special_requests number of special requests made

  • Mostly 0, some had 1 to 3 special requests

reservation_status reservation last status

  • Mostly checked out

reservation_status_date reservation last status date

  • N/A

Now after replacing the missing value, removing duplicates, checking for data distribution and invalid values, the datset hotels_no_dup is considered clean, and it was renamed as hotels_final. The final dataset has 87396 rows and 32 variables, and the first 10 rows showed below.

# rename the final cleaned dataset
hotels_final <- hotels_no_dup
# final data snapshot
head(hotels_final, 5)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
str(hotels_final)
## 'data.frame':    87396 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 0 9 85 75 23 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 3 3 4 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 0 1 1 0 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...

EDA

In this section, exploratary data analysis was performed for the final dataset hotels_final. My plan of doing EDA is outlined below:

  1. subsetting the dataset based on interest. specifically, focuse on the reservations that had been canceled
  2. identify correlations between the outcome variables is_canceled and reservation_status with other variables using x-y plots
  3. select key variables to build a regression model for cancelation prediction

Due to the nature that this dataset is consist of categorical variables primarily, in order to achieve my analysis goals, I will need to learn more about data manipulation in terms of fitting my categorical variables into a regression model.