The purpose of the project is to analyze the factors that affect such as hotel booking and cancellation, as well as guest retention, that will ultimately be used to boost hotel revenue.
For these purpose, hotel booking dataset was used in this project. This data includes two types of hotels: resort hotel (H1) and city hotel (H2).
I will also try to obtain more information about the guests checking such as when they booked their reservation, and any insights that could be helpful in understanding the client base. In addition, I will also evaluate subgroup of these reservations such as guests who chose to cancel their stays to see if there are any underlying factors that may contribute to guest cancellations.
With these insights, hotels could make better data-driven decisions that could finally increase their revenue.
library(ggplot2) # For graphs in Explorartory Data Analysis
## Warning: package 'ggplot2' was built under R version 3.6.3
library(dplyr) # For Data Manipulation like select(),mutate(),groupby()
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(scales) #Automatically determine breaks and labels for axes and legends.
## Warning: package 'scales' was built under R version 3.6.3
library(tidyverse) #Meta Package to load other necessary Packages like ggplot2,dplyr,readr
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.1 v purrr 0.3.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr) #For reading contents from a file
library(e1071) #For naive bayes,SVM models
## Warning: package 'e1071' was built under R version 3.6.3
library(caret) #Misc. functions for training and plotting classification and regression models
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart) #For building decision tree models
## Warning: package 'rpart' was built under R version 3.6.2
library(rpart.plot) #For Plotting decision Trees
## Warning: package 'rpart.plot' was built under R version 3.6.3
The data was cited from the original article Hotel Booking Demand Datasets, authored by Nuno Antonio, Ana Almeida, and Luis Nunes published in the journal Data in Brief, Volume 22, February 2019.
This data set contains booking information for resort hotel (H1) and city hotel (H2), and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. Both data sets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled. Since this is hotel real data, all data elements pertaining hotel or costumer identification were deleted.
hotel <- read_csv("C:/BANA/Data_analytics_program/Summer_2021/BANA_7025_DATA_WRANGLING/Middle_term_YGY/datasets/hotels/hotels.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## hotel = col_character(),
## arrival_date_month = col_character(),
## meal = col_character(),
## country = col_character(),
## market_segment = col_character(),
## distribution_channel = col_character(),
## reserved_room_type = col_character(),
## assigned_room_type = col_character(),
## deposit_type = col_character(),
## agent = col_character(),
## company = col_character(),
## customer_type = col_character(),
## reservation_status = col_character(),
## reservation_status_date = col_date(format = "")
## )
## i Use `spec()` for the full column specifications.
dim(hotel)
## [1] 119390 32
summary(hotel)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
colnames(hotel)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
colSums(is.na(hotel))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
hotel$meal[hotel$meal=='Undefined'] <- 'SC'
hotel$children[is.na(hotel$children)] <- 0
hotel <- subset(hotel, market_segment!='Undefined')
hotel <- subset(hotel, distribution_channel!='Undefined')
table(hotel$hotel)
##
## City Hotel Resort Hotel
## 79326 40059
The above table clearly showed that more City Hotel booking was included than that of Resort Hotels in this dataset.
table(hotel$is_canceled, hotel$hotel)
##
## City Hotel Resort Hotel
## 0 46228 28937
## 1 33098 11122
It was really surprisingly that for both of the hotels, the percentage of cancellation was pretty high, especially for the City Hotel.
Next, I tried to check how long the guests stay. The dataset includes weekends and weekdays, so I combined the two columns into a vector and examined the total time.
weekend <- hotel$stays_in_weekend_nights
weekday <- hotel$stays_in_week_nights
total <- weekday + weekend
total[total == 0] <- NA
summary(total)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 3.000 3.449 4.000 69.000 715
total1 <-as.factor(total)
plot(total1)
From the figure above we can conclude that most of guests stayed for 1 - 4 days.
adult <- hotel$adults
child <- hotel$children
baby <- hotel$babies
total_guest <- adult + baby + child
total_guest[total_guest == 0] <- NA
summary(total_guest)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.000 2.000 1.971 2.000 55.000 180
From the above output, it was hard to say what’s the number of the most guest per booking, so we plot the number of guest per booking.
guest <-as.factor(total_guest)
plot(guest,ylab="Number of Booking",xlab="Number of Guests")
From the results above, it was clear shown that most of the booking have two guests.
I continued to check the distribution of hotel bookings on a monthly basis and plotted the bookings as following:
month <-as.factor(hotel$arrival_date_month)
month<-factor(month, levels=c("January","February","March","April","May","June","July","August","September","October","November","December"))
plot(month)
From the figure above, we can generally see that guests prefer to book the hotel during warm/hot seasons.
day <- as.factor(hotel$arrival_date_day_of_month)
plot(day, ylab = "Nunber of Guest Arrival", xlab = "Day of the Month")
From the above figure we can see that there is no clear trend that guests prefer to arrive at specific day(s) of a month. The reason that the 31st day of the month has significantly less arrivals than other days is because not every month has the 31th day.
Above are just some raw data pre cleaning and preliminary analysis of the dataset. Toward the final project, I am working on establishing models to analyze various factors(variables) that affect guest booking cancellation as well as guest retention.