Final Project Submission

Introduction

Hotels struggle to make sense of their bookings and understand the behavioral patterns of their customers when it comes to booking rooms at their hotels.Looking at the data, we have two types of hotels which cater to different needs based on a variety of factors like the time of the year, the occasion of the visit, types of customers that are visiting etc. We’ve identified a gap which if alleviated by leveraging patters within the data, could potentially help hotels curate better solutions and packages to attract more customers.
We plan on conducting a thorough EDA, trying to identify correlations and trends between various factors that can lead to cancellation of bookings or in another case to see in which time periods bookings are rampant as to help hotels analyse and make sense of their booking data.
Through our analysis, we hope to aid hotels with revenue management, find places where they could increase their profits, understand their customer behavior and employ strategies to maximize their profits. Our analysis can also help hotels optimize their costs in places which are not as profitable.

Packages

The packages we are using are tidyverse ,janitor and dplyr. Tidyverse is a collection of R packages designed for data science. All packages share a similar design principle and data structure. The core packages within Tidyverse are ggplot2, dplyr, readr, purr, tibble, stringr and forcats, which help transform, model and visualize data. Dplyr and tidyr are used to solve data manipulation and data tidying challeneges respectively. Dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. The janitor package has a wide range of functions that facilitate data cleaning and exploration. The package is designed to be compatible with the tidyverse, and can therefore be seamlessly integrated into most data prep workflows

Library	Description
‘tidyverse’	Used for data manipulation.
‘dplyr’	Used for data wrangling & manipulation.
‘ggplot2’	Used for creating data visualizations.
‘janitor’	Facilitates data cleaning and exploration.

# Install Packages
library("dplyr") 
library("magrittr")
library("tidyverse")
library("tidytuesdayR")
library("tidyr")
library("here")
library("janitor")
library("ggplot2")

Data Preparation

This is an open hotel booking demand dataset published on Science Direct website by Antonio, Almeida and Nunes in 2019 and we can obtain it from Github.

Source of the dataset is derived from two hotels (resort & city) located in Portugal. Most overlying purpose of this dataset was to study the revenue management by observing data for the time period 1 July,2015 - 31 Aug,2017. Data is extracted from hotel’s Property Management System SQL database using TSQL queries. There are a total of 32 variables and 119,390 records.

Load csv data

x <-read.csv('/Users/sarangsh/Documents/Subjects for Fall /6030 Data Management Prof. Chen/Week 6/hotel_bookings.csv')
hotels <- as.data.frame(x)

Dimensions of data set

dim(hotels)

## [1] 119390     32

Assure the name of variables are in acceptable format

hotels %>% 
  janitor::clean_names()

Data Cleaning

Find structure and data type of each variable

str(hotels)

Some of the variables are converted to factors for modelling purpose in case

columns <- c("hotel", "market_segment", "distribution_channel", 
             "reserved_room_type", "assigned_room_type", "deposit_type", "customer_type")

hotels[columns] <- lapply(hotels[columns], as.factor)

hotels$is_repeated_guest   <- as.factor(hotels$is_repeated_guest)

Label the levels of some variables

hotels$is_canceled  <- factor(hotels$is_canceled, levels=c(0, 1), 
                              labels=c("Not canceled", "Canceled"))

Creating a new variable with months as factors

hotels$arrival_date_month<- factor(hotels$arrival_date_month, levels = 
                                     c("January", "February", "March", "April", "May", "June", "July",
                                       "August", "September", "October", "November", "December"))

Add new column as total_stay = stays_in_week_nights+stays_in_weekend_nights

hotels <- hotels %>%
  mutate(total_stay =  stays_in_week_nights + stays_in_weekend_nights) %>%
  select(-c(stays_in_week_nights, stays_in_weekend_nights))

Add new column as total_guests

hotels <- hotels %>%
  mutate(total_guests = adults + children + babies) %>%
  select(-c(adults , children , babies))

Change data type of total_guests from double to int as total number of people can’t be a fraction

hotels$total_guests <- as.integer(hotels$total_guests)

Identify NA

lapply(hotels, function(x) {length(which(is.na(x)))}) # There are 4 NA in the data set

## $hotel
## [1] 0
## 
## $is_canceled
## [1] 0
## 
## $lead_time
## [1] 0
## 
## $arrival_date_year
## [1] 0
## 
## $arrival_date_month
## [1] 0
## 
## $arrival_date_week_number
## [1] 0
## 
## $arrival_date_day_of_month
## [1] 0
## 
## $meal
## [1] 0
## 
## $country
## [1] 0
## 
## $market_segment
## [1] 0
## 
## $distribution_channel
## [1] 0
## 
## $is_repeated_guest
## [1] 0
## 
## $previous_cancellations
## [1] 0
## 
## $previous_bookings_not_canceled
## [1] 0
## 
## $reserved_room_type
## [1] 0
## 
## $assigned_room_type
## [1] 0
## 
## $booking_changes
## [1] 0
## 
## $deposit_type
## [1] 0
## 
## $agent
## [1] 0
## 
## $company
## [1] 0
## 
## $days_in_waiting_list
## [1] 0
## 
## $customer_type
## [1] 0
## 
## $adr
## [1] 0
## 
## $required_car_parking_spaces
## [1] 0
## 
## $total_of_special_requests
## [1] 0
## 
## $reservation_status
## [1] 0
## 
## $reservation_status_date
## [1] 0
## 
## $total_stay
## [1] 0
## 
## $total_guests
## [1] 4

Identify null values

lapply(hotels, function(x) {length(which(is.null(x)))}) # There are no nulls in the data set.

## $hotel
## [1] 0
## 
## $is_canceled
## [1] 0
## 
## $lead_time
## [1] 0
## 
## $arrival_date_year
## [1] 0
## 
## $arrival_date_month
## [1] 0
## 
## $arrival_date_week_number
## [1] 0
## 
## $arrival_date_day_of_month
## [1] 0
## 
## $meal
## [1] 0
## 
## $country
## [1] 0
## 
## $market_segment
## [1] 0
## 
## $distribution_channel
## [1] 0
## 
## $is_repeated_guest
## [1] 0
## 
## $previous_cancellations
## [1] 0
## 
## $previous_bookings_not_canceled
## [1] 0
## 
## $reserved_room_type
## [1] 0
## 
## $assigned_room_type
## [1] 0
## 
## $booking_changes
## [1] 0
## 
## $deposit_type
## [1] 0
## 
## $agent
## [1] 0
## 
## $company
## [1] 0
## 
## $days_in_waiting_list
## [1] 0
## 
## $customer_type
## [1] 0
## 
## $adr
## [1] 0
## 
## $required_car_parking_spaces
## [1] 0
## 
## $total_of_special_requests
## [1] 0
## 
## $reservation_status
## [1] 0
## 
## $reservation_status_date
## [1] 0
## 
## $total_stay
## [1] 0
## 
## $total_guests
## [1] 0

Identifying unique values

unique(hotels$company)

##   [1] "NULL" "110"  "113"  "270"  "178"  "240"  "154"  "144"  "307"  "268" 
##  [11] "59"   "204"  "312"  "318"  "94"   "174"  "274"  "195"  "223"  "317" 
##  [21] "281"  "118"  "53"   "286"  "12"   "47"   "324"  "342"  "373"  "371" 
##  [31] "383"  "86"   "82"   "218"  "88"   "31"   "397"  "392"  "405"  "331" 
##  [41] "367"  "20"   "83"   "416"  "51"   "395"  "102"  "34"   "84"   "360" 
##  [51] "394"  "457"  "382"  "461"  "478"  "386"  "112"  "486"  "421"  "9"   
##  [61] "308"  "135"  "224"  "504"  "269"  "356"  "498"  "390"  "513"  "203" 
##  [71] "263"  "477"  "521"  "169"  "515"  "445"  "337"  "251"  "428"  "292" 
##  [81] "388"  "130"  "250"  "355"  "254"  "543"  "531"  "528"  "62"   "120" 
##  [91] "42"   "81"   "116"  "530"  "103"  "39"   "16"   "92"   "61"   "501" 
## [101] "165"  "291"  "290"  "43"   "325"  "192"  "108"  "200"  "465"  "287" 
## [111] "297"  "490"  "482"  "207"  "282"  "437"  "225"  "329"  "272"  "28"  
## [121] "77"   "338"  "72"   "246"  "319"  "146"  "159"  "380"  "323"  "511" 
## [131] "407"  "278"  "80"   "403"  "399"  "14"   "137"  "343"  "346"  "347" 
## [141] "349"  "289"  "351"  "353"  "54"   "99"   "358"  "361"  "362"  "366" 
## [151] "372"  "365"  "277"  "109"  "377"  "379"  "22"   "378"  "330"  "364" 
## [161] "401"  "232"  "255"  "384"  "167"  "212"  "514"  "391"  "400"  "376" 
## [171] "402"  "396"  "302"  "398"  "6"    "370"  "369"  "409"  "168"  "104" 
## [181] "408"  "413"  "148"  "10"   "333"  "419"  "415"  "424"  "425"  "423" 
## [191] "422"  "435"  "439"  "442"  "448"  "443"  "454"  "444"  "52"   "459" 
## [201] "458"  "456"  "460"  "447"  "470"  "466"  "484"  "184"  "485"  "32"  
## [211] "487"  "491"  "494"  "193"  "516"  "496"  "499"  "29"   "78"   "520" 
## [221] "507"  "506"  "512"  "126"  "64"   "242"  "518"  "523"  "539"  "534" 
## [231] "436"  "525"  "541"  "40"   "455"  "410"  "45"   "38"   "49"   "48"  
## [241] "67"   "68"   "65"   "91"   "37"   "8"    "179"  "209"  "219"  "221" 
## [251] "227"  "153"  "186"  "253"  "202"  "216"  "275"  "233"  "280"  "309" 
## [261] "321"  "93"   "316"  "85"   "107"  "350"  "279"  "334"  "348"  "150" 
## [271] "73"   "385"  "418"  "197"  "450"  "452"  "115"  "46"   "76"   "96"  
## [281] "100"  "105"  "101"  "122"  "11"   "139"  "142"  "127"  "143"  "140" 
## [291] "149"  "163"  "160"  "180"  "238"  "183"  "222"  "185"  "217"  "215" 
## [301] "213"  "237"  "230"  "234"  "35"   "245"  "158"  "258"  "259"  "260" 
## [311] "411"  "257"  "271"  "18"   "106"  "210"  "273"  "71"   "284"  "301" 
## [321] "305"  "293"  "264"  "311"  "304"  "313"  "288"  "320"  "314"  "332" 
## [331] "341"  "352"  "243"  "368"  "393"  "132"  "220"  "412"  "420"  "426" 
## [341] "417"  "429"  "433"  "446"  "357"  "479"  "483"  "489"  "229"  "481" 
## [351] "497"  "451"  "492"

"NULL" %in% hotels$company

## [1] TRUE

"NULL" %in% hotels$country

## [1] TRUE

"NULL" %in% hotels$agent

## [1] TRUE

Remove average daily rate < 0}

hotels <- filter(hotels, hotels$adr >= 0)
summary(hotels$adr)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   69.29   94.59  101.83  126.00 5400.00

quantile(hotels$adr, probs = c(0.25,0.50,0.75,0.95,0.975,1.00), na.rm = TRUE)

##      25%      50%      75%      95%    97.5%     100% 
##   69.290   94.590  126.000  193.500  221.383 5400.000

Exploratory Data Analysis

Our aim is to look at a couple of trends or relationships between variables that could help hotels strategize better to maximize their revenue.
We aim to look at how advanced booking is being, if there is any relation with a certain time frame during the year when we see peaks or dips of advanced bookings.
We also want to look at the different hotel types and if there is any relationship between the type of the hotel and the number of guests, or during a particular time of the year.
We also want to see if the number of frequent customers for every hotel is cyclical so as to better prepare the hotel for an influx of customers.
We would also like to delve in to the type of rooms with the time of booking to see if there exists a relation or not.
The distribution of bookings between direct bookings and travel agencies could lead to some interesting observations as well.
We also want to look at the wait list through out the year to see if we can find anything interesting.

% of repeated guests Only 3.19% of total customers are repeated guests

hotels %>% 
  group_by(is_repeated_guest) %>% 
  count(is_repeated_guest) %>% 
  summarise(percent = round(n/nrow(hotels)*100,2), total = n)

## # A tibble: 2 × 3
##   is_repeated_guest percent  total
##   <fct>               <dbl>  <int>
## 1 0                   96.8  115580
## 2 1                    3.19   3809

Investigating the percentage of guests that repeat Only 3.19% of the guests are repeated guests.

hotels %>% ggplot(aes(x = hotel, fill = is_repeated_guest)) +
  geom_bar(position = "dodge") +
  geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.5) +
  scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("Non repeated guest", "Repeated guest")) +   # labels the filter
  guides(fill = guide_legend(title = NULL))    +  # removes the guide title
  ggtitle("Frequent visitors in two types of hotel")

How many bookings were cancelled - 37% of bookings were cancelled

hotels %>%
  group_by(is_canceled) %>%
  count(is_canceled) %>%
  summarise(percent = round(n/nrow(hotels)*100,2), total = n)

## # A tibble: 2 × 3
##   is_canceled  percent total
##   <fct>          <dbl> <int>
## 1 Not canceled    63.0 75165
## 2 Canceled        37.0 44224

hotels %>% ggplot(aes(x = hotel, fill = is_canceled)) +
  geom_bar(position = "dodge") +
  geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.5) +
  scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("Not canceled", "Canceled")) +   # labels the filter
  guides(fill = guide_legend(title = NULL))    +  # removes the guide title
  ggtitle("Bookings cancellations in two types of hotel")

Booking cancellations in two types of hotels

hotels %>% ggplot(aes(x = is_repeated_guest, fill = is_canceled)) +
  geom_bar(position = "dodge") +
  geom_text(stat = "Count", aes(label = scales::percent(..count../sum(..count..))), position = position_dodge(0.8), vjust = 1.0) +
  scale_fill_manual(values=c("#C2A5CF", "#80CDC1"),labels = c("No previous cancellations", "Previous cancellations")) +   # labels the filter
  guides(fill = guide_legend(title = NULL))    +  # removes the guide title
  ggtitle("Booking cancellations in two types of hotel")

To find the most common customer (guest) type

ggplot(hotels, aes(customer_type)) +
  geom_bar( fill = "#C2A5CF")

Bookings wrt room type Room A was the most popular Room

ggplot(hotels, aes(reserved_room_type , fill = factor(hotel))) +
  geom_bar(fill= "#C2A5CF")

Hotel type preferred by repeated guests

ggplot(data = hotels, aes(x = hotel)) +
  geom_bar()          # customize ylim

Months with lowest / highest bookings August has highest bookings January, Dec, Non has seen 50% less bookings than August; might be because people prefer to spend holiday at their home or the prices are too high

ggplot(data = hotels, aes(x = arrival_date_month)) +
  geom_bar()

To dig deeper we tried to find co-relation between arrival_date_week_number and arrival_date_year.

Negavtive co relation

cor(hotels$arrival_date_week_number, hotels$arrival_date_year)

## [1] -0.5405556

Most common days of booking 8-24 are the days within the month that have most number of bookings coming in

quantile(hotels$arrival_date_day_of_month, probs = c(0.25,0.50,0.75))

## 25% 50% 75% 
##   8  16  23

ggplot(data = hotels, aes(x = hotel, y = arrival_date_day_of_month)) +
  geom_boxplot()

Meal choices Bread and breakfast is the most popular meal choice among the customers

ggplot(data = hotels, aes(x = meal)) +
  geom_bar()

Customer types

hotels %>% 
  group_by(hotels$customer_type) %>%
  count()

## # A tibble: 4 × 2
## # Groups:   hotels$customer_type [4]
##   `hotels$customer_type`     n
##   <fct>                  <int>
## 1 Contract                4076
## 2 Group                    577
## 3 Transient              89613
## 4 Transient-Party        25123

Summary

A very small segment of the customer base actually frequents these hotels more than once which points towards the kind of experience or service that they are receiving from the hotel management.
Majority of the bookings are cancelled which leads to the hotel rethinking their strategy with respect to attracting customers.
City hotels are favored over resorts and July to August is the peak time where the highest percentage of bookings occur throughout the year
Group bookings are less likely to cancel vs individual bookings. *City Hotels have more cancellations versus Resort Hotels but City Hotels have been booked more than Resort Hotels.
Transients are the most common type of customers that have bookings at the hotel.
Room A was the most popular choice of room among the hotels.
There exists a negative correlation between the arrival date and the year.
8-24 are the days in which majority of bookings take place.
Bread and breakfast is the most popular choice of breakfast among customers.

Insights

There exists a strong correlation between certain weeks and the number of bookings for those specific weeks. Hotels need to be mindful about how there are specific weeks where bookings are rampant.
As city hotels are popular, try a competitive analysis to understand why city hotels are doing better and implement your findings in resorts.
Push for lucrative offers during the peak season so that one can maximize their revenue.
Increase marketing efforts to raise awareness of the hotels in countries apart from the ones that have made bookings.
Try to understand why resort hotels are not doing well as tourist destinations because city hotels are being preferred by people from different countries as well.
As group cancellations are lower, the hotel management can target groups by pushing marketing towards these groups so that they get more bookings.