Introduction

  • Hotels struggle to make sense of their bookings and understand the behavioral patterns of their customers when it comes to booking rooms at their hotels.Looking at the data, we have two types of hotels which cater to different needs based on a variety of factors like the time of the year, the occasion of the visit, types of customers that are visiting etc. We’ve identified a gap which if alleviated by leveraging patters within the data, could potentially help hotels curate better solutions and packages to attract more customers.
  • We plan on conducting a thorough EDA, trying to identify correlations and trends between various factors that can lead to cancellation of bookings or in another case to see in which time periods bookings are rampant as to help hotels analyse and make sense of their booking data.
  • Through our analysis, we hope to aid hotels with revenue management, find places where they could increase their profits, understand their customer behavior and employ strategies to maximize their profits. Our analysis can also help hotels optimize their costs in places which are not as profitable.

Packages

The packages we are using are tidyverse ,janitor and dplyr. Tidyverse is a collection of R packages designed for data science. All packages share a similar design principle and data structure. The core packages within Tidyverse are ggplot2, dplyr, readr, purr, tibble, stringr and forcats, which help transform, model and visualize data. Dplyr and tidyr are used to solve data manipulation and data tidying challeneges respectively. Dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. The janitor package has a wide range of functions that facilitate data cleaning and exploration. The package is designed to be compatible with the tidyverse, and can therefore be seamlessly integrated into most data prep workflows

Library Description
‘tidyverse’ Used for data manipulation.
‘dplyr’ Used for data wrangling & manipulation.
‘ggplot2’ Used for creating data visualizations.
‘janitor’ Facilitates data cleaning and exploration.
# Install Packages
library(tidyverse)
library("feasts")
library("dplyr")
library("skimr")

Data Preparation

This is an open hotel booking demand dataset published on Science Direct website by Antonio, Almeida and Nunes in 2019 and we can obtain it from Github.

Source of the dataset is derived from two hotels (resort & city) located in Portugal. Most overlying purpose of this dataset was to study the revenue management by observing data for the time period 1 July,2015 - 31 Aug,2017. Data is extracted from hotel’s Property Management System SQL database using TSQL queries. There are a total of 32 variables and 119,390 records.

Load csv data

tuesdata <- tidytuesdayR::tt_load(2020, week = 7)
## Only 9 Github queries remaining until 2021-11-15 05:49:12 PM EST.
## Only 9 Github queries remaining until 2021-11-15 05:49:12 PM EST.
## Only 9 Github queries remaining until 2021-11-15 05:49:12 PM EST.
## Only 9 Github queries remaining until 2021-11-15 05:49:12 PM EST.
## Only 9 Github queries remaining until 2021-11-15 05:49:12 PM EST.
## --- Compiling #TidyTuesday Information for 2020-02-11 ----
## Only 8 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## --- There are 3 files available ---
## Only 7 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## --- Starting Download ---
## Only 7 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## Only 6 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## Only 5 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## Only 4 Github queries remaining until 2021-11-15 05:49:11 PM EST.
## --- Download complete ---
hotels <- tuesdata$hotels
attach(hotels)

Dimensions of data set

dim(hotels)
## [1] 119390     32

Data Cleaning

  • We can use clean_names() to ensure that all the variables are in same format (snake-case).
hotels %>% 
  janitor::clean_names()
  • The data has three separate columns for months, year and day. It makes sense for us to combine the three columns into to make our analysis easier.
hotels_1 <- subset(hotels %>% 
  mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
         date = parse_date(date, format = "%Y-%B-%d")) ,
  select = -c(arrival_date_year,arrival_date_day_of_month))
  • For the data, we checked for null values in every column which would give us a sense of what we would have to trim, use or not use from every column.
  • After checking each column, we could see that there are only 4 Na values under the ‘childrens’ column which would replace with the median.
lapply(hotels_1, function(x) {length(which(is.na(x)))}) # There are 4 NA in the data set
## $hotel
## [1] 0
## 
## $is_canceled
## [1] 0
## 
## $lead_time
## [1] 0
## 
## $arrival_date_month
## [1] 0
## 
## $arrival_date_week_number
## [1] 0
## 
## $stays_in_weekend_nights
## [1] 0
## 
## $stays_in_week_nights
## [1] 0
## 
## $adults
## [1] 0
## 
## $children
## [1] 4
## 
## $babies
## [1] 0
## 
## $meal
## [1] 0
## 
## $country
## [1] 0
## 
## $market_segment
## [1] 0
## 
## $distribution_channel
## [1] 0
## 
## $is_repeated_guest
## [1] 0
## 
## $previous_cancellations
## [1] 0
## 
## $previous_bookings_not_canceled
## [1] 0
## 
## $reserved_room_type
## [1] 0
## 
## $assigned_room_type
## [1] 0
## 
## $booking_changes
## [1] 0
## 
## $deposit_type
## [1] 0
## 
## $agent
## [1] 0
## 
## $company
## [1] 0
## 
## $days_in_waiting_list
## [1] 0
## 
## $customer_type
## [1] 0
## 
## $adr
## [1] 0
## 
## $required_car_parking_spaces
## [1] 0
## 
## $total_of_special_requests
## [1] 0
## 
## $reservation_status
## [1] 0
## 
## $reservation_status_date
## [1] 0
## 
## $date
## [1] 0
lapply(hotels, function(x) {length(which(is.null(x)))}) # There are no nulls in the data set.
summary(hotels_1$children)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4
# Mean and median of children variable is same i.e. 0. Thus, we can replace NA's with 0

hotels_1$children[is.na(hotels_1$children)] <- 0
  • While exploring unique values we came across a variable which has “NULL” as a character. So we searched more variables and found there are 3 variables that have such scenario.
  • As mentioned by the author of this data set, customers tend to not give complete information for fields like company, country, agent etc. In such case we will keep the “NULL” value as is.
unique(hotels_1$company)
##   [1] "NULL" "110"  "113"  "270"  "178"  "240"  "154"  "144"  "307"  "268" 
##  [11] "59"   "204"  "312"  "318"  "94"   "174"  "274"  "195"  "223"  "317" 
##  [21] "281"  "118"  "53"   "286"  "12"   "47"   "324"  "342"  "373"  "371" 
##  [31] "383"  "86"   "82"   "218"  "88"   "31"   "397"  "392"  "405"  "331" 
##  [41] "367"  "20"   "83"   "416"  "51"   "395"  "102"  "34"   "84"   "360" 
##  [51] "394"  "457"  "382"  "461"  "478"  "386"  "112"  "486"  "421"  "9"   
##  [61] "308"  "135"  "224"  "504"  "269"  "356"  "498"  "390"  "513"  "203" 
##  [71] "263"  "477"  "521"  "169"  "515"  "445"  "337"  "251"  "428"  "292" 
##  [81] "388"  "130"  "250"  "355"  "254"  "543"  "531"  "528"  "62"   "120" 
##  [91] "42"   "81"   "116"  "530"  "103"  "39"   "16"   "92"   "61"   "501" 
## [101] "165"  "291"  "290"  "43"   "325"  "192"  "108"  "200"  "465"  "287" 
## [111] "297"  "490"  "482"  "207"  "282"  "437"  "225"  "329"  "272"  "28"  
## [121] "77"   "338"  "72"   "246"  "319"  "146"  "159"  "380"  "323"  "511" 
## [131] "407"  "278"  "80"   "403"  "399"  "14"   "137"  "343"  "346"  "347" 
## [141] "349"  "289"  "351"  "353"  "54"   "99"   "358"  "361"  "362"  "366" 
## [151] "372"  "365"  "277"  "109"  "377"  "379"  "22"   "378"  "330"  "364" 
## [161] "401"  "232"  "255"  "384"  "167"  "212"  "514"  "391"  "400"  "376" 
## [171] "402"  "396"  "302"  "398"  "6"    "370"  "369"  "409"  "168"  "104" 
## [181] "408"  "413"  "148"  "10"   "333"  "419"  "415"  "424"  "425"  "423" 
## [191] "422"  "435"  "439"  "442"  "448"  "443"  "454"  "444"  "52"   "459" 
## [201] "458"  "456"  "460"  "447"  "470"  "466"  "484"  "184"  "485"  "32"  
## [211] "487"  "491"  "494"  "193"  "516"  "496"  "499"  "29"   "78"   "520" 
## [221] "507"  "506"  "512"  "126"  "64"   "242"  "518"  "523"  "539"  "534" 
## [231] "436"  "525"  "541"  "40"   "455"  "410"  "45"   "38"   "49"   "48"  
## [241] "67"   "68"   "65"   "91"   "37"   "8"    "179"  "209"  "219"  "221" 
## [251] "227"  "153"  "186"  "253"  "202"  "216"  "275"  "233"  "280"  "309" 
## [261] "321"  "93"   "316"  "85"   "107"  "350"  "279"  "334"  "348"  "150" 
## [271] "73"   "385"  "418"  "197"  "450"  "452"  "115"  "46"   "76"   "96"  
## [281] "100"  "105"  "101"  "122"  "11"   "139"  "142"  "127"  "143"  "140" 
## [291] "149"  "163"  "160"  "180"  "238"  "183"  "222"  "185"  "217"  "215" 
## [301] "213"  "237"  "230"  "234"  "35"   "245"  "158"  "258"  "259"  "260" 
## [311] "411"  "257"  "271"  "18"   "106"  "210"  "273"  "71"   "284"  "301" 
## [321] "305"  "293"  "264"  "311"  "304"  "313"  "288"  "320"  "314"  "332" 
## [331] "341"  "352"  "243"  "368"  "393"  "132"  "220"  "412"  "420"  "426" 
## [341] "417"  "429"  "433"  "446"  "357"  "479"  "483"  "489"  "229"  "481" 
## [351] "497"  "451"  "492"
"NULL" %in% hotels_1$company
## [1] TRUE
"NULL" %in% hotels_1$country
## [1] TRUE
"NULL" %in% hotels_1$agent
## [1] TRUE
  • There are only two values in the hotel variable and we can convert them into numeric values which will help us modelling the data in future.
hotels_1$hotel[hotels_1$hotel %in% c('Resort Hotel')] <- 0
hotels_1$hotel[(hotels_1$hotel == 'City Hotel')] <- 1
  • We will remove duplicate rows and focus only on 8 variables to analyze this data set “lead_time” and “reserved_room_type” are kept in the data set in case if we find significant insights
hotel_data <- hotels_1[!duplicated(hotels_1),]
hotel_data <- hotel_data[c(1,2,3,4,15,16,17,18,21,26)]
dim(hotel_data)
## [1] 87396    10
  • Remove the adr values less than zero
hotel_data <- filter(hotel_data, hotel_data$adr >= 0)
summary(hotel_data$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    72.0    98.1   106.3   134.0  5400.0
quantile(hotel_data$adr, probs = c(0.25,0.50,0.75,0.95,0.975,1.00), na.rm = TRUE)
##    25%    50%    75%    95%  97.5%   100% 
##   72.0   98.1  134.0  204.0  230.0 5400.0

Skim() gives a summary of data set and a sneak peak into their statistics

  • After cleaning the data, we get the following:
    • Three date columns have been merged into 1 giving us total 30 variables
    • Character variables - 12
    • Date variables - 2
    • Numeric variables - 16
    • NA’s are replaced by 0
skim(hotels_1)
Data summary
Name hotels_1
Number of rows 119390
Number of columns 31
_______________________
Column type frequency:
character 13
Date 2
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 1 1 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
reservation_status_date 0 1 2014-10-17 2017-09-14 2016-08-07 926
date 0 1 2015-07-01 2017-08-31 2016-09-06 793

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 0 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁

Proposed Exploratory Data Analysis

  • Our aim is to look at a couple of trends or relationships between variables that could help hotels strategize better to maximize their revenue.
  • We aim to look at how advanced booking is being, if there is any relation with a certain time frame during the year when we see peaks or dips of advanced bookings.
  • We also want to look at the different hotel types and if there is any relationship between the type of the hotel and the number of guests, or during a particular time of the year.
  • We also want to see if the number of frequent customers for every hotel is cyclical so as to better prepare the hotel for an influx of customers.
  • We would also like to delve in to the type of rooms with the time of booking to see if there exists a relation or not.
  • The distribution of bookings between direct bookings and travel agencies could lead to some interesting observations as well.
  • We also want to look at the wait list through out the year to see if we can find anything interesting.
  • We would use a variety of plots from scatter plot, bar charts to trend lines to uncover insights that we have referenced on the first point.
  • We do not know much about modelling in R so we will pick that up as our classes continue.

In-Class Coding exercises Module 4

Start transforming your data to gain new insights. Based on what you learned this week, some questions you may want to ask are:

  • What features could you filter on?
    • We can filter on is_repeated_guest, adr, arrival_date_month, assigned_room_type Repeated guests might prefer a particular room type, we can observe if the adr is significantly higher for that. Also, we can see the frequency of bookings at particular month using arrival_date_month.
  • How could arranging your data in different ways help?
    • We can divide data according to hotel type (resort, city) and identify the patterns between their lead_time, previous_cancellations, previous_bookings_not_canceled, reserved_room_type, deposit_type. We could also find from which country majority guests arrives; their cancellation rate, type of guests, special services, meals etc
  • Can you reduce your data by selecting only certain variables?
    • Yes, we need 8 variables out of 32 to analyze the data, “hotel”, “is_canceled”, “arrival_date_month” , “is_repeated_guest”, “previous_cancellations” , “previous_bookings_not_canceled”,“deposit_type” , “adr”, “lead_time”, “country” and “reserved_room_type” are kept in the data set in case if we find significant insights.
  • Could creating new variables add new insights?
    • Yes, a new variable, previous_cancellation_rate, can be created by previous_cancellations and previous_bookings_not_canceled. This new variable would provide better view of the probability a customer canceled reservations historically.
  • Could summary statistics at different categorical levels tell you more?
    • Yes, data is can be aggregated by hotel or country. This detail could potentially help companies that run hotels in resort and/ or city and/or in different countries.
  • How can you incorporate the pipe (%>%) operator to make your code more efficient?
    • Pipe operator can help to do group_by function, filter,select, mutate and then again feed the output from these functions to perform some other function.
  • Does your final project leverage more than one data set?
    • No, so far we are analyzing only one data set.