Mid term Project Submission

Introduction

Hotels struggle to make sense of their bookings and understand the behavioral patterns of their customers when it comes to booking rooms at their hotels.Looking at the data, we have two types of hotels which cater to different needs based on a variety of factors like the time of the year, the occasion of the visit, types of customers that are visiting etc. We’ve identified a gap which if alleviated by leveraging patters within the data, could potentially help hotels curate better solutions and packages to attract more customers.
We plan on conducting a thorough EDA, trying to identify correlations and trends between various factors that can lead to cancellation of bookings or in another case to see in which time periods bookings are rampant as to help hotels analyse and make sense of their booking data.
Through our analysis, we hope to aid hotels with revenue management, find places where they could increase their profits, understand their customer behavior and employ strategies to maximize their profits. Our analysis can also help hotels optimize their costs in places which are not as profitable.

Packages

The packages we are using are tidyverse ,janitor and dplyr. Tidyverse is a collection of R packages designed for data science. All packages share a similar design principle and data structure. The core packages within Tidyverse are ggplot2, dplyr, readr, purr, tibble, stringr and forcats, which help transform, model and visualize data. Dplyr and tidyr are used to solve data manipulation and data tidying challeneges respectively. Dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data. It pairs nicely with tidyr which enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. The janitor package has a wide range of functions that facilitate data cleaning and exploration. The package is designed to be compatible with the tidyverse, and can therefore be seamlessly integrated into most data prep workflows

Library	Description
‘tidyverse’	Used for data manipulation.
‘dplyr’	Used for data wrangling & manipulation.
‘ggplot2’	Used for creating data visualizations.
‘janitor’	Facilitates data cleaning and exploration.

# Install Packages
library(tidyverse)
library("feasts")
library("dplyr")
library("skimr")

Data Preparation

This is an open hotel booking demand dataset published on Science Direct website by Antonio, Almeida and Nunes in 2019 and we can obtain it from Github.

Source of the dataset is derived from two hotels (resort & city) located in Portugal. Most overlying purpose of this dataset was to study the revenue management by observing data for the time period 1 July,2015 - 31 Aug,2017. Data is extracted from hotel’s Property Management System SQL database using TSQL queries. There are a total of 32 variables and 119,390 records.

Load csv data

tuesdata <- tidytuesdayR::tt_load(2020, week = 7)
hotels <- tuesdata$hotels
attach(hotels)

Dimensions of data set

dim(hotels)

## [1] 119390     32

Data Cleaning

We can use clean_names() to ensure that all the variables are in same format (snake-case).

hotels %>% 
  janitor::clean_names()

The data has three separate columns for months, year and day. It makes sense for us to combine the three columns into to make our analysis easier.

hotels_1 <- subset(hotels %>% 
  mutate(date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),
         date = parse_date(date, format = "%Y-%B-%d")) ,
  select = -c(arrival_date_year,arrival_date_month,arrival_date_day_of_month))

For the data, we checked for null values in every column which would give us a sense of what we would have to trim, use or not use from every column.
After checking each column, we could see that there are only 4 Na values under the ‘childrens’ column which would replace with the median.

lapply(hotels_1, function(x) {length(which(is.na(x)))}) # There are 4 NA in the data set

lapply(hotels, function(x) {length(which(is.null(x)))}) # There are no nulls in the data set.

summary(hotels_1$children)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1039  0.0000 10.0000       4

# Mean and median of children variable is same i.e. 0. Thus, we can replace NA's with 0

hotels_1$children[is.na(hotels_1$children)] <- 0

While exploring unique values we came across a variable which has “NULL” as a character. So we searched more variables and found there are 3 variables that have such scenario.
As mentioned by the author of this data set, customers tend to not give complete information for fields like company, country, agent etc. In such case we will keep the “NULL” value as is.

unique(hotels_1$company)
"NULL" %in% hotels_1$company
"NULL" %in% hotels_1$country
"NULL" %in% hotels_1$agent

There are only two values in the hotel variable and we can convert them into numeirc values which will help us modelling the data in future.

hotels_1$hotel[hotels_1$hotel %in% c('Resort Hotel')] <- 0
hotels_1$hotel[(hotels_1$hotel == 'City Hotel')] <- 1

Skim() gives a summary of data set and a sneak peak into their statistics

After cleaning the data, we get the following:
- Three date columns have been merged into 1 giving us total 30 variables
- Character variables - 12
- Date variables - 2
- Numeric variables - 16
- NA’s are replaced by 0

skim(hotels_1)

Data summary
Name	hotels_1
Number of rows	119390
Number of columns	30
_______________________
Column type frequency:
character	12
Date	2
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
hotel	1	1	1	2
meal	1	2	9	5
country	1	2	4	178
market_segment	1	6	13	8
distribution_channel	1	3	9	5
reserved_room_type	1	1	1	10
assigned_room_type	1	1	1	12
deposit_type	1	10	10	3
agent	1	1	4	334
company	1	1	4	353
customer_type	1	5	15	4
reservation_status	1	7	9	3

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
reservation_status_date	0	1	2014-10-17	2017-09-14	2016-08-07	926
date	0	1	2015-07-01	2017-08-31	2016-09-06	793

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
is_canceled	1	0.37	0.48	0.00	0.00	0.00	1	1	▇▁▁▁▅
lead_time	1	104.01	106.86	0.00	18.00	69.00	160	737	▇▂▁▁▁
arrival_date_week_number	1	27.17	13.61	1.00	16.00	28.00	38	53	▅▇▇▇▅
stays_in_weekend_nights	1	0.93	1.00	0.00	0.00	1.00	2	19	▇▁▁▁▁
stays_in_week_nights	1	2.50	1.91	0.00	1.00	2.00	3	50	▇▁▁▁▁
adults	1	1.86	0.58	0.00	2.00	2.00	2	55	▇▁▁▁▁
children	1	0.10	0.40	0.00	0.00	0.00	0	10	▇▁▁▁▁
babies	1	0.01	0.10	0.00	0.00	0.00	0	10	▇▁▁▁▁
is_repeated_guest	1	0.03	0.18	0.00	0.00	0.00	0	1	▇▁▁▁▁
previous_cancellations	1	0.09	0.84	0.00	0.00	0.00	0	26	▇▁▁▁▁
previous_bookings_not_canceled	1	0.14	1.50	0.00	0.00	0.00	0	72	▇▁▁▁▁
booking_changes	1	0.22	0.65	0.00	0.00	0.00	0	21	▇▁▁▁▁
days_in_waiting_list	1	2.32	17.59	0.00	0.00	0.00	0	391	▇▁▁▁▁
adr	1	101.83	50.54	-6.38	69.29	94.58	126	5400	▇▁▁▁▁
required_car_parking_spaces	1	0.06	0.25	0.00	0.00	0.00	0	8	▇▁▁▁▁
total_of_special_requests	1	0.57	0.79	0.00	0.00	0.00	1	5	▇▁▁▁▁

Proposed Exploratory Data Analysis

Our aim is to look at a couple of trends or relationships between variables that could help hotels strategize better to maximize their revenue.
We aim to look at how advanced booking is being, if there is any relation with a certain time frame during the year when we see peaks or dips of advanced bookings.
We also want to look at the different hotel types and if there is any relationship between the type of the hotel and the number of guests, or during a particular time of the year.
We also want to see if the number of frequent customers for every hotel is cyclical so as to better prepare the hotel for an influx of customers.
We would also like to delve in to the type of rooms with the time of booking to see if there exists a relation or not.
The distribution of bookings between direct bookings and travel agencies could lead to some interesting observations as well.
We also want to look at the wait list through out the year to see if we can find anything interesting.
We would use a variety of plots from scatter plot, bar charts to trend lines to uncover insights that we have referenced on the first point.
We do not know much about modelling in R so we will pick that up as our classes continue.