Demand of Hotels across the Globe


Introduction

Are you planning for a perfect vacation?

Stay tuned with me

We all like vacations, traveling, visiting new places, relaxing, and enjoying. To plan a vacation, one of the most important aspects is to look for the perfect accommodation. For that, we need to study the trend of hotel bookings over time in various countries, so that we know the seasonality of the places, availability of hotels and have a hassle-free vacation.
The hotel industry is an ever-blooming business. Travel and tourism keep on growing every year. I am interested in analyzing and understanding the trend of hotel bookings over time and studying the factors such as the number of cancellations, source of booking, type of hotels, etc. that impact the hotel bookings.

Proposed Plan

To study the trend of hotel bookings, I will be utilizing data from Antonio, Almeida and Nunes, 2019. This data was obtained directly from the hotels’ Property Management System (PMS) databases’ servers by executing a TSQL query on SQL Server Studio Manager, the integrated environment tool for managing Microsoft SQL databases. The data will be imported and studied. Then the data will be cleaned for any invalid/missing values or any other irrelevant columns/rows to fit in our analysis, after which we will perform the analysis on the data.

Analytical Approach

The proposed analytical approach is to first plot the trend of key variables over time to get an insight into how different factors are affecting the hotel bookings. Next, I will plot charts to understand the country-wise popularity in bookings by comparing bookings across various countries. Lastly, I will forecast the total bookings as a whole and total bookings in some countries.

Mission

This analysis of hotel bookings will help the consumers to get an idea of what countries to visit at what time of a year as per the estimated availability of hotels. It will let them know how the bookings have grown/shrunk in the past years and will allow them to form an idea on which countries are the most visited/popular for vacations so that they can plan too in that country. This analysis will also help the hotel industries to better understand the factors affecting the bookings and plan efficiently to provide the best services to the visitors. The prospective hotel industries can also use this analysis to look out for which countries and which type of hotels to build to maximize the profits.



Requirements

Packages Required

The following packages are required to run the code without errors.

Package Name Purpose
library(tidyverse) easy installation of packages
library(rmarkdown) to convert R Markdown documents into a variety of formats
library(countrycode) to convert country codes and country names
library(dplyr) for data manipulation
library(lubridate) to work with date-times and time-spans
library(magrittr) to decrease development time and to improve readability and maintainability of code



Data Import

Data Import

The dataset was extracted from hotels’ Property Management System (PMS) SQL databases and placed in the GitHub location by authors Antonio, Almeida and Nunes in 2019 to develop prediction models to classify a hotel booking’s likelihood to be canceled. Since this is the hotel’s real data, all data elements of the hotel or customer identification were deleted. The data is the consolidated version of two datasets of two different types of hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (h3). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of h3. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

First, we will import the CSV file and understand the variables and their type.

hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
dim(hotels)
## [1] 119390     32

The dataset has 119390 records and 32 variables including the hotel type.
Here is all the data from the hotels dataset.

paged_table(hotels)

There are many columns that are of no use for our data analysis. So we will create a new data frame with only those variables which are of interest for our analysis.

hotel_book <- hotels[,c("hotel",    "is_canceled",  "lead_time",    "arrival_date_year",    "arrival_date_month",   "arrival_date_week_number", "arrival_date_day_of_month",    "stays_in_weekend_nights",  "stays_in_week_nights", "adults",   "children", "babies",   "country"   )]

dim(hotel_book)
## [1] 119390     13

There are 119390 records and 13 variables for our analysis.


Missing Values


We will determine the number of missing values for each variable.

colSums(is.na(hotel_book))
##                     hotel               is_canceled                 lead_time 
##                         0                         0                         0 
##         arrival_date_year        arrival_date_month  arrival_date_week_number 
##                         0                         0                         0 
## arrival_date_day_of_month   stays_in_weekend_nights      stays_in_week_nights 
##                         0                         0                         0 
##                    adults                  children                    babies 
##                         0                         4                         0 
##                   country 
##                         0

We can see that there are only 4 records where there is missing value for variable children, therefore, other 119386 have complete data for all variables. The missing value for character variable is blank, whereas for the numerical variable it is NA. We will not remove these 4 records with missing values since the variable children is not of much importance for predicting the trend of booking in various parts of the world and removing these 4 records will result in loss of values for more important variables.

Data Cleaning

Data Cleaning

There are no duplicate data. Also, there are no invalid or negative values for numerical variables.
We notice that all the country codes in the country column are not of the same type, mostly are iso3c type except for CN which is of iso2c type. Also, the TMP code is not registered in iso3c type and needs explicit transformation. There are 488 NULL values for country, and therefore, cannot be used for analysis as we are not sure if this is a legitimate data.
So we will convert all the country codes to the proper country names and store them in new variable countries for ease of use. We will remove the records with NULL values in countries.

custom_match <- c(CN = 'China', TMP = 'East Timor')
countries <- countrycode(hotel_book$country, origin = 'iso3c', destination = 'country.name',custom_match = custom_match)
## Warning in countrycode(hotel_book$country, origin = "iso3c", destination = "country.name", : Some values were not matched unambiguously: NULL
hotel_book <- cbind(hotel_book[1:12],countries)

We see that all the country codes are converted to its proper country names except NULL values. We have bind this country names column to the dataset and removed the country codes column.

After removing the NULL values, there are 118902 rows left in the dataset.

hotel_book <- hotel_book[complete.cases(hotel_book[ , "countries"]),]
nrow(hotel_book)
## [1] 118902
print(paste(sum(complete.cases(hotel_book)),"Complete cases!"))
## [1] "118898 Complete cases!"

We can see that the complete cases in the dataset are just 4 less than that of the records in the dataset (these 4 are the missing values in children column seen earlier).

Merging Columns to form Arrival Date


We will join the date, month, and year of the arrival in the dataset to form a date variable date_of_arrival.

hotel_book <- hotel_book %>% 
  mutate(
    date_of_arrival = paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month,sep = "-") %>% ymd() %>% as.Date()
  )

Here is the final dataset which we will be using for our analysis.

paged_table(hotel_book)

Data Split

We will divide the dataset into two subsets - one with “Resort Hotel”, other with “City Hotel” for use in forecasting.

resort_hotel <- hotel_book[which(hotel_book$hotel == "Resort Hotel"),]
paged_table(resort_hotel)
city_hotel <- hotel_book[which(hotel_book$hotel == "City Hotel"),]
paged_table(city_hotel)

Data Description

Variable Type Description
hotel character Hotel (H1 = Resort Hotel or h3 = City Hotel)
is_canceled double Value indicating if the booking was canceled (1) or not (0)
lead_time double Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year double Year of arrival date
arrival_date_month character Month of arrival date
arrival_date_week_number double Week number of year for arrival date
arrival_date_day_of_month double Day of arrival date
stays_in_weekend_nights double Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights double Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults double Number of adults
children double Number of children
babies double Number of babies
countries character Country of origin
date_of_arrival date Date of arrival in the hotel

Proposed EDA

Proposed Plan


I will try to find out the total number of bookings against each variable over the years like total bookings per hotel type or total cancelations over the years or weekend vs weekday bookings. I also plan to visualize the most popular countries in terms of bookings to get an insight of which country to plan for vacations. Last, I plan to forecast the bookings.
To do this, I plan to use different ways to look at the data and use it for the best way to answer my questions. For that, I can count the number of records for each year based on filters or categories. I also plan to slice and dice the data by visualizing top 10 countries in terms of bookings. I plan to use dicing in maximum of my visualizations. I will try to incorporate the craetion of new variables and joining of dataframes. I will pick up the findings from each visualizations and consolidate it to answer my quesions.

Types of plots to use


I plan to use different types of plots such as bar, line, dot plots to show the trend of different variables on hotel bookings. I also plan to use map to show the bookings across the world. I will stick to line plot for forecasting since it is good for tracking changes over time.

Questions


  1. Is it mandatory to convert the demography (adults, child, babies) to rows using gather for tidy data, since I am planning not to as it will increase the number of rows by 3 times? Or is it suggestable only in small dataset?
  2. How can we show 3 different variable values in the same plot other than through bars/points?
  3. How can we incorporate a scatter plot in this project?
  4. There is warning for not using title in the markdown. Is it fine?

Is there need of Machine learning


Since I am incorporating forecasting (prediction of hotel bookings) using fable library, I do not plan to use linear regression or discriminant analysis or cluster analysis.


Conclusions