Demand of Hotels across the Globe
We all like vacations, traveling, visiting new places, relaxing, and enjoying. To plan a vacation, one of the most important aspects is to look for the perfect accommodation. For that, we need to study the trend of hotel bookings over time in various countries, so that we know the seasonality of the places, availability of hotels and have a hassle-free vacation.
The hotel industry is an ever-blooming business. Travel and tourism keep on growing every year. I am interested in analyzing and understanding the trend of hotel bookings over time and studying the factors such as the number of cancellations, source of booking, type of hotels, etc. that impact the hotel bookings.
To study the trend of hotel bookings, I will be utilizing data from Antonio, Almeida and Nunes, 2019. This data was obtained directly from the hotels’ Property Management System (PMS) databases’ servers by executing a TSQL query on SQL Server Studio Manager, the integrated environment tool for managing Microsoft SQL databases. The data will be imported and studied. Then the data will be cleaned for any invalid/missing values or any other irrelevant columns/rows to fit in our analysis, after which we will perform the analysis on the data.
The proposed analytical approach is to first plot the trend of key variables over time to get an insight into how different factors are affecting the hotel bookings. Next, I will plot charts to understand the country-wise popularity in bookings by comparing bookings across various countries. Lastly, I will forecast the total bookings as a whole and total bookings in some countries.
This analysis of hotel bookings will help the consumers to get an idea of what countries to visit at what time of a year as per the estimated availability of hotels. It will let them know how the bookings have grown/shrunk in the past years and will allow them to form an idea on which countries are the most visited/popular for vacations so that they can plan too in that country. This analysis will also help the hotel industries to better understand the factors affecting the bookings and plan efficiently to provide the best services to the visitors. The prospective hotel industries can also use this analysis to look out for which countries and which type of hotels to build to maximize the profits.
The following packages are required to run the code without errors.
| Package Name | Purpose |
|---|---|
| library(tidyverse) | easy installation of packages |
| library(rmarkdown) | to convert R Markdown documents into a variety of formats |
| library(countrycode) | to convert country codes and country names |
| library(dplyr) | for data manipulation |
| library(lubridate) | to work with date-times and time-spans |
| library(magrittr) | to decrease development time and to improve readability and maintainability of code |
The dataset was extracted from hotels’ Property Management System (PMS) SQL databases and placed in the GitHub location by authors Antonio, Almeida and Nunes in 2019 to develop prediction models to classify a hotel booking’s likelihood to be canceled. Since this is the hotel’s real data, all data elements of the hotel or customer identification were deleted. The data is the consolidated version of two datasets of two different types of hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (h3). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of h3. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.
First, we will import the CSV file and understand the variables and their type.
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
dim(hotels)
## [1] 119390 32
The dataset has 119390 records and 32 variables including the hotel type.
Here is all the data from the hotels dataset.
paged_table(hotels)
There are many columns that are of no use for our data analysis. So we will create a new data frame with only those variables which are of interest for our analysis.
hotel_book <- hotels[,c("hotel", "is_canceled", "lead_time", "arrival_date_year", "arrival_date_month", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country" )]
dim(hotel_book)
## [1] 119390 13
There are 119390 records and 13 variables for our analysis.
We will determine the number of missing values for each variable.
colSums(is.na(hotel_book))
## hotel is_canceled lead_time
## 0 0 0
## arrival_date_year arrival_date_month arrival_date_week_number
## 0 0 0
## arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
## 0 0 0
## adults children babies
## 0 4 0
## country
## 0
We can see that there are only 4 records where there is missing value for variable children, therefore, other 119386 have complete data for all variables. The missing value for character variable is blank, whereas for the numerical variable it is NA. We will not remove these 4 records with missing values since the variable children is not of much importance for predicting the trend of booking in various parts of the world and removing these 4 records will result in loss of values for more important variables.
There are no duplicate data. Also, there are no invalid or negative values for numerical variables.
We notice that all the country codes in the country column are not of the same type, mostly are iso3c type except for CN which is of iso2c type. Also, the TMP code is not registered in iso3c type and needs explicit transformation. There are 488 NULL values for country, and therefore, cannot be used for analysis as we are not sure if this is a legitimate data.
So we will convert all the country codes to the proper country names and store them in new variable countries for ease of use. We will remove the records with NULL values in countries.
custom_match <- c(CN = 'China', TMP = 'East Timor')
countries <- countrycode(hotel_book$country, origin = 'iso3c', destination = 'country.name',custom_match = custom_match)
## Warning in countrycode(hotel_book$country, origin = "iso3c", destination = "country.name", : Some values were not matched unambiguously: NULL
hotel_book <- cbind(hotel_book[1:12],countries)
We see that all the country codes are converted to its proper country names except NULL values. We have bind this country names column to the dataset and removed the country codes column.
After removing the NULL values, there are 118902 rows left in the dataset.
hotel_book <- hotel_book[complete.cases(hotel_book[ , "countries"]),]
nrow(hotel_book)
## [1] 118902
print(paste(sum(complete.cases(hotel_book)),"Complete cases!"))
## [1] "118898 Complete cases!"
We can see that the complete cases in the dataset are just 4 less than that of the records in the dataset (these 4 are the missing values in children column seen earlier).
We will join the date, month, and year of the arrival in the dataset to form a date variable date_of_arrival.
hotel_book <- hotel_book %>%
mutate(
date_of_arrival = paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month,sep = "-") %>% ymd() %>% as.Date()
)
Here is the final dataset which we will be using for our analysis.
paged_table(hotel_book)
We will divide the dataset into two subsets - one with “Resort Hotel”, other with “City Hotel” for use in forecasting.
resort_hotel <- hotel_book[which(hotel_book$hotel == "Resort Hotel"),]
paged_table(resort_hotel)
city_hotel <- hotel_book[which(hotel_book$hotel == "City Hotel"),]
paged_table(city_hotel)
| Variable | Type | Description |
|---|---|---|
| hotel | character | Hotel (H1 = Resort Hotel or h3 = City Hotel) |
| is_canceled | double | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | double | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | double | Year of arrival date |
| arrival_date_month | character | Month of arrival date |
| arrival_date_week_number | double | Week number of year for arrival date |
| arrival_date_day_of_month | double | Day of arrival date |
| stays_in_weekend_nights | double | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | double | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | double | Number of adults |
| children | double | Number of children |
| babies | double | Number of babies |
| countries | character | Country of origin |
| date_of_arrival | date | Date of arrival in the hotel |
I will try to find out the total number of bookings against each variable over the years like total bookings per hotel type or total cancelations over the years or weekend vs weekday bookings. I also plan to visualize the most popular countries in terms of bookings to get an insight of which country to plan for vacations. Last, I plan to forecast the bookings.
To do this, I plan to use different ways to look at the data and use it for the best way to answer my questions. For that, I can count the number of records for each year based on filters or categories. I also plan to slice and dice the data by visualizing top 10 countries in terms of bookings. I plan to use dicing in maximum of my visualizations. I will try to incorporate the craetion of new variables and joining of dataframes. I will pick up the findings from each visualizations and consolidate it to answer my quesions.
I plan to use different types of plots such as bar, line, dot plots to show the trend of different variables on hotel bookings. I also plan to use map to show the bookings across the world. I will stick to line plot for forecasting since it is good for tracking changes over time.
Since I am incorporating forecasting (prediction of hotel bookings) using fable library, I do not plan to use linear regression or discriminant analysis or cluster analysis.