2020_BANA7025_agrawals_finalproject.utf8

Introduction

Are you planning for a perfect vacation?

Stay tuned with me

We all like vacations, traveling, visiting new places, relaxing, and enjoying. To plan a vacation, one of the most important aspects is to look for the perfect accommodation. For that, we need to study the trend of hotel bookings over time in various countries, so that we know the seasonality of the places, availability of hotels and have a hassle-free vacation.
The hotel industry is an ever-blooming business. Travel and tourism keep on growing every year. I am interested in analyzing and understanding the trend of hotel bookings over time and studying the factors such as the number of cancellations, source of booking, type of hotels, etc. that impact the hotel bookings.

Proposed Plan

To study the trend of hotel bookings, I will be utilizing data from Antonio, Almeida and Nunes, 2019. This data was obtained directly from the hotels’ Property Management System (PMS) databases’ servers by executing a TSQL query on SQL Server Studio Manager, the integrated environment tool for managing Microsoft SQL databases. The data will be imported and studied. Then the data will be cleaned for any invalid/missing values or any other irrelevant columns/rows to fit in our analysis, after which we will perform the analysis on the data.

Analytical Approach

The proposed analytical approach is to first plot the trend of key variables over time to get an insight into how different factors are affecting the hotel bookings. Next, I will plot charts to understand the country-wise popularity in bookings by comparing bookings across various countries. Lastly, I will forecast the total bookings as a whole and total bookings in some countries.

Mission

This analysis of hotel bookings will help the consumers to get an idea of what countries to visit at what time of a year as per the estimated availability of hotels. It will let them know how the bookings have grown/shrunk in the past years and will allow them to form an idea on which countries are the most visited/popular for vacations so that they can plan too in that country. This analysis will also help the hotel industries to better understand the factors affecting the bookings and plan efficiently to provide the best services to the visitors. The prospective hotel industries can also use this analysis to look out for which countries and which type of hotels to build to maximize the profits.

Requirements

Packages Required

The following packages are required to run the code without errors.

library(tidyverse)
library(rmarkdown)
library(knitr)
library(plotly)   
library(countrycode)
library(ggplot2)    
library(dplyr)
library(lubridate)
library(magrittr)
library(feasts)
library(fable)
library(tsibble)
library(tsibbledata)
library(ggrepel)
library(reshape2)

Package Name	Purpose
library(tidyverse)	easy installation of packages
library(rmarkdown)	to convert R Markdown documents into a variety of formats
library(knitr)	for dynamic report generation
library(plotly)	for dynamic plotting
library(countrycode)	to convert country codes and country names
library(ggplot2)	for declaratively creating graphics
library(dplyr)	for data manipulation
library(lubridate)	to work with date-times and time-spans
library(magrittr)	to decrease development time and to improve readability and maintainability of code
library(feasts)	provides a collection of tools for the analysis of time series data
library(fable)	used for time-series forecasting
library(tsibble)	to easily manipulate and analyse tidy temporal data
library(tsibbledata)	provides diverse datasets in the ‘tsibble’ data structure
library(ggrepel)	to repel overlapping text labels away from each other and away from the data points that they label
library(reshape2)	to transform data between wide and long formats

Data Import

The dataset was extracted from hotels’ Property Management System (PMS) SQL databases and placed in the GitHub location by authors Antonio, Almeida and Nunes in 2019 to develop prediction models to classify a hotel booking’s likelihood to be canceled. Since this is the hotel’s real data, all data elements of the hotel or customer identification were deleted. The data is the consolidated version of two datasets of two different types of hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (h3). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of h3. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

First, we will import the CSV file and understand the variables and their type.

hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
dim(hotels)

## [1] 119390     32

The dataset has 119390 records and 32 variables including the hotel type.
Here is all the data from the hotels dataset.

paged_table(hotels)

There are many columns that are of no use for our data analysis. So we will create a new data frame with only those variables which are of interest for our analysis.

hotel_book <- hotels[,c("hotel",    "is_canceled",  "lead_time",    "arrival_date_year",    "arrival_date_month",   "arrival_date_week_number", "arrival_date_day_of_month",    "stays_in_weekend_nights",  "stays_in_week_nights", "adults",   "children", "babies",   "country"   )]

dim(hotel_book)

## [1] 119390     13

There are 119390 records and 13 variables for our analysis.

Missing Values

We will determine the number of missing values for each variable.

colSums(is.na(hotel_book))

##                     hotel               is_canceled                 lead_time 
##                         0                         0                         0 
##         arrival_date_year        arrival_date_month  arrival_date_week_number 
##                         0                         0                         0 
## arrival_date_day_of_month   stays_in_weekend_nights      stays_in_week_nights 
##                         0                         0                         0 
##                    adults                  children                    babies 
##                         0                         4                         0 
##                   country 
##                         0

We can see that there are only 4 records where there is missing value for variable children, therefore, other 119386 have complete data for all variables. The missing value for character variable is blank, whereas for the numerical variable it is NA. We will not remove these 4 records with missing values since the variable children is not of much importance for predicting the trend of booking in various parts of the world and removing these 4 records will result in loss of values for more important variables.

Negative Values

We will check if there are any negative values in the data.

sum(rowMeans(hotel_book < 0), na.rm = TRUE)

## [1] 0

We see that the number of rows with any variable value less than 0, i.e., negative values is zero. Therefore, there is no negative values for any of the variables in the dataset.

Invalid Values

We will see if there are any invalid values in the data.

unique(hotel_book$hotel)

## [1] "Resort Hotel" "City Hotel"

unique(hotel_book$is_canceled)

## [1] 0 1

sort(unique(hotel_book$arrival_date_day_of_month))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31

sort(unique(hotel_book$arrival_date_month))

##  [1] "April"     "August"    "December"  "February"  "January"   "July"     
##  [7] "June"      "March"     "May"       "November"  "October"   "September"

sort(unique(hotel_book$arrival_date_week_number))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53

sort(unique(hotel_book$arrival_date_year))

## [1] 2015 2016 2017

We observe that there is no spurious or invalid values in the dataset as confirmed from the unique values for the varibles like Hotel type, Is_canceled, Day of month, Month, Week, and Year.

Data Cleaning

There are no duplicate data. Also, there are no invalid or negative values for numerical variables.
We notice that all the country codes in the country column are not of the same type, mostly are iso3c type except for CN which is of iso2c type. Also, the TMP code is not registered in iso3c type and needs explicit transformation. There are 488 NULL values for country, and therefore, cannot be used for analysis as we are not sure if this is a legitimate data.
So we will convert all the country codes to the proper country names and store them in new variable countries for ease of use. We will remove the records with NULL values in countries.

custom_match <- c(CN = 'China', TMP = 'East Timor')
countries <- countrycode(hotel_book$country, origin = 'iso3c', destination = 'country.name',custom_match = custom_match)

## Warning in countrycode(hotel_book$country, origin = "iso3c", destination = "country.name", : Some values were not matched unambiguously: NULL

hotel_book <- cbind(hotel_book[1:12],countries)

We see that all the country codes are converted to its proper country names except NULL values. We have bind this country names column to the dataset and removed the country codes column.

After removing the NULL values, there are 118902 rows left in the dataset.

hotel_book <- hotel_book[complete.cases(hotel_book[ , "countries"]),]
nrow(hotel_book)

## [1] 118902

print(paste(sum(complete.cases(hotel_book)),"Complete cases!"))

## [1] "118898 Complete cases!"

We can see that the complete cases in the dataset are just 4 less than that of the records in the dataset (these 4 are the missing values in children column seen earlier).

Merging Columns to form Arrival Date

We will join the date, month, and year of the arrival in the dataset to form a date variable date_of_arrival.

hotel_book <- hotel_book %>% 
  mutate(
    date_of_arrival = paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month,sep = "-") %>% ymd() %>% as.Date()
  )

Here is the final dataset which we will be using for our analysis.

paged_table(hotel_book)

Data Split

We will divide the dataset into two subsets - one with “Resort Hotel”, other with “City Hotel” for use in forecasting.

resort_hotel <- hotel_book[which(hotel_book$hotel == "Resort Hotel"),]
paged_table(resort_hotel)

city_hotel <- hotel_book[which(hotel_book$hotel == "City Hotel"),]
paged_table(city_hotel)

Data Description

Variable	Type	Description
hotel	character	Hotel (H1 = Resort Hotel or h3 = City Hotel)
is_canceled	double	Value indicating if the booking was canceled (1) or not (0)
lead_time	double	Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year	double	Year of arrival date
arrival_date_month	character	Month of arrival date
arrival_date_week_number	double	Week number of year for arrival date
arrival_date_day_of_month	double	Day of arrival date
stays_in_weekend_nights	double	Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights	double	Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults	double	Number of adults
children	double	Number of children
babies	double	Number of babies
countries	character	Country of origin
date_of_arrival	date	Date of arrival in the hotel

Visualizations

Trend in Hotel Bookings over time

We are plotting key variables against Arrival date (year or month or week or day) to understand the trend of hotel bookings over time. We will be using different colors to fill out the numerical data with a categorical aspect, such as hotel types, demographic range, etc. We will use facet feature of plots to represent various categories of division, such as division by year or by hotel type.
Also, note that the data for years 2015 and 2017 is not complete since the data does not contain earlier and later months for years 2015 and 2017, respectively. Therefore, the unexpected trends like the decrease in count from 2016 to 2017 for the plots can be explained by the inavailabilty of the data after 31st August 2017.

Total Bookings with cancelations

We are plotting the number of total bookings through the bars and the number of total cancelations of bookings through the dots on the bars for both types of hotels over the years.
From this chart, we see that the number of total bookings for City hotel is more than that of Resort hotel. Further, the number of cancelations is more for City hotel than Resort hotel. The trend of bookings and cancelations follow the same pattern for both types as it increases from 2015 to 2016 and then decreases in 2017.

hotel_type <- hotel_book %>% group_by(hotel,arrival_date_year) %>% tally()

canceled <- hotel_book %>% group_by(hotel,arrival_date_year, is_canceled) %>% tally()
canceled <- canceled[canceled$is_canceled == 1,] 
  
total <- merge(hotel_type,canceled,by = c("hotel","arrival_date_year"))
   
ggplot(canceled, aes(y = n, x = arrival_date_year)) + 
    geom_bar(data = hotel_type, aes(fill = arrival_date_year), position = "dodge", stat = "identity") +
    geom_point(color = "#b3ecff", size = 2) +
    geom_line(color = "#b3ecff") + 
    labs(x = "Year") + 
    theme_bw() + theme(legend.position = "none", panel.grid.minor = element_blank()) + facet_grid(.~hotel)

Total Bookings before cancelations (bars) and number of cancelations (dots)

The table shows the percentage of booking cancelations and we see that the percentage is higher for City hotel.

cancel <-  hotel_book %>% 
    group_by(hotel,arrival_date_year, is_canceled) %>% 
    summarise(count = n()) %>% 
    mutate(percent = (count/sum(count))*100)
drops <- c("is_canceled")
cancel <- cancel[cancel$is_canceled == 1,!(names(cancel) %in% drops)] 

kable(head(cancel))

hotel	arrival_date_year	count	percent
City Hotel	2015	5990	43.82820
City Hotel	2016	15403	40.39072
City Hotel	2017	11687	42.49200
Resort Hotel	2015	2120	25.86628
Resort Hotel	2016	4910	26.83060
Resort Hotel	2017	4047	30.89313

The plot shows the total bookings after cancelations and we see that the City hotel exceeds the Resort hotel in all years.

ggplot(total, aes(x = arrival_date_year, y = n.x - n.y, fill = hotel)) +  
    geom_bar(stat = "identity", position = "dodge", width = 0.7) + 
    labs(x = "Year", y = "n") + 
    theme_bw() + theme(legend.position = "top", panel.grid.minor = element_blank())

Total Bookings after cancelations

Demographic trend

We plot the demographic trend to get an idea of the popularity of bookings on various demographic division. We plot the number of bookings of hotels for adults, child, and babies across years.
As expected, we see that the bookings for adults outnumber the others by a huge difference.

people <- hotel_book %>% group_by(arrival_date_year) %>% summarise(adult = sum(adults), child = sum(children, na.rm = TRUE), baby = sum(babies))
ggplot(people, aes(arrival_date_year)) +
    geom_line(aes(y = adult, color = "adult")) +
    geom_line(aes(y = child, color = "child")) + 
    geom_line(aes(y = baby, color = "baby")) + 
    labs(x = "Year", y = "n") + 
    theme_bw() + 
    theme(legend.position = "top", legend.title = element_blank(), panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), panel.border = element_rect(color = "gray"))

Demographic bookings over years

Lead Time

We plot the boxplot to show the lead days, i.e., the number of days between the date of booking and the date of arrival in the hotel. The boxplot shows the range along with other parameters like mean, median, quantile ranges, etc. for the lead times in each month of all years for both the types of hotels. The different months are shown using the different colors. We use the plotly library to plot the dynamic chart where one can hover over the chart for more deep insights into the information.
We see that the lead time is less in the start and the end of the year indicating that bookings is not made much prior for arriving in these months than for the other months in the year. On the contrary, the middle months of the year has high lead time indicating that vacations to be planned for these months need much earlier hotel bookings. Also, we see that there are many outliers (black dots) for the start and the end of the year showing there are out-of-trend chances of hotel being booked much prior to the arrival on these months.
On comparing both the types of hotels, the City hotel approximately follows a smooth pattern where lead time intially increases in a year and then decreases towards the end of the year. Whereas, for Resort hotel, the pattern is somewhat not proper and the lead time increases and decreases at irregular pattern.

hotel_book$arrival_date_month <- as.factor(hotel_book$arrival_date_month)
hotel_book$arrival_date_month <- factor(hotel_book$arrival_date_month, levels = c("January","February","March", "April","May","June", "July","August","September","October","November","December"))

p <- ggplot(hotel_book, aes(x = arrival_date_month, y = lead_time, fill = arrival_date_month)) + 
    geom_boxplot() + 
    scale_x_discrete(limits = c("January","February","March", "April","May","June", "July","August","September","October","November","December")) + 
    labs(x = "Arrival Month", y = "Lead Days") + 
    theme_bw() + 
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.title = element_blank()) + 
    facet_grid(hotel~arrival_date_year) 

ggplotly(p) %>% layout(legend = list(orientation = "h", y = -0.1))

Lead days summary over arrival months of the year

hotel_book$arrival_date_month <- as.character(hotel_book$arrival_date_month)

Date-wise in 2017

We plot the faceted bar chart for each month in the year 2017 to show the number of bookings and its variations for each day of the month.
The chart shows some huge bookings for few days in each month. Overall, August has quite a high number of bookings on each day of the month. There are rare instances where Resort hotel outnumbers the City hotel in the number of bookings such as on Jan 16, 2017 and some on Feb, 2017.

port_visit <- hotel_book %>% group_by(hotel,arrival_date_month,date_of_arrival) %>% filter(date_of_arrival > as.Date("2016-12-31")) %>% tally()
  
ggplot(transform(port_visit,
                   arrival_date_month = factor(arrival_date_month,levels = c("January","February","March", "April","May","June", "July","August","September","October","November","December")))) +  
    geom_bar(aes(x = date_of_arrival, y = n, fill = hotel),stat = "identity", width = 0.7, position = "dodge")  + 
    labs(x = "Date") + 
    theme_bw() + 
    theme(legend.position = "top", legend.title = element_blank(), axis.title.y = element_blank()) + 
    facet_wrap(vars(arrival_date_month), scales = "free")

Bookings on each day of month in 2017

Week-wise Bookings

We plot the flipped bar plot to show the number of bookings for each week of the year. We use different colors for different years.
We compare each week’s bookings in different years and in the same year.
We observe that there are large number of bookings in almost all weeks of the year 2017 as compared to the same weeks in 2015 or 2016.

week_bookings <- hotel_book %>% group_by(arrival_date_year, arrival_date_week_number) %>% 
    tally()  

ggplot(week_bookings, aes(x = arrival_date_week_number, y = n, fill = as.factor(arrival_date_year))) +  
    geom_bar(stat = "identity", width = 0.7, position = "dodge") + 
    labs(x = "Week Number", y = "n") + 
    theme_bw() + 
    theme(legend.position = "none") + 
    facet_wrap(vars(arrival_date_year)) + 
    coord_flip()

Bookings in each week of the year

Weekend Vs Weekday bookings

We plot the smoothed dot plot to show the comparison between the weekend and weekday bookings over the years.
We see that the weekday booking is much higher than the weekend bookings for all years indicating that more hotels are booked for stays in weekdays than weekends.

week_day <- hotel_book %>% group_by(arrival_date_year) %>% 
    summarise(weekends = sum(stays_in_weekend_nights), weekday = sum(stays_in_week_nights)) 

data <- melt(week_day, id.vars = "arrival_date_year")

ggplot(data, aes(arrival_date_year,value, col = variable)) + 
    geom_point() + 
    stat_smooth(method = 'loess') + 
    labs(x = "Year", y = "n") + 
    theme_bw() + 
    theme(legend.position = "top", legend.title = element_blank(), panel.grid.minor.x = element_blank())

Weekend Vs Weekday Bookings over the years

Trend in Hotel Bookings across the Globe

We will plot various charts to understand the countries with most booked hotels and its trend.

Top 10 countries

We plot the lollipop chart to show the top 10 countries with most booked hotels between July 2015 and August 2017.
We see that the Portugal has highest number of hotels being booked which is more than the double of second highest country, UK.The remaining countries does not have such a huge gap of bookings as with Portugal and differ by vey less number. The top 7 countries and the 10^th country are all from Europe including Portugal, UK, France, Germany, Spain, Ireland, Italy, and Belgium. The 8^th position is from Asia and the 9^th poisition is from South America.

top_countries <- hotel_book %>% group_by(countries) %>% filter(arrival_date_year == 2017) %>%
  tally() %>% rename(Bookings = n)
top_countries <- top_countries[order(-top_countries$Bookings),]
top_countries$countries <- factor(top_countries$countries, levels = top_countries$countries[order(-top_countries$Bookings)])
top_countries %>%
slice(1:10) %>%
ggplot(., aes(x = countries, y = Bookings)) +
  geom_segment( aes(x = countries, xend = countries, y = 0, yend = Bookings), color = "grey") +
  geom_point( color = "orange", size = 4) +
  geom_text(aes(x = countries,y = Bookings,label = Bookings),vjust = -0.8) + theme_classic() + theme(axis.line = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.text.x = element_text(vjust = 8), axis.title = element_blank())

Top 10 countries with highest number of hotel bookings in 2017

Top 3 countries over time

We plot the faceted dot plot to show the top 3 countries with most hotels booked for each month in a year. The months are indicated using numbers meaning 1 indicates January, 7 being July, and 12 being December.
We see that Portugal is the top country in terms of hotel bookings for each month of the year. Also, the other 2 positions are mostly occupied by countries - UK, France, and Spain. There is only one instance where Italy makes it to the top 3 which is on August 2015 at 3^rd position. There are also few instances where Germany is able to make till the top 3.

month_country <- hotel_book %>% group_by(arrival_date_year, arrival_date_month, countries) %>% 
  tally() %>%  
  top_n(3, n) 
mymonths <- c("January","February","March", "April","May","June", "July","August","September","October","November","December")
  month_country$date <- match(month_country$arrival_date_month, mymonths)
month_country <- month_country %>%  arrange(date)
month_country$countries <- as.character(month_country$countries)
month_country$countries[month_country$countries == "United Kingdom"] <- "UK"
month_country$countries[month_country$countries == "United States"] <- "US"
month_country$countries <- as.factor(month_country$countries)

ggplot(month_country, aes(x = date, y = n)) + geom_point(color = "#ff8000", size = 3) + 
    geom_text_repel(label = month_country$countries, color = "#808080", size = 3) + 
    labs(x = "Month Number") + 
    scale_x_discrete(limits = seq(1,12,1)) + facet_grid(.~arrival_date_year) + theme_bw() + theme(panel.grid.minor = element_blank())

Top 3 countries with bookings in each month of the year

Portugal Vs Others

Since Portugal tops all other countries by huge number in terms of hotel bookings, we plot the line chart over time to show the difference in bookings between Portugal and others (consolidated bookings in all other countries).
We see from the chart that there are indeed a lot of instances where the bookings in Portugal is more than the total bookings in the whole world.

w_bookings <- hotel_book %>% group_by(countries, date_of_arrival) %>% 
  tally()
port <- w_bookings[w_bookings$countries == "Portugal",]
other <- w_bookings[w_bookings$countries != "Portugal",]
other <- other %>% group_by(date_of_arrival) %>%  
  summarise(n = sum(n))
other['countries'] = 'Others'
other <- other[c('countries','date_of_arrival','n')]
new <- rbind.data.frame(port,other)
portvsother <- as_tsibble(new, key = countries, index = date_of_arrival)
portvsother <- portvsother %>%
  fill_gaps(n = 0)
portvsother %>% autoplot(n, alpha = 0.8) + theme_bw() + theme(panel.grid.minor = element_blank(), legend.position = "top", legend.spacing.x = unit(0.3, 'cm'), legend.title = element_blank()) + labs(x = "Year")

Bookings over time - Portugal Vs Others

Countries in 2017 (except Portugal)

We create a heat map to show the bookings acroos the globe in the year 2017. We have eliminated the plotting of Portugal since it has huge count of bookings which makes the map in 1 tone for all the other countries and therefore, difficult to study.
We see from the map that the yellow color indicates highest number of bookings which is for France. Next is shown by green color which includes Spain and Germany. The light blue color shows that China and Brazil has good number of bookings more than most parts of the world.

world_bookings <- hotel_book %>% group_by(countries) %>% filter(arrival_date_year == 2017 & countries != 'Portugal') %>%
  tally()
all_states <- map_data("world")
colnames(all_states) <- c("long", "lat", "group", "order", "countries", "subregion")
all_states <- all_states %>% group_by(countries)

stateMap <- left_join(all_states, world_bookings, by = "countries")

g <- ggplot(stateMap, aes(x = long, y = lat, group = group, label = countries)) +
  geom_polygon(aes(group = group, fill = n)) +
  scale_fill_gradientn(colours = "topo.colors"(4),na.value = "gray50") +
  coord_map() + theme_classic() + theme(axis.line = element_blank(),axis.title = element_blank(),
                                        axis.text = element_blank(),
                                        axis.ticks = element_blank())
ggplotly(g)

Countries with total bookings except Portugal in 2017

Forecast

We will use fable library to create various models and predict the bookings in the future. We create three different models, namely, SNAIVE, ETS, and ARIMA, each shown with different colors. We have used resort hotel and city hotel datasets for the forecasting. The data used for prediction is from January 2017 to August 2017, and the prediction is only for the month of September, 2017.

Forecasting of total bookings for types of hotels

We will forecast the total number of bookings to be made for each hotel type in the month of September, 2017. A horizontal reference line passing through the mean of the bookings for each hotel type is drawn for easy readabilty of the plot.

Resort Hotel

The prediction for the Resort Hotel using all the three models shows that in most of the days in September, the bookings would be more than the average bookings of 2017 (till August).

resort_forecast <- resort_hotel %>% group_by(date_of_arrival) %>% 
  tally()

resort_tsbl <- as_tsibble(resort_forecast, key = NULL, index = date_of_arrival)
resort_tsbl <- resort_tsbl %>%
  fill_gaps(n = 0)

resort_tsbl <- resort_tsbl %>% 
  filter(
    year(date_of_arrival) > 2016
  ) 


resort_model <- resort_tsbl %>%
  model(
    snaive = SNAIVE(n),
    ets = ETS(n),
    arima = ARIMA(n)
  )

resort_fore <- resort_model %>%
  forecast(h = 15)

resort_fore %>%
  autoplot(resort_tsbl, level = NULL) +
  xlab("Year") +
  guides(colour = guide_legend(title = "Forecast"))  + 
  theme_bw() + 
  theme(legend.position = "top") +
  geom_hline(yintercept = mean(resort_tsbl$n),linetype = "dashed", color = '#ff8000')

Forecast of total bookings for Resort Hotel

City Hotel

The prediction for the City Hotel using all the three models shows that in most of the days in September, the bookings would be less than the average bookings of 2017 (till August).

city_forecast <- city_hotel %>% group_by(date_of_arrival) %>% 
  tally()

city_tsbl <- as_tsibble(city_forecast, key = NULL, index = date_of_arrival)
city_tsbl <- city_tsbl %>%
  fill_gaps(n = 0)

city_tsbl <- city_tsbl %>% 
  filter(
    year(date_of_arrival) > 2016
  ) 


city_model <- city_tsbl %>%
  model(
    snaive = SNAIVE(n),
    ets = ETS(n),
    arima = ARIMA(n)
  )

city_fore <- city_model %>%
  forecast(h = 15)

city_fore %>%
  autoplot(city_tsbl, level = NULL) +
  xlab("Year") +
  guides(colour = guide_legend(title = "Forecast"))  + 
  theme_bw() + 
  theme(legend.position = "top") + 
  geom_hline(yintercept = mean(city_tsbl$n),linetype = "dashed", color = '#ff8000')

Forecast of total bookings for City Hotel

Forecasting of total bookings for types of hotels

We will forecast the total number of bookings to be made in the month of September, 2017 for the top 10 countries with most bookings found earlier.

Resort Hotel

The prediction for the Resort Hotel using all the three models shows that the country of Belgium, Brazil, China, France, Germany, Ireland, and Italy will face less than the average bookings in September, whereas all the remaining three countries namely Portugal, Spain, and UK will have more number of bookings than the average.

resort_countries <- resort_hotel %>% group_by(countries, date_of_arrival) %>% 
  tally()
resort_countries_tsbl <- as_tsibble(resort_countries, key = countries, index = date_of_arrival)
resort_countries_tsbl <- resort_countries_tsbl %>%
  fill_gaps(n = 0)
vec <- top_countries[1:10,1]
resort_countries_tsbl <- resort_countries_tsbl %>% 
  filter(
    countries %in% vec$countries,
    year(date_of_arrival) > 2016
  ) 

top_resort <- resort_countries_tsbl %>%
  model(
    snaive = SNAIVE(n),
    ets = ETS(n),
    arima = ARIMA(n)
  )

fc_resort <- top_resort %>%
  forecast(h = 12)

fc_resort %>%
  autoplot(resort_countries_tsbl, level = NULL) + 
  xlab("Year") +
  guides(colour = guide_legend(title = "Forecast")) + 
  theme_bw() + 
  theme(legend.position = "top")  +
  geom_hline(yintercept = mean(resort_countries_tsbl$n),linetype = "dashed", color = '#ff8000')

Forecast of top 10 countries for Resort hotel

City Hotel

The prediction for the City Hotel using all the three models shows that the country of Belgium, Brazil, China, Ireland, Italy, and Spain will face less than the average bookings in September, whereas the countries namely Germany, Portugal, and UK will have more number of bookings than the average. Interestingly, for France, the ARIMA model predicts the bookings to be higher than the average whereas the other two models of ETS and SNAIVE predict the bookings to be lower than the average.

city_countries <- city_hotel %>% group_by(countries, date_of_arrival) %>% 
  tally()
city_countries_tsbl <- as_tsibble(city_countries, key = countries, index = date_of_arrival)
city_countries_tsbl <- city_countries_tsbl %>%
  fill_gaps(n = 0)

city_countries_tsbl <- city_countries_tsbl %>% 
  filter(
    countries %in% vec$countries,
    year(date_of_arrival) > 2016
  ) 

top_city <- city_countries_tsbl %>%
  model(
    snaive = SNAIVE(n),
    ets = ETS(n),
    arima = ARIMA(n)
  )

fc_city <- top_city %>%
  forecast(h = 12)

fc_city %>%
  autoplot(city_countries_tsbl, level = NULL) + 
  xlab("Year") +
  guides(colour = guide_legend(title = "Forecast")) + 
  theme_bw() + 
  theme(legend.position = "top")  +
  geom_hline(yintercept = mean(city_countries_tsbl$n),linetype = "dashed", color = '#ff8000')

Forecast of top 10 countries for City hotel

Conclusions

Problem Statement Addressed

This analysis is aimed to help the hotel management to manage their customers appropriately and handle the crowd smoothly as they would know in prior the approximate bookings on that day/month. It will also help the people looking out to travel to plan their vacations accordingly and know the best spots to visit at a certain time.

Methodology

To provide an overview of hotel demand over time across the globe, we utilized the hotel bookings data and performed data cleaning. Further, we plotted various charts to understand the various variables contributing and showing the detailed view of bookings over time. Then we plotted charts to find the top countries where most hotels were booked. Finally, we have forecasted the bookings for next 1 month to show the number of bookings that can be expected in the next month.

Insights

Although, the City hotel is more popular than the Resort hotel in terms of bookings yet there are less percentage of cancelations for Resort hotel than City hotel.
Generally, the bookings for the arrival in May to October is done in much prior than for the remaining months.
Surprisingly, people staying in hotel on weekdays is much higher than on the weekends.
There are many instances where the number of hotels booked in Portugal outnumbers the total bookings across the world.
8 of the top 10 countries with most hotel booked are from Europe.

Implications

One of the implications for this analysis would be that if you are looking for vacation in Portugal or anywhere in the mid of the year, you should book the hotels much before the day of traveling. Hotel management can recruit more of the employees in the mid of the year to handle more bookings and serve their customers in the best possible way.

Limitations

The data is not full rather contains only half of the months in years 2015 and 2017. Thus, its better to use full dataset so the comparison is more significant. Also, the data is old and there is no new/current data, therefore, forecast on earlier months is useless for the analysis. To overcome these limitations, we can merge this dataset with other datasets available online which contains more information and is latest so that forecasting makes sense.