Demand of Hotels across the Globe
We all like vacations, traveling, visiting new places, relaxing, and enjoying. To plan a vacation, one of the most important aspects is to look for the perfect accommodation. For that, we need to study the trend of hotel bookings over time in various countries, so that we know the seasonality of the places, availability of hotels and have a hassle-free vacation.
The hotel industry is an ever-blooming business. Travel and tourism keep on growing every year. I am interested in analyzing and understanding the trend of hotel bookings over time and studying the factors such as the number of cancellations, source of booking, type of hotels, etc. that impact the hotel bookings.
To study the trend of hotel bookings, I will be utilizing data from Antonio, Almeida and Nunes, 2019. This data was obtained directly from the hotels’ Property Management System (PMS) databases’ servers by executing a TSQL query on SQL Server Studio Manager, the integrated environment tool for managing Microsoft SQL databases. The data will be imported and studied. Then the data will be cleaned for any invalid/missing values or any other irrelevant columns/rows to fit in our analysis, after which we will perform the analysis on the data.
The proposed analytical approach is to first plot the trend of key variables over time to get an insight into how different factors are affecting the hotel bookings. Next, I will plot charts to understand the country-wise popularity in bookings by comparing bookings across various countries. Lastly, I will forecast the total bookings as a whole and total bookings in some countries.
This analysis of hotel bookings will help the consumers to get an idea of what countries to visit at what time of a year as per the estimated availability of hotels. It will let them know how the bookings have grown/shrunk in the past years and will allow them to form an idea on which countries are the most visited/popular for vacations so that they can plan too in that country. This analysis will also help the hotel industries to better understand the factors affecting the bookings and plan efficiently to provide the best services to the visitors. The prospective hotel industries can also use this analysis to look out for which countries and which type of hotels to build to maximize the profits.
The following packages are required to run the code without errors.
library(tidyverse)
library(rmarkdown)
library(knitr)
library(plotly)
library(countrycode)
library(ggplot2)
library(dplyr)
library(lubridate)
library(magrittr)
library(feasts)
library(fable)
library(tsibble)
library(tsibbledata)
library(ggrepel)
library(reshape2)
| Package Name | Purpose |
|---|---|
| library(tidyverse) | easy installation of packages |
| library(rmarkdown) | to convert R Markdown documents into a variety of formats |
| library(knitr) | for dynamic report generation |
| library(plotly) | for dynamic plotting |
| library(countrycode) | to convert country codes and country names |
| library(ggplot2) | for declaratively creating graphics |
| library(dplyr) | for data manipulation |
| library(lubridate) | to work with date-times and time-spans |
| library(magrittr) | to decrease development time and to improve readability and maintainability of code |
| library(feasts) | provides a collection of tools for the analysis of time series data |
| library(fable) | used for time-series forecasting |
| library(tsibble) | to easily manipulate and analyse tidy temporal data |
| library(tsibbledata) | provides diverse datasets in the ‘tsibble’ data structure |
| library(ggrepel) | to repel overlapping text labels away from each other and away from the data points that they label |
| library(reshape2) | to transform data between wide and long formats |
The dataset was extracted from hotels’ Property Management System (PMS) SQL databases and placed in the GitHub location by authors Antonio, Almeida and Nunes in 2019 to develop prediction models to classify a hotel booking’s likelihood to be canceled. Since this is the hotel’s real data, all data elements of the hotel or customer identification were deleted. The data is the consolidated version of two datasets of two different types of hotels. One of the hotels (H1) is a resort hotel and the other is a city hotel (h3). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of h3. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.
First, we will import the CSV file and understand the variables and their type.
hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
dim(hotels)
## [1] 119390 32
The dataset has 119390 records and 32 variables including the hotel type.
Here is all the data from the hotels dataset.
paged_table(hotels)
There are many columns that are of no use for our data analysis. So we will create a new data frame with only those variables which are of interest for our analysis.
hotel_book <- hotels[,c("hotel", "is_canceled", "lead_time", "arrival_date_year", "arrival_date_month", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights", "stays_in_week_nights", "adults", "children", "babies", "country" )]
dim(hotel_book)
## [1] 119390 13
There are 119390 records and 13 variables for our analysis.
We will determine the number of missing values for each variable.
colSums(is.na(hotel_book))
## hotel is_canceled lead_time
## 0 0 0
## arrival_date_year arrival_date_month arrival_date_week_number
## 0 0 0
## arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
## 0 0 0
## adults children babies
## 0 4 0
## country
## 0
We can see that there are only 4 records where there is missing value for variable children, therefore, other 119386 have complete data for all variables. The missing value for character variable is blank, whereas for the numerical variable it is NA. We will not remove these 4 records with missing values since the variable children is not of much importance for predicting the trend of booking in various parts of the world and removing these 4 records will result in loss of values for more important variables.
We will check if there are any negative values in the data.
sum(rowMeans(hotel_book < 0), na.rm = TRUE)
## [1] 0
We see that the number of rows with any variable value less than 0, i.e., negative values is zero. Therefore, there is no negative values for any of the variables in the dataset.
We will see if there are any invalid values in the data.
unique(hotel_book$hotel)
## [1] "Resort Hotel" "City Hotel"
unique(hotel_book$is_canceled)
## [1] 0 1
sort(unique(hotel_book$arrival_date_day_of_month))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31
sort(unique(hotel_book$arrival_date_month))
## [1] "April" "August" "December" "February" "January" "July"
## [7] "June" "March" "May" "November" "October" "September"
sort(unique(hotel_book$arrival_date_week_number))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53
sort(unique(hotel_book$arrival_date_year))
## [1] 2015 2016 2017
We observe that there is no spurious or invalid values in the dataset as confirmed from the unique values for the varibles like Hotel type, Is_canceled, Day of month, Month, Week, and Year.
There are no duplicate data. Also, there are no invalid or negative values for numerical variables.
We notice that all the country codes in the country column are not of the same type, mostly are iso3c type except for CN which is of iso2c type. Also, the TMP code is not registered in iso3c type and needs explicit transformation. There are 488 NULL values for country, and therefore, cannot be used for analysis as we are not sure if this is a legitimate data.
So we will convert all the country codes to the proper country names and store them in new variable countries for ease of use. We will remove the records with NULL values in countries.
custom_match <- c(CN = 'China', TMP = 'East Timor')
countries <- countrycode(hotel_book$country, origin = 'iso3c', destination = 'country.name',custom_match = custom_match)
## Warning in countrycode(hotel_book$country, origin = "iso3c", destination = "country.name", : Some values were not matched unambiguously: NULL
hotel_book <- cbind(hotel_book[1:12],countries)
We see that all the country codes are converted to its proper country names except NULL values. We have bind this country names column to the dataset and removed the country codes column.
After removing the NULL values, there are 118902 rows left in the dataset.
hotel_book <- hotel_book[complete.cases(hotel_book[ , "countries"]),]
nrow(hotel_book)
## [1] 118902
print(paste(sum(complete.cases(hotel_book)),"Complete cases!"))
## [1] "118898 Complete cases!"
We can see that the complete cases in the dataset are just 4 less than that of the records in the dataset (these 4 are the missing values in children column seen earlier).
We will join the date, month, and year of the arrival in the dataset to form a date variable date_of_arrival.
hotel_book <- hotel_book %>%
mutate(
date_of_arrival = paste(arrival_date_year, arrival_date_month, arrival_date_day_of_month,sep = "-") %>% ymd() %>% as.Date()
)
Here is the final dataset which we will be using for our analysis.
paged_table(hotel_book)
We will divide the dataset into two subsets - one with “Resort Hotel”, other with “City Hotel” for use in forecasting.
resort_hotel <- hotel_book[which(hotel_book$hotel == "Resort Hotel"),]
paged_table(resort_hotel)
city_hotel <- hotel_book[which(hotel_book$hotel == "City Hotel"),]
paged_table(city_hotel)
| Variable | Type | Description |
|---|---|---|
| hotel | character | Hotel (H1 = Resort Hotel or h3 = City Hotel) |
| is_canceled | double | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | double | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | double | Year of arrival date |
| arrival_date_month | character | Month of arrival date |
| arrival_date_week_number | double | Week number of year for arrival date |
| arrival_date_day_of_month | double | Day of arrival date |
| stays_in_weekend_nights | double | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | double | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | double | Number of adults |
| children | double | Number of children |
| babies | double | Number of babies |
| countries | character | Country of origin |
| date_of_arrival | date | Date of arrival in the hotel |
We are plotting key variables against Arrival date (year or month or week or day) to understand the trend of hotel bookings over time. We will be using different colors to fill out the numerical data with a categorical aspect, such as hotel types, demographic range, etc. We will use facet feature of plots to represent various categories of division, such as division by year or by hotel type.
Also, note that the data for years 2015 and 2017 is not complete since the data does not contain earlier and later months for years 2015 and 2017, respectively. Therefore, the unexpected trends like the decrease in count from 2016 to 2017 for the plots can be explained by the inavailabilty of the data after 31st August 2017.
We are plotting the number of total bookings through the bars and the number of total cancelations of bookings through the dots on the bars for both types of hotels over the years.
From this chart, we see that the number of total bookings for City hotel is more than that of Resort hotel. Further, the number of cancelations is more for City hotel than Resort hotel. The trend of bookings and cancelations follow the same pattern for both types as it increases from 2015 to 2016 and then decreases in 2017.
hotel_type <- hotel_book %>% group_by(hotel,arrival_date_year) %>% tally()
canceled <- hotel_book %>% group_by(hotel,arrival_date_year, is_canceled) %>% tally()
canceled <- canceled[canceled$is_canceled == 1,]
total <- merge(hotel_type,canceled,by = c("hotel","arrival_date_year"))
ggplot(canceled, aes(y = n, x = arrival_date_year)) +
geom_bar(data = hotel_type, aes(fill = arrival_date_year), position = "dodge", stat = "identity") +
geom_point(color = "#b3ecff", size = 2) +
geom_line(color = "#b3ecff") +
labs(x = "Year") +
theme_bw() + theme(legend.position = "none", panel.grid.minor = element_blank()) + facet_grid(.~hotel)
Total Bookings before cancelations (bars) and number of cancelations (dots)
The table shows the percentage of booking cancelations and we see that the percentage is higher for City hotel.
cancel <- hotel_book %>%
group_by(hotel,arrival_date_year, is_canceled) %>%
summarise(count = n()) %>%
mutate(percent = (count/sum(count))*100)
drops <- c("is_canceled")
cancel <- cancel[cancel$is_canceled == 1,!(names(cancel) %in% drops)]
kable(head(cancel))
| hotel | arrival_date_year | count | percent |
|---|---|---|---|
| City Hotel | 2015 | 5990 | 43.82820 |
| City Hotel | 2016 | 15403 | 40.39072 |
| City Hotel | 2017 | 11687 | 42.49200 |
| Resort Hotel | 2015 | 2120 | 25.86628 |
| Resort Hotel | 2016 | 4910 | 26.83060 |
| Resort Hotel | 2017 | 4047 | 30.89313 |
The plot shows the total bookings after cancelations and we see that the City hotel exceeds the Resort hotel in all years.
ggplot(total, aes(x = arrival_date_year, y = n.x - n.y, fill = hotel)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
labs(x = "Year", y = "n") +
theme_bw() + theme(legend.position = "top", panel.grid.minor = element_blank())
Total Bookings after cancelations
We plot the demographic trend to get an idea of the popularity of bookings on various demographic division. We plot the number of bookings of hotels for adults, child, and babies across years.
As expected, we see that the bookings for adults outnumber the others by a huge difference.
people <- hotel_book %>% group_by(arrival_date_year) %>% summarise(adult = sum(adults), child = sum(children, na.rm = TRUE), baby = sum(babies))
ggplot(people, aes(arrival_date_year)) +
geom_line(aes(y = adult, color = "adult")) +
geom_line(aes(y = child, color = "child")) +
geom_line(aes(y = baby, color = "baby")) +
labs(x = "Year", y = "n") +
theme_bw() +
theme(legend.position = "top", legend.title = element_blank(), panel.grid.minor = element_blank(), panel.grid.major.x = element_blank(), panel.border = element_rect(color = "gray"))
Demographic bookings over years
We plot the boxplot to show the lead days, i.e., the number of days between the date of booking and the date of arrival in the hotel. The boxplot shows the range along with other parameters like mean, median, quantile ranges, etc. for the lead times in each month of all years for both the types of hotels. The different months are shown using the different colors. We use the plotly library to plot the dynamic chart where one can hover over the chart for more deep insights into the information.
We see that the lead time is less in the start and the end of the year indicating that bookings is not made much prior for arriving in these months than for the other months in the year. On the contrary, the middle months of the year has high lead time indicating that vacations to be planned for these months need much earlier hotel bookings. Also, we see that there are many outliers (black dots) for the start and the end of the year showing there are out-of-trend chances of hotel being booked much prior to the arrival on these months.
On comparing both the types of hotels, the City hotel approximately follows a smooth pattern where lead time intially increases in a year and then decreases towards the end of the year. Whereas, for Resort hotel, the pattern is somewhat not proper and the lead time increases and decreases at irregular pattern.
hotel_book$arrival_date_month <- as.factor(hotel_book$arrival_date_month)
hotel_book$arrival_date_month <- factor(hotel_book$arrival_date_month, levels = c("January","February","March", "April","May","June", "July","August","September","October","November","December"))
p <- ggplot(hotel_book, aes(x = arrival_date_month, y = lead_time, fill = arrival_date_month)) +
geom_boxplot() +
scale_x_discrete(limits = c("January","February","March", "April","May","June", "July","August","September","October","November","December")) +
labs(x = "Arrival Month", y = "Lead Days") +
theme_bw() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.title = element_blank()) +
facet_grid(hotel~arrival_date_year)
ggplotly(p) %>% layout(legend = list(orientation = "h", y = -0.1))
Lead days summary over arrival months of the year
hotel_book$arrival_date_month <- as.character(hotel_book$arrival_date_month)
We plot the faceted bar chart for each month in the year 2017 to show the number of bookings and its variations for each day of the month.
The chart shows some huge bookings for few days in each month. Overall, August has quite a high number of bookings on each day of the month. There are rare instances where Resort hotel outnumbers the City hotel in the number of bookings such as on Jan 16, 2017 and some on Feb, 2017.
port_visit <- hotel_book %>% group_by(hotel,arrival_date_month,date_of_arrival) %>% filter(date_of_arrival > as.Date("2016-12-31")) %>% tally()
ggplot(transform(port_visit,
arrival_date_month = factor(arrival_date_month,levels = c("January","February","March", "April","May","June", "July","August","September","October","November","December")))) +
geom_bar(aes(x = date_of_arrival, y = n, fill = hotel),stat = "identity", width = 0.7, position = "dodge") +
labs(x = "Date") +
theme_bw() +
theme(legend.position = "top", legend.title = element_blank(), axis.title.y = element_blank()) +
facet_wrap(vars(arrival_date_month), scales = "free")
Bookings on each day of month in 2017
We plot the flipped bar plot to show the number of bookings for each week of the year. We use different colors for different years.
We compare each week’s bookings in different years and in the same year.
We observe that there are large number of bookings in almost all weeks of the year 2017 as compared to the same weeks in 2015 or 2016.
week_bookings <- hotel_book %>% group_by(arrival_date_year, arrival_date_week_number) %>%
tally()
ggplot(week_bookings, aes(x = arrival_date_week_number, y = n, fill = as.factor(arrival_date_year))) +
geom_bar(stat = "identity", width = 0.7, position = "dodge") +
labs(x = "Week Number", y = "n") +
theme_bw() +
theme(legend.position = "none") +
facet_wrap(vars(arrival_date_year)) +
coord_flip()
Bookings in each week of the year
We plot the smoothed dot plot to show the comparison between the weekend and weekday bookings over the years.
We see that the weekday booking is much higher than the weekend bookings for all years indicating that more hotels are booked for stays in weekdays than weekends.
week_day <- hotel_book %>% group_by(arrival_date_year) %>%
summarise(weekends = sum(stays_in_weekend_nights), weekday = sum(stays_in_week_nights))
data <- melt(week_day, id.vars = "arrival_date_year")
ggplot(data, aes(arrival_date_year,value, col = variable)) +
geom_point() +
stat_smooth(method = 'loess') +
labs(x = "Year", y = "n") +
theme_bw() +
theme(legend.position = "top", legend.title = element_blank(), panel.grid.minor.x = element_blank())
Weekend Vs Weekday Bookings over the years
We will plot various charts to understand the countries with most booked hotels and its trend.
We plot the lollipop chart to show the top 10 countries with most booked hotels between July 2015 and August 2017.
We see that the Portugal has highest number of hotels being booked which is more than the double of second highest country, UK.The remaining countries does not have such a huge gap of bookings as with Portugal and differ by vey less number. The top 7 countries and the 10th country are all from Europe including Portugal, UK, France, Germany, Spain, Ireland, Italy, and Belgium. The 8th position is from Asia and the 9th poisition is from South America.
top_countries <- hotel_book %>% group_by(countries) %>% filter(arrival_date_year == 2017) %>%
tally() %>% rename(Bookings = n)
top_countries <- top_countries[order(-top_countries$Bookings),]
top_countries$countries <- factor(top_countries$countries, levels = top_countries$countries[order(-top_countries$Bookings)])
top_countries %>%
slice(1:10) %>%
ggplot(., aes(x = countries, y = Bookings)) +
geom_segment( aes(x = countries, xend = countries, y = 0, yend = Bookings), color = "grey") +
geom_point( color = "orange", size = 4) +
geom_text(aes(x = countries,y = Bookings,label = Bookings),vjust = -0.8) + theme_classic() + theme(axis.line = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.text.x = element_text(vjust = 8), axis.title = element_blank())
Top 10 countries with highest number of hotel bookings in 2017
We plot the faceted dot plot to show the top 3 countries with most hotels booked for each month in a year. The months are indicated using numbers meaning 1 indicates January, 7 being July, and 12 being December.
We see that Portugal is the top country in terms of hotel bookings for each month of the year. Also, the other 2 positions are mostly occupied by countries - UK, France, and Spain. There is only one instance where Italy makes it to the top 3 which is on August 2015 at 3rd position. There are also few instances where Germany is able to make till the top 3.
month_country <- hotel_book %>% group_by(arrival_date_year, arrival_date_month, countries) %>%
tally() %>%
top_n(3, n)
mymonths <- c("January","February","March", "April","May","June", "July","August","September","October","November","December")
month_country$date <- match(month_country$arrival_date_month, mymonths)
month_country <- month_country %>% arrange(date)
month_country$countries <- as.character(month_country$countries)
month_country$countries[month_country$countries == "United Kingdom"] <- "UK"
month_country$countries[month_country$countries == "United States"] <- "US"
month_country$countries <- as.factor(month_country$countries)
ggplot(month_country, aes(x = date, y = n)) + geom_point(color = "#ff8000", size = 3) +
geom_text_repel(label = month_country$countries, color = "#808080", size = 3) +
labs(x = "Month Number") +
scale_x_discrete(limits = seq(1,12,1)) + facet_grid(.~arrival_date_year) + theme_bw() + theme(panel.grid.minor = element_blank())
Top 3 countries with bookings in each month of the year
Since Portugal tops all other countries by huge number in terms of hotel bookings, we plot the line chart over time to show the difference in bookings between Portugal and others (consolidated bookings in all other countries).
We see from the chart that there are indeed a lot of instances where the bookings in Portugal is more than the total bookings in the whole world.
w_bookings <- hotel_book %>% group_by(countries, date_of_arrival) %>%
tally()
port <- w_bookings[w_bookings$countries == "Portugal",]
other <- w_bookings[w_bookings$countries != "Portugal",]
other <- other %>% group_by(date_of_arrival) %>%
summarise(n = sum(n))
other['countries'] = 'Others'
other <- other[c('countries','date_of_arrival','n')]
new <- rbind.data.frame(port,other)
portvsother <- as_tsibble(new, key = countries, index = date_of_arrival)
portvsother <- portvsother %>%
fill_gaps(n = 0)
portvsother %>% autoplot(n, alpha = 0.8) + theme_bw() + theme(panel.grid.minor = element_blank(), legend.position = "top", legend.spacing.x = unit(0.3, 'cm'), legend.title = element_blank()) + labs(x = "Year")
Bookings over time - Portugal Vs Others
We create a heat map to show the bookings acroos the globe in the year 2017. We have eliminated the plotting of Portugal since it has huge count of bookings which makes the map in 1 tone for all the other countries and therefore, difficult to study.
We see from the map that the yellow color indicates highest number of bookings which is for France. Next is shown by green color which includes Spain and Germany. The light blue color shows that China and Brazil has good number of bookings more than most parts of the world.
world_bookings <- hotel_book %>% group_by(countries) %>% filter(arrival_date_year == 2017 & countries != 'Portugal') %>%
tally()
all_states <- map_data("world")
colnames(all_states) <- c("long", "lat", "group", "order", "countries", "subregion")
all_states <- all_states %>% group_by(countries)
stateMap <- left_join(all_states, world_bookings, by = "countries")
g <- ggplot(stateMap, aes(x = long, y = lat, group = group, label = countries)) +
geom_polygon(aes(group = group, fill = n)) +
scale_fill_gradientn(colours = "topo.colors"(4),na.value = "gray50") +
coord_map() + theme_classic() + theme(axis.line = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
ggplotly(g)
Countries with total bookings except Portugal in 2017
We will use fable library to create various models and predict the bookings in the future. We create three different models, namely, SNAIVE, ETS, and ARIMA, each shown with different colors. We have used resort hotel and city hotel datasets for the forecasting. The data used for prediction is from January 2017 to August 2017, and the prediction is only for the month of September, 2017.
We will forecast the total number of bookings to be made for each hotel type in the month of September, 2017. A horizontal reference line passing through the mean of the bookings for each hotel type is drawn for easy readabilty of the plot.
The prediction for the Resort Hotel using all the three models shows that in most of the days in September, the bookings would be more than the average bookings of 2017 (till August).
resort_forecast <- resort_hotel %>% group_by(date_of_arrival) %>%
tally()
resort_tsbl <- as_tsibble(resort_forecast, key = NULL, index = date_of_arrival)
resort_tsbl <- resort_tsbl %>%
fill_gaps(n = 0)
resort_tsbl <- resort_tsbl %>%
filter(
year(date_of_arrival) > 2016
)
resort_model <- resort_tsbl %>%
model(
snaive = SNAIVE(n),
ets = ETS(n),
arima = ARIMA(n)
)
resort_fore <- resort_model %>%
forecast(h = 15)
resort_fore %>%
autoplot(resort_tsbl, level = NULL) +
xlab("Year") +
guides(colour = guide_legend(title = "Forecast")) +
theme_bw() +
theme(legend.position = "top") +
geom_hline(yintercept = mean(resort_tsbl$n),linetype = "dashed", color = '#ff8000')
Forecast of total bookings for Resort Hotel
The prediction for the City Hotel using all the three models shows that in most of the days in September, the bookings would be less than the average bookings of 2017 (till August).
city_forecast <- city_hotel %>% group_by(date_of_arrival) %>%
tally()
city_tsbl <- as_tsibble(city_forecast, key = NULL, index = date_of_arrival)
city_tsbl <- city_tsbl %>%
fill_gaps(n = 0)
city_tsbl <- city_tsbl %>%
filter(
year(date_of_arrival) > 2016
)
city_model <- city_tsbl %>%
model(
snaive = SNAIVE(n),
ets = ETS(n),
arima = ARIMA(n)
)
city_fore <- city_model %>%
forecast(h = 15)
city_fore %>%
autoplot(city_tsbl, level = NULL) +
xlab("Year") +
guides(colour = guide_legend(title = "Forecast")) +
theme_bw() +
theme(legend.position = "top") +
geom_hline(yintercept = mean(city_tsbl$n),linetype = "dashed", color = '#ff8000')
Forecast of total bookings for City Hotel
We will forecast the total number of bookings to be made in the month of September, 2017 for the top 10 countries with most bookings found earlier.
The prediction for the Resort Hotel using all the three models shows that the country of Belgium, Brazil, China, France, Germany, Ireland, and Italy will face less than the average bookings in September, whereas all the remaining three countries namely Portugal, Spain, and UK will have more number of bookings than the average.
resort_countries <- resort_hotel %>% group_by(countries, date_of_arrival) %>%
tally()
resort_countries_tsbl <- as_tsibble(resort_countries, key = countries, index = date_of_arrival)
resort_countries_tsbl <- resort_countries_tsbl %>%
fill_gaps(n = 0)
vec <- top_countries[1:10,1]
resort_countries_tsbl <- resort_countries_tsbl %>%
filter(
countries %in% vec$countries,
year(date_of_arrival) > 2016
)
top_resort <- resort_countries_tsbl %>%
model(
snaive = SNAIVE(n),
ets = ETS(n),
arima = ARIMA(n)
)
fc_resort <- top_resort %>%
forecast(h = 12)
fc_resort %>%
autoplot(resort_countries_tsbl, level = NULL) +
xlab("Year") +
guides(colour = guide_legend(title = "Forecast")) +
theme_bw() +
theme(legend.position = "top") +
geom_hline(yintercept = mean(resort_countries_tsbl$n),linetype = "dashed", color = '#ff8000')
Forecast of top 10 countries for Resort hotel
The prediction for the City Hotel using all the three models shows that the country of Belgium, Brazil, China, Ireland, Italy, and Spain will face less than the average bookings in September, whereas the countries namely Germany, Portugal, and UK will have more number of bookings than the average. Interestingly, for France, the ARIMA model predicts the bookings to be higher than the average whereas the other two models of ETS and SNAIVE predict the bookings to be lower than the average.
city_countries <- city_hotel %>% group_by(countries, date_of_arrival) %>%
tally()
city_countries_tsbl <- as_tsibble(city_countries, key = countries, index = date_of_arrival)
city_countries_tsbl <- city_countries_tsbl %>%
fill_gaps(n = 0)
city_countries_tsbl <- city_countries_tsbl %>%
filter(
countries %in% vec$countries,
year(date_of_arrival) > 2016
)
top_city <- city_countries_tsbl %>%
model(
snaive = SNAIVE(n),
ets = ETS(n),
arima = ARIMA(n)
)
fc_city <- top_city %>%
forecast(h = 12)
fc_city %>%
autoplot(city_countries_tsbl, level = NULL) +
xlab("Year") +
guides(colour = guide_legend(title = "Forecast")) +
theme_bw() +
theme(legend.position = "top") +
geom_hline(yintercept = mean(city_countries_tsbl$n),linetype = "dashed", color = '#ff8000')
Forecast of top 10 countries for City hotel
This analysis is aimed to help the hotel management to manage their customers appropriately and handle the crowd smoothly as they would know in prior the approximate bookings on that day/month. It will also help the people looking out to travel to plan their vacations accordingly and know the best spots to visit at a certain time.
To provide an overview of hotel demand over time across the globe, we utilized the hotel bookings data and performed data cleaning. Further, we plotted various charts to understand the various variables contributing and showing the detailed view of bookings over time. Then we plotted charts to find the top countries where most hotels were booked. Finally, we have forecasted the bookings for next 1 month to show the number of bookings that can be expected in the next month.
One of the implications for this analysis would be that if you are looking for vacation in Portugal or anywhere in the mid of the year, you should book the hotels much before the day of traveling. Hotel management can recruit more of the employees in the mid of the year to handle more bookings and serve their customers in the best possible way.
The data is not full rather contains only half of the months in years 2015 and 2017. Thus, its better to use full dataset so the comparison is more significant. Also, the data is old and there is no new/current data, therefore, forecast on earlier months is useless for the analysis. To overcome these limitations, we can merge this dataset with other datasets available online which contains more information and is latest so that forecasting makes sense.