The hotel industry plays a critical role in the success of the hospitality and tourism sectors. As competition in the market grows, hotel business owners need to understand their hotel’s performance and develop effective marketing strategies to increase visibility.
To achieve this, hotels need to analyze historical trends and customer behavior to gain valuable insights and inform data-driven marketing campaigns. By leveraging this information and implementing well-planned marketing strategies, hotels can boost their performance and revenue, ensuring continued success in the highly competitive hospitality industry.
Data Collection & Preparation.
Data Cleaning & Integrity.
Data Analysis.
Exploratory Data Analysis.
Data Visualization.
Insight and Recommendation.
Analysis summary.
R programming language = for data cleaning & analysis.
R markdown = for executing the final report.
Method = Exploratory Data Analysis.
The data originally comes from an open dataset of hotel booking in Portugal from Antonio, Almeida and Nunes, 2019 and was downloaded from the Kaggle public data set, hotel_bookings.csv, click this link.
The hotel bookings data contains:
119,210 entries.
32 columns.
tidyverse : This package consists of 3 most important for this project:
ggplot2: Used for creating powerful visualizations.
dplyr: Used for data manipulation.
tidyr: Used for data modifications.
Load packages
library(vctrs)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:vctrs':
##
## data_frame
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
hotel_bookings <- read.csv("~/hotel_bookings.csv")
View(hotel_bookings)
str(hotel_bookings) ## to display the structure and content of the data set.
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
colnames(hotel_bookings) ## to display the columns name of the data set.
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
head(hotel_bookings) ## to quickly inspect the first 6 rows of the data set.
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
tail(hotel_bookings) ## to quickly inspect the last 6 rows of the data set.
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 119385 City Hotel 0 21 2017 August
## 119386 City Hotel 0 23 2017 August
## 119387 City Hotel 0 102 2017 August
## 119388 City Hotel 0 34 2017 August
## 119389 City Hotel 0 109 2017 August
## 119390 City Hotel 0 205 2017 August
## arrival_date_week_number arrival_date_day_of_month
## 119385 35 30
## 119386 35 30
## 119387 35 31
## 119388 35 31
## 119389 35 31
## 119390 35 29
## stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 119385 2 5 2 0 0 BB
## 119386 2 5 2 0 0 BB
## 119387 2 5 3 0 0 BB
## 119388 2 5 2 0 0 BB
## 119389 2 5 2 0 0 BB
## 119390 2 7 2 0 0 HB
## country market_segment distribution_channel is_repeated_guest
## 119385 BEL Offline TA/TO TA/TO 0
## 119386 BEL Offline TA/TO TA/TO 0
## 119387 FRA Online TA TA/TO 0
## 119388 DEU Online TA TA/TO 0
## 119389 GBR Online TA TA/TO 0
## 119390 DEU Online TA TA/TO 0
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## 119385 0 0 A
## 119386 0 0 A
## 119387 0 0 E
## 119388 0 0 D
## 119389 0 0 A
## 119390 0 0 A
## assigned_room_type booking_changes deposit_type agent company
## 119385 A 0 No Deposit 394 NULL
## 119386 A 0 No Deposit 394 NULL
## 119387 E 0 No Deposit 9 NULL
## 119388 D 0 No Deposit 9 NULL
## 119389 A 0 No Deposit 89 NULL
## 119390 A 0 No Deposit 9 NULL
## days_in_waiting_list customer_type adr required_car_parking_spaces
## 119385 0 Transient 96.14 0
## 119386 0 Transient 96.14 0
## 119387 0 Transient 225.43 0
## 119388 0 Transient 157.71 0
## 119389 0 Transient 104.40 0
## 119390 0 Transient 151.20 0
## total_of_special_requests reservation_status reservation_status_date
## 119385 2 Check-Out 2017-09-06
## 119386 0 Check-Out 2017-09-06
## 119387 2 Check-Out 2017-09-07
## 119388 4 Check-Out 2017-09-07
## 119389 0 Check-Out 2017-09-07
## 119390 2 Check-Out 2017-09-07
##attach(hotel_bookings)
summary(hotel_bookings) ## to explore and summarize the data set.
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
The summary has identified issues in two variables:
The Average Daily Rate data contains a negative value of -6.38, which is incorrect and needs to be corrected.
The Days in Waiting List data exhibits anomalies, with a maximum value of 391 and an average value of 2.32 which are significantly different. This discrepancy requires further investigation.
Addressing these issues will ensure the accuracy and reliability of the data used in the analysis in the next step.
negatives <- which(hotel_bookings$adr < 0)
hotel_bookings$adr[negatives] <- mean(hotel_bookings$adr, na.rm = TRUE)
sum(is.na(hotel_bookings$adr))
## [1] 0
colSums(is.na(hotel_bookings))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
The result reveals that the children column has only 4 missing values (NA). Given the small number of missing values compared to the total number of observations, we can either remove them or replace them with mean values.
In this case, we have decided to replace the missing values with the mean value of children.
## Remove NA's
colSums(is.na(hotel_bookings))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
hotel_bookings$children[is.na(hotel_bookings$children)] = round(mean(hotel_bookings$children, na.rm = TRUE),0)
sum(is.na(hotel_bookings$children))
## [1] 0
table(arrival_date_month), table(meal), table(country), table(market_segment) table(distribution_channel), table(reserved_room_type), table(assigned_room_type), table(agent), table(company), table(customer_type)
The result has identified NULL values in three variables.
Due to the high number of NULL values in Agent and Company variables, removing them would result in a significant loss of important data. Therefore, replacing the NULL values in these variables with 0, as the agent and company values are just IDs and replacing them with 0 won’t create any issues.
For Country variable, given the relatively low number of missing values, we can either delete the corresponding rows or replace the NULL values with the most frequently occurring country.
## Replace the NULL value with 0 and max
hotel_bookings$agent[is.null(hotel_bookings$agent)] <- 0
hotel_bookings$company[is.null(hotel_bookings$company)] <- 0
hotel_bookings$country[is.null(hotel_bookings$country)] <- max(hotel_bookings$country)
We will consolidate the three guest categories (Adults, Children, Babies) and create a new column called ‘total_guests’ by adding the number of adults, children, and babies in each booking. This will simplify the data and make it easier to analyze the total number of guests for each booking.”
hotel_bookings <- mutate(hotel_bookings,total_guests = adults + children + babies)
table(hotel_bookings$total_guests)
##
## 0 1 2 3 4 5 6 10 12 20 26 27 40
## 180 22581 82051 10495 3929 137 1 2 2 2 5 2 1
## 50 55
## 1 1
The result displays that there are 180 rows in the data that contain zero guests. These rows are being removed as booking without guests is deemed meaningless.
## Remove 180 rows with 0 guest
hotel_bookings <- filter(hotel_bookings, total_guests > 0)
table(hotel_bookings$total_guests)
##
## 1 2 3 4 5 6 10 12 20 26 27 40 50
## 22581 82051 10495 3929 137 1 2 2 2 5 2 1 1
## 55
## 1
count(hotel_bookings[duplicated(hotel_bookings),])
## n
## 1 31980
count(unique(hotel_bookings[duplicated(hotel_bookings),]))
## n
## 1 8161
There are total 40.141 duplicate values in the hotel data set. All columns are the same for these duplicates, but the reason for the duplication is not clear. Since there is no unique identifier like a customer name, the duplicate rows cannot be removed from the data. To handle this situation we would need to consult with the client and determine the reason for the duplication before making a decision.
Exploratory Data Analysis (EDA) is a crucial step in data analysis as it helps to identify patterns and relationships in the data. In this analysis, we will perform an EDA to analyze the hotel’s performance across the years 2015, 2016 & 2017.
The objectives of this analysis are to identify the distribution of bookings and cancellations, factors affecting cancellations, top-performing hotels in terms of revenue, seasonality of bookings and cancellations, and countries with the highest number of bookings and cancellations.
Distribution of bookings and cancellations across the years: This analysis will help us to understand the overall trends over the years.
Seasonality of bookings and cancellations: We will analyze the seasonality of bookings and cancellations and identify the months with the highest and lowest revenue. This analysis will help us to understand the seasonality of the hotel’s business and plan marketing and promotional campaigns accordingly.
Factors affecting cancellations: We will analyze the factors affecting cancellations such as lead time, market segment, customer type, room type and deposit type. This analysis will help us to identify the most significant factors contributing to cancellations.
Countries with the highest number of bookings and cancellations: We will identify the countries with the highest number of bookings and cancellations and evaluate their contribution to the hotel’s performance. This analysis will help us to understand the hotel’s customer base and identify opportunities to attract more customers from other countries.
Let’s perform a comprehensive data analysis to gain insights into the patterns and trends present in the dataset.
Analyzing annual metrics in terms of bookings, cancellations, and revenue can provide insights into the overall performance of a hotel or hospitality business over time.
Distribution of the reservation across the years
We analyze the hotel’s performance (both City Hotel & Resort Hotel) based on their reservation status each year to gain a comprehensive understanding of the reservation trends.
Analysis process:
# Filter the data by years 2015, 2016, 2017.
hotel_bookings <- hotel_bookings %>%
filter(arrival_date_year %in% c(2015, 2016, 2017))
# Group the data by year and reservation status.
yearly_bookings <- hotel_bookings %>%
group_by(arrival_date_year, reservation_status) %>%
summarise(count = n()) %>%
ungroup()
## `summarise()` has grouped output by 'arrival_date_year'. You can override using
## the `.groups` argument.
yearly_bookings
## # A tibble: 9 × 3
## arrival_date_year reservation_status count
## <int> <chr> <int>
## 1 2015 Canceled 7950
## 2 2015 Check-Out 13826
## 3 2015 No-Show 191
## 4 2016 Canceled 19657
## 5 2016 Check-Out 36299
## 6 2016 No-Show 667
## 7 2017 Canceled 15386
## 8 2017 Check-Out 24886
## 9 2017 No-Show 348
Visualization:
# Plot the data
library(ggplot2)
ggplot(data = yearly_bookings,
aes(x = arrival_date_year, y = count, group = reservation_status,
color = reservation_status)) +
geom_line(linewidth = 1) +
scale_x_continuous(breaks = c(2015, 2016, 2017), expand = c(0.17, 0)) +
scale_y_continuous(labels = scales::comma) +
scale_color_manual(values = c("Canceled" = "#529993", "Check-Out" = "#B6B6B6", "No-Show" = "#B6B6B6")) +
scale_fill_manual(values = c("Canceled" = "#529993", "Check-Out" = "#B6B6B6", "No-Show" = "#B6B6B6")) +
labs(title = "Reservation number across years",
subtitle = "",
x = "Year",
y = "Count",
caption = "
Data Source: Kaggle",
color = "Reservation Status") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.15),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.17),
panel.background = element_rect(fill = "white"),
panel.grid.minor.x = element_line(size = 0),
legend.position = "none",
axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.78, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.11, size = 9),
axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
geom_text(data = yearly_bookings %>% filter(arrival_date_year %in% c(2015, 2016, 2017))
%>% group_by(reservation_status) %>% slice_tail(n = 1),
aes(x = arrival_date_year, y = count, label = reservation_status),
hjust = 0.01, vjust = 0.01, size = 4, color = c("#529993", "#8E8783", "#8E8783"))
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
The analysis and chart provide us with the following insights:
We have three reservation status: Check-Out, Canceled & No-Show. Let’s elaborate each one!
Check-Out status: the highest year = 2016 (36.370) and the lowest year = 2015 (13.854).
The check-out was higher than the canceled and no-show across all three years and from both hotel types. This indicates that most bookings result in successful stays, which is a positive sign for the hotel business.
Canceled status: the highest year = 2016 (19.669) and the lowest year = 2015 (7.951).
The canceled number is concerning for the hotel business, even though it was lower than the number of check-out. This indicates a negative trend for the hotel’s overall business.
No-Show status: the number of no-shows was much lower than the number of canceled bookings, but it still contributes to the business performance.
To obtain a more thorough comprehension of the trends and patterns, it is necessary to conduct further analysis on cancellations, bookings, and revenue generated by hotel types.
Analysis process:
# Filter only canceled bookings
canceled_bookings <- hotel_bookings %>%
filter(is_canceled == 1) %>%
group_by(arrival_date_month, hotel, arrival_date_year) %>%
summarize(total_cancelations = n()) %>%
mutate(month_num = match(arrival_date_month, month.name),
months = factor(substr(arrival_date_month, 1, 3), levels = month.abb)) %>%
arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'arrival_date_month', 'hotel'. You can
## override using the `.groups` argument.
# lowest point
lowest_points <- canceled_bookings %>%
group_by(hotel, arrival_date_year) %>%
slice_min(total_cancelations) %>%
arrange(arrival_date_year) %>%
ungroup()
# highest point
highest_points <- canceled_bookings %>%
group_by(hotel, arrival_date_year) %>%
slice_max(total_cancelations) %>%
ungroup()
# create a new data frame to store the highest and lowest points
highest_lowest_points <- canceled_bookings %>%
group_by(arrival_date_year, hotel) %>%
slice_min(total_cancelations, n = 1) %>% # to get the lowest point
bind_rows(canceled_bookings %>%
group_by(arrival_date_year, hotel) %>%
slice_max(total_cancelations, n = 1)) # to get the highest point
highest_lowest_points
## # A tibble: 12 × 6
## # Groups: arrival_date_year, hotel [6]
## arrival_date_month hotel arrival_date_year total_canc…¹ month…² months
## <chr> <chr> <int> <int> <int> <fct>
## 1 November City Hotel 2015 301 11 Nov
## 2 November Resort Hotel 2015 185 11 Nov
## 3 January City Hotel 2016 438 1 Jan
## 4 January Resort Hotel 2016 119 1 Jan
## 5 February City Hotel 2017 970 2 Feb
## 6 January Resort Hotel 2017 206 1 Jan
## 7 September City Hotel 2015 1542 9 Sep
## 8 September Resort Hotel 2015 551 9 Sep
## 9 October City Hotel 2016 1947 10 Oct
## 10 August Resort Hotel 2016 578 8 Aug
## 11 May City Hotel 2017 2217 5 May
## 12 August Resort Hotel 2017 693 8 Aug
## # … with abbreviated variable names ¹total_cancelations, ²month_num
Visualization:
# plot the data with added points and labels
ggplot(data = canceled_bookings, aes(x = months, y = total_cancelations, group = hotel, colour = hotel)) +
facet_wrap(~arrival_date_year, ncol = 1 ) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Cancellations trend across years",
x = "Arrival month",
y = "Count",
caption = "
Data Source: Kaggle",
color = "Hotel Type",
fill = "Hotel Type") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.24),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.2),
panel.background = element_rect(fill = "white"),
panel.grid.minor.x = element_line(size = 0),
legend.position = "right",
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.97, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9),
axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
geom_point(data = highest_lowest_points, aes(x = months, y = total_cancelations), size = 3)
The analysis and chart provide us with the following insights:
City Hotel:
In 2015, the cancellation rate was highest in September and decreased significantly in November.
In 2016, the cancellation rate was lowest in January and increased gradually during spring months (March, April), then decreased in July, and increased again till October, followed by a sharp decrease until December.
In 2017, the cancellation rate increased sharply from February to May and decreased again till August.
Resort Hotel:
The Resort Hotel obviously shows that there is no fluctuation trend and the pattern is relatively stable during all three years.
However, the chance of an increase in cancellations during the summer months is positive.
Analysis process:
##Filter only non-cancelled bookings
high_low_bookings <- hotel_bookings %>%
filter(is_canceled == 0) %>%
group_by(arrival_date_month, hotel, arrival_date_year) %>%
summarize(total_bookings = n()) %>%
mutate(month_num = match(arrival_date_month, month.name),
months = factor(substr(arrival_date_month, 1, 3),
levels = month.abb)) %>%
arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'arrival_date_month', 'hotel'. You can
## override using the `.groups` argument.
#lowest point
lowest_points <- high_low_bookings %>%
group_by(hotel, arrival_date_year) %>%
slice_min(total_bookings) %>%
arrange(arrival_date_year) %>%
ungroup()
#highest point
highest_points <- high_low_bookings %>%
group_by(hotel, arrival_date_year) %>%
slice_max(total_bookings) %>%
ungroup()
# create a new data frame to store the highest and lowest points
top_bottom_points <- high_low_bookings %>%
group_by(arrival_date_year, hotel) %>%
slice_min(total_bookings, n = 1) %>% # get the lowest point
bind_rows(high_low_bookings %>%
group_by(arrival_date_year, hotel) %>%
slice_max(total_bookings, n = 1)) %>% # get the highest point
arrange(arrival_date_year)
top_bottom_points
## # A tibble: 12 × 6
## # Groups: arrival_date_year, hotel [6]
## arrival_date_month hotel arrival_date_year total_book…¹ month…² months
## <chr> <chr> <int> <int> <int> <fct>
## 1 July City Hotel 2015 457 7 Jul
## 2 November Resort Hotel 2015 919 11 Nov
## 3 October City Hotel 2015 2061 10 Oct
## 4 October Resort Hotel 2015 1158 10 Oct
## 5 January City Hotel 2016 926 1 Jan
## 6 January Resort Hotel 2016 765 1 Jan
## 7 September City Hotel 2016 2301 9 Sep
## 8 October Resort Hotel 2016 1417 10 Oct
## 9 January City Hotel 2017 1323 1 Jan
## 10 June Resort Hotel 2017 1044 6 Jun
## 11 May City Hotel 2017 2331 5 May
## 12 May Resort Hotel 2017 1212 5 May
## # … with abbreviated variable names ¹total_bookings, ²month_num
Visualization:
# Plot the line chart with the lowest and highest points added
ggplot(data = high_low_bookings, aes(x = months, y = total_bookings,
group = hotel, colour = hotel)) +
facet_wrap(~arrival_date_year, ncol = 1 ) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Booking trend across years",
subtitle = "",
x = "Arrival month",
y = "Count",
caption = "
Data Source: Kaggle",
color = "Hotel Type",
fill = "Hotel Type") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.19),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
panel.background = element_rect(fill = "white"),
panel.grid.minor.x = element_line(size = 0),
legend.position = "right",
axis.title.y = element_text(colour = "#3A3A3A", vjust = 0.5, hjust = 0.96, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
geom_point(data = top_bottom_points, aes(x = months, y = total_bookings),
size = 3)
The analysis and chart provide us with the following insights:
City Hotel:
In 2015, bookings reached their lowest point in July and steadily increased until September, then experienced a sharp drop in November.
In 2016 and 2017, bookings consistently increased from spring to summer months, with a gradual decrease during winter months.
Resort Hotel:
Our next analysis is to observe the percentage of rooms that are occupied by guests during the year 2015, 2016 & 2017.
Analysis process:
occupancy_rate_monthly <- hotel_bookings %>%
group_by(hotel, arrival_date_year, arrival_date_month) %>%
summarize(total_bookings = n(),
total_occupied = sum(is_canceled == 0),
occupancy_rate = mean(is_canceled == 0)) %>%
mutate(month_num = match(arrival_date_month, month.name),
months = factor(substr(arrival_date_month, 1, 3),
levels = month.abb)) %>%
arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'hotel', 'arrival_date_year'. You can
## override using the `.groups` argument.
#lowest point
lowest_occ_rate <- occupancy_rate_monthly %>%
group_by(hotel, arrival_date_year) %>%
slice_min(occupancy_rate) %>%
ungroup()
#highest point
highest_occ_rate <- occupancy_rate_monthly %>%
group_by(hotel, arrival_date_year) %>%
slice_max(occupancy_rate) %>%
ungroup()
# create a new data frame to store the highest and lowest points
highest_lowest_occrate <- occupancy_rate_monthly %>%
group_by(arrival_date_year, hotel) %>%
slice_min(occupancy_rate, n = 1) %>% # get the lowest point
bind_rows(occupancy_rate_monthly %>%
group_by(arrival_date_year, hotel) %>%
slice_max(occupancy_rate, n = 1)) %>% # get the highest point
arrange(arrival_date_year)
highest_lowest_occrate
## # A tibble: 12 × 8
## # Groups: arrival_date_year, hotel [6]
## hotel arrival_date_year arriv…¹ total…² total…³ occup…⁴ month…⁵ months
## <chr> <int> <chr> <int> <int> <dbl> <int> <fct>
## 1 City Hotel 2015 July 1396 457 0.327 7 Jul
## 2 Resort Hotel 2015 Septem… 1585 1034 0.652 9 Sep
## 3 City Hotel 2015 Novemb… 1233 932 0.756 11 Nov
## 4 Resort Hotel 2015 Novemb… 1104 919 0.832 11 Nov
## 5 City Hotel 2016 October 4212 2265 0.538 10 Oct
## 6 Resort Hotel 2016 August 1685 1107 0.657 8 Aug
## 7 City Hotel 2016 January 1364 926 0.679 1 Jan
## 8 Resort Hotel 2016 January 884 765 0.865 1 Jan
## 9 City Hotel 2017 April 3911 1990 0.509 4 Apr
## 10 Resort Hotel 2017 August 1800 1107 0.615 8 Aug
## 11 City Hotel 2017 August 3123 2002 0.641 8 Aug
## 12 Resort Hotel 2017 January 1307 1101 0.842 1 Jan
## # … with abbreviated variable names ¹arrival_date_month, ²total_bookings,
## # ³total_occupied, ⁴occupancy_rate, ⁵month_num
Visualization:
# plot the data
ggplot(occupancy_rate_monthly, aes(x = months, y = occupancy_rate, color = hotel, group = hotel)) +
facet_wrap(~arrival_date_year, ncol = 1) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Occupancy rate across years",
subtitle = "",
x = "Arrival month",
y = "Rate",
color = "Hotel Type",
caption = "
Data Source: Kaggle",
fill = "Hotel Type") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.17),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.2),
panel.background = element_rect(fill = "white"),
panel.grid.minor.x = element_line(size = 0),
legend.position = "right",
axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.97, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
geom_point(data = highest_lowest_occrate, aes(x = months, y = occupancy_rate),
size = 3)
The analysis and chart provide us with the following insights:
The chart shows that both hotels have similar patterns of occupancy rates across the three years, with peak months during the summer (June to August) and lower occupancy rates during the winter months (November to January). However, the occupancy rate of the Resort Hotel is generally higher than the City Hotel throughout the year.
One interesting observation is that the City Hotel experienced a dip in occupancy rate during the summer months of 2016, while the Resort Hotel did not. In addition, both hotels experienced a dip in occupancy rate during the fall months of 2016 and 2017.
Analysis process:
# filter data of non-canceled booking
revenue_by_month <- hotel_bookings %>%
filter(is_canceled == 0) %>%
group_by(hotel, arrival_date_year, arrival_date_month) %>%
summarize(total_revenue = sum(adr * (stays_in_week_nights + stays_in_weekend_nights))) %>%
mutate(month_num = match(arrival_date_month, month.name),
months = factor(substr(arrival_date_month, 1, 3),
levels = month.abb)) %>%
arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'hotel', 'arrival_date_year'. You can
## override using the `.groups` argument.
#lowest revenue
lowest_revenue <- revenue_by_month %>%
group_by(hotel, arrival_date_year) %>%
slice_min(total_revenue) %>%
ungroup()
#highest revenue
highest_revenue <- revenue_by_month %>%
group_by(hotel, arrival_date_year) %>%
slice_max(total_revenue) %>%
ungroup()
# create a new data frame to store the highest and lowest revenue.
highest_lowest_revenue <- revenue_by_month %>%
group_by(arrival_date_year, hotel) %>%
slice_min(total_revenue, n = 1) %>% # get the lowest point
bind_rows(revenue_by_month %>%
group_by(arrival_date_year, hotel) %>%
slice_max(total_revenue, n = 1)) %>% # get the highest point
arrange(arrival_date_year)
highest_lowest_revenue
## # A tibble: 12 × 6
## # Groups: arrival_date_year, hotel [6]
## hotel arrival_date_year arrival_date_month total_reve…¹ month…² months
## <chr> <int> <chr> <dbl> <int> <fct>
## 1 City Hotel 2015 July 79646. 7 Jul
## 2 Resort Hotel 2015 November 155481. 11 Nov
## 3 City Hotel 2015 September 588545. 9 Sep
## 4 Resort Hotel 2015 August 852455. 8 Aug
## 5 City Hotel 2016 January 175232. 1 Jan
## 6 Resort Hotel 2016 January 89289. 1 Jan
## 7 City Hotel 2016 August 805804. 8 Aug
## 8 Resort Hotel 2016 August 1001850. 8 Aug
## 9 City Hotel 2017 January 352313. 1 Jan
## 10 Resort Hotel 2017 January 159455. 1 Jan
## 11 City Hotel 2017 August 885430. 8 Aug
## 12 Resort Hotel 2017 August 1084751. 8 Aug
## # … with abbreviated variable names ¹total_revenue, ²month_num
#plot the chart.
ggplot(revenue_by_month, aes(x = months, y = total_revenue, color = hotel, group = hotel)) +
facet_wrap(~arrival_date_year, ncol = 1) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
labs(title = "Revenue by Hotel Type",
subtitle = "From year 2015-2015
",
x = "Arrival month",
y = "Total Revenue",
color = "Hotel Type",
caption = "
Data Source: Kaggle",
fill = "Hotel Type") +
scale_y_continuous(labels = scales::dollar) +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.26),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.22),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.27),
panel.background = element_rect(fill = "white"),
panel.grid.minor.x = element_line(size = 0),
legend.position = "right",
axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.95, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
geom_point(data = highest_lowest_revenue, aes(x = months, y = total_revenue),
size = 3)
The analysis and chart provide us with the following insights:
In general, the City Hotel generates more revenue than the Resort Hotel during all three years.
But the chart also shows the Resort Hotel generates more revenue during summer months (July and August) for all three years.
This analysis is important to identify the contributing factors behind the high cancellation rates and take appropriate actions to decrease them, which can enhance the hotel’s overall performance
cancellation_by_lead_time <- hotel_bookings %>%
mutate(lead_time_group = cut(lead_time, breaks = seq(0, 750, by = 50), include.lowest = TRUE)) %>%
group_by(lead_time_group) %>%
summarise(total_cancellations = sum(is_canceled),
total_bookings = n(),
cancellation_rate = total_cancellations / total_bookings)
cancellation_by_lead_time
## # A tibble: 14 × 4
## lead_time_group total_cancellations total_bookings cancellation_rate
## <fct> <int> <int> <dbl>
## 1 [0,50] 11446 50668 0.226
## 2 (50,100] 8542 21212 0.403
## 3 (100,150] 6540 14795 0.442
## 4 (150,200] 5463 11898 0.459
## 5 (200,250] 3385 7060 0.479
## 6 (250,300] 3414 5692 0.600
## 7 (300,350] 2793 4060 0.688
## 8 (350,400] 1109 1712 0.648
## 9 (400,450] 669 1007 0.664
## 10 (450,500] 426 625 0.682
## 11 (500,550] 148 215 0.688
## 12 (550,600] 119 119 1
## 13 (600,650] 145 145 1
## 14 (700,750] 0 2 0
ggplot(cancellation_by_lead_time, aes(x = lead_time_group, y = cancellation_rate, fill = cancellation_rate > 0.5)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_manual(values = c("#9DC6C2", "#096E65")) +
scale_x_discrete(limits = levels(cancellation_by_lead_time$lead_time_group)) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Cancellation rate vs Lead time",
subtitle = "Year from 2015 to 2017
",
x = "
Lead Time (days)",
y = "",
caption = "
Data Source: Kaggle",
fill = "Rate > 50% in Green") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.22),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.17),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
panel.background = element_rect(fill = "white"),
legend.position = "none",
axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.05,vjust = 0.5, size = 9),
axis.title.y = element_text(colour = "#3A3A3A", hjust = 0.96,vjust = 0.7, size = 9),
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust=1))
The analyisis and chart provide us with the following insights:
There is a clear relationship between lead time and cancellation rate, with longer lead times generally associated with higher cancellation rates.
The cancellation rate for bookings made with a lead time of 50-250 days is particularly high, reaching almost 50%.
The cancellation rate for bookings made with a lead time of 250-550 days is generally almost reaching 75%, indicating that this is a high-risk period for cancellations.
The cancellation rate for bookings made with a lead time of 600-650 days is generally 100% cancellation rate.
The chart highlights the importance of booking policies and communication strategies, particularly for bookings made with shorter lead times. It may be useful for stakeholders to consider implementing policies and strategies that encourage guests to make bookings with longer lead times, such as offering incentives or discounts.
# calculate cancellation by market segment
cancellation_by_segment_type <- hotel_bookings %>%
filter(is_canceled == 1) %>%
group_by(market_segment, hotel) %>%
summarise(cancellation_num = n())
## `summarise()` has grouped output by 'market_segment'. You can override using
## the `.groups` argument.
# plot the chart
ggplot(cancellation_by_segment_type, aes(x = market_segment, y = cancellation_num, fill = market_segment)) +
geom_col(position = "dodge", width = 0.7) +
geom_text(aes(label = ifelse(cancellation_num == max(cancellation_num), cancellation_num, "")),
position = position_dodge(width = 0.7), vjust = -1.7,
hjust = 1, size = 3) +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Cancellation vs Market segment",
subtitle = "Year from 2015 to 2017",
x = "Market Segment",
y = "
Count",
caption = "
Data Source: Kaggle",
fill = "") +
scale_fill_manual(values = c("Aviation" = "#B6B6B6", "Complementary"= "#B6B6B6", "Corporate" = "#B6B6B6", "Direct" = "#B6B6B6", "Groups" = "#B6B6B6", "Offline TA/TO" = "#B6B6B6", "Online TA" = "#529993", "Undefined" = "#B6B6B6")) +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.33),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.25),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.27),
panel.background = element_rect(fill = "white"),
legend.position = "none",
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text = element_text(hjust = 0.7),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.91, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.01,vjust = 0.5, size = 9))
The analysis and chart provide us with the following insights:
From the chart, we can observe that the online travel agency (Online TA) market segment has the highest number of cancellations compared to other segments, followed by offline travel agencies/tour operators (Offline TA/TO) and Direct bookings.
Meanwhile, the Aviation, Complementary, Corporate, Groups, and Undefined market segments have the lowest number of cancellations.
cancellation_by_customer_type <- hotel_bookings %>%
filter(is_canceled == 1) %>%
group_by(customer_type, hotel) %>%
summarise(cancellation_count = n())
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
# plot the chart
ggplot(cancellation_by_customer_type, aes(x = customer_type, y = cancellation_count, fill = customer_type)) +
geom_col(position = "dodge", width = 0.7) +
geom_text(aes(label = ifelse(cancellation_count == max(cancellation_count),
cancellation_count, "")), position = position_dodge(width = 0.5), vjust = -3, hjust = 1, size = 3) +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Cancellation vs Customer type",
subtitle = "Year from 2015 to 2017",
x = "Customer Type",
y = "Count",
caption = "
Data Source: Kaggle",
fill = "") +
scale_fill_manual(values = c("Contract" = "#B6B6B6", "Group"= "#B6B6B6", "Transient" = "#529993", "Transient-Party" = "#B6B6B6")) +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.31),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.24),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.26),
panel.background = element_rect(fill = "white"),
legend.position = "none",
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust=0.8),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.82, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9))
The analysis and chartprovide us the following insights:
We can see that the Transient customer type has the highest number of cancellations, followed by Transient-Party, and then Contract and Group with relatively lower cancellation numbers.
This could indicate that the cancellation policies and procedures for Transient and Transient-Party customers need to be revisited or improved to reduce cancellations in the future.
# Compute the total cancellations by room type
cancellation_by_roomtype <- hotel_bookings %>%
filter(is_canceled == 1) %>%
group_by(reserved_room_type) %>%
summarize(total_cancellations = sum(is_canceled)) %>%
arrange(desc(reserved_room_type))
cancellation_by_roomtype
## # A tibble: 9 × 2
## reserved_room_type total_cancellations
## <chr> <int>
## 1 L 2
## 2 H 245
## 3 G 763
## 4 F 880
## 5 E 1908
## 6 D 6101
## 7 C 308
## 8 B 367
## 9 A 33625
cancellation_by_roomtype$reserved_room_type <- factor(cancellation_by_roomtype$reserved_room_type, levels = c("P", "O", "D", "E", "F", "G", "H", "C", "B", "A"))
# Plot the data as a bar chart with a color gradient
ggplot(cancellation_by_roomtype, aes(x = reserved_room_type, y = total_cancellations, fill = total_cancellations)) +
geom_col(position = "dodge") +
geom_text(aes(label = ifelse(total_cancellations == max(total_cancellations), total_cancellations, "")), position = position_dodge(width = 0.5), vjust = -2,
hjust = 1, size = 3) +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
scale_fill_gradient(low = "#B6B6B6", high = "#529993", name = "Total Cancellations", guide = "legend", labels = scales::comma_format()) +
labs(title = "Cancellations vs Room type",
subtitle = "Year from 2015 to 2017",
x = "Room Type",
y = "Count",
caption = "
Data Source: Kaggle") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.09),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.08),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.11),
panel.background = element_rect(fill = "white"),
legend.position = "none",
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust=0.8),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9))
The analysis and chart provide us the following insights:
From this chart, we can see that the most cancellations occurred for the A and D room types, followed by E and F.
The least cancellations occurred for the G, B, C and H room types.
cancellation_by_deposit <- hotel_bookings %>%
group_by(deposit_type) %>%
summarise(total_bookings = n(),
total_cancellations = sum(is_canceled))
cancellation_by_deposit
## # A tibble: 3 × 3
## deposit_type total_bookings total_cancellations
## <chr> <int> <int>
## 1 No Deposit 104461 29669
## 2 Non Refund 14587 14494
## 3 Refundable 162 36
cancellation_by_deposit$deposit_type <- factor(cancellation_by_deposit$deposit_type, levels = c("Refundable", "Non Refund", "No Deposit"))
#plot the chart
ggplot(cancellation_by_deposit, aes(x = deposit_type, y = total_cancellations, fill = deposit_type)) +
geom_col(width = 0.5) +
coord_flip() +
geom_text(aes(label = total_cancellations), position = position_dodge(width = 0.5),
vjust = -2.9, hjust = 1, size = 3) +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("No Deposit" = "#529993", "Non Refund" = "#B6B6B6", "Refundable" = "#B6B6B6")) +
labs(title = "Cancellations vs Deposit type",
subtitle = "Year from 2015 to 2017",
x = "Deposit type",
y = "
Count",
fill = "Deposit Type",
caption = "
Data Source: Kaggle") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.25),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.2),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
panel.background = element_rect(fill = "white"),
legend.position = "none",
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust=0.8),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.78, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.01,vjust = 2, size = 9))
The analysis and chart provide us the following insights:
The majority of bookings made were without any deposit, and this group also had the highest number of cancellations.
On the other hand, bookings made with refundable deposit had the lowest number of cancellations.
top_10_country <- hotel_bookings %>%
filter(is_canceled == 0) %>%
group_by(country) %>%
summarise(booking_count = n()) %>%
arrange(desc(booking_count)) %>%
head(10)
top_10_country
## # A tibble: 10 × 2
## country booking_count
## <chr> <int>
## 1 PRT 20977
## 2 GBR 9668
## 3 FRA 8468
## 4 ESP 6383
## 5 DEU 6067
## 6 IRL 2542
## 7 ITA 2428
## 8 BEL 1868
## 9 NLD 1716
## 10 USA 1592
ggplot(top_10_country, aes(x = reorder(country, booking_count), y = booking_count)) +
geom_bar(stat = "identity", width = 0.7, fill = "#529993") +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Top 10 countries with the highest number of bookings",
subtitle = "From Year 2015 - 2017",
x = "Country",
y = "Count",
caption = "
Data Source: Kaggle",
fill = "Customer Type") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.27),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.11),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.13),
panel.background = element_rect(fill = "white"),
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust = 0.8),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9)) +
geom_text(aes(label = booking_count), position = position_dodge(width = 0.7), vjust = -0.3, hjust = -0.04, size = 2.7, colour = "#3A3A3A")
top_10_country <- hotel_bookings %>%
filter(is_canceled == 1) %>%
group_by(country) %>%
summarise(booking_count = n()) %>%
arrange(desc(booking_count)) %>%
head(10)
ggplot(top_10_country, aes(x = reorder(country, booking_count), y = booking_count)) +
geom_bar(stat = "identity", width = 0.7, fill = "#529993") +
coord_flip() +
scale_y_continuous (labels = scales::comma) +
labs(title = "Top 10 countries with the highest number of cancellations",
subtitle = "From Year 2015 - 2017
",
x = "Country",
y = "Count",
caption = "
Data Source: Kaggle",
fill = "Customer Type") +
theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.3),
plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.1),
plot.caption = element_text(colour = "#3A3A3A", hjust = -0.13),
panel.background = element_rect(fill = "white"),
axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
axis.text.x = element_text(vjust = 0.5, hjust=0.8),
axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9)) +
geom_text(aes(label = booking_count), position = position_dodge(width = 0.7), vjust = -0.1, hjust = -0.05 , size = 2.6, colour = "#3A3A3A")
The analysis and charts provide us the following insights:
Portugal has the highest number of bookings, followed by the UK and France.
Portugal has the highest number of cancellations, followed by the UK and Spain.
We can also see that most of the cancellations are made by customers of type “Transient”.
Overall, this visualization gives us a quick understanding of the countries that are most popular for bookings and cancellations the type of customers that make most bookings.
The number of successful stays (Check-Out status) is high, and the number of canceled bookings (Canceled status) is also significant, indicating a negative performance for the hotel’s overall business.
The factors that contributes the cancellations as follows;
a. Lead time, booking with 550-650 days is 100% of cancellations.
b. Market Segment = Online TA
c. Customer Type = Transient
d. Room Type = A
e. Deposit Type = No Deposit
f. Country = Portugal
Trends and patterns
a. Seasonality pattern is clearly pronounced for City Hotel where the peak cancellations is during summer months and trough cancellations is during the winter months.
b. For Resort Hotel is relatively stable, there is no fluctuation number either cancellation or booking.
Revenue and Occupancy Rate
a. City Hotel generated more revenue than Resort Hotel in all three years. Yet, an interesting situation during summer months is Resort Hotel generated more revenue than City Hotel.
b. Resort Hotel has higher occupancy rate in all three years than City Hotel.
Offer flexible cancellations policies for market segment “Online TA”, allowing customers to cancel their bookings within a certain time-frame without penalty. We can use (lead time with the lowest cancellation rate).
Implementing a dynamic pricing strategy for room type A: develop pricing rules based on factors such as demand, seasonality & occupancy rates.
Invest in technology: Use pricing software to automate price updates in real-time based, and monitor the performance of the pricing strategy and make adjustments as needed.
Targeted marketing campaign:
Focus on promoting City hotel bookings during the winter months, the campaign could target demographics that are most likely to travel during the winter.
Emphasize the benefits of booking early such as offering early book discounts, to minimize the high cancellations rate associated with longer lead times. Encourage guest to book Room type D, which have a lower cancellation rate than room type A. Attract more bookings from the least canceled countries such as Ireland, Brazil, USA & Belgium. Highlight a specific promotions and deals that are exclusive to those countries traveler.
Encouraging stable bookings for Resort hotel, could promote the unique features and amenities that differentiate it from other accommodation options. For example all-inclusive package, couple package, family holidays, retreat package, etc.
The data set was not provided the customer feedback, that we can use to identify areas for improvement such as check-out procedures, room amenities, and customer satisfactions.
For further analysis, we can approach some areas
Demographics which countries likely booking in summer months and winter months.
Key measurement metrics such as: ADR (average daily rate), RevPar (revenue per available room), and ROI (return of investment).