The data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 31 variables describing the 40,060 observations and 79,330 observations. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
Through our analysis, we were able to answer key business questions which are stated below:
library(tidyverse)
library(ggplot2)
The data originally comes from an open hotel booking demand dataset on the website Antonio, Almeida and Nunes, 2019. It was published on Twitter as a part of weekly project in R named TidyTuesday.
The data was acquired from hotels’ Property Management System (PMS) SQL databases using TSQL queries. This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 32 variables describing the 40,060 observations and 79,330 observations respectively. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
data<- read.csv("C:/Users/srhrs/OneDrive/Documents/Wrangling/Week 3/hotels.csv",stringsAsFactors = TRUE)
str(data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
The dataset contains 32 variables and 119390 records.
# Generating a table for the first 10 rows in the dataset.
knitr::kable(head(data,n = 10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
summary(data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 Min. :0.0000 Min. : 0 Min. :2015
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## NA's :4 (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 Min. :0.00000
## Offline TA/TO:24219 Direct :14645 1st Qu.:0.00000
## Groups :19811 GDS : 193 Median :0.00000
## Direct :12606 TA/TO :97870 Mean :0.03191
## Corporate : 5295 Undefined: 5 3rd Qu.:0.00000
## Complementary: 743 Max. :1.00000
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
Summary for variables of concern
adults column has a minimum value of 0. We must inspect those records.distribution_channel has 5 Undefined values.agent, country and company columns have records stored under the category NULL.adr column has a negative value which needs to be removed.# Calculating null value count for all columns
colSums(is.na(data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
children with 4 missing values.children column.# Finding summary statistics for 'children'
summary(data$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
# Imputing missing values in 'children' column
data$children[is.na(data$children)] <- 0
After imputing, the number of missing values in the dataset is 0.
sum(is.na(data))
## [1] 0
Identifying observations where the total number of guests (adults+children+babies) is 0 and removing them as they are not required for further analysis.
attach(data)
length(which(adults+children+babies==0))
## [1] 180
data <- data[-which(adults+children+babies==0), ]
dim(data)
## [1] 119210 32
We found 180 observations where the total number of guests are 0 and hence removed them.
From the summary statistics it was observed that there were values marked “Undefined” for the variable distribution channel. Hence removing those observations as there are only 5 such values.
data <- subset(data, distribution_channel!="Undefined")
summary(data$distribution_channel)
## Corporate Direct GDS TA/TO Undefined
## 6651 14611 193 97750 0
dim(data)
## [1] 119205 32
Below are the plots of various numeric variables:
par(mfrow = c(3,3), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(data$is_canceled, main = "is_canceled")
boxplot(data$lead_time, main = "lead_time")
boxplot(data$arrival_date_year, main = "arrival_date_year")
boxplot(data$arrival_date_week_number, main = "arrival_date_week_number")
boxplot(data$arrival_date_day_of_month, main = "arrival_date_day_of_month")
boxplot(data$stays_in_weekend_nights, main = "stays_in_weekend_nights")
boxplot(data$stays_in_week_nights, main = "stays_in_week_nights")
boxplot(data$children, main = "children")
boxplot(data$adults, main = "adults")
boxplot(data$babies, main = "babies")
boxplot(data$is_repeated_guest, main = "is_repeated_gues")
boxplot(data$previous_cancellations, main = "previous_cancellation")
boxplot(data$previous_bookings_not_canceled, main ="previous_bookings_not_canceled")
boxplot(data$booking_changes, main = "booking_changes")
boxplot(data$days_in_waiting_list, main = "days_in_waiting_list")
boxplot(data$adr, main = "adr")
boxplot(data$required_car_parking_spaces, main = "required_car_parking_spaces")
boxplot(data$total_of_special_requests, main = "total_of_special_requests")
Observations from boxplots
Lead time has values higher than 650 which can be possible outliers. However, we did not remove them as we dont know the reason for such high values. It can be removed only after checking the reason from the hotel company.days_in_waiting_list has values as high as 400. This implies that waiting time is greater than 1 year and measures must be taken to reduce this value.adr column, there is one value larger than 5000. It can be an outlier. We should remove this value as it significantly affects the summary statistics.data <- data[!(data$adr < 0 | data$adr == 5400),]
boxplot(data$adr, main = "Boxplot : Avg. Daily Price")
We removed observations which were negative and also removed an outlier with value 5400 for our further analysis.
adr column.summary(data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.50 94.95 101.93 126.00 510.00
adrhist(data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(data$adr) - 1 ,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
knitr::kable(head(data,n=10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
hist(data$arrival_date_year, prob = T, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")
Majority bookings can be seen in 2016
Let us analyze the number of bookings by the type of hotel for each available year.
ggplot(data = data,aes(arrival_date_year,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings across years") +
labs(fill = 'Hotel Type')
Total number of bookings in 2016 seems to be higher compared to 2015 & 2017. We can use the arrival_date_month field to view the data across every month for the availabe years.
data_2015 <- data[data$arrival_date_year == 2015,]
data_2016 <- data[data$arrival_date_year == 2016,]
data_2017 <- data[data$arrival_date_year == 2017,]
ggplot(data = data_2015,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2015") +
labs(fill = 'Hotel Type')
ggplot(data = data_2016,aes(arrival_date_month,fill = (hotel))) + theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
)) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2016") +
labs(fill = 'Hotel Type')
ggplot(data = data_2017,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for years 2017") +
labs(fill = 'Hotel Type')
Looking at the above plots we can see that complete data is available only for 2 quarters for years 2015 and 2017 and the data is missing for remaining quarters whereas, for 2016 the data is available for the entire year. Due to this the number of bookings for 2016 is showing a much higher value than the rest of the years.
data.weeks <- (data %>%
group_by(arrival_date_year) %>%
summarize(min = min(arrival_date_week_number),
max = max(arrival_date_week_number),
weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))
knitr::kable(data.weeks)
| arrival_date_year | min | max | weeks |
|---|---|---|---|
| 2015 | 27 | 53 | 27 |
| 2016 | 1 | 53 | 53 |
| 2017 | 1 | 35 | 35 |
From the above table we can see the data for number of weeks is not same in each available year. So we find the average number of bookings and average revenue generated per week in the 2 types of hotel segments and compare them.
data.summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(hotel.count = n(),
hotel.rates = sum(adr)) %>%
transmute(arrival_date_year,
avg_weekly_bookings = case_when(arrival_date_year == '2015' ~ (hotel.count/27),
arrival_date_year == '2017' ~ (hotel.count/35),
TRUE ~ (hotel.count/53)),
avg_weekly_earnings = case_when(arrival_date_year == '2015' ~ (hotel.rates/27),
arrival_date_year == '2017' ~ (hotel.rates/35),
TRUE ~ round(hotel.rates/53)))
knitr::kable(data.summary)
| hotel | arrival_date_year | avg_weekly_bookings | avg_weekly_earnings |
|---|---|---|---|
| City Hotel | 2015 | 283.5185 | 24988.38 |
| City Hotel | 2016 | 427.6038 | 44620.00 |
| City Hotel | 2017 | 450.4571 | 52772.59 |
| Resort Hotel | 2015 | 228.5185 | 20527.46 |
| Resort Hotel | 2016 | 257.2830 | 21589.00 |
| Resort Hotel | 2017 | 260.5429 | 26533.26 |
From the above table we can say the total number of bookings and earnings are increasing year over year for both types of hotels. But, this increase is more in City Hotels compared to Resort Hotels. Also the number of bookings increased by upto 56% from 2015 to 2016 for City Hotels but there is only a 6% increase in bookings from 2016 to 2017.
data.revenue_ctype <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel, customer_type) %>%
summarize(hotel.count = n(),
hotel.rates = sum(adr))
ggplot(data = data.revenue_ctype,aes(customer_type, hotel.rates, fill = customer_type)) +
geom_bar(stat = "identity", width = 0.25) + scale_y_continuous(labels = scales::comma) +
scale_fill_discrete(name = "Customer Type") + theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
)) +
facet_wrap(~hotel) + xlab("Customer Type") + ylab("Revenue")
From the above output, for both hotel types during the operational weeks, customers from Transient and Transient-Party types have contributed to the highest and the 2nd highest revenues with close to 75% and 20% of the bookings respectively for both hotel types.
We will now explore the countries from which the guests belong to.
data_country <- data %>% group_by(country) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>%
ggplot(.,aes(country, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="dark blue")
The above graph shows the top 10 countries from where the guests come from. 40% of the guests are from Portugal followed by Great Britain and France contributing around 10% of the bookings each.
Now we will analyse the distribution channel which customers generally choose for their bookings.
data_DC <- data %>% filter(!grepl("Undefined", distribution_channel)) %>% group_by(distribution_channel) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
ggplot(data=data_DC,aes(distribution_channel, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="dark blue")
As shown above, majority (i.e. 82%) of the bookings come from travel agents/tour operators, therefore the hotel company can tie up with more number of such agents/operators and retain the existing ones with lucrative offers and commissions.
We will now analyze the bookings on weekdays and weekends for different hotel types.
week_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(avg_weekend_stay = round(sum(stays_in_weekend_nights)/2),
avg_weekday_stay = round(sum(stays_in_week_nights)/5))
knitr::kable(week_summary)
| hotel | arrival_date_year | avg_weekend_stay | avg_weekday_stay |
|---|---|---|---|
| City Hotel | 2015 | 2850 | 3078 |
| City Hotel | 2016 | 8968 | 9498 |
| City Hotel | 2017 | 6591 | 6954 |
| Resort Hotel | 2015 | 3675 | 3950 |
| Resort Hotel | 2016 | 7368 | 7846 |
| Resort Hotel | 2017 | 5353 | 5607 |
Here we can see the number of bookings on weekdays is slightly higher compared to weekend bookings across hotel types. This is interesting as we would generally believe the hotel bookings to be higher on weekends compared to weekdays. So the hotel company can further investigate by interacting with the customers and identifying the customers who prefer weekday bookings & those who prefer weekends so that they can target the respective type of customers in a much more streamlined way.
We will now study the room type preferred by guests.
room_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(reserved_room_type) %>%
summarize(room_count = n()) %>%
arrange(-room_count)
knitr::kable(room_summary)
| reserved_room_type | room_count |
|---|---|
| A | 52246 |
| D | 13078 |
| E | 4611 |
| F | 2014 |
| G | 1329 |
| B | 748 |
| C | 623 |
| H | 356 |
| L | 4 |
ggplot(data = data,aes(reserved_room_type,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Room Type") +
ggtitle("Room Type Preferred") +
labs(fill = 'Hotel Type')
From the above table and graph, we see that room type A is in high demand for both city and resort hotels, contributing to 79% of the bookings in city hotel and around 58% in the resort hotel. And therefore the company can think of strategies to improve the booking rate in the remaining room types and ensure that the number of rooms of type A in the hotel is sufficient to meet the demand at all times.
Studying the meal type preferred by the guests.
ggplot(data, aes(meal)) +
geom_bar(width=0.25, fill = "dark blue")
The above plots shows that “Bed & Breakfast” is the most preferred meal type by the guests comprising of 78% of the bookings. And therefore the hotel company should further investigate the reasons for the low number of bookings with other type of meal offerings and devise strategies to make them more attractive.
We will now study the cancellation trend across hotel types.
ggplot(data, aes(x = is_canceled, fill = factor(hotel))) +
geom_histogram(binwidth = 0.2) +
scale_x_continuous(breaks = seq(0, 1, 1))
The above plot shows that there are more cancellations in City Hotels than Resort Hotels.
data$total_stay_nights <- rowSums(cbind(data$stays_in_weekend_nights, data$stays_in_week_nights), na.rm = TRUE)
data <- data %>% mutate(revenue = (total_stay_nights * adr))
data_hotel_revenue <- data %>% filter(is_canceled == 0) %>% group_by(hotel) %>% dplyr::summarise(total_bookings= n(), total_revenue = sum(revenue))
data_lm <- data %>% filter(is_canceled == 0) %>% group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% dplyr::summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
model <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_lm)
summary(model)
##
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) +
## as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights +
## week_adr, data = data_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56659 -20715 -5261 10370 241914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6596.36 3762.64 1.753 0.0798
## as.factor(hotel)Resort Hotel -1052.04 2249.37 -0.468 0.6401
## as.factor(customer_type)Group -3424.63 3654.12 -0.937 0.3488
## as.factor(customer_type)Transient 41706.45 3055.42 13.650 < 2e-16
## as.factor(customer_type)Transient-Party 14657.22 3154.40 4.647 3.7e-06
## as.factor(meal)FB -60124.91 4596.38 -13.081 < 2e-16
## as.factor(meal)HB -33366.26 2514.50 -13.270 < 2e-16
## as.factor(meal)SC -32748.99 2776.39 -11.796 < 2e-16
## as.factor(meal)Undefined -53858.72 4524.97 -11.903 < 2e-16
## as.factor(is_repeated_guest)1 -34168.92 2278.91 -14.994 < 2e-16
## average_stay_nights 664.09 354.40 1.874 0.0612
## week_adr 214.85 21.16 10.152 < 2e-16
##
## (Intercept) .
## as.factor(hotel)Resort Hotel
## as.factor(customer_type)Group
## as.factor(customer_type)Transient ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB ***
## as.factor(meal)HB ***
## as.factor(meal)SC ***
## as.factor(meal)Undefined ***
## as.factor(is_repeated_guest)1 ***
## average_stay_nights .
## week_adr ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34720 on 1361 degrees of freedom
## Multiple R-squared: 0.3776, Adjusted R-squared: 0.3725
## F-statistic: 75.05 on 11 and 1361 DF, p-value: < 2.2e-16
The conclusions from the analysis of the Hotel data are :
Around 66% of yearly bookings come from the City Hotel and is much higher than the resort hotel and therefore company can devise strategies accordingly as City hotel is the major revenue generator.
70% of the bookings come from 5 countries namely Portugal, Great Britain, France, Spain and Germany among which 40% comes from Portugal alone. Therefore, the company can design strategies and promotions to attract the target customers during the months when booking are low.
82% of the bookings come from travel agents/tour operators, therefore the hotel company can tie up with more number of such agents/operators and retain the existing ones with lucrative offers and commissions.
The most preferred room type is A. Therefore the company should plan to increase the number of type A rooms and investigate the reasons for such low volume of bookings for the other room types.
The most preferred meal type is Bed & Breakfast which comprise of 78% of the bookings. The company can therefore be prepared to have required amount of resources for breakfast in future and also investigate on the low number of bookings for the other type of meal offerings and devise strategies to make them more attractive.
On an average, the yearly cancellation rate is 42% for city hotel and 28% for resort hotel. Therefore, the company should further deep dive into the reasons for such cancellations and look for ways to reduce the cancellation rate.
The Revenue is positively affected by adr, Total nights stayed by the guest, Transient and Transient Party customers.