Through this analysis, we were able to answer key business questions which are stated below:
This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 31 variables describing the 40,060 observations and 79,330 observations. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
library(tidyverse)
library(ggplot2)
The data originally comes from an open hotel booking demand dataset on the website Antonio, Almeida and Nunes, 2019. It was published on Twitter as a part of weekly project in R named TidyTuesday.
The data was acquired from hotels’ Property Management System (PMS) SQL databases using TSQL queries. This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 32 variables describing the 40,060 observations and 79,330 observations respectively. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
data<- read.csv("E:/Data_Wrangling/hotels/hotels.csv",stringsAsFactors = TRUE)
str(data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
The dataset contains 32 variables and 119390 records.
# Generating a table for the first 10 rows in the dataset.
knitr::kable(head(data,n = 10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
# Calculating null value count for all columns
colSums(is.na(data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
children with 4 missing values.children column.# Finding summary statistics for 'children'
summary(data$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
# Imputing missing values in 'children' column
data$children[is.na(data$children)] <- 0
After imputing, the number of missing values in the dataset is 0.
sum(is.na(data))
## [1] 0
summary(data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 Min. :0.0000 Min. : 0 Min. :2015
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 Min. :0.00000
## Offline TA/TO:24219 Direct :14645 1st Qu.:0.00000
## Groups :19811 GDS : 193 Median :0.00000
## Direct :12606 TA/TO :97870 Mean :0.03191
## Corporate : 5295 Undefined: 5 3rd Qu.:0.00000
## Complementary: 743 Max. :1.00000
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
Summary for variables of concern
Resort Hotel with 40060 records and City Hotel with 79330 records.August month has recorded maximum number of bookings with 13877 records. This can be seen from the summary statistics of arrival_date_month column.arrival_date_week_number is 28. This shows that there are more number of bookings in the 2nd half of the year.stays_in_weekend_nights) observe longer stays than those in weekends(stays_in_week_nights) i.e., on Saturday, Sunday.adults column has a minimum value of 0. We must inspect those records.distribution_channel has 5 Undefined values.A rooms have 85994 bookings registered against them. Also, there are 74053 bookings assigned for type A rooms.agent and company columns have records stored under the category NULL.adr column has negative values. These must be investigated.hist(data$arrival_date_year, prob = T, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")
Majority bookings can be seen in 2016
hist(data$adr, prob = T, breaks = 50,xlim = c(min(data$adr) - 1,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
There seem some outliers in the adr column
Below are the plots of various numeric variables:
par(mfrow = c(3,3), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(data$is_canceled, main = "is_canceled")
boxplot(data$lead_time, main = "lead_time")
boxplot(data$arrival_date_year, main = "arrival_date_year")
boxplot(data$arrival_date_week_number, main = "arrival_date_week_number")
boxplot(data$arrival_date_day_of_month, main = "arrival_date_day_of_month")
boxplot(data$stays_in_weekend_nights, main = "stays_in_weekend_nights")
boxplot(data$stays_in_week_nights, main = "stays_in_week_nights")
boxplot(data$children, main = "children")
boxplot(data$adults, main = "adults")
boxplot(data$babies, main = "babies")
boxplot(data$is_repeated_guest, main = "is_repeated_gues")
boxplot(data$previous_cancellations, main = "previous_cancellation")
boxplot(data$previous_bookings_not_canceled, main ="previous_bookings_not_canceled")
boxplot(data$booking_changes, main = "booking_changes")
boxplot(data$days_in_waiting_list, main = "days_in_waiting_list")
boxplot(data$adr, main = "adr")
boxplot(data$required_car_parking_spaces, main = "required_car_parking_spaces")
boxplot(data$total_of_special_requests, main = "total_of_special_requests")
Observations from boxplots
children column, we observe that there is one booking with 10 children. Most of the other bookings have value less than 4.babies column, we observe that there are two bookings with value greater than 4.days_in_waiting_list has values as high as 400. This implies that waiting time is greater than 1 year and measures must be taken to reduce this value.adr column, there is one value larger than 5000. It can be an outlier. We should remove this value as it significantly affects the summary statistics. The observation with negative adr can also be removed.data <- data[!(data$adr < 0 | data$adr == 5400),]
boxplot(data$adr, main = "Boxplot : Avg. Daily Price")
adr column.summary(data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.29 94.58 101.79 126.00 510.00
adrhist(data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(data$adr) - 1 ,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
* Let us check the final dataset.
knitr::kable(head(data,n=10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
Let us analyze the number of bookings by the type of hotel for each available year.
ggplot(data = data,aes(arrival_date_year,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings across years") +
labs(fill = 'Hotel Type')
Total number of bookings in 2016 seems to be higher compared to 2015 & 2017. We can use the arrival_date_month field to view the data across every month for the availabe years.
data_2015 <- data[data$arrival_date_year == 2015,]
data_2016 <- data[data$arrival_date_year == 2016,]
data_2017 <- data[data$arrival_date_year == 2017,]
ggplot(data = data_2015,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2015") +
labs(fill = 'Hotel Type')
ggplot(data = data_2016,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2016") +
labs(fill = 'Hotel Type')
ggplot(data = data_2017,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for years 2017") +
labs(fill = 'Hotel Type')
Looking at the above plots we can see that the data available for 2015 is only for 3rd Quarter of the year and for 2017 the data is available only for 1st Quatert of the year. Whereas, for 2016 the data is available for the entire year. Due to this we are seeing more number of bookings in 2016.
data.weeks <- (data %>%
group_by(arrival_date_year) %>%
summarize(min = min(arrival_date_week_number),
max = max(arrival_date_week_number),
weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))
knitr::kable(data.weeks)
| arrival_date_year | min | max | weeks |
|---|---|---|---|
| 2015 | 27 | 53 | 27 |
| 2016 | 1 | 53 | 53 |
| 2017 | 1 | 35 | 35 |
From the above table we can see the data for number of weeks is not same in each available year. So we find the average number of bookings and average revenue generated per week in the 2 types of hotel segments and compare them.
data.summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(hotel.count = n(),
hotel.rates = sum(adr)) %>%
transmute(arrival_date_year,
avg_weekly_bookings = case_when(arrival_date_year == '2015' ~ (hotel.count/27),
arrival_date_year == '2017' ~ (hotel.count/35),
TRUE ~ (hotel.count/53)),
avg_weekly_earnings = case_when(arrival_date_year == '2015' ~ (hotel.rates/27),
arrival_date_year == '2017' ~ (hotel.rates/35),
TRUE ~ round(hotel.rates/53)))
knitr::kable(data.summary)
| hotel | arrival_date_year | avg_weekly_bookings | avg_weekly_earnings |
|---|---|---|---|
| City Hotel | 2015 | 284.3704 | 24988.60 |
| City Hotel | 2016 | 428.9245 | 44642.00 |
| City Hotel | 2017 | 451.9143 | 52791.06 |
| Resort Hotel | 2015 | 228.7407 | 20531.64 |
| Resort Hotel | 2016 | 257.3019 | 21589.00 |
| Resort Hotel | 2017 | 260.6857 | 26533.26 |
From the above table we can say the total number of bookings and earnings are increasing year over year for both types of hotels. But, this increase is more in City Hotels compared to Resort Hotels. Also the number of bookings increased by upto 56% from 2015 to 2016 for City Hotels but there is only a 6% increase in bookings from 2016 to 2017.
We can also visualize the cancellation trend across hotel types.
ggplot(data, aes(x = is_canceled, fill = factor(hotel))) +
geom_histogram(binwidth = 0.2) +
scale_x_continuous(breaks = seq(0, 1, 1))
The above plot shows that there are more cancellations in City Hotels.
We can analyze the room preferred.
room_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(reserved_room_type) %>%
summarize(room_count = n()) %>%
arrange(-room_count)
knitr::kable(room_summary)
| reserved_room_type | room_count |
|---|---|
| A | 52363 |
| D | 13099 |
| E | 4621 |
| F | 2017 |
| G | 1331 |
| B | 750 |
| C | 624 |
| H | 356 |
| L | 4 |
From the above table we can say that room type A is in high demand. And we can think of strategies to improve the booking rate in room type L.
We can analyze the bookings on weekdays and weekends for different hotel types.
week_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(avg_weekend_stay = round(sum(stays_in_weekend_nights)/2),
avg_weekday_stay = round(sum(stays_in_week_nights)/5))
knitr::kable(week_summary)
| hotel | arrival_date_year | avg_weekend_stay | avg_weekday_stay |
|---|---|---|---|
| City Hotel | 2015 | 2862 | 3091 |
| City Hotel | 2016 | 9024 | 9555 |
| City Hotel | 2017 | 6620 | 6982 |
| Resort Hotel | 2015 | 3678 | 3954 |
| Resort Hotel | 2016 | 7369 | 7848 |
| Resort Hotel | 2017 | 5358 | 5612 |
Here we can see the number of bookings on weekdays is slightly higer compared to weekend bookings across hotel types. This finding is interesting as we would generally believe the hotel bookings to be more on weekends compared to weekdays. So we can investigate to find the customers who prefer weekday bookings.