Through this analysis, we were able to answer key business questions which are stated below:
This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 31 variables describing the 40,060 observations and 79,330 observations. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
library(tidyverse)
library(ggplot2)
The data originally comes from an open hotel booking demand dataset on the website Antonio, Almeida and Nunes, 2019. It was published on Twitter as a part of weekly project in R named TidyTuesday.
The data was acquired from hotels’ Property Management System (PMS) SQL databases using TSQL queries. This data describes demand data of two different types of hotels. One of the hotels is a resort hotel and the other is a city hotel. Both of them share the same structure, with 32 variables describing the 40,060 observations and 79,330 observations respectively. Each observation represents a hotel booking.Both data sets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were cancelled.
data<- read.csv("C:/Users/srhrs/OneDrive/Documents/Wrangling/Week 3/hotels.csv",stringsAsFactors = TRUE)
str(data)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : Factor w/ 2 levels "City Hotel","Resort Hotel": 2 2 2 2 2 2 2 2 2 2 ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : Factor w/ 12 levels "April","August",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : Factor w/ 5 levels "BB","FB","HB",..: 1 1 1 1 1 1 1 2 1 3 ...
## $ country : Factor w/ 178 levels "ABW","AGO","AIA",..: 137 137 60 60 60 60 137 137 137 137 ...
## $ market_segment : Factor w/ 8 levels "Aviation","Complementary",..: 4 4 4 3 7 7 4 4 7 6 ...
## $ distribution_channel : Factor w/ 5 levels "Corporate","Direct",..: 2 2 2 1 4 4 2 2 4 4 ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : Factor w/ 10 levels "A","B","C","D",..: 3 3 1 1 1 1 3 3 1 4 ...
## $ assigned_room_type : Factor w/ 12 levels "A","B","C","D",..: 3 3 3 1 1 1 3 3 1 4 ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : Factor w/ 3 levels "No Deposit","Non Refund",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ agent : Factor w/ 334 levels "1","10","103",..: 334 334 334 157 103 103 334 156 103 40 ...
## $ company : Factor w/ 353 levels "10","100","101",..: 353 353 353 353 353 353 353 353 353 353 ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : Factor w/ 4 levels "Contract","Group",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : Factor w/ 3 levels "Canceled","Check-Out",..: 2 2 2 2 2 2 2 2 1 1 ...
## $ reservation_status_date : Factor w/ 926 levels "2014-10-17","2014-11-18",..: 122 122 123 123 124 124 124 124 73 62 ...
The dataset contains 32 variables and 119390 records.
# Generating a table for the first 10 rows in the dataset.
knitr::kable(head(data,n = 10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
# Calculating null value count for all columns
colSums(is.na(data))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
children with 4 missing values.children column.# Finding summary statistics for 'children'
summary(data$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1039 0.0000 10.0000 4
# Imputing missing values in 'children' column
data$children[is.na(data$children)] <- 0
After imputing, the number of missing values in the dataset is 0.
sum(is.na(data))
## [1] 0
summary(data)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 Min. :0.0000 Min. : 0 Min. :2015
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 Min. :0.00000
## Offline TA/TO:24219 Direct :14645 1st Qu.:0.00000
## Groups :19811 GDS : 193 Median :0.00000
## Direct :12606 TA/TO :97870 Mean :0.03191
## Corporate : 5295 Undefined: 5 3rd Qu.:0.00000
## Complementary: 743 Max. :1.00000
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
Summary for variables of concern
Resort Hotel with 40060 records and City Hotel with 79330 records.arrival_date_week_number is 28. This shows that there are more number of bookings in the 2nd half of the year.stays_in_weekend_nights) observe longer stays than those in weekends(stays_in_week_nights) i.e., on Saturday, Sunday.adults column has a minimum value of 0. We must inspect those records.distribution_channel has 5 Undefined values.A rooms have 85994 bookings registered against them. Also, there are 74053 bookings assigned for type A rooms.agent and company columns have records stored under the category NULL.adr column has negative values. These must be investigated.hist(data$arrival_date_year, prob = T, main = "Histogram Yearly Bookings", xlab = "Yearly Bookings")
Majority bookings can be seen in 2016
hist(data$adr, prob = T, breaks = 50,xlim = c(min(data$adr) - 1,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
There seem some outliers in the adr column
Below are the plots of various numeric variables:
par(mfrow = c(3,3), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(data$is_canceled, main = "is_canceled")
boxplot(data$lead_time, main = "lead_time")
boxplot(data$arrival_date_year, main = "arrival_date_year")
boxplot(data$arrival_date_week_number, main = "arrival_date_week_number")
boxplot(data$arrival_date_day_of_month, main = "arrival_date_day_of_month")
boxplot(data$stays_in_weekend_nights, main = "stays_in_weekend_nights")
boxplot(data$stays_in_week_nights, main = "stays_in_week_nights")
boxplot(data$children, main = "children")
boxplot(data$adults, main = "adults")
boxplot(data$babies, main = "babies")
boxplot(data$is_repeated_guest, main = "is_repeated_gues")
boxplot(data$previous_cancellations, main = "previous_cancellation")
boxplot(data$previous_bookings_not_canceled, main ="previous_bookings_not_canceled")
boxplot(data$booking_changes, main = "booking_changes")
boxplot(data$days_in_waiting_list, main = "days_in_waiting_list")
boxplot(data$adr, main = "adr")
boxplot(data$required_car_parking_spaces, main = "required_car_parking_spaces")
boxplot(data$total_of_special_requests, main = "total_of_special_requests")
Observations from boxplots
children column, we observe that there is one booking with 10 children. Most of the other bookings have value less than 4.babies column, we observe that there are two bookings with value greater than 4.days_in_waiting_list has values as high as 400. This implies that waiting time is greater than 1 year and measures must be taken to reduce this value.adr column, there is one value larger than 5000. It can be an outlier. We should remove this value as it significantly affects the summary statistics. The observation with negative adr can also be removed.data <- data[!(data$adr < 0 | data$adr == 5400),]
boxplot(data$adr, main = "Boxplot : Avg. Daily Price")
adr column.summary(data$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 69.29 94.58 101.79 126.00 510.00
adrhist(data$adr, prob = T, col ="grey", breaks = 100, xlim = c(min(data$adr) - 1 ,max(data$adr) + 1), main = "Histogram of Avg. Daily Price", xlab = "Avg. Daily Price")
* Let us check the final dataset.
knitr::kable(head(data,n=10), "pandoc")
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NULL | NULL | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304 | NULL | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 0 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | NULL | NULL | 0 | Transient | 107.0 | 0 | 0 | Check-Out | 2015-07-03 |
| Resort Hotel | 0 | 9 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0 | 0 | FB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 0 | No Deposit | 303 | NULL | 0 | Transient | 103.0 | 0 | 1 | Check-Out | 2015-07-03 |
| Resort Hotel | 1 | 85 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | BB | PRT | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240 | NULL | 0 | Transient | 82.0 | 0 | 1 | Canceled | 2015-05-06 |
| Resort Hotel | 1 | 75 | 2015 | July | 27 | 1 | 0 | 3 | 2 | 0 | 0 | HB | PRT | Offline TA/TO | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 15 | NULL | 0 | Transient | 105.5 | 0 | 0 | Canceled | 2015-04-22 |
Let us analyze the number of bookings by the type of hotel for each available year.
ggplot(data = data,aes(arrival_date_year,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings across years") +
labs(fill = 'Hotel Type')
Total number of bookings in 2016 seems to be higher compared to 2015 & 2017. We can use the arrival_date_month field to view the data across every month for the availabe years.
data_2015 <- data[data$arrival_date_year == 2015,]
data_2016 <- data[data$arrival_date_year == 2016,]
data_2017 <- data[data$arrival_date_year == 2017,]
ggplot(data = data_2015,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2015") +
labs(fill = 'Hotel Type')
ggplot(data = data_2016,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for year 2016") +
labs(fill = 'Hotel Type')
ggplot(data = data_2017,aes(arrival_date_month,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Years") +
ggtitle("Total bookings for years 2017") +
labs(fill = 'Hotel Type')
Looking at the above plots we can see that complete data is available only for 2 quarters for years 2015 and 2017 and the data is missing for remaining quarters whereas, for 2016 the data is available for the entire year. Due to this the number of bookings for 2016 is showing a much higher value than the rest of the years.
data.weeks <- (data %>%
group_by(arrival_date_year) %>%
summarize(min = min(arrival_date_week_number),
max = max(arrival_date_week_number),
weeks = max(arrival_date_week_number) - min(arrival_date_week_number) + 1))
knitr::kable(data.weeks)
| arrival_date_year | min | max | weeks |
|---|---|---|---|
| 2015 | 27 | 53 | 27 |
| 2016 | 1 | 53 | 53 |
| 2017 | 1 | 35 | 35 |
From the above table we can see the data for number of weeks is not same in each available year. So we find the average number of bookings and average revenue generated per week in the 2 types of hotel segments and compare them.
data.summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(hotel.count = n(),
hotel.rates = sum(adr)) %>%
transmute(arrival_date_year,
avg_weekly_bookings = case_when(arrival_date_year == '2015' ~ (hotel.count/27),
arrival_date_year == '2017' ~ (hotel.count/35),
TRUE ~ (hotel.count/53)),
avg_weekly_earnings = case_when(arrival_date_year == '2015' ~ (hotel.rates/27),
arrival_date_year == '2017' ~ (hotel.rates/35),
TRUE ~ round(hotel.rates/53)))
knitr::kable(data.summary)
| hotel | arrival_date_year | avg_weekly_bookings | avg_weekly_earnings |
|---|---|---|---|
| City Hotel | 2015 | 284.3704 | 24988.60 |
| City Hotel | 2016 | 428.9245 | 44642.00 |
| City Hotel | 2017 | 451.9143 | 52791.06 |
| Resort Hotel | 2015 | 228.7407 | 20531.64 |
| Resort Hotel | 2016 | 257.3019 | 21589.00 |
| Resort Hotel | 2017 | 260.6857 | 26533.26 |
From the above table we can say the total number of bookings and earnings are increasing year over year for both types of hotels. But, this increase is more in City Hotels compared to Resort Hotels. Also the number of bookings increased by upto 56% from 2015 to 2016 for City Hotels but there is only a 6% increase in bookings from 2016 to 2017.
We will now explore the countries from which the guests belong to.
data_country <- data %>% group_by(country) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>%
ggplot(.,aes(country, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="dark blue")
The above graph shows the top 10 countries from where the guests come from. Majority of the guests are from Portugal followed by Great Britain and France.
Now we will analyse the distribution channel which customers generally choose for their bookings.
data_DC <- data %>% filter(!grepl("Undefined", distribution_channel)) %>% group_by(distribution_channel) %>% dplyr::summarise(booking_count = n()) %>% arrange(desc(booking_count))
ggplot(data=data_DC,aes(distribution_channel, booking_count)) +
geom_bar(stat = "identity", width = 0.25, fill ="dark blue")
As shown above, majority of the bookings come from travel agents/tour operators, therefore the hotel company can tie up with more number of such agents/operators and retain the existing ones with lucrative offers and commissions.
We will now analyze the bookings on weekdays and weekends for different hotel types.
week_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(hotel,arrival_date_year) %>%
summarize(avg_weekend_stay = round(sum(stays_in_weekend_nights)/2),
avg_weekday_stay = round(sum(stays_in_week_nights)/5))
knitr::kable(week_summary)
| hotel | arrival_date_year | avg_weekend_stay | avg_weekday_stay |
|---|---|---|---|
| City Hotel | 2015 | 2862 | 3091 |
| City Hotel | 2016 | 9024 | 9555 |
| City Hotel | 2017 | 6620 | 6982 |
| Resort Hotel | 2015 | 3678 | 3954 |
| Resort Hotel | 2016 | 7369 | 7848 |
| Resort Hotel | 2017 | 5358 | 5612 |
Here we can see the number of bookings on weekdays is slightly higher compared to weekend bookings across hotel types. This is interesting as we would generally believe the hotel bookings to be higher on weekends compared to weekdays. So the hotel company can further investigate by interacting with the customers and identifying the customers who prefer weekday bookings & those who prefer weekends so that they can target the respective type of customers by a much more streamlined process.
We will now study the room type preferred by guests.
room_summary <- data %>%
filter(is_canceled == 0) %>%
group_by(reserved_room_type) %>%
summarize(room_count = n()) %>%
arrange(-room_count)
knitr::kable(room_summary)
| reserved_room_type | room_count |
|---|---|
| A | 52363 |
| D | 13099 |
| E | 4621 |
| F | 2017 |
| G | 1331 |
| B | 750 |
| C | 624 |
| H | 356 |
| L | 4 |
ggplot(data = data,aes(reserved_room_type,fill = (hotel))) +
geom_bar(position = 'dodge') +
ylab("Number Of Bookings") +
xlab("Room Type") +
ggtitle("Room Type Preferred") +
labs(fill = 'Hotel Type')
From the above table and graph, we can say that room type A is in high demand for both city and resort hotels. And therefore the company can think of strategies to improve the booking rate in the remaining room types and ensure that the number of rooms of type A in the hotel is much higher than the remaining room types.
Studying the meal type preferred by the guests.
ggplot(data, aes(meal)) +
geom_bar(width=0.25, fill = "dark blue")
The above plots shows that “Bed & Breakfast” is the most preferred meal type by the guests. And therefore the hotel company should further investigate the reasons for the low number of bookings with other type of meal offerings and devise strategies to make them more attractive.
We will now study the cancellation trend across hotel types.
ggplot(data, aes(x = is_canceled, fill = factor(hotel))) +
geom_histogram(binwidth = 0.2) +
scale_x_continuous(breaks = seq(0, 1, 1))
The above plot shows that there are more cancellations in City Hotels than Resort Hotels.
data$total_stay_nights <- rowSums(cbind(data$stays_in_weekend_nights, data$stays_in_week_nights), na.rm = TRUE)
data <- data %>% mutate(revenue = (total_stay_nights * adr))
data_hotel_revenue <- data %>% filter(is_canceled == 0) %>% group_by(hotel) %>% dplyr::summarise(total_bookings= n(), total_revenue = sum(revenue))
data_lm <- data %>% filter(is_canceled == 0) %>% group_by(arrival_date_week_number,hotel,customer_type, meal, is_repeated_guest) %>% dplyr::summarise(total_bookings = n(), total_guests = sum(adults + children + babies),average_stay_nights = mean(total_stay_nights, na.rm = TRUE), week_adr = mean(adr, na.rm = TRUE) ,total_revenue = sum(revenue))
## `summarise()` has grouped output by 'arrival_date_week_number', 'hotel', 'customer_type', 'meal'. You can override using the `.groups` argument.
model <- lm(total_revenue ~ as.factor(hotel) + as.factor(customer_type) + as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights + week_adr, data = data_lm)
summary(model)
##
## Call:
## lm(formula = total_revenue ~ as.factor(hotel) + as.factor(customer_type) +
## as.factor(meal) + as.factor(is_repeated_guest) + average_stay_nights +
## week_adr, data = data_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55215 -20804 -5367 10432 242978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7434.26 3706.77 2.006 0.0451
## as.factor(hotel)Resort Hotel -1644.98 2212.95 -0.743 0.4574
## as.factor(customer_type)Group -3607.78 3597.32 -1.003 0.3161
## as.factor(customer_type)Transient 41462.05 3017.20 13.742 < 2e-16
## as.factor(customer_type)Transient-Party 14504.96 3108.08 4.667 3.36e-06
## as.factor(meal)FB -59550.92 4570.39 -13.030 < 2e-16
## as.factor(meal)HB -32974.17 2495.85 -13.212 < 2e-16
## as.factor(meal)SC -32095.51 2730.19 -11.756 < 2e-16
## as.factor(meal)Undefined -53234.85 4497.99 -11.835 < 2e-16
## as.factor(is_repeated_guest)1 -33540.02 2258.14 -14.853 < 2e-16
## average_stay_nights 656.51 345.97 1.898 0.0580
## week_adr 208.88 20.75 10.068 < 2e-16
##
## (Intercept) *
## as.factor(hotel)Resort Hotel
## as.factor(customer_type)Group
## as.factor(customer_type)Transient ***
## as.factor(customer_type)Transient-Party ***
## as.factor(meal)FB ***
## as.factor(meal)HB ***
## as.factor(meal)SC ***
## as.factor(meal)Undefined ***
## as.factor(is_repeated_guest)1 ***
## average_stay_nights .
## week_adr ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34580 on 1378 degrees of freedom
## Multiple R-squared: 0.3765, Adjusted R-squared: 0.3716
## F-statistic: 75.66 on 11 and 1378 DF, p-value: < 2.2e-16
The conclusions from the analysis of the Hotel data are :
Around 66% of yearly bookings come from the City Hotel and is much higher than the resort hotel and therefore company can devise strategies accordingly as City hotel is the major revenue generator.
70% of the bookings come from 5 countries namely Portugal, Great Britain, France, Spain and Germany among which 40% comes from Portugal alone. Therefore, the company can design strategies and promotions to attract the target customers during the months when booking are low.
82% of the bookings come from travel agents/tour operators, therefore the hotel company can tie up with more number of such agents/operators and retain the existing ones with lucrative offers and commissions.
The most preferred room type is A. Therefore the company should plan to increase the number of type A rooms and investigate the reasons for such low volume of bookings for the other room types.
The most preferred meal type is Bed & Breakfast which comprise of 78% of the bookings. The company can therefore be prepared to have required amount of resources for breakfast in future and also investigate on the low number of bookings for the other type of meal offerings and devise strategies to make them more attractive.
On an average, the yearly cancellation rate is 42% for city hotel and 28% for resort hotel. Therefore, the company should further deep dive into the reasons for such cancellations and look for ways to reduce the cancellation rate.
The Revenue is positively affected by adr, Total nights stayed by the guest, Transient and Transient Party customers.