Introduction
This data contains booking information of hotel. Data includes information such as when the booking was made, lead time, country, the number of adults, children, and/or babies, etc. Source of data from Kaggle.
Set Up Library
First, we must set up the library that we needed.
library(lubridate)
library(dplyr)
library(GGally)
library(ggplot2)
library(plotly)
library(glue)
library(scales)
Data Preparation
Input Data
Input our data and put it into ‘hotel_bookings’ object. We use
stringsAsFactors = True
, so all columns with character data
type will be converted into factors.
hotel_bookings <- read.csv("hotel_bookings.csv", stringsAsFactors = T)
Overview our data:
head(hotel_bookings)
tail(hotel_bookings)
Data Structure
Check the number of columns and rows.
dim(hotel_bookings)
## [1] 119390 32
Data contains 119,390 rows and 32 columns.
View all columns and the data types.
glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,~
## $ country <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,~
## $ market_segment <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, ~
## $ assigned_room_type <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, ~
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent <fct> NULL, NULL, NULL, 304, 240, 240, NULL, ~
## $ company <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <fct> Transient, Transient, Transient, Transi~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201~
Some of data types are not in the correct type.
Pre-processing Data
Convert the data type!
The column is_canceled, arrival_date_year, and is_repeated_guest should be Factor, reservation_status_date should be Date, and we have to sort month’s name from arrival_date_month. Then we can delete columns that are not used in this analysis.
hotel_bookings <- hotel_bookings %>%
select(-c(previous_cancellations, previous_bookings_not_canceled, agent, company)) %>%
mutate(is_canceled = factor(is_canceled, levels = c(0, 1), labels = c("No", "Yes")),
arrival_date_year = as.factor(arrival_date_year),
is_repeated_guest = factor(is_repeated_guest, levels = c(0, 1), labels = c("No", "Yes")),
reservation_status_date = ymd(reservation_status_date),
arrival_date_month = factor(arrival_date_month, levels = c(
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),ordered = TRUE))
Now the data type of all columns are correct. Next, checking the missing value.
colSums(is.na(hotel_bookings))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
We found 4 missing values in Children column. Convert the missing values to 0.
hotel_bookings <- hotel_bookings %>%
mutate(children = replace(children, is.na(children), 0))
Let’s take a look data again!
head(hotel_bookings)
Data Summary
Summary of all columns.
summary(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 No :75166 Min. : 0 2015:21996
## Resort Hotel:40060 Yes:44224 1st Qu.: 18 2016:56707
## Median : 69 2017:40687
## Mean :104
## 3rd Qu.:160
## Max. :737
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 No :115580
## Offline TA/TO:24219 Direct :14645 Yes: 3810
## Groups :19811 GDS : 193
## Direct :12606 TA/TO :97870
## Corporate : 5295 Undefined: 5
## Complementary: 743
## (Other) : 239
## reserved_room_type assigned_room_type booking_changes deposit_type
## A :85994 A :74053 Min. : 0.0000 No Deposit:104641
## D :19201 D :25322 1st Qu.: 0.0000 Non Refund: 14587
## E : 6535 E : 7806 Median : 0.0000 Refundable: 162
## F : 2897 F : 3751 Mean : 0.2211
## G : 2094 G : 2553 3rd Qu.: 0.0000
## B : 1118 C : 2375 Max. :21.0000
## (Other): 1551 (Other): 3530
## days_in_waiting_list customer_type adr
## Min. : 0.000 Contract : 4076 Min. : -6.38
## 1st Qu.: 0.000 Group : 577 1st Qu.: 69.29
## Median : 0.000 Transient :89613 Median : 94.58
## Mean : 2.321 Transient-Party:25124 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Canceled :43017
## 1st Qu.:0.00000 1st Qu.:0.0000 Check-Out:75166
## Median :0.00000 Median :0.0000 No-Show : 1207
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
Question and Visualization
- Which is the most busy year?
# Subset data
bookings <- hotel_bookings %>%
select(arrival_date_year,arrival_date_month) %>%
mutate(count = 1) %>%
group_by(arrival_date_year, arrival_date_month) %>%
summarise(count = sum(count))
colnames(bookings) <- c("Year","Month", "Total")
bookings <- bookings %>%
mutate(label=glue("Year: {Year}
Month: {Month}
Total Bookings: {comma(Total)}"))
#Plotting
bookings_plot <- ggplot(data = bookings, aes(x = Total,
y = Year, text = label)) +
geom_col(aes(fill = Month)) +
labs(title = "Number of Booking Per Year",
x = "Total Bookings",
y = "Year",
fill = "Month") +
scale_x_continuous(labels = comma) +
coord_flip() +
theme_minimal()
ggplotly(bookings_plot, tooltip = "text")
Based on the visualization above, the most busy year occured in 2016. But several months in 2015 and 2017 were not available in this dataset. Data in 2015 are available from July-December while 2017 are available from January-August.
- How much do guests pay for a room per night?
# Subset data
room_price <- hotel_bookings %>%
select(hotel, assigned_room_type, adr) %>%
mutate(adr = replace(adr, 48516, 0)) %>% #replace outlier with 0 in row 48516
group_by(hotel, assigned_room_type)
colnames(room_price) <- c("Hotel","Room_Type", "Price")
#Plotting
room_price_plot <- ggplot(room_price, aes(x = Room_Type, y = Price, color = Hotel)) +
geom_boxplot() +
facet_grid(.~Hotel) +
labs(title = "Price In Each Hotel and Room Type",
x = "Room Type",
y = "Price",
fill = "Hotel")
ggplotly(room_price_plot)
Price for each hotel and room type are quite variety.
- How about the reservation status in each hotel?
# Subset Data
reservation <- hotel_bookings %>%
select(hotel, reservation_status) %>%
mutate(count = 1) %>%
group_by(hotel, reservation_status) %>%
summarise(count = sum(count))
colnames(reservation) <- c("Hotel","Reservation_Status", "Total")
reservation <- reservation %>%
mutate(label=glue("Hotel: {Hotel}
Reservation Status: {Reservation_Status}
Total: {comma(Total)}"))
#Plotting
reservation_plot <- ggplot(data = reservation, aes(x = Total,
y = reorder(Reservation_Status,-Total), fill = Hotel, text = label)) +
geom_col(aes(fill = Hotel)) +
facet_grid(.~Hotel)+
coord_flip() +
labs(title = "Reservation Status In Each Hotel",
x = "Total",
y = "Reservation Status",
fill = "Hotel") +
scale_x_continuous(labels = comma) +
theme_minimal()
ggplotly(reservation_plot, tooltip = "text")
Most bookings were not canceled or the guest stayed at a hotel. However, there are quite a lot of guest who canceled.
- How many guests (adult+children+babies) stayed at hotel in year?
# Subset Data
guests_stayed <- hotel_bookings %>%
select(hotel, arrival_date_year, reservation_status, reservation_status_date, adults, children, babies) %>%
filter(reservation_status == "Check-Out") %>%
mutate(guests = adults + children + babies) %>%
group_by(hotel, arrival_date_year) %>%
summarise(guests = sum(guests))
colnames(guests_stayed) <- c("Hotel","Year", "Guests")
guests_stayed <- guests_stayed %>%
mutate(label=glue("Hotel: {Hotel}
Year: {Year}
Total Guests: {comma(Guests)}"))
# Plotting
guests_stayed_plot <- ggplot(data = guests_stayed, aes(x = Guests,
y = Year, fill = Hotel, text = label)) +
geom_col(aes(fill = Hotel)) +
facet_grid(.~Hotel) +
labs(title = "Total of Guests Stayed In Hotel",
x = "Total Guest",
y = "Year",
fill = "Hotel") +
scale_x_continuous(labels = comma) +
theme_minimal()
ggplotly(guests_stayed_plot, tooltip = "text")
In 2016, total guests reached 44,433 person in city hotel and 26,432 in resort hotel.
- From where the most guests are stayed in hotel?
# Subset Data
country_guest <- hotel_bookings %>%
select(country, reservation_status) %>%
filter(reservation_status == "Check-Out") %>%
mutate(count = 1) %>%
group_by(country) %>%
summarise(count = sum(count)) %>%
arrange(-count) %>%
top_n(10)
colnames(country_guest) <- c("Country","Total")
country_guest <- country_guest %>%
mutate(label=glue("Country: {Country}
Total: {comma(Total)}"))
# Plotting
country_guest_plot <- ggplot(data = country_guest, aes(x = Total,
y = reorder(Country, -Total), text = label)) +
geom_col(aes(fill = Country)) +
coord_flip() +
labs(title = "Top 10 Highest Booking Country",
x = "Total Bookings",
y = "Country") +
scale_x_continuous(labels = comma) +
theme_minimal()
ggplotly(country_guest_plot, tooltip = "text")
- How many guests visited hotel more than once?
# Subset Data
repeated <- as.data.frame(table(hotel_bookings$is_repeated_guest))
colnames(repeated) <- c("Repeated","Total")
# Plotting
repeated_plot <- plot_ly(type='pie', labels=repeated$Repeated, values=repeated$Total,
textinfo='label+percent',
insidetextorientation='radial')
repeated_plot %>% layout(title = 'Repeated Guest')
Most guest never booked at the same place. The percentage of repeated guest is very low.
- How many requirement of parking spaces based on customer type?
# Subset Data
parking_spaces <- hotel_bookings %>%
select(required_car_parking_spaces, customer_type) %>%
group_by(customer_type) %>%
summarise(required_car_parking_spaces = sum(required_car_parking_spaces))
colnames(parking_spaces) <- c("Customer_Type","Parking_Space")
parking_spaces <- parking_spaces %>%
mutate(label=glue("Customer Type: {Customer_Type}
Total Parking Spaces: {comma(Parking_Space)}"))
# Plotting
parking_spaces_plot <- ggplot(parking_spaces, aes(x = reorder(Customer_Type, Parking_Space),
y = Parking_Space, text = label)) +
geom_segment(aes(x = reorder(Customer_Type, Parking_Space), xend = reorder(Customer_Type, Parking_Space), y = 0, yend = Parking_Space), color="black") +
geom_point(color="orange") +
coord_flip() +
labs(title = "Parking Spaces Based on Customer Type",
x = "Customer Type",
y = "Parking Spaces") +
scale_y_continuous(labels = comma) +
theme_minimal()
ggplotly(parking_spaces_plot, tooltip = "text")
Most guests from Transient and Transient-Party’s type need a lot of parking spaces.
- Which hotel have maximum number of cancellation of bookings? Are they give the deposit?
# Subset Data
cancel_deposit <- hotel_bookings %>%
select(hotel, is_canceled, deposit_type) %>%
filter(is_canceled == "Yes") %>%
mutate(count = 1) %>%
group_by(hotel, deposit_type) %>%
summarise(count = sum(count))
colnames(cancel_deposit) <- c("Hotel","Deposit_Type","Total")
cancel_deposit <- cancel_deposit %>%
mutate(label=glue("Hotel: {Hotel}
Deposit Type: {Deposit_Type}
Total: {comma(Total)}"))
# Plotting
cancel_deposit_plot <- ggplot(data = cancel_deposit, aes(x = Total,
y = Hotel, fill = Hotel, text = label)) +
geom_col(aes(fill = Hotel)) +
facet_grid(.~Deposit_Type)+
coord_flip() +
labs(title = "Guest Canceled in Each Hotel",
x = "Total",
y = "Reservation Status",
fill = "Hotel") +
scale_x_continuous(labels = comma)
ggplotly(cancel_deposit_plot, tooltip = "text")
Mostly they did not give deposit. But the number of people canceled a Non-Refund booking was quite high.
- Which type of customers are more likely to cancel the booking?
#Subset Data
customers_cancel <- hotel_bookings %>%
select(is_canceled, customer_type) %>%
filter(is_canceled == "Yes") %>%
mutate(count = 1) %>%
group_by(customer_type) %>%
summarise(count = sum(count))
# Plotting
customers_cancel_plot <- plot_ly(type='pie', hole = 0.5, labels=customers_cancel$customer_type, values=customers_cancel$count,
textinfo='label+percent', insidetextorientation='radial')
customers_cancel_plot %>% layout(title = 'Canceled Based on Customer Type')
Type of customer are more likely to cancel the booking was Transient and Transient-Party.
- Does lead time affect guest deciding to cancel?
#Subset Data
lead_time <- hotel_bookings %>%
select(arrival_date_year, lead_time, is_canceled)
colnames(lead_time) <- c("Year","Lead_Time","Canceled")
# Plotting
lead_time_plot <- ggplot(lead_time, aes(x=Year, y=Lead_Time, fill=Canceled)) +
geom_violin() +
geom_hline(yintercept=0, alpha=0.5) +
labs(title = "Canceled Lead Time",
x = "Lead Time",
y = "Year",
fill = "Canceled") +
theme_minimal()
ggplotly(lead_time_plot)
The tendency for longer lead times will cause most guest to cancel.
- Which Distribution Channel is mostly preferred for bookings?
distribution <- as.data.frame(table(hotel_bookings$distribution_channel))
colnames(distribution) <- c("Distribution","Total")
distribution <- distribution %>%
mutate(label=glue("Distribution Channel: {Distribution}
Total Bookings: {comma(Total)}"))
distribution_plot <- ggplot(data = distribution, aes(x = Total,
y = reorder(Distribution, Total),
text = label)) +
geom_col(aes(fill = Total)) +
scale_fill_gradient(low="grey", high="black") +
labs(title = "Booking Distribution Channel",
x = "Total Bookings",
y = "Distribution Channel") +
scale_x_continuous(labels = comma) +
theme_minimal()
ggplotly(distribution_plot, tooltip = "text")
The number of bookings from the TA/TO distribution channel is the highest.
Conclusion
- City Hotel have more bookings than Resort Hotel. Resort Hotel may have to increase their advertising.
- There are a lot of guests that canceled the booking. Hotel can send them a survey when they canceled the booking, so we can know the reason why they canceled.
- The percentage of repeated guest is just 3.19%. Hotel may give discounts for guests who have stayed before.
- Hotel need to prepare parking spaces for Transient and Transient-Party, but we know that Transient and Transient-Party are the type of customer that more likely to cancel their booking even though they booked a non-refund booking.
- Hotel must consider the lead time for customer because the longer lead time the more likely to cancel.