Hotel Booking Analysis

Safira Widya Putri

2022-05-22

Introduction

This data contains booking information of hotel. Data includes information such as when the booking was made, lead time, country, the number of adults, children, and/or babies, etc. Source of data from Kaggle.

Set Up Library

First, we must set up the library that we needed.

library(lubridate)
library(dplyr)
library(GGally)
library(ggplot2)
library(plotly)
library(glue)
library(scales)

Data Preparation

Input Data

Input our data and put it into ‘hotel_bookings’ object. We use stringsAsFactors = True, so all columns with character data type will be converted into factors.

hotel_bookings <- read.csv("hotel_bookings.csv", stringsAsFactors = T)

Overview our data:

head(hotel_bookings)
tail(hotel_bookings)

Data Structure

Check the number of columns and rows.

dim(hotel_bookings)
## [1] 119390     32

Data contains 119,390 rows and 32 columns.

View all columns and the data types.

glimpse(hotel_bookings)
## Rows: 119,390
## Columns: 32
## $ hotel                          <fct> Resort Hotel, Resort Hotel, Resort Hote~
## $ is_canceled                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time                      <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year              <int> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month             <fct> July, July, July, July, July, July, Jul~
## $ arrival_date_week_number       <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights           <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults                         <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal                           <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,~
## $ country                        <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,~
## $ market_segment                 <fct> Direct, Direct, Direct, Corporate, Onli~
## $ distribution_channel           <fct> Direct, Direct, Direct, Corporate, TA/T~
## $ is_repeated_guest              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type             <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, ~
## $ assigned_room_type             <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, ~
## $ booking_changes                <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type                   <fct> No Deposit, No Deposit, No Deposit, No ~
## $ agent                          <fct> NULL, NULL, NULL, 304, 240, 240, NULL, ~
## $ company                        <fct> NULL, NULL, NULL, NULL, NULL, NULL, NUL~
## $ days_in_waiting_list           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type                  <fct> Transient, Transient, Transient, Transi~
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests      <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status             <fct> Check-Out, Check-Out, Check-Out, Check-~
## $ reservation_status_date        <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201~

Some of data types are not in the correct type.

Pre-processing Data

Convert the data type!

The column is_canceled, arrival_date_year, and is_repeated_guest should be Factor, reservation_status_date should be Date, and we have to sort month’s name from arrival_date_month. Then we can delete columns that are not used in this analysis.

hotel_bookings <- hotel_bookings %>%
  select(-c(previous_cancellations, previous_bookings_not_canceled, agent, company)) %>% 
  mutate(is_canceled = factor(is_canceled, levels = c(0, 1), labels = c("No", "Yes")),
         arrival_date_year = as.factor(arrival_date_year),
         is_repeated_guest = factor(is_repeated_guest, levels = c(0, 1), labels = c("No", "Yes")),
         reservation_status_date = ymd(reservation_status_date),
         arrival_date_month = factor(arrival_date_month, levels = c(
      "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"),ordered = TRUE))

Now the data type of all columns are correct. Next, checking the missing value.

colSums(is.na(hotel_bookings))
##                       hotel                 is_canceled 
##                           0                           0 
##                   lead_time           arrival_date_year 
##                           0                           0 
##          arrival_date_month    arrival_date_week_number 
##                           0                           0 
##   arrival_date_day_of_month     stays_in_weekend_nights 
##                           0                           0 
##        stays_in_week_nights                      adults 
##                           0                           0 
##                    children                      babies 
##                           4                           0 
##                        meal                     country 
##                           0                           0 
##              market_segment        distribution_channel 
##                           0                           0 
##           is_repeated_guest          reserved_room_type 
##                           0                           0 
##          assigned_room_type             booking_changes 
##                           0                           0 
##                deposit_type        days_in_waiting_list 
##                           0                           0 
##               customer_type                         adr 
##                           0                           0 
## required_car_parking_spaces   total_of_special_requests 
##                           0                           0 
##          reservation_status     reservation_status_date 
##                           0                           0

We found 4 missing values in Children column. Convert the missing values to 0.

hotel_bookings <- hotel_bookings %>% 
  mutate(children = replace(children, is.na(children), 0))

Let’s take a look data again!

head(hotel_bookings)

Data Summary

Summary of all columns.

summary(hotel_bookings)
##           hotel       is_canceled   lead_time   arrival_date_year
##  City Hotel  :79330   No :75166   Min.   :  0   2015:21996       
##  Resort Hotel:40060   Yes:44224   1st Qu.: 18   2016:56707       
##                                   Median : 69   2017:40687       
##                                   Mean   :104                    
##                                   3rd Qu.:160                    
##                                   Max.   :737                    
##                                                                  
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  August :13877      Min.   : 1.00            Min.   : 1.0             
##  July   :12661      1st Qu.:16.00            1st Qu.: 8.0             
##  May    :11791      Median :28.00            Median :16.0             
##  October:11160      Mean   :27.17            Mean   :15.8             
##  April  :11089      3rd Qu.:38.00            3rd Qu.:23.0             
##  June   :10939      Max.   :53.00            Max.   :31.0             
##  (Other):47873                                                        
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies                 meal          country     
##  Min.   : 0.0000   Min.   : 0.000000   BB       :92310   PRT    :48590  
##  1st Qu.: 0.0000   1st Qu.: 0.000000   FB       :  798   GBR    :12129  
##  Median : 0.0000   Median : 0.000000   HB       :14463   FRA    :10415  
##  Mean   : 0.1039   Mean   : 0.007949   SC       :10650   ESP    : 8568  
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000   Undefined: 1169   DEU    : 7287  
##  Max.   :10.0000   Max.   :10.000000                     ITA    : 3766  
##                                                          (Other):28635  
##        market_segment  distribution_channel is_repeated_guest
##  Online TA    :56477   Corporate: 6677      No :115580       
##  Offline TA/TO:24219   Direct   :14645      Yes:  3810       
##  Groups       :19811   GDS      :  193                       
##  Direct       :12606   TA/TO    :97870                       
##  Corporate    : 5295   Undefined:    5                       
##  Complementary:  743                                         
##  (Other)      :  239                                         
##  reserved_room_type assigned_room_type booking_changes       deposit_type   
##  A      :85994      A      :74053      Min.   : 0.0000   No Deposit:104641  
##  D      :19201      D      :25322      1st Qu.: 0.0000   Non Refund: 14587  
##  E      : 6535      E      : 7806      Median : 0.0000   Refundable:   162  
##  F      : 2897      F      : 3751      Mean   : 0.2211                      
##  G      : 2094      G      : 2553      3rd Qu.: 0.0000                      
##  B      : 1118      C      : 2375      Max.   :21.0000                      
##  (Other): 1551      (Other): 3530                                           
##  days_in_waiting_list         customer_type        adr         
##  Min.   :  0.000      Contract       : 4076   Min.   :  -6.38  
##  1st Qu.:  0.000      Group          :  577   1st Qu.:  69.29  
##  Median :  0.000      Transient      :89613   Median :  94.58  
##  Mean   :  2.321      Transient-Party:25124   Mean   : 101.83  
##  3rd Qu.:  0.000                              3rd Qu.: 126.00  
##  Max.   :391.000                              Max.   :5400.00  
##                                                                
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Canceled :43017   
##  1st Qu.:0.00000             1st Qu.:0.0000            Check-Out:75166   
##  Median :0.00000             Median :0.0000            No-Show  : 1207   
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

Question and Visualization

  1. Which is the most busy year?
# Subset data
bookings <- hotel_bookings %>% 
  select(arrival_date_year,arrival_date_month) %>% 
  mutate(count = 1) %>% 
  group_by(arrival_date_year, arrival_date_month) %>% 
  summarise(count = sum(count))
colnames(bookings) <- c("Year","Month", "Total")
bookings <- bookings %>% 
  mutate(label=glue("Year: {Year}
                    Month: {Month}
                    Total Bookings: {comma(Total)}"))

#Plotting
bookings_plot <- ggplot(data = bookings, aes(x = Total, 
                              y = Year, text = label)) +
  geom_col(aes(fill = Month)) +
  labs(title = "Number of Booking Per Year",
       x = "Total Bookings",
       y = "Year",
       fill = "Month") +
  scale_x_continuous(labels = comma) +
  coord_flip() +
  theme_minimal()


ggplotly(bookings_plot, tooltip = "text")

Based on the visualization above, the most busy year occured in 2016. But several months in 2015 and 2017 were not available in this dataset. Data in 2015 are available from July-December while 2017 are available from January-August.

  1. How much do guests pay for a room per night?
# Subset data
room_price <- hotel_bookings %>% 
  select(hotel, assigned_room_type, adr) %>%
  mutate(adr = replace(adr, 48516, 0)) %>% #replace outlier with 0 in row 48516
  group_by(hotel, assigned_room_type)
colnames(room_price) <- c("Hotel","Room_Type", "Price")

#Plotting
room_price_plot <- ggplot(room_price, aes(x = Room_Type, y = Price, color = Hotel)) +
  geom_boxplot() +
  facet_grid(.~Hotel) +
  labs(title = "Price In Each Hotel and Room Type",
       x = "Room Type",
       y = "Price",
       fill = "Hotel")
ggplotly(room_price_plot)

Price for each hotel and room type are quite variety.

  1. How about the reservation status in each hotel?
# Subset Data
reservation <- hotel_bookings %>% 
  select(hotel, reservation_status) %>%
  mutate(count = 1) %>%
  group_by(hotel, reservation_status) %>% 
  summarise(count = sum(count))
colnames(reservation) <- c("Hotel","Reservation_Status", "Total")
reservation <- reservation %>% 
  mutate(label=glue("Hotel: {Hotel}
                    Reservation Status: {Reservation_Status}
                    Total: {comma(Total)}"))


#Plotting
reservation_plot <- ggplot(data = reservation, aes(x = Total, 
                              y = reorder(Reservation_Status,-Total), fill = Hotel, text = label)) +
  geom_col(aes(fill = Hotel)) +
  facet_grid(.~Hotel)+
  coord_flip() +
  labs(title = "Reservation Status In Each Hotel",
       x = "Total",
       y = "Reservation Status",
       fill = "Hotel") +
  scale_x_continuous(labels = comma) +
  theme_minimal() 
ggplotly(reservation_plot, tooltip = "text")

Most bookings were not canceled or the guest stayed at a hotel. However, there are quite a lot of guest who canceled.

  1. How many guests (adult+children+babies) stayed at hotel in year?
# Subset Data

guests_stayed <- hotel_bookings %>% 
  select(hotel, arrival_date_year, reservation_status, reservation_status_date, adults, children, babies) %>% 
  filter(reservation_status == "Check-Out") %>% 
  mutate(guests = adults + children + babies) %>% 
  group_by(hotel, arrival_date_year) %>% 
  summarise(guests = sum(guests))
colnames(guests_stayed) <- c("Hotel","Year", "Guests")
guests_stayed <- guests_stayed %>% 
  mutate(label=glue("Hotel: {Hotel}
                    Year: {Year}
                    Total Guests: {comma(Guests)}"))

# Plotting
guests_stayed_plot <- ggplot(data = guests_stayed, aes(x = Guests, 
                              y = Year, fill = Hotel, text = label)) +
  geom_col(aes(fill = Hotel)) +
  facet_grid(.~Hotel) +
  labs(title = "Total of Guests Stayed In Hotel",
       x = "Total Guest",
       y = "Year",
       fill = "Hotel") +
  scale_x_continuous(labels = comma) +
  theme_minimal() 
ggplotly(guests_stayed_plot, tooltip = "text")

In 2016, total guests reached 44,433 person in city hotel and 26,432 in resort hotel.

  1. From where the most guests are stayed in hotel?
# Subset Data
country_guest <- hotel_bookings %>% 
  select(country, reservation_status) %>% 
  filter(reservation_status == "Check-Out") %>% 
  mutate(count = 1) %>% 
  group_by(country) %>% 
  summarise(count = sum(count)) %>% 
  arrange(-count) %>%
  top_n(10)
colnames(country_guest) <- c("Country","Total")
country_guest <- country_guest %>% 
  mutate(label=glue("Country: {Country}
                    Total: {comma(Total)}"))

# Plotting
country_guest_plot <- ggplot(data = country_guest, aes(x = Total, 
                              y = reorder(Country, -Total), text = label)) +
  geom_col(aes(fill = Country)) +
  coord_flip() +
  labs(title = "Top 10 Highest Booking Country",
       x = "Total Bookings",
       y = "Country") +
  scale_x_continuous(labels = comma) +
  theme_minimal()
ggplotly(country_guest_plot, tooltip = "text")
  1. How many guests visited hotel more than once?
# Subset Data
repeated <- as.data.frame(table(hotel_bookings$is_repeated_guest))
colnames(repeated) <- c("Repeated","Total")

# Plotting
repeated_plot <-  plot_ly(type='pie', labels=repeated$Repeated, values=repeated$Total, 
               textinfo='label+percent',
               insidetextorientation='radial')
repeated_plot %>% layout(title = 'Repeated Guest')

Most guest never booked at the same place. The percentage of repeated guest is very low.

  1. How many requirement of parking spaces based on customer type?
# Subset Data
parking_spaces <- hotel_bookings %>% 
  select(required_car_parking_spaces, customer_type) %>% 
  group_by(customer_type) %>% 
  summarise(required_car_parking_spaces = sum(required_car_parking_spaces))
colnames(parking_spaces) <- c("Customer_Type","Parking_Space")
parking_spaces <- parking_spaces %>% 
  mutate(label=glue("Customer Type: {Customer_Type}
                    Total Parking Spaces: {comma(Parking_Space)}"))

# Plotting
parking_spaces_plot <- ggplot(parking_spaces, aes(x = reorder(Customer_Type, Parking_Space), 
                    y = Parking_Space, text = label)) +
  geom_segment(aes(x = reorder(Customer_Type, Parking_Space), xend = reorder(Customer_Type, Parking_Space), y = 0, yend = Parking_Space), color="black") +
  geom_point(color="orange") +
  coord_flip() +
  labs(title = "Parking Spaces Based on Customer Type",
       x = "Customer Type",
       y = "Parking Spaces") +
  scale_y_continuous(labels = comma) +
  theme_minimal()
ggplotly(parking_spaces_plot, tooltip = "text")

Most guests from Transient and Transient-Party’s type need a lot of parking spaces.

  1. Which hotel have maximum number of cancellation of bookings? Are they give the deposit?
# Subset Data
cancel_deposit <- hotel_bookings %>% 
  select(hotel, is_canceled, deposit_type) %>%
  filter(is_canceled == "Yes") %>% 
  mutate(count = 1) %>%
  group_by(hotel, deposit_type) %>%
  summarise(count = sum(count))
colnames(cancel_deposit) <- c("Hotel","Deposit_Type","Total")
cancel_deposit <- cancel_deposit %>% 
  mutate(label=glue("Hotel: {Hotel}
                    Deposit Type: {Deposit_Type}
                    Total: {comma(Total)}"))

# Plotting
cancel_deposit_plot <- ggplot(data = cancel_deposit, aes(x = Total, 
                              y = Hotel, fill = Hotel, text = label)) +
  geom_col(aes(fill = Hotel)) +
  facet_grid(.~Deposit_Type)+
  coord_flip() +
  labs(title = "Guest Canceled in Each Hotel",
       x = "Total",
       y = "Reservation Status",
       fill = "Hotel") +
  scale_x_continuous(labels = comma)
ggplotly(cancel_deposit_plot, tooltip = "text")

Mostly they did not give deposit. But the number of people canceled a Non-Refund booking was quite high.

  1. Which type of customers are more likely to cancel the booking?
#Subset Data
customers_cancel <- hotel_bookings %>% 
  select(is_canceled, customer_type) %>% 
  filter(is_canceled == "Yes") %>% 
  mutate(count = 1) %>% 
  group_by(customer_type) %>%
  summarise(count = sum(count))

# Plotting
customers_cancel_plot <- plot_ly(type='pie', hole = 0.5, labels=customers_cancel$customer_type, values=customers_cancel$count, 
               textinfo='label+percent', insidetextorientation='radial')
customers_cancel_plot %>% layout(title = 'Canceled Based on Customer Type')

Type of customer are more likely to cancel the booking was Transient and Transient-Party.

  1. Does lead time affect guest deciding to cancel?
#Subset Data
lead_time <- hotel_bookings %>% 
  select(arrival_date_year, lead_time, is_canceled)
colnames(lead_time) <- c("Year","Lead_Time","Canceled")

# Plotting
lead_time_plot <- ggplot(lead_time, aes(x=Year, y=Lead_Time, fill=Canceled)) + 
  geom_violin() +
  geom_hline(yintercept=0, alpha=0.5) +
  labs(title = "Canceled Lead Time",
       x = "Lead Time",
       y = "Year",
       fill = "Canceled") +
  theme_minimal()
ggplotly(lead_time_plot)

The tendency for longer lead times will cause most guest to cancel.

  1. Which Distribution Channel is mostly preferred for bookings?
distribution <- as.data.frame(table(hotel_bookings$distribution_channel))
colnames(distribution) <- c("Distribution","Total")
distribution <- distribution %>% 
  mutate(label=glue("Distribution Channel: {Distribution}
                    Total Bookings: {comma(Total)}"))

distribution_plot <- ggplot(data = distribution, aes(x = Total, 
                              y = reorder(Distribution, Total),
                              text = label)) +
  geom_col(aes(fill = Total)) +
  scale_fill_gradient(low="grey", high="black") +
  labs(title = "Booking Distribution Channel",
       x = "Total Bookings",
       y = "Distribution Channel") +
  scale_x_continuous(labels = comma) +
  theme_minimal()

ggplotly(distribution_plot, tooltip = "text")

The number of bookings from the TA/TO distribution channel is the highest.

Conclusion

  1. City Hotel have more bookings than Resort Hotel. Resort Hotel may have to increase their advertising.
  2. There are a lot of guests that canceled the booking. Hotel can send them a survey when they canceled the booking, so we can know the reason why they canceled.
  3. The percentage of repeated guest is just 3.19%. Hotel may give discounts for guests who have stayed before.
  4. Hotel need to prepare parking spaces for Transient and Transient-Party, but we know that Transient and Transient-Party are the type of customer that more likely to cancel their booking even though they booked a non-refund booking.
  5. Hotel must consider the lead time for customer because the longer lead time the more likely to cancel.