Introduction

The hotel sector is a data-rich industry that collects massive amounts of various forms of data. Analytics can be used to segment passengers based on booking patterns, behavior, and other criteria, revealing their tedency to respond to discounts and new travel trends. Hence, using our Hotel data we plan to find out insights which can be useful in future business planning and to optimize revenue. Below are some of the patterns we going to look into:

  • To analyse the booking trends over year to compare bookings
  • To understand customer behavior
  • To get insights about promotional activity

Some specific questions, we will be answering

  • How advance bookings are made by international customers in comparison to Domestic?
  • Which is the most preferred meal type?
  • Relationship between Lead time and cancellations?

Libraries Used

We have used below mentioned packages for our analysis:

 library(readxl) #### To import the excel files into R
 library(DT) #### To display the data attractively
 library(ggplot2) #### To visualize
 library(tidyverse) #### To tidy/clean the data
 library(SmartEDA) #### To visualize
 library(dplyr) #### To tidy/clean the data
 library(skimr) #### To describe the data
 library(lubridate) ### To deal with dates

Data Preparation

Importing the data

hotels <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
## Rows: 119390 Columns: 32
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Dictionary

data_dictionary <- read_excel("D:/Course/Data Wrangling/Mid Term Project/Data_Dictionary.xlsx")
attach(data_dictionary)


datatable(
  head(data_dictionary,40),
  extensions = 'FixedColumns',
  options = list(
    scrollY = "500px",
    scrollX = TRUE,
    fixedColumns = TRUE
  )
)

Understanding the data

Evaluating data dimension

dim(hotels)
## [1] 119390     32

Data ‘hotels’ has 119390 rows and 32 observations

Top 10 rows of data

head(hotels, 10)
## # A tibble: 10 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0       342              2015 July              
##  2 Resort Hotel           0       737              2015 July              
##  3 Resort Hotel           0         7              2015 July              
##  4 Resort Hotel           0        13              2015 July              
##  5 Resort Hotel           0        14              2015 July              
##  6 Resort Hotel           0        14              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         9              2015 July              
##  9 Resort Hotel           1        85              2015 July              
## 10 Resort Hotel           1        75              2015 July              
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...

Summarizing the data

summary(hotels)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

Structure of the data to know more about the type of data

str(hotels)
## spec_tbl_df [119,390 x 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_date(format = "")
##   .. )
##  - attr(*, "problems")=<externalptr>

We can observe there is a need to change the data type of two variables: repeated__guest, is__canceled from integar to float

hotels$is_canceled<-as.factor(hotels$is_canceled)
hotels$is_repeated_guest<-as.factor(hotels$is_repeated_guest)

Checking if the data set has missing values

overall
sum(is.na(hotels))
## [1] 4
Columnwise
colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

There are 4 missing value in the column ‘children’. Hence, we will replace the missing values with mode.

Creating the Mode function

getmode = function(v) {
uniqv = unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}

Replacing NA’s with Mode__

getmode(hotels$children)
## [1] 0
hotels$children[is.na(hotels$children)] <- getmode(hotels$children)

Now the data is clean to work on for insights. Let’s summarize the data

head(hotels, 10)
## # A tibble: 10 x 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>        <fct>           <dbl>             <dbl> <chr>             
##  1 Resort Hotel 0                 342              2015 July              
##  2 Resort Hotel 0                 737              2015 July              
##  3 Resort Hotel 0                   7              2015 July              
##  4 Resort Hotel 0                  13              2015 July              
##  5 Resort Hotel 0                  14              2015 July              
##  6 Resort Hotel 0                  14              2015 July              
##  7 Resort Hotel 0                   0              2015 July              
##  8 Resort Hotel 0                   9              2015 July              
##  9 Resort Hotel 1                  85              2015 July              
## 10 Resort Hotel 1                  75              2015 July              
## # ... with 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <fct>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, ...
colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              0                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

There are no missing value now.

Frequency for some categorical variables

table(hotels$arrival_date_month)
## 
##     April    August  December  February   January      July      June     March 
##     11089     13877      6780      8068      5929     12661     10939      9794 
##       May  November   October September 
##     11791      6794     11160     10508
table(hotels$market_segment)
## 
##      Aviation Complementary     Corporate        Direct        Groups 
##           237           743          5295         12606         19811 
## Offline TA/TO     Online TA     Undefined 
##         24219         56477             2

Studying some Numerical Variables

boxplot(hotels$lead_time, main = "lead_time")

boxplot(hotels$arrival_date_year, main = "arrival_date_year")

boxplot(hotels$arrival_date_week_number, main = "arrival_date_week_number")

Significant outliers in lead_time but not for other two. Lead time is an important part of data hence can not be replaced or removed.

The processed data has 119390 rows and 32 columns with no missing values. Also, the data type for the column is aligned as per the requirement.

Exploratory Data Analysis

  • The data is obtained from github website and link to the site is Data source

  • The data consist the information about the hotel bookings. We have total 1,19,390 records for the hotel bookings which are collected between the arrival date of 1st July 2015 and 31st August 2017. It consist of 32 variables like Hotel type, date and time of booking, repeated customer or not, country, number of people etc.

  • To begin with our analysis, the first thing we are going to look for is the booking curve over the time. What is the time period when most number of bookings are received.

Booking Pattern

hotel1 <- hotels %>% 
    mutate(arrival_date = glue::glue("{arrival_date_year}-{arrival_date_month}-{arrival_date_day_of_month}"),arrival_date = parse_date(arrival_date, format = "%Y-%B-%d"))
    
p <- ggplot(hotel1, aes(x=hotel1$lead_time)) + geom_area(stat="bin",color="cornflowerblue",fill="lightblue")
p + labs(x="Number of Days", y="Count", title="Bookings: number of days out") + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can observe from the graph, maximum number of bookings are received within 0-30 and 31-60 window.

Customer Behavior

Once, we have the idea how the booking curve works, we should understand more about our customers. Their preferences while making a booking.

Below graph indicates the popular choice among customers for City Hotel or Resort Hotel.

hotels_by_type <- hotels %>% select(arrival_date_year, hotel) %>%  group_by(arrival_date_year, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count)*100)
   
ggplot(hotels_by_type, aes(x = arrival_date_year, y = pct, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25) +
  scale_x_continuous(breaks = seq(2015,2017,1)) +ggtitle("Preferred Type")

It’s clearly visible from the above graph that customers prefer City Hotel. The number of city hotels booked in 2015 is 62.5% of total number of bookings and the share increased by approximately 6% in 2016.

hotels %>% 
group_by(is_repeated_guest) %>% 
summarise(Count=n())%>% mutate(Percent =  round(Count/ sum(Count)*100,2))
## # A tibble: 2 x 3
##   is_repeated_guest  Count Percent
##   <fct>              <int>   <dbl>
## 1 0                 115580   96.8 
## 2 1                   3810    3.19

Here, 0 represents the not repeated guest and 1 repeated guest. 96.81% of customers are not repeated, indicating customers pereference to book a new place every time they visit outside.

Further, we want to dig in to know the type of repeated and unrepeated guest

Customers_vs_repeatedguests <- hotels %>% group_by(customer_type, is_repeated_guest)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.
ggplot(Customers_vs_repeatedguests, aes(x = customer_type, y = pct, fill = factor(is_repeated_guest))) +   geom_bar(stat = "identity" , width=0.5, position = "dodge")+ ggtitle("Customer Type")

People prefer visiting in groups while contracts has the least.

hotels %>% select (deposit_type, is_canceled) %>% group_by(deposit_type) %>% 
summarise(Total_Cancellations=sum(is_canceled==1))  %>% mutate(PercentCancellation =  Total_Cancellations/ sum(Total_Cancellations)*100)
## # A tibble: 3 x 3
##   deposit_type Total_Cancellations PercentCancellation
##   <chr>                      <int>               <dbl>
## 1 No Deposit                 29694             67.1   
## 2 Non Refund                 14494             32.8   
## 3 Refundable                    36              0.0814

Looking at the cancellation trend, we observe customer is more tend to cancel the bookings when it is ‘No deposit’ that is 67%

hotels %>% select (meal) %>% group_by(meal) %>% 
summarise(Count=n())  %>% ggplot(aes(x=meal,y=Count))+ geom_bar(stat = 'identity',fill = 'Steelblue')+
geom_text(aes(label = meal),vjust = 1.5)+ggtitle("Preferred Room/Meal")

The most preferred room type among customers is BB i.e. ‘Bed and Breakfast’

One interesting thing to note is, not getting desired room is not the reasons for people cancelling the room.

hotels %>% filter(is_canceled==1, reserved_room_type != assigned_room_type) %>% summarise(n())
## # A tibble: 1 x 1
##   `n()`
##   <int>
## 1   802

Only 0.6% cases of cancellation where customer does not get the desired room.

Promotional Activity

hotels %>% filter(arrival_date_year == '2016') %>% select (arrival_date_month, distribution_channel) %>% group_by(arrival_date_month,distribution_channel) %>% 
summarise(Bookings_distribution_channel=n())  %>% mutate(PercentBooking =  Bookings_distribution_channel/sum(Bookings_distribution_channel)*100) %>% arrange(desc(PercentBooking))
## `summarise()` has grouped output by 'arrival_date_month'. You can override using the `.groups` argument.
## # A tibble: 47 x 4
## # Groups:   arrival_date_month [12]
##    arrival_date_month distribution_channel Bookings_distribution~ PercentBooking
##    <chr>              <chr>                                 <int>          <dbl>
##  1 August             TA/TO                                  4334           85.6
##  2 September          TA/TO                                  4572           84.8
##  3 June               TA/TO                                  4471           84.5
##  4 July               TA/TO                                  3858           84.4
##  5 May                TA/TO                                  4614           84.2
##  6 April              TA/TO                                  4561           84.0
##  7 December           TA/TO                                  3127           81.0
##  8 November           TA/TO                                  3590           80.6
##  9 March              TA/TO                                  3854           79.9
## 10 October            TA/TO                                  4948           79.8
## # ... with 37 more rows

Above table shows us the monthwise distribution channel bookings which can be used for promotional activities. In general, booking share is like below:

bookings_by_distribution_channel <- hotels %>% group_by(distribution_channel, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'distribution_channel'. You can override using the `.groups` argument.
ggplot(bookings_by_distribution_channel, aes(x = distribution_channel, y = pct, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.5, position = "dodge")+ggtitle("Booking through Distribution channel")

From this plot we can notice that distribution channel GDS does bookings only at City hotels and TA/TO has higher bookings at City hotels while Corporate and Direct have equal weights.

But we should also look for Percentage of cancellations by distribution Channel and Type

cancellation_by_distribution_channel <- hotels %>% group_by(distribution_channel, is_canceled)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'distribution_channel'. You can override using the `.groups` argument.
ggplot(cancellation_by_distribution_channel, aes(x = distribution_channel, y = pct, fill = factor(is_canceled))) + 
  geom_bar(stat = "identity" , width=0.5, position = "dodge")+ggtitle("Cancellation through Distribution channel")

From this plot it can be observed that distribution channel TA/TO cancelled around 40% of the bookings while others cancelled around 20%.

Some specific questions we look to answer

  • Top 10 Countries by number of bookings
data_country <- hotels %>% group_by(country) %>% summarise(booking_count = n()) %>% arrange(desc(booking_count))
top_n(data_country,10,booking_count) %>% 
  ggplot(.,aes(country, booking_count)) +
  geom_bar(stat = "identity", fill = "Blue")+ggtitle("Top 10 countries")

Portugal has highest number of bookings followed by Great Britain and France.

  • Percentage of cancellations by hotel type each year
Cancellations_by_type <- hotels %>% filter(is_canceled == 1) %>% group_by(arrival_date_year, hotel)%>% summarize(count = n()) %>% mutate(pct = count/sum(count))
## `summarise()` has grouped output by 'arrival_date_year'. You can override using the `.groups` argument.
ggplot(Cancellations_by_type, aes(x = arrival_date_year, y = pct, fill = factor(hotel))) + 
  geom_bar(stat = "identity" , width=0.25) +
  scale_x_continuous(breaks = seq(2015,2017,1))+ ggtitle("Cancellation %")

Of the total number of cancellations city hotels has higher share than resort hotels which is expected as the number of bookings at city hotel is higher, but bookings percentage of 62.5 vs cancellation percentage of 75 are not exactly proportional.

  • We know bookings are made closer to stay but what about are the cancellations also more for bookings amde closer to stay
df1=(hotels %>% group_by(lead_time) %>% 
summarise(Bookings= n()))

df2= hotels %>% filter(is_canceled==1) %>%  group_by(lead_time) %>% 
summarise(cancellations= n())

df = merge(x=df1,y=df2,by="lead_time")

df3 = df %>%  mutate(cancellationBooking =  cancellations/Bookings*100)

p <- ggplot(df3, aes(lead_time,cancellationBooking)) + geom_area(color="cornflowerblue",fill="cornflowerblue")
p + labs(x="Lead Time", y="Cancellation to Booking %", title="Cancellation to Bookings") + theme_bw()

It is evident from the plot, 25% of the bookings are only cancelled closer to stay but bookings made one month out have higher chances of getting cancelled.

Summary/Insights

  • More than 50% bookings are coming within 0-30 day of stay. Hence, industry should be more focused on this period.
  • Customers are more inclined to cancellations when booking is done with ‘No Deposit’. During the high selling months this type of payment should not be given.
  • 60% Customers are more inclined to City Hotels and only 4% go to same hotel again. Therefore, marketing strategy should be more targeted to attract new customers.
  • There are 3 times more cancellation in city hotel as compared to resort hotels.
  • Portugal with highest number of bookings can be the right place to market or provide different offers.
  • Bookings made outside 30 days have more then 45% chances to get cancel , hence, while aligning the prices one should be aware that more vacant rooms will be there available at the end.