Introduction

Background & Objectives

The hotel industry plays a critical role in the success of the hospitality and tourism sectors. As competition in the market grows, hotel business owners need to understand their hotel’s performance and develop effective marketing strategies to increase visibility.

To achieve this, hotels need to analyze historical trends and customer behavior to gain valuable insights and inform data-driven marketing campaigns. By leveraging this information and implementing well-planned marketing strategies, hotels can boost their performance and revenue, ensuring continued success in the highly competitive hospitality industry.

Data Analytics Phases

  1. Data Collection & Preparation.

  2. Data Cleaning & Integrity.

  3. Data Analysis.

    • Exploratory Data Analysis.

    • Data Visualization.

    • Insight and Recommendation.

  4. Analysis summary.

Tools & Method

R programming language = for data cleaning & analysis.

R markdown = for executing the final report.

Method = Exploratory Data Analysis.

Data Preparation

Data Collection

The data originally comes from an open dataset of hotel booking in Portugal from Antonio, Almeida and Nunes, 2019 and was downloaded from the Kaggle public data set, hotel_bookings.csv, click this link.

The hotel bookings data contains:

  • 119,210 entries.

  • 32 columns.

Packages Required

tidyverse : This package consists of 3 most important for this project:

  • ggplot2: Used for creating powerful visualizations.

  • dplyr: Used for data manipulation.

  • tidyr: Used for data modifications.

Load packages

library(vctrs)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:vctrs':
## 
##     data_frame
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Load and Examine the data

hotel_bookings <- read.csv("~/hotel_bookings.csv")
View(hotel_bookings)
str(hotel_bookings) ## to display the structure and content of the data set.
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
colnames(hotel_bookings) ## to display the columns name of the data set.
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
head(hotel_bookings) ## to quickly inspect the first 6 rows of the data set.
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
tail(hotel_bookings) ## to quickly inspect the last 6 rows of the data set.
##             hotel is_canceled lead_time arrival_date_year arrival_date_month
## 119385 City Hotel           0        21              2017             August
## 119386 City Hotel           0        23              2017             August
## 119387 City Hotel           0       102              2017             August
## 119388 City Hotel           0        34              2017             August
## 119389 City Hotel           0       109              2017             August
## 119390 City Hotel           0       205              2017             August
##        arrival_date_week_number arrival_date_day_of_month
## 119385                       35                        30
## 119386                       35                        30
## 119387                       35                        31
## 119388                       35                        31
## 119389                       35                        31
## 119390                       35                        29
##        stays_in_weekend_nights stays_in_week_nights adults children babies meal
## 119385                       2                    5      2        0      0   BB
## 119386                       2                    5      2        0      0   BB
## 119387                       2                    5      3        0      0   BB
## 119388                       2                    5      2        0      0   BB
## 119389                       2                    5      2        0      0   BB
## 119390                       2                    7      2        0      0   HB
##        country market_segment distribution_channel is_repeated_guest
## 119385     BEL  Offline TA/TO                TA/TO                 0
## 119386     BEL  Offline TA/TO                TA/TO                 0
## 119387     FRA      Online TA                TA/TO                 0
## 119388     DEU      Online TA                TA/TO                 0
## 119389     GBR      Online TA                TA/TO                 0
## 119390     DEU      Online TA                TA/TO                 0
##        previous_cancellations previous_bookings_not_canceled reserved_room_type
## 119385                      0                              0                  A
## 119386                      0                              0                  A
## 119387                      0                              0                  E
## 119388                      0                              0                  D
## 119389                      0                              0                  A
## 119390                      0                              0                  A
##        assigned_room_type booking_changes deposit_type agent company
## 119385                  A               0   No Deposit   394    NULL
## 119386                  A               0   No Deposit   394    NULL
## 119387                  E               0   No Deposit     9    NULL
## 119388                  D               0   No Deposit     9    NULL
## 119389                  A               0   No Deposit    89    NULL
## 119390                  A               0   No Deposit     9    NULL
##        days_in_waiting_list customer_type    adr required_car_parking_spaces
## 119385                    0     Transient  96.14                           0
## 119386                    0     Transient  96.14                           0
## 119387                    0     Transient 225.43                           0
## 119388                    0     Transient 157.71                           0
## 119389                    0     Transient 104.40                           0
## 119390                    0     Transient 151.20                           0
##        total_of_special_requests reservation_status reservation_status_date
## 119385                         2          Check-Out              2017-09-06
## 119386                         0          Check-Out              2017-09-06
## 119387                         2          Check-Out              2017-09-07
## 119388                         4          Check-Out              2017-09-07
## 119389                         0          Check-Out              2017-09-07
## 119390                         2          Check-Out              2017-09-07
##attach(hotel_bookings)
summary(hotel_bookings) ## to explore and summarize the data set.
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 

The summary has identified issues in two variables:

  1. The Average Daily Rate data contains a negative value of -6.38, which is incorrect and needs to be corrected.

  2. The Days in Waiting List data exhibits anomalies, with a maximum value of 391 and an average value of 2.32 which are significantly different. This discrepancy requires further investigation.

Addressing these issues will ensure the accuracy and reliability of the data used in the analysis in the next step.

Data Cleaning & Integrity

Step 1. Investigate the negative value and anomalies.

  • First, Identify negative value in Average Daily Rate (adr) variable, then replace negative values with a suitable value.
negatives <- which(hotel_bookings$adr < 0)
hotel_bookings$adr[negatives] <- mean(hotel_bookings$adr, na.rm = TRUE)

sum(is.na(hotel_bookings$adr))
## [1] 0
The data now displays a value of (0), whereas previously it was (-6.38).
  • Last, Identify the anomalies in Days in Waiting List variable. The maximum value 391 and provides context for the decision to leave it as is for now, with a recognition that removing outliers may be necessary for future calculations.

Step 2. Investigate NA’s & NULL values

  • Identify the NA’s value
colSums(is.na(hotel_bookings))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0

The result reveals that the children column has only 4 missing values (NA). Given the small number of missing values compared to the total number of observations, we can either remove them or replace them with mean values.

In this case, we have decided to replace the missing values with the mean value of children.

## Remove NA's
colSums(is.na(hotel_bookings))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
hotel_bookings$children[is.na(hotel_bookings$children)] = round(mean(hotel_bookings$children, na.rm = TRUE),0)
sum(is.na(hotel_bookings$children))
## [1] 0
The data now displays a value of (0), whereas previously it was (4).
  • Identify NULL Value.

table(arrival_date_month), table(meal), table(country), table(market_segment) table(distribution_channel), table(reserved_room_type), table(assigned_room_type), table(agent), table(company), table(customer_type)

The result has identified NULL values in three variables.

  1. Variable Agent has 16340 NULL values.
  2. Variable Company has 112593 NULL values.
  3. Variable Country has 488 NULL values.

Due to the high number of NULL values in Agent and Company variables, removing them would result in a significant loss of important data. Therefore, replacing the NULL values in these variables with 0, as the agent and company values are just IDs and replacing them with 0 won’t create any issues.

For Country variable, given the relatively low number of missing values, we can either delete the corresponding rows or replace the NULL values with the most frequently occurring country.

## Replace the NULL value with 0 and max
hotel_bookings$agent[is.null(hotel_bookings$agent)] <- 0
hotel_bookings$company[is.null(hotel_bookings$company)] <- 0
hotel_bookings$country[is.null(hotel_bookings$country)] <- max(hotel_bookings$country)

Step 3. Data Integrity Check

We will consolidate the three guest categories (Adults, Children, Babies) and create a new column called ‘total_guests’ by adding the number of adults, children, and babies in each booking. This will simplify the data and make it easier to analyze the total number of guests for each booking.”

hotel_bookings <- mutate(hotel_bookings,total_guests = adults + children + babies)
table(hotel_bookings$total_guests)
## 
##     0     1     2     3     4     5     6    10    12    20    26    27    40 
##   180 22581 82051 10495  3929   137     1     2     2     2     5     2     1 
##    50    55 
##     1     1

The result displays that there are 180 rows in the data that contain zero guests. These rows are being removed as booking without guests is deemed meaningless.

## Remove 180 rows with 0 guest
hotel_bookings <- filter(hotel_bookings, total_guests > 0)
table(hotel_bookings$total_guests)
## 
##     1     2     3     4     5     6    10    12    20    26    27    40    50 
## 22581 82051 10495  3929   137     1     2     2     2     5     2     1     1 
##    55 
##     1
The data now does not displays a value of (0), whereas previously it was (180).
  • Identify the duplicate values
count(hotel_bookings[duplicated(hotel_bookings),])
##       n
## 1 31980
count(unique(hotel_bookings[duplicated(hotel_bookings),]))
##      n
## 1 8161

There are total 40.141 duplicate values in the hotel data set. All columns are the same for these duplicates, but the reason for the duplication is not clear. Since there is no unique identifier like a customer name, the duplicate rows cannot be removed from the data. To handle this situation we would need to consult with the client and determine the reason for the duplication before making a decision.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in data analysis as it helps to identify patterns and relationships in the data. In this analysis, we will perform an EDA to analyze the hotel’s performance across the years 2015, 2016 & 2017.

The objectives of this analysis are to identify the distribution of bookings and cancellations, factors affecting cancellations, top-performing hotels in terms of revenue, seasonality of bookings and cancellations, and countries with the highest number of bookings and cancellations.

Analysis Objectives:

  1. Distribution of bookings and cancellations across the years: This analysis will help us to understand the overall trends over the years.

  2. Seasonality of bookings and cancellations: We will analyze the seasonality of bookings and cancellations and identify the months with the highest and lowest revenue. This analysis will help us to understand the seasonality of the hotel’s business and plan marketing and promotional campaigns accordingly.

  3. Factors affecting cancellations: We will analyze the factors affecting cancellations such as lead time, market segment, customer type, room type and deposit type. This analysis will help us to identify the most significant factors contributing to cancellations.

  4. Countries with the highest number of bookings and cancellations: We will identify the countries with the highest number of bookings and cancellations and evaluate their contribution to the hotel’s performance. This analysis will help us to understand the hotel’s customer base and identify opportunities to attract more customers from other countries.

Let’s perform a comprehensive data analysis to gain insights into the patterns and trends present in the dataset.

Data Analysis Process

Annual Metrics

Analyzing annual metrics in terms of bookings, cancellations, and revenue can provide insights into the overall performance of a hotel or hospitality business over time.

Distribution of the reservation across the years

We analyze the hotel’s performance (both City Hotel & Resort Hotel) based on their reservation status each year to gain a comprehensive understanding of the reservation trends.

Analysis process:

# Filter the data by years 2015, 2016, 2017.
hotel_bookings <- hotel_bookings %>%
  filter(arrival_date_year %in% c(2015, 2016, 2017))

# Group the data by year and reservation status.
yearly_bookings <- hotel_bookings %>%
  group_by(arrival_date_year, reservation_status) %>%
  summarise(count = n()) %>%
  ungroup()
## `summarise()` has grouped output by 'arrival_date_year'. You can override using
## the `.groups` argument.
yearly_bookings
## # A tibble: 9 × 3
##   arrival_date_year reservation_status count
##               <int> <chr>              <int>
## 1              2015 Canceled            7950
## 2              2015 Check-Out          13826
## 3              2015 No-Show              191
## 4              2016 Canceled           19657
## 5              2016 Check-Out          36299
## 6              2016 No-Show              667
## 7              2017 Canceled           15386
## 8              2017 Check-Out          24886
## 9              2017 No-Show              348

Visualization:

# Plot the data
library(ggplot2)

ggplot(data = yearly_bookings, 
       aes(x = arrival_date_year, y = count, group = reservation_status, 
           color = reservation_status)) +
    geom_line(linewidth = 1) +
    scale_x_continuous(breaks = c(2015, 2016, 2017), expand = c(0.17, 0)) +
    scale_y_continuous(labels = scales::comma) +
    scale_color_manual(values = c("Canceled" = "#529993", "Check-Out" = "#B6B6B6", "No-Show" = "#B6B6B6")) +
    scale_fill_manual(values = c("Canceled" = "#529993", "Check-Out" = "#B6B6B6", "No-Show" = "#B6B6B6")) +
    labs(title = "Reservation number across years",
         subtitle = "",
         x = "Year",
         y = "Count",
         caption = "
         Data Source: Kaggle",
         color = "Reservation Status") +
     theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.15),
          plot.caption = element_text(colour = "#3A3A3A", hjust = -0.17),
       panel.background = element_rect(fill = "white"),
          panel.grid.minor.x = element_line(size = 0),
          legend.position = "none",
          axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.78, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.11, size = 9),
          axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
    geom_text(data = yearly_bookings %>% filter(arrival_date_year %in% c(2015, 2016, 2017)) 
              %>% group_by(reservation_status) %>% slice_tail(n = 1),
              aes(x = arrival_date_year, y = count, label = reservation_status),
              hjust = 0.01, vjust = 0.01, size = 4, color = c("#529993", "#8E8783", "#8E8783"))
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.

The analysis and chart provide us with the following insights:

We have three reservation status: Check-Out, Canceled & No-Show. Let’s elaborate each one!

  1. Check-Out status: the highest year = 2016 (36.370) and the lowest year = 2015 (13.854).

    The check-out was higher than the canceled and no-show across all three years and from both hotel types. This indicates that most bookings result in successful stays, which is a positive sign for the hotel business.

  2. Canceled status: the highest year = 2016 (19.669) and the lowest year = 2015 (7.951).

    The canceled number is concerning for the hotel business, even though it was lower than the number of check-out. This indicates a negative trend for the hotel’s overall business.

  3. No-Show status: the number of no-shows was much lower than the number of canceled bookings, but it still contributes to the business performance.

Monthly Metrics

To obtain a more thorough comprehension of the trends and patterns, it is necessary to conduct further analysis on cancellations, bookings, and revenue generated by hotel types.

A. Cancellation trend and pattern

Analysis process:

# Filter only canceled bookings
canceled_bookings <- hotel_bookings %>% 
  filter(is_canceled == 1) %>%
  group_by(arrival_date_month, hotel, arrival_date_year) %>%
  summarize(total_cancelations = n()) %>%
  mutate(month_num = match(arrival_date_month, month.name),
         months = factor(substr(arrival_date_month, 1, 3), levels = month.abb)) %>%
  arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'arrival_date_month', 'hotel'. You can
## override using the `.groups` argument.
# lowest point
lowest_points <- canceled_bookings %>%
  group_by(hotel, arrival_date_year) %>%
  slice_min(total_cancelations) %>%
  arrange(arrival_date_year) %>%
  ungroup()

# highest point
highest_points <- canceled_bookings %>%
  group_by(hotel, arrival_date_year) %>%
  slice_max(total_cancelations) %>%
  ungroup()

# create a new data frame to store the highest and lowest points
highest_lowest_points <- canceled_bookings %>%
  group_by(arrival_date_year, hotel) %>%
  slice_min(total_cancelations, n = 1) %>% # to get the lowest point
  bind_rows(canceled_bookings %>% 
              group_by(arrival_date_year, hotel) %>%
              slice_max(total_cancelations, n = 1)) # to get the highest point

highest_lowest_points
## # A tibble: 12 × 6
## # Groups:   arrival_date_year, hotel [6]
##    arrival_date_month hotel        arrival_date_year total_canc…¹ month…² months
##    <chr>              <chr>                    <int>        <int>   <int> <fct> 
##  1 November           City Hotel                2015          301      11 Nov   
##  2 November           Resort Hotel              2015          185      11 Nov   
##  3 January            City Hotel                2016          438       1 Jan   
##  4 January            Resort Hotel              2016          119       1 Jan   
##  5 February           City Hotel                2017          970       2 Feb   
##  6 January            Resort Hotel              2017          206       1 Jan   
##  7 September          City Hotel                2015         1542       9 Sep   
##  8 September          Resort Hotel              2015          551       9 Sep   
##  9 October            City Hotel                2016         1947      10 Oct   
## 10 August             Resort Hotel              2016          578       8 Aug   
## 11 May                City Hotel                2017         2217       5 May   
## 12 August             Resort Hotel              2017          693       8 Aug   
## # … with abbreviated variable names ¹​total_cancelations, ²​month_num

Visualization:

# plot the data with added points and labels
ggplot(data = canceled_bookings, aes(x = months, y = total_cancelations, group = hotel, colour = hotel)) +
  facet_wrap(~arrival_date_year, ncol = 1 ) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
  scale_y_continuous(labels = scales::comma) +    
  labs(title = "Cancellations trend across years",
       x = "Arrival month",
       y = "Count",
       caption = "
       Data Source: Kaggle",
       color = "Hotel Type",
       fill = "Hotel Type") +
   theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.24),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.2),
     panel.background = element_rect(fill = "white"),
          panel.grid.minor.x = element_line(size = 0),
          legend.position = "right",
          axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.97, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9),
          axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
  geom_point(data = highest_lowest_points, aes(x = months, y = total_cancelations), size = 3)

We have now discovered that the hotel_bookings dataset only includes a full month of data for 2016, while 2015 has data for only 5 months and 2017 has data for 8 months. This helps to explain why 2016 was the busiest year in our analysis.

The analysis and chart provide us with the following insights:

City Hotel:

  • In 2015, the cancellation rate was highest in September and decreased significantly in November.

  • In 2016, the cancellation rate was lowest in January and increased gradually during spring months (March, April), then decreased in July, and increased again till October, followed by a sharp decrease until December.

  • In 2017, the cancellation rate increased sharply from February to May and decreased again till August.

Resort Hotel:

  • The Resort Hotel obviously shows that there is no fluctuation trend and the pattern is relatively stable during all three years.

  • However, the chance of an increase in cancellations during the summer months is positive.

B. Bookings trend and pattern

Analysis process:

##Filter only non-cancelled bookings
  high_low_bookings <- hotel_bookings %>% 
    filter(is_canceled == 0) %>%
    group_by(arrival_date_month, hotel, arrival_date_year) %>%
    summarize(total_bookings = n()) %>%
    mutate(month_num = match(arrival_date_month, month.name),
           months = factor(substr(arrival_date_month, 1, 3), 
                           levels = month.abb)) %>%
    arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'arrival_date_month', 'hotel'. You can
## override using the `.groups` argument.
  #lowest point
  lowest_points <- high_low_bookings %>%
    group_by(hotel, arrival_date_year) %>%
    slice_min(total_bookings) %>%
    arrange(arrival_date_year) %>%
    ungroup()

  #highest point
  highest_points <- high_low_bookings %>%
    group_by(hotel, arrival_date_year) %>%
    slice_max(total_bookings) %>%
    ungroup()

  
  # create a new data frame to store the highest and lowest points
  top_bottom_points <- high_low_bookings %>%
    group_by(arrival_date_year, hotel) %>%
    slice_min(total_bookings, n = 1) %>% # get the lowest point
    bind_rows(high_low_bookings %>% 
                group_by(arrival_date_year, hotel) %>%
                slice_max(total_bookings, n = 1)) %>% # get the highest point
    arrange(arrival_date_year)
  
  top_bottom_points
## # A tibble: 12 × 6
## # Groups:   arrival_date_year, hotel [6]
##    arrival_date_month hotel        arrival_date_year total_book…¹ month…² months
##    <chr>              <chr>                    <int>        <int>   <int> <fct> 
##  1 July               City Hotel                2015          457       7 Jul   
##  2 November           Resort Hotel              2015          919      11 Nov   
##  3 October            City Hotel                2015         2061      10 Oct   
##  4 October            Resort Hotel              2015         1158      10 Oct   
##  5 January            City Hotel                2016          926       1 Jan   
##  6 January            Resort Hotel              2016          765       1 Jan   
##  7 September          City Hotel                2016         2301       9 Sep   
##  8 October            Resort Hotel              2016         1417      10 Oct   
##  9 January            City Hotel                2017         1323       1 Jan   
## 10 June               Resort Hotel              2017         1044       6 Jun   
## 11 May                City Hotel                2017         2331       5 May   
## 12 May                Resort Hotel              2017         1212       5 May   
## # … with abbreviated variable names ¹​total_bookings, ²​month_num

Visualization:

# Plot the line chart with the lowest and highest points added
ggplot(data = high_low_bookings, aes(x = months, y = total_bookings, 
                                       group = hotel, colour = hotel)) +
    facet_wrap(~arrival_date_year, ncol = 1 ) +
    geom_line(linewidth = 1) +
    scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
    scale_y_continuous(labels = scales::comma) +    
    labs(title = "Booking trend across years",
         subtitle = "",
         x = "Arrival month",
         y = "Count",
         caption = "
         Data Source: Kaggle",
         color = "Hotel Type",
         fill = "Hotel Type") +
    theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.19),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
     panel.background = element_rect(fill = "white"),
          panel.grid.minor.x = element_line(size = 0),
          legend.position = "right",
          axis.title.y = element_text(colour = "#3A3A3A", vjust = 0.5, hjust = 0.96, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
          axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
    geom_point(data = top_bottom_points, aes(x = months, y = total_bookings), 
               size = 3)

The analysis and chart provide us with the following insights:

City Hotel:

  • In 2015, bookings reached their lowest point in July and steadily increased until September, then experienced a sharp drop in November.

  • In 2016 and 2017, bookings consistently increased from spring to summer months, with a gradual decrease during winter months.

Resort Hotel:

  • Across all three years, bookings for the Resort Hotel are relatively stable, with slight increases and decreases observed.

C. Occupancy (%) pattern

Our next analysis is to observe the percentage of rooms that are occupied by guests during the year 2015, 2016 & 2017.

Analysis process:

occupancy_rate_monthly <- hotel_bookings %>%
  group_by(hotel, arrival_date_year, arrival_date_month) %>%
  summarize(total_bookings = n(),
            total_occupied = sum(is_canceled == 0),
            occupancy_rate = mean(is_canceled == 0)) %>%
  mutate(month_num = match(arrival_date_month, month.name),
         months = factor(substr(arrival_date_month, 1, 3), 
                         levels = month.abb)) %>%
  arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'hotel', 'arrival_date_year'. You can
## override using the `.groups` argument.
#lowest point
lowest_occ_rate <- occupancy_rate_monthly %>%
  group_by(hotel, arrival_date_year) %>%
  slice_min(occupancy_rate) %>%
  ungroup()

#highest point
highest_occ_rate <- occupancy_rate_monthly %>%
  group_by(hotel, arrival_date_year) %>%
  slice_max(occupancy_rate) %>%
  ungroup()

# create a new data frame to store the highest and lowest points
highest_lowest_occrate <- occupancy_rate_monthly %>%
  group_by(arrival_date_year, hotel) %>%
  slice_min(occupancy_rate, n = 1) %>% # get the lowest point
  bind_rows(occupancy_rate_monthly %>% 
              group_by(arrival_date_year, hotel) %>%
              slice_max(occupancy_rate, n = 1)) %>% # get the highest point
  arrange(arrival_date_year) 

highest_lowest_occrate
## # A tibble: 12 × 8
## # Groups:   arrival_date_year, hotel [6]
##    hotel        arrival_date_year arriv…¹ total…² total…³ occup…⁴ month…⁵ months
##    <chr>                    <int> <chr>     <int>   <int>   <dbl>   <int> <fct> 
##  1 City Hotel                2015 July       1396     457   0.327       7 Jul   
##  2 Resort Hotel              2015 Septem…    1585    1034   0.652       9 Sep   
##  3 City Hotel                2015 Novemb…    1233     932   0.756      11 Nov   
##  4 Resort Hotel              2015 Novemb…    1104     919   0.832      11 Nov   
##  5 City Hotel                2016 October    4212    2265   0.538      10 Oct   
##  6 Resort Hotel              2016 August     1685    1107   0.657       8 Aug   
##  7 City Hotel                2016 January    1364     926   0.679       1 Jan   
##  8 Resort Hotel              2016 January     884     765   0.865       1 Jan   
##  9 City Hotel                2017 April      3911    1990   0.509       4 Apr   
## 10 Resort Hotel              2017 August     1800    1107   0.615       8 Aug   
## 11 City Hotel                2017 August     3123    2002   0.641       8 Aug   
## 12 Resort Hotel              2017 January    1307    1101   0.842       1 Jan   
## # … with abbreviated variable names ¹​arrival_date_month, ²​total_bookings,
## #   ³​total_occupied, ⁴​occupancy_rate, ⁵​month_num

Visualization:

# plot the data
ggplot(occupancy_rate_monthly, aes(x = months, y = occupancy_rate, color = hotel, group = hotel)) +
  facet_wrap(~arrival_date_year, ncol = 1) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Occupancy rate across years",
       subtitle = "",
       x = "Arrival month",
       y = "Rate",
       color = "Hotel Type",
       caption = "
         Data Source: Kaggle",
       fill = "Hotel Type") +
    theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.17),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.2),
     panel.background = element_rect(fill = "white"),
          panel.grid.minor.x = element_line(size = 0),
          legend.position = "right",
          axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.97, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
          axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
  geom_point(data = highest_lowest_occrate, aes(x = months, y = occupancy_rate), 
             size = 3)

The analysis and chart provide us with the following insights:

  • The chart shows that both hotels have similar patterns of occupancy rates across the three years, with peak months during the summer (June to August) and lower occupancy rates during the winter months (November to January). However, the occupancy rate of the Resort Hotel is generally higher than the City Hotel throughout the year.

  • One interesting observation is that the City Hotel experienced a dip in occupancy rate during the summer months of 2016, while the Resort Hotel did not. In addition, both hotels experienced a dip in occupancy rate during the fall months of 2016 and 2017.

D. Hotel performance in term of revenue

Analysis process:

# filter data of non-canceled booking
revenue_by_month <- hotel_bookings %>%
  filter(is_canceled == 0) %>%
  group_by(hotel, arrival_date_year, arrival_date_month) %>%
  summarize(total_revenue = sum(adr * (stays_in_week_nights + stays_in_weekend_nights))) %>%
  mutate(month_num = match(arrival_date_month, month.name),
         months = factor(substr(arrival_date_month, 1, 3), 
                         levels = month.abb)) %>%
  arrange(hotel, arrival_date_year, month_num)
## `summarise()` has grouped output by 'hotel', 'arrival_date_year'. You can
## override using the `.groups` argument.
#lowest revenue
lowest_revenue <- revenue_by_month %>%
  group_by(hotel, arrival_date_year) %>%
  slice_min(total_revenue) %>%
  ungroup()

#highest revenue
highest_revenue <- revenue_by_month %>%
  group_by(hotel, arrival_date_year) %>%
  slice_max(total_revenue) %>%
  ungroup()

# create a new data frame to store the highest and lowest revenue.
highest_lowest_revenue <- revenue_by_month %>%
  group_by(arrival_date_year, hotel) %>%
  slice_min(total_revenue, n = 1) %>% # get the lowest point
  bind_rows(revenue_by_month %>% 
              group_by(arrival_date_year, hotel) %>%
              slice_max(total_revenue, n = 1)) %>% # get the highest point
  arrange(arrival_date_year) 

highest_lowest_revenue
## # A tibble: 12 × 6
## # Groups:   arrival_date_year, hotel [6]
##    hotel        arrival_date_year arrival_date_month total_reve…¹ month…² months
##    <chr>                    <int> <chr>                     <dbl>   <int> <fct> 
##  1 City Hotel                2015 July                     79646.       7 Jul   
##  2 Resort Hotel              2015 November                155481.      11 Nov   
##  3 City Hotel                2015 September               588545.       9 Sep   
##  4 Resort Hotel              2015 August                  852455.       8 Aug   
##  5 City Hotel                2016 January                 175232.       1 Jan   
##  6 Resort Hotel              2016 January                  89289.       1 Jan   
##  7 City Hotel                2016 August                  805804.       8 Aug   
##  8 Resort Hotel              2016 August                 1001850.       8 Aug   
##  9 City Hotel                2017 January                 352313.       1 Jan   
## 10 Resort Hotel              2017 January                 159455.       1 Jan   
## 11 City Hotel                2017 August                  885430.       8 Aug   
## 12 Resort Hotel              2017 August                 1084751.       8 Aug   
## # … with abbreviated variable names ¹​total_revenue, ²​month_num
#plot the chart.
ggplot(revenue_by_month, aes(x = months, y = total_revenue, color = hotel, group = hotel)) +
    facet_wrap(~arrival_date_year, ncol = 1) +
  geom_line(linewidth = 1) +
  scale_color_manual(values = c("City Hotel" = "#096E65", "Resort Hotel" = "#9DC6C2")) +
  labs(title = "Revenue by Hotel Type",
       subtitle = "From year 2015-2015
       ",
       x = "Arrival month",
       y = "Total Revenue",
       color = "Hotel Type",
       caption = "
       Data Source: Kaggle",
       fill = "Hotel Type") +
  scale_y_continuous(labels = scales::dollar) +
  theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.26),
        plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.22),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.27),
     panel.background = element_rect(fill = "white"),
          panel.grid.minor.x = element_line(size = 0),
          legend.position = "right",
          axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.95, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9),
          axis.line = element_line(linewidth = 0.4, colour = "#8E8783")) +
  geom_point(data = highest_lowest_revenue, aes(x = months, y = total_revenue), 
             size = 3)

The analysis and chart provide us with the following insights:

  • In general, the City Hotel generates more revenue than the Resort Hotel during all three years.

  • But the chart also shows the Resort Hotel generates more revenue during summer months (July and August) for all three years.

Factors Contribution

This analysis is important to identify the contributing factors behind the high cancellation rates and take appropriate actions to decrease them, which can enhance the hotel’s overall performance

A. Cancellation (%) vs Lead Time

cancellation_by_lead_time <- hotel_bookings %>%
  mutate(lead_time_group = cut(lead_time, breaks = seq(0, 750, by = 50), include.lowest = TRUE)) %>%
  group_by(lead_time_group) %>%
  summarise(total_cancellations = sum(is_canceled),
            total_bookings = n(),
            cancellation_rate = total_cancellations / total_bookings)

cancellation_by_lead_time
## # A tibble: 14 × 4
##    lead_time_group total_cancellations total_bookings cancellation_rate
##    <fct>                         <int>          <int>             <dbl>
##  1 [0,50]                        11446          50668             0.226
##  2 (50,100]                       8542          21212             0.403
##  3 (100,150]                      6540          14795             0.442
##  4 (150,200]                      5463          11898             0.459
##  5 (200,250]                      3385           7060             0.479
##  6 (250,300]                      3414           5692             0.600
##  7 (300,350]                      2793           4060             0.688
##  8 (350,400]                      1109           1712             0.648
##  9 (400,450]                       669           1007             0.664
## 10 (450,500]                       426            625             0.682
## 11 (500,550]                       148            215             0.688
## 12 (550,600]                       119            119             1    
## 13 (600,650]                       145            145             1    
## 14 (700,750]                         0              2             0
ggplot(cancellation_by_lead_time, aes(x = lead_time_group, y = cancellation_rate, fill = cancellation_rate > 0.5)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = c("#9DC6C2", "#096E65")) +
  scale_x_discrete(limits = levels(cancellation_by_lead_time$lead_time_group)) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Cancellation rate vs Lead time",
       subtitle = "Year from 2015 to 2017
       ",
       x = "
       
       Lead Time (days)",
       y = "", 
       caption = "
       
       Data Source: Kaggle",
       fill = "Rate > 50% in Green") +
   theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.22),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.17),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
        panel.background = element_rect(fill = "white"),
        legend.position = "none",
        axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.05,vjust = 0.5, size = 9),
        axis.title.y = element_text(colour = "#3A3A3A", hjust = 0.96,vjust = 0.7, size = 9),
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
          axis.text.x = element_text(vjust = 0.5, hjust=1))

The analyisis and chart provide us with the following insights:

  • There is a clear relationship between lead time and cancellation rate, with longer lead times generally associated with higher cancellation rates.

  • The cancellation rate for bookings made with a lead time of 50-250 days is particularly high, reaching almost 50%.

  • The cancellation rate for bookings made with a lead time of 250-550 days is generally almost reaching 75%, indicating that this is a high-risk period for cancellations.

  • The cancellation rate for bookings made with a lead time of 600-650 days is generally 100% cancellation rate.

  • The chart highlights the importance of booking policies and communication strategies, particularly for bookings made with shorter lead times. It may be useful for stakeholders to consider implementing policies and strategies that encourage guests to make bookings with longer lead times, such as offering incentives or discounts.

B. Cancellation vs Market Segment

# calculate cancellation  by market segment 
cancellation_by_segment_type <- hotel_bookings %>%
  filter(is_canceled == 1) %>%
  group_by(market_segment, hotel) %>%
  summarise(cancellation_num = n()) 
## `summarise()` has grouped output by 'market_segment'. You can override using
## the `.groups` argument.
# plot the chart
ggplot(cancellation_by_segment_type, aes(x = market_segment, y = cancellation_num, fill = market_segment)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = ifelse(cancellation_num == max(cancellation_num), cancellation_num, "")),
            position = position_dodge(width = 0.7), vjust = -1.7, 
            hjust = 1, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Cancellation vs Market segment",
       subtitle = "Year from 2015 to 2017",
       x = "Market Segment",
       y = "
       Count",
       caption = "
       Data Source: Kaggle",
       fill = "") +
  scale_fill_manual(values = c("Aviation" = "#B6B6B6", "Complementary"= "#B6B6B6", "Corporate" = "#B6B6B6", "Direct" = "#B6B6B6", "Groups" = "#B6B6B6", "Offline TA/TO" = "#B6B6B6", "Online TA" = "#529993", "Undefined" = "#B6B6B6")) +
  theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.33),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.25),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.27),
        panel.background = element_rect(fill = "white"),
        legend.position = "none",
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text = element_text(hjust = 0.7),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 1, hjust = 0.91, size = 9),
          axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.01,vjust = 0.5, size = 9))

The analysis and chart provide us with the following insights:

  • From the chart, we can observe that the online travel agency (Online TA) market segment has the highest number of cancellations compared to other segments, followed by offline travel agencies/tour operators (Offline TA/TO) and Direct bookings.

  • Meanwhile, the Aviation, Complementary, Corporate, Groups, and Undefined market segments have the lowest number of cancellations.

C. Cancellation vs Customer Type

cancellation_by_customer_type <- hotel_bookings %>%
  filter(is_canceled == 1) %>%
  group_by(customer_type, hotel) %>%
  summarise(cancellation_count = n()) 
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
# plot the chart
ggplot(cancellation_by_customer_type, aes(x = customer_type, y = cancellation_count, fill = customer_type)) +
  geom_col(position = "dodge", width = 0.7) +
  geom_text(aes(label = ifelse(cancellation_count == max(cancellation_count), 
                               cancellation_count, "")), position = position_dodge(width = 0.5), vjust = -3, hjust = 1, size = 3) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Cancellation vs Customer type",
       subtitle = "Year from 2015 to 2017",
       x = "Customer Type",
       y = "Count",
       caption = "
       Data Source: Kaggle",
       fill = "") +
  scale_fill_manual(values = c("Contract" = "#B6B6B6", "Group"= "#B6B6B6", "Transient" = "#529993", "Transient-Party" = "#B6B6B6")) +
   theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.31),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.24),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.26),
        panel.background = element_rect(fill = "white"),
        legend.position = "none",
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text.x = element_text(vjust = 0.5, hjust=0.8),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.82, size = 9),
        axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9))

The analysis and chartprovide us the following insights:

  • We can see that the Transient customer type has the highest number of cancellations, followed by Transient-Party, and then Contract and Group with relatively lower cancellation numbers.

  • This could indicate that the cancellation policies and procedures for Transient and Transient-Party customers need to be revisited or improved to reduce cancellations in the future.

D. Cancellation vs Room Type

# Compute the total cancellations by room type
cancellation_by_roomtype <- hotel_bookings %>%
  filter(is_canceled == 1) %>%
  group_by(reserved_room_type) %>%
  summarize(total_cancellations = sum(is_canceled)) %>%
  arrange(desc(reserved_room_type))

cancellation_by_roomtype
## # A tibble: 9 × 2
##   reserved_room_type total_cancellations
##   <chr>                            <int>
## 1 L                                    2
## 2 H                                  245
## 3 G                                  763
## 4 F                                  880
## 5 E                                 1908
## 6 D                                 6101
## 7 C                                  308
## 8 B                                  367
## 9 A                                33625
cancellation_by_roomtype$reserved_room_type <- factor(cancellation_by_roomtype$reserved_room_type, levels = c("P", "O", "D", "E", "F", "G", "H", "C", "B", "A"))


# Plot the data as a bar chart with a color gradient
ggplot(cancellation_by_roomtype, aes(x = reserved_room_type, y = total_cancellations, fill = total_cancellations)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = ifelse(total_cancellations == max(total_cancellations), total_cancellations, "")), position = position_dodge(width = 0.5), vjust = -2, 
            hjust = 1, size = 3) + 
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_gradient(low = "#B6B6B6", high = "#529993", name = "Total Cancellations", guide = "legend", labels = scales::comma_format()) +
  labs(title = "Cancellations vs Room type",
       subtitle = "Year from 2015 to 2017",
       x = "Room Type",
       y = "Count",
       caption = "
       Data Source: Kaggle") +
  theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.09),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.08),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.11),
        panel.background = element_rect(fill = "white"),
        legend.position = "none",
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text.x = element_text(vjust = 0.5, hjust=0.8),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
        axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04,vjust = 0.5, size = 9))

The analysis and chart provide us the following insights:

  • From this chart, we can see that the most cancellations occurred for the A and D room types, followed by E and F.

  • The least cancellations occurred for the G, B, C and H room types.

E. Cancellation vs Deposit Type

cancellation_by_deposit <- hotel_bookings %>%
  group_by(deposit_type) %>%
  summarise(total_bookings = n(),
            total_cancellations = sum(is_canceled))
  
cancellation_by_deposit
## # A tibble: 3 × 3
##   deposit_type total_bookings total_cancellations
##   <chr>                 <int>               <int>
## 1 No Deposit           104461               29669
## 2 Non Refund            14587               14494
## 3 Refundable              162                  36
cancellation_by_deposit$deposit_type <- factor(cancellation_by_deposit$deposit_type, levels = c("Refundable", "Non Refund", "No Deposit"))

#plot the chart
ggplot(cancellation_by_deposit, aes(x = deposit_type, y = total_cancellations, fill = deposit_type)) +
  geom_col(width = 0.5) +
  coord_flip() +
  geom_text(aes(label = total_cancellations), position = position_dodge(width = 0.5), 
            vjust = -2.9, hjust = 1, size = 3) +   
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = c("No Deposit" = "#529993", "Non Refund" = "#B6B6B6", "Refundable" = "#B6B6B6")) +
  labs(title = "Cancellations vs Deposit type",
       subtitle = "Year from 2015 to 2017",
       x = "Deposit type",
       y = "
       Count",
       fill = "Deposit Type",
       caption = "
       Data Source: Kaggle") +
  theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.25),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.2),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.22),
        panel.background = element_rect(fill = "white"),
        legend.position = "none",
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text.x = element_text(vjust = 0.5, hjust=0.8),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.78, size = 9),
        axis.title.x = element_text(colour = "#3A3A3A", hjust = -0.01,vjust = 2, size = 9))

The analysis and chart provide us the following insights:

  • The majority of bookings made were without any deposit, and this group also had the highest number of cancellations.

  • On the other hand, bookings made with refundable deposit had the lowest number of cancellations.

Countries

A. Top 10 booking countries

top_10_country <- hotel_bookings %>% 
  filter(is_canceled == 0) %>%
  group_by(country) %>%  
  summarise(booking_count = n()) %>%
  arrange(desc(booking_count)) %>%
  head(10)

top_10_country
## # A tibble: 10 × 2
##    country booking_count
##    <chr>           <int>
##  1 PRT             20977
##  2 GBR              9668
##  3 FRA              8468
##  4 ESP              6383
##  5 DEU              6067
##  6 IRL              2542
##  7 ITA              2428
##  8 BEL              1868
##  9 NLD              1716
## 10 USA              1592
ggplot(top_10_country, aes(x = reorder(country, booking_count), y = booking_count)) +
  geom_bar(stat = "identity", width = 0.7, fill = "#529993") +
  coord_flip() +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Top 10 countries with the highest number of bookings",
       subtitle = "From Year 2015 - 2017",
       x = "Country",
       y = "Count",
       caption = "
       Data Source: Kaggle",
       fill = "Customer Type") +
  theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.27),
        plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.11),
        plot.caption = element_text(colour = "#3A3A3A", hjust = -0.13),
        panel.background = element_rect(fill = "white"),
        axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text.x = element_text(vjust = 0.5, hjust = 0.8),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
        axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9)) +
  geom_text(aes(label = booking_count), position = position_dodge(width = 0.7), vjust = -0.3, hjust = -0.04, size = 2.7, colour = "#3A3A3A")

B. Top 10 cancel countries

top_10_country <- hotel_bookings %>% 
  filter(is_canceled == 1) %>%
  group_by(country) %>%  
  summarise(booking_count = n()) %>%
  arrange(desc(booking_count)) %>%
  head(10)
  
  ggplot(top_10_country, aes(x = reorder(country, booking_count), y = booking_count)) +
  geom_bar(stat = "identity", width = 0.7, fill = "#529993") +
    coord_flip() +
  scale_y_continuous (labels = scales::comma) +
  labs(title = "Top 10 countries with the highest number of cancellations",
       subtitle = "From Year 2015 - 2017
       ",
       x = "Country",
       y = "Count",
       caption = "
       Data Source: Kaggle",
       fill = "Customer Type") +
 theme(plot.title = element_text(colour = "#3A3A3A", hjust = -0.3),
         plot.subtitle = element_text(colour = "#3A3A3A", hjust = -0.1),
         plot.caption = element_text(colour = "#3A3A3A", hjust = -0.13),
        panel.background = element_rect(fill = "white"),
       axis.line = element_line(linewidth = 0.3, colour = "#8E8783"),
        axis.text.x = element_text(vjust = 0.5, hjust=0.8),
        axis.title.y = element_text(colour = "#3A3A3A", vjust = 2, hjust = 0.94, size = 9),
        axis.title.x = element_text(colour = "#3A3A3A", hjust = 0.04, vjust = 0.5, size = 9)) +
    geom_text(aes(label = booking_count), position = position_dodge(width = 0.7), vjust = -0.1, hjust = -0.05 , size = 2.6, colour = "#3A3A3A")

The analysis and charts provide us the following insights:

  • Portugal has the highest number of bookings, followed by the UK and France.

  • Portugal has the highest number of cancellations, followed by the UK and Spain.

  • We can also see that most of the cancellations are made by customers of type “Transient”.

  • Overall, this visualization gives us a quick understanding of the countries that are most popular for bookings and cancellations the type of customers that make most bookings.

Summary

Our goals:

  1. Understand how the hotel’s performance across the years and;
  2. Use this insight to inform the development of strategies that can improve the hotel’s performance and identify potential areas of concern.

Insights:

  • The number of successful stays (Check-Out status) is high, and the number of canceled bookings (Canceled status) is also significant, indicating a negative performance for the hotel’s overall business.

  • The factors that contributes the cancellations as follows;

    a. Lead time, booking with 550-650 days is 100% of cancellations.

    b. Market Segment = Online TA

    c. Customer Type = Transient

    d. Room Type = A

    e. Deposit Type = No Deposit

    f. Country = Portugal

  • Trends and patterns

    a. Seasonality pattern is clearly pronounced for City Hotel where the peak cancellations is during summer months and trough cancellations is during the winter months.

    b. For Resort Hotel is relatively stable, there is no fluctuation number either cancellation or booking.

  • Revenue and Occupancy Rate

    a. City Hotel generated more revenue than Resort Hotel in all three years. Yet, an interesting situation during summer months is Resort Hotel generated more revenue than City Hotel.

    b. Resort Hotel has higher occupancy rate in all three years than City Hotel.

Possible Recommendations:

  1. Offer flexible cancellations policies for market segment “Online TA”, allowing customers to cancel their bookings within a certain time-frame without penalty. We can use (lead time with the lowest cancellation rate).

  2. Implementing a dynamic pricing strategy for room type A: develop pricing rules based on factors such as demand, seasonality & occupancy rates.

  3. Invest in technology: Use pricing software to automate price updates in real-time based, and monitor the performance of the pricing strategy and make adjustments as needed.

  4. Targeted marketing campaign:

    • Focus on promoting City hotel bookings during the winter months, the campaign could target demographics that are most likely to travel during the winter.

    • Emphasize the benefits of booking early such as offering early book discounts, to minimize the high cancellations rate associated with longer lead times. Encourage guest to book Room type D, which have a lower cancellation rate than room type A. Attract more bookings from the least canceled countries such as Ireland, Brazil, USA & Belgium. Highlight a specific promotions and deals that are exclusive to those countries traveler.

    • Encouraging stable bookings for Resort hotel, could promote the unique features and amenities that differentiate it from other accommodation options. For example all-inclusive package, couple package, family holidays, retreat package, etc.

Limitation

  1. The data set was not provided the customer feedback, that we can use to identify areas for improvement such as check-out procedures, room amenities, and customer satisfactions.

  2. For further analysis, we can approach some areas

    • Demographics which countries likely booking in summer months and winter months.

    • Key measurement metrics such as: ADR (average daily rate), RevPar (revenue per available room), and ROI (return of investment).