First, initial setup, Configure the library and Load the data file

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## Loading required package: fabletools
## 
## 
## Attaching package: 'maps'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     map
## 
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
##   backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
## 
## Revert to `kableExtra` for one session:
## 
##   options(modelsummary_factory_default = 'kableExtra')
## 
## Change the default backend persistently:
## 
##   config_modelsummary(factory_default = 'gt')
## 
## Silence this message forever:
## 
##   config_modelsummary(startup_message = FALSE)
## Rows: 40060 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): ArrivalDateMonth, Meal, Country, MarketSegment, DistributionChann...
## dbl  (18): IsCanceled, LeadTime, ArrivalDateYear, ArrivalDateWeekNumber, Arr...
## date  (1): ReservationStatusDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 79330 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): ArrivalDateMonth, Meal, Country, MarketSegment, DistributionChann...
## dbl  (18): IsCanceled, LeadTime, ArrivalDateYear, ArrivalDateWeekNumber, Arr...
## date  (1): ReservationStatusDate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Explore the structure of the data set

## tibble [119,390 × 32] (S3: tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14     
## 

About Dataset

This dataset contains information about hotel bookings with a total of 119390 records for two hotel groups - “City Hotel” and “Resort Hotel” with various attributes of guests and their stays (Abstract is given above). It encompasses diverse variables such as Average daily rate, cancellation, Arrival date, Length of stay, Guest demographics, Booking channels, and Special requests. The data spans multiple years and months.It is providing a comprehensive overview of booking patterns over time. The dataset variables like lead time, cancellation, meal, length of stay, etc. can be used in detailed analysis to find the pattern or message in the dataset which can help formulate the business strategy.

Acknowledgement and Dataset source
This Dataset has been taken from Kaggle, known as TidyTuesday. This has been written by Antonio, Almeida and Nunes.

Data Dictionary (written by tidytuesday in Datasource)
variable description
hotel Hotel
is_canceled Value indicating if the booking was canceled (1) or not (0)
lead_time Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
arrival_date_year Year of arrival date
arrival_date_month Month of arrival date
arrival_date_week_number Week number of year for arrival date
arrival_date_day_of_month Day of arrival date
stays_in_weekend_nights Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults Number of adults
children Number of children
babies Number of babies
meal Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner)
country Country of origin. Categories are represented in the ISO 3155–3:2013 format
market_segment Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
distribution_channel Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”
is_repeated_guest Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type Code of room type reserved. Code is presented instead of designation for anonymity reasons
assigned_room_type Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
booking_changes Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
deposit_type Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.
agent ID of the travel agency that made the booking
company ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons
customer_type Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking
adr Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
required_car_parking_spaces Number of car parking spaces required by the customer
total_of_special_requests Number of special requests made by the customer (e.g. twin bed or high floor)
reservation_status Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why
reservation_status_date Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

Dataset summary and analysis

# Calculate the proportions of each hotel type
hotel_counts <- table(hotel_row$hotel)
hotel_percentages <- prop.table(hotel_counts) * 100

# Create a data frame with hotel types and percentages
hotel_percentages_df <- data.frame(
  hotel = names(hotel_percentages),
  percentage = hotel_percentages
)
options(repr.plot.width = 20, repr.plot.height = 15)

# Create the pie chart using ggplot
pie_chart <- ggplot(hotel_percentages_df, aes(x = "", y = hotel_percentages_df$percentage.Freq, fill = hotel)) +
 geom_bar(stat = "identity", width = 1) +
 geom_text(aes(label = paste0(round(hotel_percentages_df$percentage.Freq, 1), "%")), position = position_stack(vjust = 1.0), size = 4) + # Add data points
  coord_polar("y", start = 0) +
  labs(title = "Both Hotels data Contribution in Dataset", fill = "Hotel Type", x = NULL, y = NULL) +
  scale_fill_manual(values = c("#47f5df", "#ea47f5")) + 
  theme_minimal()
# Print the pie chart
print(pie_chart)

Two hotels data taken for data set. 1. City Hotel and 2. Resort Hotel. City Hotel’s data contribution is 66.4% while Resort Hotel’s data contributed 33.6%.
#Data summary
skim(hotel_row)
Data summary
Name hotel_row
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 13
Date 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
reservation_status_date 0 1 2014-10-17 2017-09-14 2016-08-07 926

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁
  • There are 4 places where children values were missing (figured it out by reviewing n_missing field in above table.) which was defined as NA.
  • There are total of 32 variables and 119390 rows of data of both hotels.
Hotel’s number of row :
# Hotel data group by column "hotel" 
hotel_row %>%
  group_by(hotel) %>%
  summarise(count=n())
## # A tibble: 2 × 2
##   hotel        count
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
Hotel’s Cancellation number : [canceled (1) or not (0)]
# Hotel data group by column "is_canceled:

hotel_row %>%
  group_by(is_canceled) %>%
    summarise(count=n())
## # A tibble: 2 × 2
##   is_canceled count
##         <dbl> <int>
## 1           0 75166
## 2           1 44224
Hotel’s guest arrival yearly number :
hotel_row %>%
  group_by(arrival_date_year) %>%
  summarise(count=n())
## # A tibble: 3 × 2
##   arrival_date_year count
##               <dbl> <int>
## 1              2015 21996
## 2              2016 56707
## 3              2017 40687
Meal types ordered by Guests : (for both hotels)
hotel_row %>%
  group_by(meal) %>%
  summarise(count=n())
## # A tibble: 5 × 2
##   meal      count
##   <chr>     <int>
## 1 BB        92310
## 2 FB          798
## 3 HB        14463
## 4 SC        10650
## 5 Undefined  1169
Hotel’s Country-wise guest number:
hotel_row %>%
  group_by(country) %>%
  summarise(count=n())
## # A tibble: 178 × 2
##    country count
##    <chr>   <int>
##  1 ABW         2
##  2 AGO       362
##  3 AIA         1
##  4 ALB        12
##  5 AND         7
##  6 ARE        51
##  7 ARG       214
##  8 ARM         8
##  9 ASM         1
## 10 ATA         2
## # ℹ 168 more rows
Hotel’s Market Segment-wise number:
hotel_row %>%
  group_by(market_segment) %>%
  summarise(count=n())
## # A tibble: 8 × 2
##   market_segment count
##   <chr>          <int>
## 1 Aviation         237
## 2 Complementary    743
## 3 Corporate       5295
## 4 Direct         12606
## 5 Groups         19811
## 6 Offline TA/TO  24219
## 7 Online TA      56477
## 8 Undefined          2
  • There are two rows where market_segment is undefined.
Hotel’s Distribution channel-wise number:

[Repeated guest (1) or not (0)]

hotel_row %>%
  group_by(distribution_channel) %>%
  summarise(count=n())
## # A tibble: 5 × 2
##   distribution_channel count
##   <chr>                <int>
## 1 Corporate             6677
## 2 Direct               14645
## 3 GDS                    193
## 4 TA/TO                97870
## 5 Undefined                5
  • There are five rows where distribution_channel is undefined.
Hotel’s Repeated business/booking number:
hotel_row %>%
  group_by(is_repeated_guest) %>%
  summarise(count=n())
## # A tibble: 2 × 2
##   is_repeated_guest  count
##               <dbl>  <int>
## 1                 0 115580
## 2                 1   3810
  • There are 3810 repeated booking happned (for both hotels)

Hotel’s reserved room type-wise number:

hotel_row %>%
  group_by(reserved_room_type) %>%
  summarise(count=n())
## # A tibble: 10 × 2
##    reserved_room_type count
##    <chr>              <int>
##  1 A                  85994
##  2 B                   1118
##  3 C                    932
##  4 D                  19201
##  5 E                   6535
##  6 F                   2897
##  7 G                   2094
##  8 H                    601
##  9 L                      6
## 10 P                     12
Hotel’s reservation status and number :(for both the hotels)
hotel_row %>%
  group_by(reservation_status) %>%
  summarise(count=n())
## # A tibble: 3 × 2
##   reservation_status count
##   <chr>              <int>
## 1 Canceled           43017
## 2 Check-Out          75166
## 3 No-Show             1207

Data anomaly & cleaning:

  1. In hotel dataset, there are 4 rows with NA value in “children” column. These rows can be deleted considering no impact.
hotel_data_without_na <- hotel_row %>% 
  na.omit(hotel_data)

cat("Before data cleaning, Number of rows : ",nrow(hotel_row),"\n")
## Before data cleaning, Number of rows :  119390
cat("After data cleaning, Number of rows : ",nrow(hotel_data_without_na),"\n")
## After data cleaning, Number of rows :  119386
  1. In the dataset, there are 294 rows where adults count is zero and reservation_status is check-out. This does not make sense.
hotel_data_filtered <- hotel_data_without_na %>%
  filter((hotel_data_without_na$adults == 0 & hotel_data_without_na$reservation_status == "Check-Out"))
adt_0_chl <-nrow(hotel_data_filtered)

cat("Number record which has adult as zero with checkout status : ",adt_0_chl,"\n")
## Number record which has adult as zero with checkout status :  294
hotel_temp <- subset(hotel_data_without_na, !(adults == 0 & reservation_status == "Check-Out"))
#nrow(hotel_temp)


hotel_temp <- 
  subset(hotel_data_without_na, hotel_data_without_na$adults == 0 & hotel_data_without_na$children == 0 & hotel_data_without_na$babies ==0)

hotels_ultimate_data<- hotel_data_without_na %>% 
  filter(adults +children +babies !=0)

nrow(hotels_ultimate_data)
## [1] 119206
  1. After cleaning, Data set have “119206” rows of both hotels.

Data Exploration and Insights

Guest type analysis:

In hotel dataset, We can divide hotel guests into four major groups i.e. Adult, Adult with Children,Adult with babies and Adult with Children & Babies.
Q. what is the distribution of these groups across the two types of hotels?

guest_groups <- hotels_ultimate_data %>% select(hotel, adults,children,babies) 

guest_classification_group <-guest_groups %>% 
  mutate(group= case_when(adults >  0 & children == 0 & babies == 0 ~ 'Adult',
                          adults >  0 & children >  0 & babies == 0 ~ 'Adult_Children',
                          adults >  0 & children == 0 & babies >  0 ~ 'Adult_Babies',
                          adults >  0 & children >  0 & babies > 0 ~ 'Adult_Children_Babies',
                          adults == 0 & children >  0 & babies == 0 ~ 'Children',
                          adults == 0 & children == 0 & babies >  0 ~ 'Babies',
                          adults == 0 & children >  0 & babies >  0 ~ 'Children_Babies'
                          ))

guest_classification_group %>% 
  group_by(group) %>% 
  summarise(count=n())
## # A tibble: 6 × 2
##   group                  count
##   <chr>                  <int>
## 1 Adult                 109874
## 2 Adult_Babies             742
## 3 Adult_Children          8195
## 4 Adult_Children_Babies    172
## 5 Children                 220
## 6 Children_Babies            3

After classifying the guest into groups, I found that there were 223 rows for which Adult count was zero but children and babies had some value. This was also making no sense in real world that children and babies visited hotel without Adult.

guest_group_hotel <- guest_classification_group %>% select(hotel,group) 


guest_group_hotel <- guest_group_hotel %>% group_by(hotel,group) %>% summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
guest_group_CityHotel <- 
  guest_group_hotel %>% 
  filter(hotel == "City Hotel")

guest_group_ResortHotel <- 
  guest_group_hotel %>% 
  filter(hotel == "Resort Hotel")

skim(guest_group_CityHotel)
Data summary
Name guest_group_CityHotel
Number of rows 6
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables hotel

Variable type: character

skim_variable hotel n_missing complete_rate min max empty n_unique whitespace
group City Hotel 0 1 5 21 0 6 0

Variable type: numeric

skim_variable hotel n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
count City Hotel 0 1 13193.17 29728.46 3 106.75 258.5 3684.75 73756 ▇▁▁▁▂
skim(guest_group_ResortHotel)
Data summary
Name guest_group_ResortHotel
Number of rows 4
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables hotel

Variable type: character

skim_variable hotel n_missing complete_rate min max empty n_unique whitespace
group Resort Hotel 0 1 5 21 0 4 0

Variable type: numeric

skim_variable hotel n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
count Resort Hotel 0 1 10011.75 17466.25 103 359.5 1913 11565.25 36118 ▇▁▁▁▂
guest_classification_group %>% 
  mutate(group = factor(group, levels=c("Adult","Adult_Children","Adult_Babies","Children","Adult_Children_Babies", "Children_Babies"))) %>% 
  ggplot(aes(x = group,
             y = stat(count),
         fill = factor(hotel),
         label = stat(count) ))+
  geom_bar(position = position_dodge(),width = .5)+
  geom_text(stat = "count",
            position = position_dodge(1),
            vjust = -0.5,
            size = 3)+
  labs(title = "Guest groups distribution by hotel", 
       x = "Guest group",
       y = "Count")+                               
  theme(plot.title = element_text(hjust=0.5), 
        axis.text.x = element_text(size = 7),   
        axis.text.y = element_text(size = 7))+     
  scale_fill_manual(values = c("#47f5df", "#ea47f5"),
    name= "Hotel"
  ) 


1. There are two main groups -“Adults” and “Adults with Children”which have most of data.
2. There are 220 children of City Hotel with no Adults.
3. There are 3 children_Babies and with no Adults.

print(guest_group_hotel)
## # A tibble: 10 × 3
## # Groups:   hotel [2]
##    hotel        group                 count
##    <chr>        <chr>                 <int>
##  1 City Hotel   Adult                 73756
##  2 City Hotel   Adult_Babies            297
##  3 City Hotel   Adult_Children         4814
##  4 City Hotel   Adult_Children_Babies    69
##  5 City Hotel   Children                220
##  6 City Hotel   Children_Babies           3
##  7 Resort Hotel Adult                 36118
##  8 Resort Hotel Adult_Babies            445
##  9 Resort Hotel Adult_Children         3381
## 10 Resort Hotel Adult_Children_Babies   103
Country vs Guest distribution of the two hotels
country_guest <- hotels_ultimate_data   %>% 
                filter(is_canceled==0)  %>%
                select(hotel, country)  %>%
                group_by(hotel,country) %>%
                summarise(count=n(),.groups = 'drop')
country_guest_R_hotel <- country_guest %>% 
  filter(hotel== "Resort Hotel") %>% 
  arrange(desc(count))

country_guest_R_hotel %>% 
  arrange(desc(count))
## # A tibble: 119 × 3
##    hotel        country count
##    <chr>        <chr>   <int>
##  1 Resort Hotel PRT     10184
##  2 Resort Hotel GBR      5922
##  3 Resort Hotel ESP      3105
##  4 Resort Hotel IRL      1734
##  5 Resort Hotel FRA      1399
##  6 Resort Hotel DEU      1057
##  7 Resort Hotel CN        614
##  8 Resort Hotel NLD       458
##  9 Resort Hotel NULL      419
## 10 Resort Hotel USA       407
## # ℹ 109 more rows
top_15_countries_r <- head(country_guest_R_hotel, 15)



# Plotting
ggplot(top_15_countries_r, aes(x = country, y = count)) +
  geom_bar(stat = "identity", fill = "#ea47f5") +
  labs(title = "Country-wise - Count of Guests in Resort Hotel",
       x = "Country",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 


1. Top five country of Resort hotel on the basis of visited guests are, Portugal (PRT), United Kingdom(GBR), Spain(ESP), Ireland (IRL) and France(FRA).

country_guest_C_hotel <- country_guest %>% 
  filter(hotel== "City Hotel") %>% 
  arrange(desc(count))

country_guest_C_hotel %>% 
  arrange(desc(count))
## # A tibble: 151 × 3
##    hotel      country count
##    <chr>      <chr>   <int>
##  1 City Hotel PRT     10793
##  2 City Hotel FRA      7069
##  3 City Hotel DEU      5010
##  4 City Hotel GBR      3746
##  5 City Hotel ESP      3278
##  6 City Hotel ITA      2049
##  7 City Hotel BEL      1479
##  8 City Hotel NLD      1258
##  9 City Hotel USA      1185
## 10 City Hotel BRA      1063
## # ℹ 141 more rows
top_15_countries_c <- head(country_guest_C_hotel, 15)



# Plotting
ggplot(top_15_countries_c, aes(x = country, y = count)) +
  geom_bar(stat = "identity", fill = "#47f5df") +
  labs(title = "Country-wise - Count of Guests in City Hotel",
       x = "Country",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 


2. Top five country of City hotel on the basis of visited guests are Portugal (PRT),France(FRA), Germany (DEU) , United Kingdom(GBR), Spain(ESP)

Repeated Guest percentage of the two hotels:
repeated_rate <- hotels_ultimate_data %>% 
  filter(is_canceled == 0) %>% 
  select(hotel, is_repeated_guest)

repeated_rate %>% 
  group_by(hotel, is_repeated_guest) %>% 
  summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   hotel [2]
##   hotel        is_repeated_guest count
##   <chr>                    <dbl> <int>
## 1 City Hotel                   0 44546
## 2 City Hotel                   1  1538
## 3 Resort Hotel                 0 27260
## 4 Resort Hotel                 1  1667
City_rep_guest <- (1538/(1538+44546))*100

cat(" In City Hotel, Repeated guest %: " , City_rep_guest ,"\n")
##  In City Hotel, Repeated guest %:  3.337384
Resort_rep_guest <- (1667/(1667+27260))*100

cat(" In Resort Hotel, Repeated guest %: " , Resort_rep_guest ,"\n")
##  In Resort Hotel, Repeated guest %:  5.762782
Great!!! Hotel have their guests geography and repeated rate with above determined percentage so hotel business/stratigest can make their Strategy for their main customer geography and repeated business. Also they can make plan for the Ad-campaign.

Reservation Pattern Analysis

In Hotel Dataset, Column “adr”(Average Daily Rate, is a numeric field) could be one of the key valuable continuous variables of the hotel room. adr - defined by dividing the sum of all lodging transactions by the total number of staying nights or the average paid amount for a room per night.It is key metric for observing the financial performance of a hotel. It is a key data point from the hotel dataset for hotel stakeholders like hotel managers, investigators etc. to evaluate revenue generation and pricing strategies for future business. Therefore, ADR “Average Daily Rate” can be considered a valuable continuous variable in hotel data analysis in my hotel dataset.
Sum of ADR of room type of hotels
#Select adr and associated variable 
hotels_data_adr <- hotels_ultimate_data %>% select (hotel,adr,is_canceled,assigned_room_type,reservation_status_date, arrival_date_year, arrival_date_month)


# Arranging hotel data accoring to adr(Average Daily Rate).
hotels_data_adr <-hotels_data_adr[order(hotels_data_adr$adr,decreasing = TRUE),]
#Display top 10 row for columns hotel and adr.
#head(hotels_data_adr[,c("hotel","assigned_room_type","adr")],10)

# Room type in hotel
hotel_unique_room_type <- unique(hotels_data_adr$assigned_room_type)

cat("Room type in hotel : ",hotel_unique_room_type)
## Room type in hotel :  A G C E H F D I B K L
hotels_data_adr %>% 
  group_by(hotel,assigned_room_type) %>% 
  summarise(sum(adr))
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 3
## # Groups:   hotel [2]
##    hotel        assigned_room_type `sum(adr)`
##    <chr>        <chr>                   <dbl>
##  1 City Hotel   A                    5509837.
##  2 City Hotel   B                     187999.
##  3 City Hotel   C                      16181.
##  4 City Hotel   D                    1822770.
##  5 City Hotel   E                     312072.
##  6 City Hotel   F                     360519.
##  7 City Hotel   G                     128932.
##  8 City Hotel   K                      13521.
##  9 Resort Hotel A                    1387440.
## 10 Resort Hotel B                      16178.
## 11 Resort Hotel C                     253200.
## 12 Resort Hotel D                     898143.
## 13 Resort Hotel E                     606727.
## 14 Resort Hotel F                     209220.
## 15 Resort Hotel G                     296020.
## 16 Resort Hotel H                     122023.
## 17 Resort Hotel I                      14826.
## 18 Resort Hotel L                          8


Room type: A and D are in trends and generating major revenue.

Number of room type booked of hotels
hotels_data_adr %>% 
  group_by(hotel,assigned_room_type) %>% 
  summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 3
## # Groups:   hotel [2]
##    hotel        assigned_room_type count
##    <chr>        <chr>              <int>
##  1 City Hotel   A                  56977
##  2 City Hotel   B                   1991
##  3 City Hotel   C                    158
##  4 City Hotel   D                  14971
##  5 City Hotel   E                   2160
##  6 City Hotel   F                   2018
##  7 City Hotel   G                    697
##  8 City Hotel   K                    187
##  9 Resort Hotel A                  17043
## 10 Resort Hotel B                    159
## 11 Resort Hotel C                   2212
## 12 Resort Hotel D                  10338
## 13 Resort Hotel E                   5638
## 14 Resort Hotel F                   1733
## 15 Resort Hotel G                   1852
## 16 Resort Hotel H                    712
## 17 Resort Hotel I                    359
## 18 Resort Hotel L                      1
Reservation Status Date Analysis

How does the distribution of reservation statuses vary over time?

# Convert reservation_status_date to Date type
hotels_ultimate_data$reservation_status_date <- as.Date(hotels_ultimate_data$reservation_status_date)

# Plot the distribution of reservation status over time
reservation_status_plot <- ggplot(hotels_ultimate_data, aes(x = reservation_status_date, fill = reservation_status)) +
  geom_bar() +
  labs(title = "Distribution of Reservation Status Over Time",
       x = "Reservation Status Date",
       y = "Count",
       fill = "Reservation Status") +
  theme_minimal()

# Print the plot
print(reservation_status_plot)

Conclusion: Reservation Status Date Analysis provides insights into the distribution and trends of reservation statuses over time.
1. By examining the distribution of reservation statuses over time, It can help business to identify patterns and trends in the booking activity of the hotel.
2. Analysis of reservation statuses over time can reveal insights into guest booking behavior, such as the frequency of cancellations, No-show, and successful Stay.
3. This analysis can also help business in forecasting and planning to anticipate demand fluctuations and implement strategies to optimize revenue and guest satisfaction.

Hotels Analysis
Analysis Aspect :How do different hotels perform in terms of various metrics like “Hotel booking trends over time”,“Guest demographics analysis”,“Revenue analysis” etc. ?
#head(hotel_booking_trends)
# Hotel booking trends over time
hotel_booking_trends <- hotels_ultimate_data %>%
  group_by(hotel, arrival_date_year) %>%
  summarise(total_bookings = n()) %>%
  ggplot(aes(x = factor(arrival_date_year), y = total_bookings, fill = hotel)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hotel Booking Trends Over Time",
       x = "Year",
       y = "Total Bookings") +
  theme_minimal()
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
# Guest demographics analysis
guest_demographics <- hotels_ultimate_data %>%
  group_by(hotel, country) %>%
  summarise(total_guests = n_distinct(customer_type)) %>%
  ggplot(aes(x = country, y = total_guests, fill = hotel)) +
  geom_bar(stat = "identity") +
  labs(title = "Guest Demographics Analysis",
       x = "Country",
       y = "Total Guests") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
# Revenue analysis
revenue_analysis <- hotels_ultimate_data %>%
  group_by(hotel) %>%
  summarise(total_revenue = sum(adr * (stays_in_weekend_nights + stays_in_week_nights))) %>%
  ggplot(aes(x = hotel, y = total_revenue, fill = hotel)) +
  geom_bar(stat = "identity") +
  labs(title = "Revenue Analysis by Hotel",
       x = "Hotel",
       y = "Total Revenue") +
  theme_minimal()

grid.arrange(hotel_booking_trends, revenue_analysis, nrow = 1)

Observation :

  1. City Hotel’s revenue is higher than Resort Hotel.
  2. City Hotel’s booking count is higher than Resort Hotel.
  3. The year 2016 was a good year for both the hotels. Management can further analyze the year 2016 for “Why did this year have more booking and revenue(Possible answer could be - if there were any special events around etc.)”
# Filter top 10 countries for each hotel
top_countries <- hotels_ultimate_data %>%
  group_by(hotel, country) %>%
  summarise(total_guests = n_distinct(customer_type)) %>%
  arrange(hotel, desc(total_guests)) %>%
  group_by(hotel) %>%
  top_n(10)
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## Selecting by total_guests
# Plot guest demographics analysis for top 10 countries
guest_demographics_top10 <- top_countries %>%
  ggplot(aes(x = country, y = total_guests, fill = hotel)) +
  geom_bar(stat = "identity") +
  labs(title = "Guest Demographics Analysis (Top 10 Countries)",
       x = "Country",
       y = "Total Guests") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Print the plot
print(guest_demographics_top10)

Explanation: by analyzing above graph, here are some outlines.
1. Hotel Booking Trends Over Time of analyzing the trend in booking for both hotels over time. Here, City hotel is leading the trends in all three years. Although in 2017, Trends went little bit downside.
2. Revenue Analysis is a revenue metric which has been drived by average daily rate (ADR), Stay number of night,total revenue for each hotel.City Hotel’s revenue is higher than resort hotel.
3. In Guest Demographics graph, We have drawn the top ten countries of guest of hotels. which help business to make the strategy for their guest welcome,stay , hotel arrangement and future strategy for them attract.

ADR (Average Daily Rate) Analysis

Question: How has the Average Daily Rate (ADR) for hotel bookings evolved over time?

# Calculate ADR (Average Daily Rate)
hotels_ultimate_data$ADR <- hotels_ultimate_data$adr / hotels_ultimate_data$stays_in_weekend_nights + hotels_ultimate_data$stays_in_week_nights

# Aggregate ADR by date
ADR_by_date <- hotels_ultimate_data %>%
  group_by(reservation_status_date) %>%
  summarise(Total_ADR = mean(ADR, na.rm = TRUE))

head(ADR_by_date,10)
## # A tibble: 10 × 2
##    reservation_status_date Total_ADR
##    <date>                      <dbl>
##  1 2014-10-17                 Inf   
##  2 2014-11-18                   3   
##  3 2015-01-01                 Inf   
##  4 2015-01-02                   9.54
##  5 2015-01-18                   2   
##  6 2015-01-20                  41.2 
##  7 2015-01-21                 Inf   
##  8 2015-01-22                 Inf   
##  9 2015-01-28                 127   
## 10 2015-01-29                 Inf
# Plot Total ADR over time
ggplot(ADR_by_date, aes(x = reservation_status_date, y = Total_ADR)) +
  geom_line() +
  labs(title = "Total ADR Over Time", x = "Date", y = "Total ADR") +
  theme_minimal()

Observation : Graph has continous pick for 2016, means there is some special event in 2016.

ADR Analysis by Hotel
# Calculate Total ADR for each hotel
total_adr <- hotels_ultimate_data %>%
  group_by(hotel) %>%
  summarise(total_adr = mean(adr, na.rm = TRUE))

# Plot Total ADR for each hotel
ggplot(total_adr, aes(x = hotel, y = total_adr, fill = hotel)) +
  geom_bar(stat = "identity") +
  labs(title = "Total ADR Analysis by Hotel",
       x = "Hotel",
       y = "Average Daily Rate (ADR)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Explanation:

ADR value of City hotel is higher than Resort Hotel.
Conclusion: Total ADR (Average Daily Rate) over time Analysis and Trends can help business to identify the peak during certain periods, such as holidays or peak travel season or any event like game match. Hotelwise - Total ADR Analysis, Business can see insights into the performance and competitiveness of each hotel within the market landscape.

ADR boxplot

Question : How does the Average Daily Rate (ADR) vary across different reserved room types in the hotels?

adr_boxplot_resort <- ggplot(hotel1, aes(x = reserved_room_type, y = adr, fill = factor(hotel))) +
  geom_boxplot(position = position_dodge(),width = .7,  outlier.shape = NA) +
  labs(title = "ADR Distribution by Reserved \n Room Type (Resort Hotel)",
       x = "Reserved Room Type",
       y = "ADR") +
  scale_fill_discrete( name="Resort Hotel")+
  theme(plot.title = element_text(hjust=0.5))




adr_boxplot_city <- ggplot(hotel2, aes(x = reserved_room_type, y = adr, fill = factor(hotel))) +
  geom_boxplot(position = position_dodge(), width = 0.7, outlier.shape = NA) +
  labs(title = "ADR Distribution by Reserved \n Room Type (City Hotel)",
       x = "Reserved Room Type",
       y = "ADR") +
  scale_fill_manual(name = "City Hotel", values = c("#47f5df")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = function(x) paste0(x/100, "00"))


# Print the boxplot
#print(adr_boxplot)

grid.arrange(adr_boxplot_resort,adr_boxplot_city, nrow = 1)

Explanation and Conclusion: The ADR Distribution by Reserved Room Type plot illustrates the distribution of Average Daily Rate (ADR) across various room types that guests have reserved in the hotels.Above graph is providing the insight on variability of ADR among different room types which helps to understand pricing and preferences of room type in guest choice.
This analysis will help hotel management to optimize the dynamic pricing and resource allocation according to the demand of room type.

Lead Time vs. ADR Analysis:
Question: 1. How does the lead time (the number of days between booking and arrival) affect the average daily rate (ADR) of hotel bookings?
# linear regression Model with adr and lead time 
model_lm_adr_ld_tm <- lm(adr ~ lead_time, data = hotels_ultimate_data)


coef(model_lm_adr_ld_tm)
##  (Intercept)    lead_time 
## 105.16823992  -0.03070445
summary(model_lm_adr_ld_tm)
## 
## Call:
## lm(formula = adr ~ lead_time, data = hotels_ultimate_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -105.6  -31.3   -7.3   23.8 5295.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.168240   0.203492  516.82   <2e-16 ***
## lead_time    -0.030704   0.001364  -22.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.33 on 119204 degrees of freedom
## Multiple R-squared:  0.004234,   Adjusted R-squared:  0.004225 
## F-statistic: 506.8 on 1 and 119204 DF,  p-value: < 2.2e-16
#modelsummary(model_lm_adr_ld_tm)


# Plot scatter for lm_lead_adr_mod_1
lm_lead_adr_mod_1_scatter  <- ggplot(hotels_ultimate_data, aes(x = lead_time, y = adr)) +
  geom_point() +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Linear Regression:\nADR vs. Lead Time",
       x = "Lead Time",
       y = "ADR")

#Plot Linner for lm_lead_adr_mod_1
lm_lead_adr_mod_1_linner_2 <- ggplot(hotels_ultimate_data, aes(x = lead_time, y = adr)) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Linear Regression: \n ADR vs. Lead Time",
       x = "Lead Time",
       y = "ADR")


#Arrange above two graph with single grid 
grid.arrange(lm_lead_adr_mod_1_scatter,lm_lead_adr_mod_1_linner_2, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Explanation:
✔ Interpretation :
The Intercept is 105.168 (estimated ADR) when lead time is zero which is unlikely hood in our hotel bookings case.It might not have practical interpretation since it is unlikely to have a lead time zero.

✔ Significance :The Coefficients for lead time is -0.031. It indicates the estimated change in the average daily rate for a room increase in lead time. In our case,it suggests that for each additional day of lead time, the average daily rate decreases by $0.02983 which is unlike in realworld. Here low p-value: < 2.2e-16 is indicating that model is statistically significant for variable leadtime and adr relationship.

✔ Conclusion : According to above result, there is a significant negative relationship between lead time and ARD. It is indicating that as lead time increase, ADR tends to decrease.

2. How does lead time , the month of arrival , special car parking request collectively influence the average daily rate (ADR) of hotel bookings?


#Build a linear regression model with adr , lead_time and arrival_date_month,required_car_parking_spaces as predictors
lm_lead_adr_mod_2 <- lm(adr ~ lead_time + arrival_date_day_of_month+required_car_parking_spaces, data = hotels_ultimate_data)
# Calculates the predicted average daily rate (ADR) for each observation in the dataset using linner regression model i.e which is lm_lead_adr_mod_2
hotels_ultimate_data$predicted_adr <- predict(lm_lead_adr_mod_2, newdata = hotels_ultimate_data)

print("*************************************")
## [1] "*************************************"
summary(lm_lead_adr_mod_2)
## 
## Call:
## lm(formula = adr ~ lead_time + arrival_date_day_of_month + required_car_parking_spaces, 
##     data = hotels_ultimate_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -144.3  -31.3   -7.1   23.9 5295.1 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 101.537505   0.335151  302.96   <2e-16 ***
## lead_time                    -0.028022   0.001371  -20.44   <2e-16 ***
## arrival_date_day_of_month     0.171996   0.016573   10.38   <2e-16 ***
## required_car_parking_spaces  10.136894   0.597182   16.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.24 on 119202 degrees of freedom
## Multiple R-squared:  0.007556,   Adjusted R-squared:  0.007531 
## F-statistic: 302.5 on 3 and 119202 DF,  p-value: < 2.2e-16
print("*************************************")
## [1] "*************************************"
coef(lm_lead_adr_mod_2)
##                 (Intercept)                   lead_time 
##                101.53750477                 -0.02802225 
##   arrival_date_day_of_month required_car_parking_spaces 
##                  0.17199568                 10.13689379
print("*************************************")
## [1] "*************************************"
vif(lm_lead_adr_mod_2)
##                   lead_time   arrival_date_day_of_month 
##                    1.013802                    1.000084 
## required_car_parking_spaces 
##                    1.013871
print("*************************************")
## [1] "*************************************"
ncvTest(lm_lead_adr_mod_2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5856.871, Df = 1, p = < 2.22e-16
print("*************************************")
## [1] "*************************************"
# Plot scatter for lm_lead_adr_mod_2
lm_lead_adr_mod_2_scatter <- ggplot(hotels_ultimate_data, aes(x = predicted_adr, y = adr)) +
  geom_point() +  # Plot the observed values
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Add linear regression line
  labs(title = "Linear Regression:\n Observed vs. Predicted ADR",
       x = "Predicted ADR",
       y = "Observed ADR")
#Plot Linner for lm_lead_adr_mod_2
lm_lead_adr_mod_2_linner <- ggplot(hotels_ultimate_data, aes(x = predicted_adr, y = adr)) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Linear Regression:\n Observed vs. Predicted ADR",
       x = "Predicted ADR",
       y = "Observed ADR")

#Arrange above two graph with single grid 
grid.arrange(lm_lead_adr_mod_2_scatter,lm_lead_adr_mod_2_linner, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Explanation : + In above model,We have created a linear regression model between lead time along with arrival month ,car parking request on ARD (This type of scenario is significant where car parking is one of the major factor for hotel booking.)
+ Then calculated the predicted values(hotels_ultimate_data$predicted_adr) based on a fitted model which is lm_lead_adr_mod_2 for or each observation of dataset.

✔ Interpretation : The Intercept is 101.537 (estimated ADR) when all predictor variables - lead time, arrival date day of month, and required car parking spaces are zero. (This Intercept value 101.537 is almost same of above model 105.168 )

✔ Significance : Cofficient value for lead time is -0.028022 (almost same of above model -.030).Cofficient for arrival_date_day_of_month is 0.172 menas for each day in the month of arrival, ADR might have impact of .172 and Cofficient for required_car_parking_spaces is 10.137 menas if guest have need of car parking then ADR may have impact of 10.137 unit price.

✔ Conclusion : Lead time along with month of arrival, and special car parking request collectivly have a significant (as p-value: < 2.2e-16 is low) but relatively small influence on the ADR of hotel bookings. Lead time has negative impact and car parking request has positive impact on ADR whereas the month of arrival has a very smaller positive effect on ADR. Overall impact of these factors is very small on ADR.

Recommendations:

  1. According to the above linear regression model outcome, It is recommended for optimizing their hotel revenue that they should implement dynamic pricing to offer better rates for those customers who book the room in advance.
  2. The linear regression model can also be used in forecasting future demand and this information can be used for formulating pricing strategies so that hotel revenue can be optimized.

Lead time vs. Cancellation Analysis

Question: How does the lead time ( “adr - The number of days between booking and arrival”) impact the likelihood of cancellation in hotel bookings?

hotel_city_lead_cancellation <- hotels_ultimate_data %>% 
  filter(hotel== "City Hotel")

hotel_resort_lead_cancellation <- hotels_ultimate_data %>% 
  filter(hotel== "Resort Hotel")

nrow(hotel_city_lead_cancellation)
## [1] 79159
sm_model_hotel_city_lead_cancellation <- lm(lead_time~is_canceled,hotel_city_lead_cancellation)
print("***********************************************************************")
## [1] "***********************************************************************"
summary(sm_model_hotel_city_lead_cancellation)
## 
## Call:
## lm(formula = lead_time ~ is_canceled, data = hotel_city_lead_cancellation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.34  -75.85  -32.34   52.15  478.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.8547     0.4916  164.46   <2e-16 ***
## is_canceled  69.4894     0.7606   91.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.5 on 79157 degrees of freedom
## Multiple R-squared:  0.09539,    Adjusted R-squared:  0.09538 
## F-statistic:  8347 on 1 and 79157 DF,  p-value: < 2.2e-16
print("***********************************************************************")
## [1] "***********************************************************************"
sm_model_hotel_resort_lead_cancellation <- lm(lead_time~is_canceled,hotel_resort_lead_cancellation)
print("***********************************************************************")
## [1] "***********************************************************************"
summary(sm_model_hotel_resort_lead_cancellation)
## 
## Call:
## lm(formula = lead_time ~ is_canceled, data = hotel_resort_lead_cancellation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128.70  -74.70  -36.70   57.15  658.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  78.8510     0.5568  141.62   <2e-16 ***
## is_canceled  49.8526     1.0566   47.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.7 on 40045 degrees of freedom
## Multiple R-squared:  0.05266,    Adjusted R-squared:  0.05264 
## F-statistic:  2226 on 1 and 40045 DF,  p-value: < 2.2e-16
print("***********************************************************************")
## [1] "***********************************************************************"
lead_time_cancel_plot_city <- ggplot(hotel_city_lead_cancellation, aes(x = lead_time, fill= factor(is_canceled), label = stat(count))) +
  geom_density(alpha = 0.5) +  # Adjust alpha for transparency
  labs(title = "City Hotel:Lead Time vs. Cancellation ",
       x = "Lead Time",
       y = "Density",
       color = "Hotel") +
  scale_fill_discrete(
    name = "Cancelled",
    breaks = c("0", "1"),
    label = c("Not Cancelled", "Cancelled")
  )+
  theme_minimal()


lead_time_cancel_plot_resort <- ggplot(hotel_resort_lead_cancellation, aes(x = lead_time, fill= factor(is_canceled), label = stat(count))) +
  geom_density(alpha = 0.5) +  # Adjust alpha for transparency
  labs(title = "Resort Hotel: Lead Time vs. Cancellation",
       x = "Lead Time",
       y = "Density",
       color = "Hotel") +
  scale_fill_discrete(
    name = "Cancelled",
    breaks = c("0", "1"),
    label = c("Not Cancelled", "Cancelled")
  )+
  theme_minimal()

# Print the scatter plot
#print(lead_time_cancel_plot)

grid.arrange(lead_time_cancel_plot_city, lead_time_cancel_plot_resort, nrow = 2)

Explanation:
1.Lead time and Cancellation analysis determine the relationship between lead time and likely hood of Cancellation in hotel booking.
2. The intercept is 80.854 tells that during this estimated lead time when a booking may not canceled in city hotel where as This value is 78.851 for resort hotel. 3. Significance Both coefficients are highly significant because p-value < 2.2e-16 (small then 0.05).

Conclusion: 1. For City hotel, “Canceled booking” tend to have a lead time approximately 69 days longer than non-canceled bookings
2. For resort hotel, “Canceled booking” tend to have a lead time approximately 49 days longer than non-canceled bookings.
3. By understanding the relationship of cancellation status and duration can help hotel management for defining their cancelation policy , resource arrangement and other strategies (like booking insurance etc.).

Overall, this analysis may help hotel management to make strategic decisions to enhance booking , revenue and optimize the resources.

#Factor 
str(hotels_ultimate_data)
## tibble [119,206 × 34] (S3: tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119206] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119206] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119206] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119206] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119206] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119206] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119206] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119206] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119206] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119206] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119206] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119206] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119206] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119206] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119206] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119206] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119206] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119206] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119206] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119206] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119206] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119206] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119206] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : Date[1:119206], format: "2015-07-01" "2015-07-01" ...
##  $ ADR                           : num [1:119206] NaN NaN Inf Inf Inf ...
##  $ predicted_adr                 : Named num [1:119206] 92.1 81.1 101.5 101.3 101.3 ...
##   ..- attr(*, "names")= chr [1:119206] "1" "2" "3" "4" ...
##  - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
##   ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"
booking_cancel_percentage <- 44224/(44224+75166)

booking_cancel_percentage <- booking_cancel_percentage * 100


cat(" Percentage of hotel booking : ",booking_cancel_percentage)
##  Percentage of hotel booking :  37.04163
# Perform logistic regression
logit_model <- glm(is_canceled ~ lead_time + previous_cancellations + adults + children + babies + is_repeated_guest, 
                   data = hotels_ultimate_data, 
                   family = binomial)

# Summary of the logistic regression model
summary(logit_model)
## 
## Call:
## glm(formula = is_canceled ~ lead_time + previous_cancellations + 
##     adults + children + babies + is_repeated_guest, family = binomial, 
##     data = hotels_ultimate_data)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.399e+00  2.652e-02 -52.760  < 2e-16 ***
## lead_time               4.984e-03  6.322e-05  78.835  < 2e-16 ***
## previous_cancellations  2.144e+00  3.922e-02  54.673  < 2e-16 ***
## adults                  1.362e-01  1.345e-02  10.133  < 2e-16 ***
## children                1.220e-01  1.543e-02   7.910 2.57e-15 ***
## babies                 -8.429e-01  8.665e-02  -9.728  < 2e-16 ***
## is_repeated_guest      -2.161e+00  6.813e-02 -31.723  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 157197  on 119205  degrees of freedom
## Residual deviance: 141388  on 119199  degrees of freedom
## AIC: 141402
## 
## Number of Fisher Scoring iterations: 6

Explanation: Since in our database, Hotel booking cancellation percentage is 37.04% which is quite high. Here we have tried to see pattern in cancellation, lead time , history of canceling, the number of people in the booking and repeated guest.

Coefficients Each variable has different value of coefficient which mean different impact of these variable on cencellation.Although babies and is_repeated_guest has negative coefficient value -8.42 and -2.16 which means there is less likely hood of booking cencelation.

AIC (Akaike Information Criterion) value of 141402 is a measure of how well the model fits the data while penalizing for complexity. Lower AIC values indicate better fitting models relative to other models being compared.

Conclusion:
1. Lead time and previous cancellations have positive effects on the likelihood of cancellation.
2. The number of adults, children, and babies also influences cancellation likelihood positively.
3. Being a repeated guest reduces the likelihood of cancellation. 4. This analysis give insight of factor which may have affect in booking cencellation which can help hotel management to make strategies to minimize cancellations and optimize revenue.



Thank you!!!