First, initial setup, Configure the library and Load the data file
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Loading required package: carData
##
##
## Attaching package: 'car'
##
##
## The following object is masked from 'package:dplyr':
##
## recode
##
##
## The following object is masked from 'package:purrr':
##
## some
##
##
## Loading required package: fabletools
##
##
## Attaching package: 'maps'
##
##
## The following object is masked from 'package:purrr':
##
## map
##
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
##
## Change the default backend persistently:
##
## config_modelsummary(factory_default = 'gt')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
## Rows: 40060 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): ArrivalDateMonth, Meal, Country, MarketSegment, DistributionChann...
## dbl (18): IsCanceled, LeadTime, ArrivalDateYear, ArrivalDateWeekNumber, Arr...
## date (1): ReservationStatusDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 79330 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): ArrivalDateMonth, Meal, Country, MarketSegment, DistributionChann...
## dbl (18): IsCanceled, LeadTime, ArrivalDateYear, ArrivalDateWeekNumber, Arr...
## date (1): ReservationStatusDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## tibble [119,390 × 32] (S3: tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
##
This dataset contains information about hotel bookings with a total of 119390 records for two hotel groups - “City Hotel” and “Resort Hotel” with various attributes of guests and their stays (Abstract is given above). It encompasses diverse variables such as Average daily rate, cancellation, Arrival date, Length of stay, Guest demographics, Booking channels, and Special requests. The data spans multiple years and months.It is providing a comprehensive overview of booking patterns over time. The dataset variables like lead time, cancellation, meal, length of stay, etc. can be used in detailed analysis to find the pattern or message in the dataset which can help formulate the business strategy.
| variable | description |
|---|---|
| hotel | Hotel |
| is_canceled | Value indicating if the booking was canceled (1) or not (0) |
| lead_time | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| arrival_date_year | Year of arrival date |
| arrival_date_month | Month of arrival date |
| arrival_date_week_number | Week number of year for arrival date |
| arrival_date_day_of_month | Day of arrival date |
| stays_in_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | Number of adults |
| children | Number of children |
| babies | Number of babies |
| meal | Type of meal booked. Categories are presented in standard hospitality meal packages: Undefined/SC – no meal package; BB – Bed & Breakfast; HB – Half board (breakfast and one other meal – usually dinner); FB – Full board (breakfast, lunch and dinner) |
| country | Country of origin. Categories are represented in the ISO 3155–3:2013 format |
| market_segment | Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| distribution_channel | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” |
| is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | Code of room type reserved. Code is presented instead of designation for anonymity reasons |
| assigned_room_type | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons |
| booking_changes | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| deposit_type | Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay. |
| agent | ID of the travel agency that made the booking |
| company | ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons |
| customer_type | Type of booking, assuming one of four categories: Contract - when the booking has an allotment or other type of contract associated to it; Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking |
| adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights |
| required_car_parking_spaces | Number of car parking spaces required by the customer |
| total_of_special_requests | Number of special requests made by the customer (e.g. twin bed or high floor) |
| reservation_status | Reservation last status, assuming one of three categories: Canceled – booking was canceled by the customer; Check-Out – customer has checked in but already departed; No-Show – customer did not check-in and did inform the hotel of the reason why |
| reservation_status_date | Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel |
# Calculate the proportions of each hotel type
hotel_counts <- table(hotel_row$hotel)
hotel_percentages <- prop.table(hotel_counts) * 100
# Create a data frame with hotel types and percentages
hotel_percentages_df <- data.frame(
hotel = names(hotel_percentages),
percentage = hotel_percentages
)
options(repr.plot.width = 20, repr.plot.height = 15)
# Create the pie chart using ggplot
pie_chart <- ggplot(hotel_percentages_df, aes(x = "", y = hotel_percentages_df$percentage.Freq, fill = hotel)) +
geom_bar(stat = "identity", width = 1) +
geom_text(aes(label = paste0(round(hotel_percentages_df$percentage.Freq, 1), "%")), position = position_stack(vjust = 1.0), size = 4) + # Add data points
coord_polar("y", start = 0) +
labs(title = "Both Hotels data Contribution in Dataset", fill = "Hotel Type", x = NULL, y = NULL) +
scale_fill_manual(values = c("#47f5df", "#ea47f5")) +
theme_minimal()
# Print the pie chart
print(pie_chart)
#Data summary
skim(hotel_row)
| Name | hotel_row |
| Number of rows | 119390 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 1 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
| arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
| meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
| country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
| market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
| distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
| reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
| deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
| agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
| company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
| customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
| reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| reservation_status_date | 0 | 1 | 2014-10-17 | 2017-09-14 | 2016-08-07 | 926 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
| lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
| arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
| arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
| arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
| stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
| stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
| adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
| children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
| previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
| previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
| booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
| days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
| adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
| required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
| total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
n_missing field in above table.) which was
defined as NA.# Hotel data group by column "hotel"
hotel_row %>%
group_by(hotel) %>%
summarise(count=n())
## # A tibble: 2 × 2
## hotel count
## <chr> <int>
## 1 City Hotel 79330
## 2 Resort Hotel 40060
# Hotel data group by column "is_canceled:
hotel_row %>%
group_by(is_canceled) %>%
summarise(count=n())
## # A tibble: 2 × 2
## is_canceled count
## <dbl> <int>
## 1 0 75166
## 2 1 44224
hotel_row %>%
group_by(arrival_date_year) %>%
summarise(count=n())
## # A tibble: 3 × 2
## arrival_date_year count
## <dbl> <int>
## 1 2015 21996
## 2 2016 56707
## 3 2017 40687
hotel_row %>%
group_by(meal) %>%
summarise(count=n())
## # A tibble: 5 × 2
## meal count
## <chr> <int>
## 1 BB 92310
## 2 FB 798
## 3 HB 14463
## 4 SC 10650
## 5 Undefined 1169
hotel_row %>%
group_by(country) %>%
summarise(count=n())
## # A tibble: 178 × 2
## country count
## <chr> <int>
## 1 ABW 2
## 2 AGO 362
## 3 AIA 1
## 4 ALB 12
## 5 AND 7
## 6 ARE 51
## 7 ARG 214
## 8 ARM 8
## 9 ASM 1
## 10 ATA 2
## # ℹ 168 more rows
hotel_row %>%
group_by(market_segment) %>%
summarise(count=n())
## # A tibble: 8 × 2
## market_segment count
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
[Repeated guest (1) or not (0)]
hotel_row %>%
group_by(distribution_channel) %>%
summarise(count=n())
## # A tibble: 5 × 2
## distribution_channel count
## <chr> <int>
## 1 Corporate 6677
## 2 Direct 14645
## 3 GDS 193
## 4 TA/TO 97870
## 5 Undefined 5
hotel_row %>%
group_by(is_repeated_guest) %>%
summarise(count=n())
## # A tibble: 2 × 2
## is_repeated_guest count
## <dbl> <int>
## 1 0 115580
## 2 1 3810
hotel_row %>%
group_by(reserved_room_type) %>%
summarise(count=n())
## # A tibble: 10 × 2
## reserved_room_type count
## <chr> <int>
## 1 A 85994
## 2 B 1118
## 3 C 932
## 4 D 19201
## 5 E 6535
## 6 F 2897
## 7 G 2094
## 8 H 601
## 9 L 6
## 10 P 12
hotel_row %>%
group_by(reservation_status) %>%
summarise(count=n())
## # A tibble: 3 × 2
## reservation_status count
## <chr> <int>
## 1 Canceled 43017
## 2 Check-Out 75166
## 3 No-Show 1207
hotel_data_without_na <- hotel_row %>%
na.omit(hotel_data)
cat("Before data cleaning, Number of rows : ",nrow(hotel_row),"\n")
## Before data cleaning, Number of rows : 119390
cat("After data cleaning, Number of rows : ",nrow(hotel_data_without_na),"\n")
## After data cleaning, Number of rows : 119386
hotel_data_filtered <- hotel_data_without_na %>%
filter((hotel_data_without_na$adults == 0 & hotel_data_without_na$reservation_status == "Check-Out"))
adt_0_chl <-nrow(hotel_data_filtered)
cat("Number record which has adult as zero with checkout status : ",adt_0_chl,"\n")
## Number record which has adult as zero with checkout status : 294
hotel_temp <- subset(hotel_data_without_na, !(adults == 0 & reservation_status == "Check-Out"))
#nrow(hotel_temp)
hotel_temp <-
subset(hotel_data_without_na, hotel_data_without_na$adults == 0 & hotel_data_without_na$children == 0 & hotel_data_without_na$babies ==0)
hotels_ultimate_data<- hotel_data_without_na %>%
filter(adults +children +babies !=0)
nrow(hotels_ultimate_data)
## [1] 119206
In hotel dataset, We can divide hotel guests into four major groups
i.e. Adult, Adult with Children,Adult with babies and Adult with
Children & Babies.
Q. what is the distribution of these
groups across the two types of hotels?
guest_groups <- hotels_ultimate_data %>% select(hotel, adults,children,babies)
guest_classification_group <-guest_groups %>%
mutate(group= case_when(adults > 0 & children == 0 & babies == 0 ~ 'Adult',
adults > 0 & children > 0 & babies == 0 ~ 'Adult_Children',
adults > 0 & children == 0 & babies > 0 ~ 'Adult_Babies',
adults > 0 & children > 0 & babies > 0 ~ 'Adult_Children_Babies',
adults == 0 & children > 0 & babies == 0 ~ 'Children',
adults == 0 & children == 0 & babies > 0 ~ 'Babies',
adults == 0 & children > 0 & babies > 0 ~ 'Children_Babies'
))
guest_classification_group %>%
group_by(group) %>%
summarise(count=n())
## # A tibble: 6 × 2
## group count
## <chr> <int>
## 1 Adult 109874
## 2 Adult_Babies 742
## 3 Adult_Children 8195
## 4 Adult_Children_Babies 172
## 5 Children 220
## 6 Children_Babies 3
After classifying the guest into groups, I found that there were 223 rows for which Adult count was zero but children and babies had some value. This was also making no sense in real world that children and babies visited hotel without Adult.
guest_group_hotel <- guest_classification_group %>% select(hotel,group)
guest_group_hotel <- guest_group_hotel %>% group_by(hotel,group) %>% summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
guest_group_CityHotel <-
guest_group_hotel %>%
filter(hotel == "City Hotel")
guest_group_ResortHotel <-
guest_group_hotel %>%
filter(hotel == "Resort Hotel")
skim(guest_group_CityHotel)
| Name | guest_group_CityHotel |
| Number of rows | 6 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | hotel |
Variable type: character
| skim_variable | hotel | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| group | City Hotel | 0 | 1 | 5 | 21 | 0 | 6 | 0 |
Variable type: numeric
| skim_variable | hotel | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | City Hotel | 0 | 1 | 13193.17 | 29728.46 | 3 | 106.75 | 258.5 | 3684.75 | 73756 | ▇▁▁▁▂ |
skim(guest_group_ResortHotel)
| Name | guest_group_ResortHotel |
| Number of rows | 4 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | hotel |
Variable type: character
| skim_variable | hotel | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| group | Resort Hotel | 0 | 1 | 5 | 21 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | hotel | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | Resort Hotel | 0 | 1 | 10011.75 | 17466.25 | 103 | 359.5 | 1913 | 11565.25 | 36118 | ▇▁▁▁▂ |
guest_classification_group %>%
mutate(group = factor(group, levels=c("Adult","Adult_Children","Adult_Babies","Children","Adult_Children_Babies", "Children_Babies"))) %>%
ggplot(aes(x = group,
y = stat(count),
fill = factor(hotel),
label = stat(count) ))+
geom_bar(position = position_dodge(),width = .5)+
geom_text(stat = "count",
position = position_dodge(1),
vjust = -0.5,
size = 3)+
labs(title = "Guest groups distribution by hotel",
x = "Guest group",
y = "Count")+
theme(plot.title = element_text(hjust=0.5),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7))+
scale_fill_manual(values = c("#47f5df", "#ea47f5"),
name= "Hotel"
)
1. There are two main groups -“Adults” and “Adults with
Children”which have most of data.
2. There are 220 children of City
Hotel with no Adults.
3. There are 3 children_Babies and with no
Adults.
print(guest_group_hotel)
## # A tibble: 10 × 3
## # Groups: hotel [2]
## hotel group count
## <chr> <chr> <int>
## 1 City Hotel Adult 73756
## 2 City Hotel Adult_Babies 297
## 3 City Hotel Adult_Children 4814
## 4 City Hotel Adult_Children_Babies 69
## 5 City Hotel Children 220
## 6 City Hotel Children_Babies 3
## 7 Resort Hotel Adult 36118
## 8 Resort Hotel Adult_Babies 445
## 9 Resort Hotel Adult_Children 3381
## 10 Resort Hotel Adult_Children_Babies 103
country_guest <- hotels_ultimate_data %>%
filter(is_canceled==0) %>%
select(hotel, country) %>%
group_by(hotel,country) %>%
summarise(count=n(),.groups = 'drop')
country_guest_R_hotel <- country_guest %>%
filter(hotel== "Resort Hotel") %>%
arrange(desc(count))
country_guest_R_hotel %>%
arrange(desc(count))
## # A tibble: 119 × 3
## hotel country count
## <chr> <chr> <int>
## 1 Resort Hotel PRT 10184
## 2 Resort Hotel GBR 5922
## 3 Resort Hotel ESP 3105
## 4 Resort Hotel IRL 1734
## 5 Resort Hotel FRA 1399
## 6 Resort Hotel DEU 1057
## 7 Resort Hotel CN 614
## 8 Resort Hotel NLD 458
## 9 Resort Hotel NULL 419
## 10 Resort Hotel USA 407
## # ℹ 109 more rows
top_15_countries_r <- head(country_guest_R_hotel, 15)
# Plotting
ggplot(top_15_countries_r, aes(x = country, y = count)) +
geom_bar(stat = "identity", fill = "#ea47f5") +
labs(title = "Country-wise - Count of Guests in Resort Hotel",
x = "Country",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
1. Top five country of Resort hotel on the basis of visited guests
are, Portugal (PRT), United Kingdom(GBR), Spain(ESP), Ireland (IRL) and
France(FRA).
country_guest_C_hotel <- country_guest %>%
filter(hotel== "City Hotel") %>%
arrange(desc(count))
country_guest_C_hotel %>%
arrange(desc(count))
## # A tibble: 151 × 3
## hotel country count
## <chr> <chr> <int>
## 1 City Hotel PRT 10793
## 2 City Hotel FRA 7069
## 3 City Hotel DEU 5010
## 4 City Hotel GBR 3746
## 5 City Hotel ESP 3278
## 6 City Hotel ITA 2049
## 7 City Hotel BEL 1479
## 8 City Hotel NLD 1258
## 9 City Hotel USA 1185
## 10 City Hotel BRA 1063
## # ℹ 141 more rows
top_15_countries_c <- head(country_guest_C_hotel, 15)
# Plotting
ggplot(top_15_countries_c, aes(x = country, y = count)) +
geom_bar(stat = "identity", fill = "#47f5df") +
labs(title = "Country-wise - Count of Guests in City Hotel",
x = "Country",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
2. Top five country of City hotel on the basis of visited guests
are Portugal (PRT),France(FRA), Germany (DEU) , United Kingdom(GBR),
Spain(ESP)
repeated_rate <- hotels_ultimate_data %>%
filter(is_canceled == 0) %>%
select(hotel, is_repeated_guest)
repeated_rate %>%
group_by(hotel, is_repeated_guest) %>%
summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: hotel [2]
## hotel is_repeated_guest count
## <chr> <dbl> <int>
## 1 City Hotel 0 44546
## 2 City Hotel 1 1538
## 3 Resort Hotel 0 27260
## 4 Resort Hotel 1 1667
City_rep_guest <- (1538/(1538+44546))*100
cat(" In City Hotel, Repeated guest %: " , City_rep_guest ,"\n")
## In City Hotel, Repeated guest %: 3.337384
Resort_rep_guest <- (1667/(1667+27260))*100
cat(" In Resort Hotel, Repeated guest %: " , Resort_rep_guest ,"\n")
## In Resort Hotel, Repeated guest %: 5.762782
#Select adr and associated variable
hotels_data_adr <- hotels_ultimate_data %>% select (hotel,adr,is_canceled,assigned_room_type,reservation_status_date, arrival_date_year, arrival_date_month)
# Arranging hotel data accoring to adr(Average Daily Rate).
hotels_data_adr <-hotels_data_adr[order(hotels_data_adr$adr,decreasing = TRUE),]
#Display top 10 row for columns hotel and adr.
#head(hotels_data_adr[,c("hotel","assigned_room_type","adr")],10)
# Room type in hotel
hotel_unique_room_type <- unique(hotels_data_adr$assigned_room_type)
cat("Room type in hotel : ",hotel_unique_room_type)
## Room type in hotel : A G C E H F D I B K L
hotels_data_adr %>%
group_by(hotel,assigned_room_type) %>%
summarise(sum(adr))
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 3
## # Groups: hotel [2]
## hotel assigned_room_type `sum(adr)`
## <chr> <chr> <dbl>
## 1 City Hotel A 5509837.
## 2 City Hotel B 187999.
## 3 City Hotel C 16181.
## 4 City Hotel D 1822770.
## 5 City Hotel E 312072.
## 6 City Hotel F 360519.
## 7 City Hotel G 128932.
## 8 City Hotel K 13521.
## 9 Resort Hotel A 1387440.
## 10 Resort Hotel B 16178.
## 11 Resort Hotel C 253200.
## 12 Resort Hotel D 898143.
## 13 Resort Hotel E 606727.
## 14 Resort Hotel F 209220.
## 15 Resort Hotel G 296020.
## 16 Resort Hotel H 122023.
## 17 Resort Hotel I 14826.
## 18 Resort Hotel L 8
Room type: A and D are in trends and generating major
revenue.
hotels_data_adr %>%
group_by(hotel,assigned_room_type) %>%
summarise(count=n())
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 3
## # Groups: hotel [2]
## hotel assigned_room_type count
## <chr> <chr> <int>
## 1 City Hotel A 56977
## 2 City Hotel B 1991
## 3 City Hotel C 158
## 4 City Hotel D 14971
## 5 City Hotel E 2160
## 6 City Hotel F 2018
## 7 City Hotel G 697
## 8 City Hotel K 187
## 9 Resort Hotel A 17043
## 10 Resort Hotel B 159
## 11 Resort Hotel C 2212
## 12 Resort Hotel D 10338
## 13 Resort Hotel E 5638
## 14 Resort Hotel F 1733
## 15 Resort Hotel G 1852
## 16 Resort Hotel H 712
## 17 Resort Hotel I 359
## 18 Resort Hotel L 1
How does the distribution of reservation statuses vary over time?
# Convert reservation_status_date to Date type
hotels_ultimate_data$reservation_status_date <- as.Date(hotels_ultimate_data$reservation_status_date)
# Plot the distribution of reservation status over time
reservation_status_plot <- ggplot(hotels_ultimate_data, aes(x = reservation_status_date, fill = reservation_status)) +
geom_bar() +
labs(title = "Distribution of Reservation Status Over Time",
x = "Reservation Status Date",
y = "Count",
fill = "Reservation Status") +
theme_minimal()
# Print the plot
print(reservation_status_plot)
Conclusion: Reservation Status Date Analysis provides insights
into the distribution and trends of reservation statuses over time.
1. By examining the distribution of reservation statuses over time, It
can help business to identify patterns and trends in the booking
activity of the hotel.
2. Analysis of reservation statuses over time
can reveal insights into guest booking behavior, such as the frequency
of cancellations, No-show, and successful Stay.
3. This analysis can
also help business in forecasting and planning to anticipate
demand fluctuations and implement strategies to optimize revenue and
guest satisfaction.
#head(hotel_booking_trends)
# Hotel booking trends over time
hotel_booking_trends <- hotels_ultimate_data %>%
group_by(hotel, arrival_date_year) %>%
summarise(total_bookings = n()) %>%
ggplot(aes(x = factor(arrival_date_year), y = total_bookings, fill = hotel)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Hotel Booking Trends Over Time",
x = "Year",
y = "Total Bookings") +
theme_minimal()
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
# Guest demographics analysis
guest_demographics <- hotels_ultimate_data %>%
group_by(hotel, country) %>%
summarise(total_guests = n_distinct(customer_type)) %>%
ggplot(aes(x = country, y = total_guests, fill = hotel)) +
geom_bar(stat = "identity") +
labs(title = "Guest Demographics Analysis",
x = "Country",
y = "Total Guests") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
# Revenue analysis
revenue_analysis <- hotels_ultimate_data %>%
group_by(hotel) %>%
summarise(total_revenue = sum(adr * (stays_in_weekend_nights + stays_in_week_nights))) %>%
ggplot(aes(x = hotel, y = total_revenue, fill = hotel)) +
geom_bar(stat = "identity") +
labs(title = "Revenue Analysis by Hotel",
x = "Hotel",
y = "Total Revenue") +
theme_minimal()
grid.arrange(hotel_booking_trends, revenue_analysis, nrow = 1)
Observation :
# Filter top 10 countries for each hotel
top_countries <- hotels_ultimate_data %>%
group_by(hotel, country) %>%
summarise(total_guests = n_distinct(customer_type)) %>%
arrange(hotel, desc(total_guests)) %>%
group_by(hotel) %>%
top_n(10)
## `summarise()` has grouped output by 'hotel'. You can override using the
## `.groups` argument.
## Selecting by total_guests
# Plot guest demographics analysis for top 10 countries
guest_demographics_top10 <- top_countries %>%
ggplot(aes(x = country, y = total_guests, fill = hotel)) +
geom_bar(stat = "identity") +
labs(title = "Guest Demographics Analysis (Top 10 Countries)",
x = "Country",
y = "Total Guests") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Print the plot
print(guest_demographics_top10)
Explanation: by analyzing above graph, here are some outlines.
1.
Hotel Booking Trends Over Time of analyzing the trend in booking
for both hotels over time. Here, City hotel is leading the trends in all
three years. Although in 2017, Trends went little bit downside.
2.
Revenue Analysis is a revenue metric which has been drived by
average daily rate (ADR), Stay number of night,total revenue for each
hotel.City Hotel’s revenue is higher than resort hotel.
3. In
Guest Demographics graph, We have drawn the top ten countries of
guest of hotels. which help business to make the strategy for their
guest welcome,stay , hotel arrangement and future strategy for them
attract.
Question: How has the Average Daily Rate (ADR) for hotel bookings evolved over time?
# Calculate ADR (Average Daily Rate)
hotels_ultimate_data$ADR <- hotels_ultimate_data$adr / hotels_ultimate_data$stays_in_weekend_nights + hotels_ultimate_data$stays_in_week_nights
# Aggregate ADR by date
ADR_by_date <- hotels_ultimate_data %>%
group_by(reservation_status_date) %>%
summarise(Total_ADR = mean(ADR, na.rm = TRUE))
head(ADR_by_date,10)
## # A tibble: 10 × 2
## reservation_status_date Total_ADR
## <date> <dbl>
## 1 2014-10-17 Inf
## 2 2014-11-18 3
## 3 2015-01-01 Inf
## 4 2015-01-02 9.54
## 5 2015-01-18 2
## 6 2015-01-20 41.2
## 7 2015-01-21 Inf
## 8 2015-01-22 Inf
## 9 2015-01-28 127
## 10 2015-01-29 Inf
# Plot Total ADR over time
ggplot(ADR_by_date, aes(x = reservation_status_date, y = Total_ADR)) +
geom_line() +
labs(title = "Total ADR Over Time", x = "Date", y = "Total ADR") +
theme_minimal()
Observation : Graph has continous pick for 2016, means there is some special event in 2016.
# Calculate Total ADR for each hotel
total_adr <- hotels_ultimate_data %>%
group_by(hotel) %>%
summarise(total_adr = mean(adr, na.rm = TRUE))
# Plot Total ADR for each hotel
ggplot(total_adr, aes(x = hotel, y = total_adr, fill = hotel)) +
geom_bar(stat = "identity") +
labs(title = "Total ADR Analysis by Hotel",
x = "Hotel",
y = "Average Daily Rate (ADR)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Explanation:
Question : How does the Average Daily Rate (ADR) vary across different reserved room types in the hotels?
adr_boxplot_resort <- ggplot(hotel1, aes(x = reserved_room_type, y = adr, fill = factor(hotel))) +
geom_boxplot(position = position_dodge(),width = .7, outlier.shape = NA) +
labs(title = "ADR Distribution by Reserved \n Room Type (Resort Hotel)",
x = "Reserved Room Type",
y = "ADR") +
scale_fill_discrete( name="Resort Hotel")+
theme(plot.title = element_text(hjust=0.5))
adr_boxplot_city <- ggplot(hotel2, aes(x = reserved_room_type, y = adr, fill = factor(hotel))) +
geom_boxplot(position = position_dodge(), width = 0.7, outlier.shape = NA) +
labs(title = "ADR Distribution by Reserved \n Room Type (City Hotel)",
x = "Reserved Room Type",
y = "ADR") +
scale_fill_manual(name = "City Hotel", values = c("#47f5df")) +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = function(x) paste0(x/100, "00"))
# Print the boxplot
#print(adr_boxplot)
grid.arrange(adr_boxplot_resort,adr_boxplot_city, nrow = 1)
Explanation and Conclusion: The ADR Distribution by Reserved Room
Type plot illustrates the distribution of Average Daily Rate (ADR)
across various room types that guests have reserved in the hotels.Above
graph is providing the insight on variability of ADR among different
room types which helps to understand pricing and preferences of room
type in guest choice.
This analysis will help hotel management to
optimize the dynamic pricing and resource allocation according to the
demand of room type.
# linear regression Model with adr and lead time
model_lm_adr_ld_tm <- lm(adr ~ lead_time, data = hotels_ultimate_data)
coef(model_lm_adr_ld_tm)
## (Intercept) lead_time
## 105.16823992 -0.03070445
summary(model_lm_adr_ld_tm)
##
## Call:
## lm(formula = adr ~ lead_time, data = hotels_ultimate_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.6 -31.3 -7.3 23.8 5295.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.168240 0.203492 516.82 <2e-16 ***
## lead_time -0.030704 0.001364 -22.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.33 on 119204 degrees of freedom
## Multiple R-squared: 0.004234, Adjusted R-squared: 0.004225
## F-statistic: 506.8 on 1 and 119204 DF, p-value: < 2.2e-16
#modelsummary(model_lm_adr_ld_tm)
# Plot scatter for lm_lead_adr_mod_1
lm_lead_adr_mod_1_scatter <- ggplot(hotels_ultimate_data, aes(x = lead_time, y = adr)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression:\nADR vs. Lead Time",
x = "Lead Time",
y = "ADR")
#Plot Linner for lm_lead_adr_mod_1
lm_lead_adr_mod_1_linner_2 <- ggplot(hotels_ultimate_data, aes(x = lead_time, y = adr)) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression: \n ADR vs. Lead Time",
x = "Lead Time",
y = "ADR")
#Arrange above two graph with single grid
grid.arrange(lm_lead_adr_mod_1_scatter,lm_lead_adr_mod_1_linner_2, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Explanation:
✔ Interpretation : The Intercept is 105.168
(estimated ADR) when lead time is zero which is unlikely hood in our
hotel bookings case.It might not have practical interpretation since it
is unlikely to have a lead time zero.
✔ Significance :The Coefficients for lead time is -0.031. It indicates the estimated change in the average daily rate for a room increase in lead time. In our case,it suggests that for each additional day of lead time, the average daily rate decreases by $0.02983 which is unlike in realworld. Here low p-value: < 2.2e-16 is indicating that model is statistically significant for variable leadtime and adr relationship.
✔ Conclusion : According to above result, there is a significant negative relationship between lead time and ARD. It is indicating that as lead time increase, ADR tends to decrease.
#Build a linear regression model with adr , lead_time and arrival_date_month,required_car_parking_spaces as predictors
lm_lead_adr_mod_2 <- lm(adr ~ lead_time + arrival_date_day_of_month+required_car_parking_spaces, data = hotels_ultimate_data)
# Calculates the predicted average daily rate (ADR) for each observation in the dataset using linner regression model i.e which is lm_lead_adr_mod_2
hotels_ultimate_data$predicted_adr <- predict(lm_lead_adr_mod_2, newdata = hotels_ultimate_data)
print("*************************************")
## [1] "*************************************"
summary(lm_lead_adr_mod_2)
##
## Call:
## lm(formula = adr ~ lead_time + arrival_date_day_of_month + required_car_parking_spaces,
## data = hotels_ultimate_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.3 -31.3 -7.1 23.9 5295.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.537505 0.335151 302.96 <2e-16 ***
## lead_time -0.028022 0.001371 -20.44 <2e-16 ***
## arrival_date_day_of_month 0.171996 0.016573 10.38 <2e-16 ***
## required_car_parking_spaces 10.136894 0.597182 16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.24 on 119202 degrees of freedom
## Multiple R-squared: 0.007556, Adjusted R-squared: 0.007531
## F-statistic: 302.5 on 3 and 119202 DF, p-value: < 2.2e-16
print("*************************************")
## [1] "*************************************"
coef(lm_lead_adr_mod_2)
## (Intercept) lead_time
## 101.53750477 -0.02802225
## arrival_date_day_of_month required_car_parking_spaces
## 0.17199568 10.13689379
print("*************************************")
## [1] "*************************************"
vif(lm_lead_adr_mod_2)
## lead_time arrival_date_day_of_month
## 1.013802 1.000084
## required_car_parking_spaces
## 1.013871
print("*************************************")
## [1] "*************************************"
ncvTest(lm_lead_adr_mod_2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5856.871, Df = 1, p = < 2.22e-16
print("*************************************")
## [1] "*************************************"
# Plot scatter for lm_lead_adr_mod_2
lm_lead_adr_mod_2_scatter <- ggplot(hotels_ultimate_data, aes(x = predicted_adr, y = adr)) +
geom_point() + # Plot the observed values
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Add linear regression line
labs(title = "Linear Regression:\n Observed vs. Predicted ADR",
x = "Predicted ADR",
y = "Observed ADR")
#Plot Linner for lm_lead_adr_mod_2
lm_lead_adr_mod_2_linner <- ggplot(hotels_ultimate_data, aes(x = predicted_adr, y = adr)) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression:\n Observed vs. Predicted ADR",
x = "Predicted ADR",
y = "Observed ADR")
#Arrange above two graph with single grid
grid.arrange(lm_lead_adr_mod_2_scatter,lm_lead_adr_mod_2_linner, nrow = 1)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Explanation : + In above model,We have created a linear
regression model between lead time along with arrival month ,car parking
request on ARD (This type of scenario is significant where car parking
is one of the major factor for hotel booking.)
+ Then calculated
the predicted values(hotels_ultimate_data$predicted_adr) based on a
fitted model which is lm_lead_adr_mod_2 for or each observation of
dataset.
✔ Interpretation : The Intercept is 101.537 (estimated ADR) when all predictor variables - lead time, arrival date day of month, and required car parking spaces are zero. (This Intercept value 101.537 is almost same of above model 105.168 )
✔ Significance : Cofficient value for lead time is -0.028022 (almost same of above model -.030).Cofficient for arrival_date_day_of_month is 0.172 menas for each day in the month of arrival, ADR might have impact of .172 and Cofficient for required_car_parking_spaces is 10.137 menas if guest have need of car parking then ADR may have impact of 10.137 unit price.
✔ Conclusion : Lead time along with month of arrival, and special car parking request collectivly have a significant (as p-value: < 2.2e-16 is low) but relatively small influence on the ADR of hotel bookings. Lead time has negative impact and car parking request has positive impact on ADR whereas the month of arrival has a very smaller positive effect on ADR. Overall impact of these factors is very small on ADR.
Recommendations:
Question: How does the lead time ( “adr - The number of days between booking and arrival”) impact the likelihood of cancellation in hotel bookings?
hotel_city_lead_cancellation <- hotels_ultimate_data %>%
filter(hotel== "City Hotel")
hotel_resort_lead_cancellation <- hotels_ultimate_data %>%
filter(hotel== "Resort Hotel")
nrow(hotel_city_lead_cancellation)
## [1] 79159
sm_model_hotel_city_lead_cancellation <- lm(lead_time~is_canceled,hotel_city_lead_cancellation)
print("***********************************************************************")
## [1] "***********************************************************************"
summary(sm_model_hotel_city_lead_cancellation)
##
## Call:
## lm(formula = lead_time ~ is_canceled, data = hotel_city_lead_cancellation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.34 -75.85 -32.34 52.15 478.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.8547 0.4916 164.46 <2e-16 ***
## is_canceled 69.4894 0.7606 91.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105.5 on 79157 degrees of freedom
## Multiple R-squared: 0.09539, Adjusted R-squared: 0.09538
## F-statistic: 8347 on 1 and 79157 DF, p-value: < 2.2e-16
print("***********************************************************************")
## [1] "***********************************************************************"
sm_model_hotel_resort_lead_cancellation <- lm(lead_time~is_canceled,hotel_resort_lead_cancellation)
print("***********************************************************************")
## [1] "***********************************************************************"
summary(sm_model_hotel_resort_lead_cancellation)
##
## Call:
## lm(formula = lead_time ~ is_canceled, data = hotel_resort_lead_cancellation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.70 -74.70 -36.70 57.15 658.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.8510 0.5568 141.62 <2e-16 ***
## is_canceled 49.8526 1.0566 47.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.7 on 40045 degrees of freedom
## Multiple R-squared: 0.05266, Adjusted R-squared: 0.05264
## F-statistic: 2226 on 1 and 40045 DF, p-value: < 2.2e-16
print("***********************************************************************")
## [1] "***********************************************************************"
lead_time_cancel_plot_city <- ggplot(hotel_city_lead_cancellation, aes(x = lead_time, fill= factor(is_canceled), label = stat(count))) +
geom_density(alpha = 0.5) + # Adjust alpha for transparency
labs(title = "City Hotel:Lead Time vs. Cancellation ",
x = "Lead Time",
y = "Density",
color = "Hotel") +
scale_fill_discrete(
name = "Cancelled",
breaks = c("0", "1"),
label = c("Not Cancelled", "Cancelled")
)+
theme_minimal()
lead_time_cancel_plot_resort <- ggplot(hotel_resort_lead_cancellation, aes(x = lead_time, fill= factor(is_canceled), label = stat(count))) +
geom_density(alpha = 0.5) + # Adjust alpha for transparency
labs(title = "Resort Hotel: Lead Time vs. Cancellation",
x = "Lead Time",
y = "Density",
color = "Hotel") +
scale_fill_discrete(
name = "Cancelled",
breaks = c("0", "1"),
label = c("Not Cancelled", "Cancelled")
)+
theme_minimal()
# Print the scatter plot
#print(lead_time_cancel_plot)
grid.arrange(lead_time_cancel_plot_city, lead_time_cancel_plot_resort, nrow = 2)
Explanation:
1.Lead time and Cancellation analysis determine
the relationship between lead time and likely hood of Cancellation in
hotel booking.
2. The intercept is 80.854 tells that during
this estimated lead time when a booking may not canceled in city hotel
where as This value is 78.851 for resort hotel. 3. Significance
Both coefficients are highly significant because p-value < 2.2e-16
(small then 0.05).
Conclusion: 1. For City hotel, “Canceled booking” tend to have
a lead time approximately 69 days longer than non-canceled bookings
2. For resort hotel, “Canceled booking” tend to have a lead time
approximately 49 days longer than non-canceled bookings.
3. By
understanding the relationship of cancellation status and duration can
help hotel management for defining their cancelation policy , resource
arrangement and other strategies (like booking insurance etc.).
#Factor
str(hotels_ultimate_data)
## tibble [119,206 × 34] (S3: tbl_df/tbl/data.frame)
## $ hotel : chr [1:119206] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119206] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119206] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119206] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119206] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119206] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119206] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119206] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119206] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119206] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119206] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119206] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119206] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119206] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119206] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119206] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119206] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119206] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119206] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119206] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119206] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119206] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119206] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119206] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119206], format: "2015-07-01" "2015-07-01" ...
## $ ADR : num [1:119206] NaN NaN Inf Inf Inf ...
## $ predicted_adr : Named num [1:119206] 92.1 81.1 101.5 101.3 101.3 ...
## ..- attr(*, "names")= chr [1:119206] "1" "2" "3" "4" ...
## - attr(*, "na.action")= 'omit' Named int [1:4] 40601 40668 40680 41161
## ..- attr(*, "names")= chr [1:4] "40601" "40668" "40680" "41161"
booking_cancel_percentage <- 44224/(44224+75166)
booking_cancel_percentage <- booking_cancel_percentage * 100
cat(" Percentage of hotel booking : ",booking_cancel_percentage)
## Percentage of hotel booking : 37.04163
# Perform logistic regression
logit_model <- glm(is_canceled ~ lead_time + previous_cancellations + adults + children + babies + is_repeated_guest,
data = hotels_ultimate_data,
family = binomial)
# Summary of the logistic regression model
summary(logit_model)
##
## Call:
## glm(formula = is_canceled ~ lead_time + previous_cancellations +
## adults + children + babies + is_repeated_guest, family = binomial,
## data = hotels_ultimate_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.399e+00 2.652e-02 -52.760 < 2e-16 ***
## lead_time 4.984e-03 6.322e-05 78.835 < 2e-16 ***
## previous_cancellations 2.144e+00 3.922e-02 54.673 < 2e-16 ***
## adults 1.362e-01 1.345e-02 10.133 < 2e-16 ***
## children 1.220e-01 1.543e-02 7.910 2.57e-15 ***
## babies -8.429e-01 8.665e-02 -9.728 < 2e-16 ***
## is_repeated_guest -2.161e+00 6.813e-02 -31.723 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 157197 on 119205 degrees of freedom
## Residual deviance: 141388 on 119199 degrees of freedom
## AIC: 141402
##
## Number of Fisher Scoring iterations: 6
Explanation: Since in our database, Hotel booking cancellation percentage is 37.04% which is quite high. Here we have tried to see pattern in cancellation, lead time , history of canceling, the number of people in the booking and repeated guest.
Coefficients Each variable has different value of coefficient which mean different impact of these variable on cencellation.Although babies and is_repeated_guest has negative coefficient value -8.42 and -2.16 which means there is less likely hood of booking cencelation.
AIC (Akaike Information Criterion) value of 141402 is a measure of how well the model fits the data while penalizing for complexity. Lower AIC values indicate better fitting models relative to other models being compared.
Conclusion:
1. Lead time and previous cancellations have
positive effects on the likelihood of cancellation.
2. The number of
adults, children, and babies also influences cancellation likelihood
positively.
3. Being a repeated guest reduces the likelihood of
cancellation. 4. This analysis give insight of factor which may have
affect in booking cencellation which can help hotel management to make
strategies to minimize cancellations and optimize revenue.