DATA ANALYST: KAPPO STEPHEN
The hotel booking data set provides detailed information about hotel reservations from a variety of guests, spanning across two hotel types: City Hotels and Resort Hotels. This data set contains rich insights into guest behavior, booking patterns, and operational metrics. Key variables include:
Booking Lead Time (the number of days between booking and arrival),
Cancellation Rates,
Length of Stay,
Special Requests made by guests,
Guest Demographics (e.g., country of origin, group size),
Booking Channels, and
Seasonality (booking trends by time of year).
The purpose of this analysis is to uncover valuable trends and patterns, such as how booking behavior varies between different hotel types, what factors influence cancellations, and how external factors like seasonality and special requests impact overall hotel performance. Understanding these dynamics is essential for enhancing customer satisfaction, optimizing revenue, and improving hotel management strategies.
What time of year experiences the highest and lowest demand for bookings?
(Objective: Analyze seasonal trends in hotel bookings.)
Which type of hotel (city vs resort) tends to have a higher occupancy rate during peak seasons?
(Objective: Compare demand between city hotels and resort hotels.)
Does the length of stay impact the average daily rate (ADR) for bookings?
(Objective: Investigate how the length of stay affects pricing.)
What is the average lead time for bookings, and how does it vary by hotel type?
(Objective: Analyze how far in advance people book different types of hotels.)
How do cancellation rates vary by season and hotel type?
(Objective: Identify any trends in booking cancellations and potential causes.)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# LOAD THE DATASET
hotel_bookings <- read.csv("hotel_bookings.csv")
#STRUCTURE OF THE DATASET
str(hotel_bookings)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
#SUMMARY OF THE DATASET
summary(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
head(hotel_bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
Checking for the total number of null values in each of the columns
colSums(is.na(hotel_bookings))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
hotel_bookings$children[is.na(hotel_bookings$children)] = round(mean(hotel_bookings$children, na.rm = TRUE), 0)
sum(is.na(hotel_bookings$children)) # TO CHECK OR COUNT THE NUMBER OF MISSING VALUES IN CHILDREN COLUMN JUST TO BE SURE OUR CODE ABOVE WORKED.
## [1] 0
NB : the agent and company values are just IDs replacing with 0 won’t create any problem.
hotel_bookings$agent[hotel_bookings$agent == "NULL"] <- 0
hotel_bookings$company[hotel_bookings$company == "NULL"] <- 0
table(hotel_bookings$country) # From the outcome of this code, it shows that country with the code "PRT" i.e PORTUGAL appears the most with a maximum number of 48,590
##
## ABW AGO AIA ALB AND ARE ARG ARM ASM ATA ATF AUS AUT
## 2 362 1 12 7 51 214 8 1 2 1 426 1263
## AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH BLR BOL BRA
## 17 1 2342 3 1 12 75 5 1 13 26 10 2224
## BRB BWA CAF CHE CHL CHN CIV CMR CN COL COM CPV CRI
## 4 1 5 1730 65 999 6 10 1279 71 2 24 19
## CUB CYM CYP CZE DEU DJI DMA DNK DOM DZA ECU EGY ESP
## 8 1 51 171 7287 1 1 435 14 103 27 32 8568
## EST ETH FIN FJI FRA FRO GAB GBR GEO GGY GHA GIB GLP
## 83 3 447 1 10415 5 4 12129 22 3 4 18 2
## GNB GRC GTM GUY HKG HND HRV HUN IDN IMN IND IRL IRN
## 9 128 4 1 29 1 100 230 35 2 152 3375 83
## IRQ ISL ISR ITA JAM JEY JOR JPN KAZ KEN KHM KIR KNA
## 14 57 669 3766 6 8 21 197 19 6 2 1 2
## KOR KWT LAO LBN LBY LCA LIE LKA LTU LUX LVA MAC MAR
## 133 16 2 31 8 1 3 7 81 287 55 16 259
## MCO MDG MDV MEX MKD MLI MLT MMR MNE MOZ MRT MUS MWI
## 4 1 12 85 10 1 18 1 5 67 1 7 2
## MYS MYT NAM NCL NGA NIC NLD NOR NPL NULL NZL OMN PAK
## 28 2 1 1 34 1 2104 607 1 488 74 18 14
## PAN PER PHL PLW POL PRI PRT PRY PYF QAT ROU RUS RWA
## 9 29 40 1 919 12 48590 4 1 15 500 632 2
## SAU SDN SEN SGP SLE SLV SMR SRB STP SUR SVK SVN SWE
## 48 1 11 39 1 2 1 101 2 5 65 57 1024
## SYC SYR TGO THA TJK TMP TUN TUR TWN TZA UGA UKR UMI
## 2 3 2 59 9 3 39 248 51 5 2 68 1
## URY USA UZB VEN VGB VNM ZAF ZMB ZWE
## 32 2097 4 26 1 8 80 2 4
hotel_bookings$country[hotel_bookings$country == "NULL"] <- "PRT"
hotel_bookings <- hotel_bookings[hotel_bookings$adr >= 0, ]
# This method works by subsetting the dataset to exclude rows where the condition (adr >= 0) is false.
hotel_bookings <- mutate(hotel_bookings,
Occupancy = adults + children + babies,
length_of_stay = stays_in_week_nights + stays_in_weekend_nights)
str(hotel_bookings)
## 'data.frame': 119389 obs. of 34 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "0" "0" "0" "304" ...
## $ company : chr "0" "0" "0" "0" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
## $ Occupancy : num 2 2 1 1 2 2 2 2 2 2 ...
## $ length_of_stay : int 0 0 1 1 2 2 2 2 3 3 ...
# The mutate function in R (from the dplyr package) adds a new variable to a dataset. In this case, you're creating a new variable total_guests, which is the sum of the adults, children, and babies columns. Once the mutation is applied, str(hotel_data) will display the structure of the updated dataset.
table(hotel_bookings$Occupancy)
##
## 0 1 2 3 4 5 6 10 12 20 26 27 40
## 180 22581 82050 10495 3929 137 1 2 2 2 5 2 1
## 50 55
## 1 1
hotel_bookings <- filter(hotel_bookings, is_canceled > 0, Occupancy > 0)
#For each booking, there should be at-least 1 guest, here guests are of three categories: Adults, Children, Babies We have combined all three and added a new column called “total_guests”.
#But in data we have around 180 rows which have 0 guests. So we are removing those rows directly, as booking without guests doesn’t make sense.
INSIGHT: The data shows that the highest demand for bookings occurs in July and August, while the lowest demand is in January and December.
month_levels <- c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')
hotel_bookings$arrival_date_month <- factor(hotel_bookings$arrival_date_month, levels = month_levels, ordered = T )
months_count <- table(hotel_bookings$arrival_date_month)
barplot(months_count,
main = "Seasonal Trend in Hotel Booking",
xlab = "Months",
ylab = "Record Count",
col = "skyblue")
ggplot(data = hotel_bookings, mapping = aes(x = arrival_date_month)) +
geom_bar(fill = "brown") +
labs(title = "Seasonal Trend in Hotel Booking", x = "Months", y = "Record Count of Months") +
theme_minimal()
INSIGHT: City Hotels tend to have a higher occupancy rate during peak seasons, with 69.01% compared to Resort Hotels at 30.73%.
peak_season_data <- filter(hotel_bookings, arrival_date_month %in% c("July", "August"))
occupancy_rates <- peak_season_data %>%
group_by(hotel) %>%
summarise(total_bookings = n(),
occupancy_rate = (total_bookings / 10000) * 100)
occupancy_rates
## # A tibble: 2 × 3
## hotel total_bookings occupancy_rate
## <chr> <int> <dbl>
## 1 City Hotel 6901 69.0
## 2 Resort Hotel 3073 30.7
barplot(height = occupancy_rates$occupancy_rate,
names.arg = occupancy_rates$hotel,
main = "Occupancy Rates by Hotel Type during Peak Season",
xlab = "Hotel Type",
ylab = "Occupancy Rate (%)",
col = c("maroon", "darkblue"),
ylim = c(0, 100))
INSIGHT: The data shows that ADR tends to fluctuate with the length of stay, with some interesting patterns:
Short stays (1-2 nights) have lower ADRs
There’s a peak in ADR for 5-6 night stays
Longer stays (7+ nights) show a slight decrease in ADR
This suggests that hotels might be pricing their rooms to encourage certain lengths of stay.
cor(hotel_bookings$adr, hotel_bookings$length_of_stay)
## [1] 0.09188117
# Calculate Average Daily Rate for each stay
adr_by_stay <- hotel_bookings %>%
group_by(length_of_stay) %>%
summarise(avg_adr = mean(adr, na.rm = T))
plot(x = adr_by_stay$length_of_stay, y = adr_by_stay$avg_adr,
main = "Average Daily Rate vs Length of Stay",
xlab = "Length of Stay",
ylab = "Average Daily Rate",
xlim = c(0, 10),
lty = 1, lwd = 2, pch = 20, col = 'darkblue')
abline(h = mean(adr_by_stay$avg_adr), col = "maroon")
lines(adr_by_stay$length_of_stay, adr_by_stay$avg_adr, col = "black")
INSIGHT: The chart below indicates that guests tend to book City Hotels further in advance than Resort Hotels.
lead_time_by_hotel <- hotel_bookings %>%
group_by(hotel) %>%
summarise(avg_lead_time_by_hotel = mean(lead_time, na.rm = T))
barplot(height = lead_time_by_hotel$avg_lead_time_by_hotel,
names.arg = lead_time_by_hotel$hotel,
main = "Average Lead Time by Hotel Type",
xlab = "Hotel Type",
ylab = "Average Lead Time",
ylim = c(0, 200),
col = c("darkred", "dodgerblue"))
INSIGHT: With respect to the chart below
City Hotels consistently have higher cancellation rates compared to Resort Hotels across all months.
There are seasonal variations in cancellation rates for both hotel types.
The highest cancellation rates for both hotel types seem to occur in the summer months (July and August).
Potential causes for these trends:
City Hotels might have more business travelers who are more likely to have changes in their plans.
The higher cancellation rates in summer could be due to changes in vacation plans or overbooking during peak season.
Resort Hotels might have lower cancellation rates due to stricter cancellation policies or because they are often booked for leisure travel, which might be less prone to last-minute changes
cancellation_rates <- hotel_bookings %>%
group_by(arrival_date_month, hotel) %>%
summarise(cancellation_rate = (sum(is_canceled) / 10000) * 100)
## `summarise()` has grouped output by 'arrival_date_month'. You can override
## using the `.groups` argument.
barplot(height = cancellation_rates$cancellation_rate,
names.arg = cancellation_rates$arrival_date_month,
main = "Cancellation Rate by Season and Hotel Type",
xlab = "Month",
ylab = "Cancellation Rate (%)",
ylim = c(0, 50),
col = c("yellow4", "tomato"),
legend.text = c("City Hotel", "Resort Hotel")
)