The hotel industry is constantly evolving, and it is important for hotel managers to understand factors that influence booking demand in order to optimize pricing, marketing, and other business strategies. In this analysis, we will explore the Hotel Booking Demand dataset from Kaggle, which contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and babies, and the amount paid. Our goal is to identify trends and patterns in the data that can provide actionable insights for hotel management.
First, we will import the necessary libraries and read in the dataset.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
bookings <- read.csv("hotel_bookings.csv")
# Check the unique values of the year variable
unique(bookings$year)
## NULL
# Select only the 2015 and 2016 levels
bookings <- bookings %>%
filter(arrival_date_year %in% c("2015", "2016"))
# Check the unique values of the year variable
unique(bookings$year)
## NULL
Next, we will explore the data to get a better understanding of the variables and their values.
str(bookings)
## 'data.frame': 78703 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
head(bookings)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
summary(bookings)
## hotel is_canceled lead_time arrival_date_year
## Length:78703 Min. :0.0000 Min. : 0.0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 16.0 1st Qu.:2015
## Mode :character Median :0.0000 Median : 65.0 Median :2016
## Mean :0.3619 Mean :101.3 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:158.0 3rd Qu.:2016
## Max. :1.0000 Max. :737.0 Max. :2016
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:78703 Min. : 1.00 Min. : 1.00
## Class :character 1st Qu.:21.00 1st Qu.: 8.00
## Mode :character Median :34.00 Median :16.00
## Mean :31.58 Mean :15.84
## 3rd Qu.:42.00 3rd Qu.:23.00
## Max. :53.00 Max. :31.00
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.000 Median : 2.000 Median : 2.000
## Mean : 0.904 Mean : 2.448 Mean : 1.845
## 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :19.000 Max. :50.000 Max. :55.000
##
## children babies meal country
## Min. : 0.00000 Min. : 0.000000 Length:78703 Length:78703
## 1st Qu.: 0.00000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.00000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.09231 Mean : 0.008716
## 3rd Qu.: 0.00000 3rd Qu.: 0.000000
## Max. :10.00000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:78703 Length:78703 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03074
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.000 Min. : 0.0000 Length:78703
## 1st Qu.: 0.000 1st Qu.: 0.0000 Class :character
## Median : 0.000 Median : 0.0000 Mode :character
## Mean : 0.126 Mean : 0.1195
## 3rd Qu.: 0.000 3rd Qu.: 0.0000
## Max. :26.000 Max. :58.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:78703 Min. : 0.0000 Length:78703 Length:78703
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2117
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:78703 Min. : 0.0 Length:78703 Min. : 0.00
## Class :character 1st Qu.: 0.0 Class :character 1st Qu.: 65.00
## Mode :character Median : 0.0 Mode :character Median : 89.00
## Mean : 3.4 Mean : 95.21
## 3rd Qu.: 0.0 3rd Qu.: 117.30
## Max. :391.0 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.000 Length:78703
## 1st Qu.:0.00000 1st Qu.:0.000 Class :character
## Median :0.00000 Median :0.000 Mode :character
## Mean :0.06641 Mean :0.517
## 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :3.00000 Max. :5.000
##
## reservation_status_date
## Length:78703
## Class :character
## Mode :character
##
##
##
##
# Remove missing values and outliers
bookings <- bookings %>%
drop_na(is_canceled, arrival_date_year, arrival_date_month) %>%
filter(adults > 0)
Now that the data is cleaned and preprocessed, we can begin to perform exploratory data analysis (EDA) to identify trends and patterns in the data.
We will start by visualizing the cancellation rate, which is the proportion of bookings that were ultimately canceled. We can create a bar plot using the ‘ggplot’ function and the ‘geom_bar’ layer.
ggplot(bookings, aes(x = is_canceled)) +
geom_bar() +
labs(title = "Cancellation Rate")
From the plot, we can see that the majority of bookings were not canceled. However, it would be interesting to see if the cancellation rate has changed over time. We can create a line plot to visualize the cancellation rate by year.
ggplot(bookings, aes(x = arrival_date_year, y = is_canceled)) +
geom_line() +
labs(title = "Cancellation Rate Over Time")
We can also visualize the cancellation rate by month to see if there are any seasonal trends.
ggplot(bookings, aes(x = arrival_date_month, y = is_canceled)) +
geom_line() +
labs(title = "Cancellation Rate by Month")
#The box plot suggests that the cancellation rate varies significantly by month.
We have identified some trends and patterns in the data, but we need to use statistical tests to determine whether these trends are statistically significant.
We will start by testing the hypothesis that the cancellation rate has decreased over time. We can use a t-test to compare the means of the is_canceled variable for the two years in the dataset.
# Create a new variable indicating the year
bookings$year <- as.factor(bookings$arrival_date_year)
# Perform the t-test
t.test(is_canceled ~ year, data = bookings)
##
## Welch Two Sample t-test
##
## data: is_canceled by year
## t = 3.0306, df = 39741, p-value = 0.002443
## alternative hypothesis: true difference in means between group 2015 and group 2016 is not equal to 0
## 95 percent confidence interval:
## 0.004104647 0.019134903
## sample estimates:
## mean in group 2015 mean in group 2016
## 0.3706268 0.3590070
The test results show that the difference in means is statistically significant, indicating that the cancellation rate has indeed decreased over time.
Next, we will test the hypothesis that the cancellation rate varies significantly by month. We can use a ANOVA test to compare the means of the ‘is_canceled’ variable for each month.
aov.test <- aov(is_canceled ~ arrival_date_month, data = bookings)
summary(aov.test)
## Df Sum Sq Mean Sq F value Pr(>F)
## arrival_date_month 11 90 8.209 35.7 <2e-16 ***
## Residuals 78418 18029 0.230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test results show that the difference in means is statistically significant, indicating that the cancellation rate does vary significantly by month.
In this analysis, we used the Hotel Booking Demand dataset to identify trends and patterns in booking demand. We found that the cancellation rate has decreased over time and varies significantly by month. These insights can be valuable for hotel management, as they can inform pricing, marketing, and other business strategies. For example, the hotel may want to target marketing efforts towards months with higher cancellation rates in order to encourage bookings and increase revenue.
Overall, this analysis demonstrates the importance of understanding factors that influence booking demand in the hotel industry, and the value of using data analysis tools like R to uncover actionable insights.