Introduction

The hotel industry is constantly evolving, and it is important for hotel managers to understand factors that influence booking demand in order to optimize pricing, marketing, and other business strategies. In this analysis, we will explore the Hotel Booking Demand dataset from Kaggle, which contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and babies, and the amount paid. Our goal is to identify trends and patterns in the data that can provide actionable insights for hotel management.

Data Exploration and Preprocessing

First, we will import the necessary libraries and read in the dataset.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)

bookings <- read.csv("hotel_bookings.csv")

# Check the unique values of the year variable
unique(bookings$year)
## NULL
# Select only the 2015 and 2016 levels
bookings <- bookings %>%
  filter(arrival_date_year %in% c("2015", "2016"))

# Check the unique values of the year variable
unique(bookings$year)
## NULL

Next, we will explore the data to get a better understanding of the variables and their values.

str(bookings)
## 'data.frame':    78703 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
head(bookings)
##          hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel           0       342              2015               July
## 2 Resort Hotel           0       737              2015               July
## 3 Resort Hotel           0         7              2015               July
## 4 Resort Hotel           0        13              2015               July
## 5 Resort Hotel           0        14              2015               July
## 6 Resort Hotel           0        14              2015               July
##   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                       27                         1                       0
## 2                       27                         1                       0
## 3                       27                         1                       0
## 4                       27                         1                       0
## 5                       27                         1                       0
## 6                       27                         1                       0
##   stays_in_week_nights adults children babies meal country market_segment
## 1                    0      2        0      0   BB     PRT         Direct
## 2                    0      2        0      0   BB     PRT         Direct
## 3                    1      1        0      0   BB     GBR         Direct
## 4                    1      1        0      0   BB     GBR      Corporate
## 5                    2      2        0      0   BB     GBR      Online TA
## 6                    2      2        0      0   BB     GBR      Online TA
##   distribution_channel is_repeated_guest previous_cancellations
## 1               Direct                 0                      0
## 2               Direct                 0                      0
## 3               Direct                 0                      0
## 4            Corporate                 0                      0
## 5                TA/TO                 0                      0
## 6                TA/TO                 0                      0
##   previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                              0                  C                  C
## 2                              0                  C                  C
## 3                              0                  A                  C
## 4                              0                  A                  A
## 5                              0                  A                  A
## 6                              0                  A                  A
##   booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1               3   No Deposit  NULL    NULL                    0     Transient
## 2               4   No Deposit  NULL    NULL                    0     Transient
## 3               0   No Deposit  NULL    NULL                    0     Transient
## 4               0   No Deposit   304    NULL                    0     Transient
## 5               0   No Deposit   240    NULL                    0     Transient
## 6               0   No Deposit   240    NULL                    0     Transient
##   adr required_car_parking_spaces total_of_special_requests reservation_status
## 1   0                           0                         0          Check-Out
## 2   0                           0                         0          Check-Out
## 3  75                           0                         0          Check-Out
## 4  75                           0                         0          Check-Out
## 5  98                           0                         1          Check-Out
## 6  98                           0                         1          Check-Out
##   reservation_status_date
## 1              2015-07-01
## 2              2015-07-01
## 3              2015-07-02
## 4              2015-07-02
## 5              2015-07-03
## 6              2015-07-03
summary(bookings)
##     hotel            is_canceled       lead_time     arrival_date_year
##  Length:78703       Min.   :0.0000   Min.   :  0.0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 16.0   1st Qu.:2015     
##  Mode  :character   Median :0.0000   Median : 65.0   Median :2016     
##                     Mean   :0.3619   Mean   :101.3   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:158.0   3rd Qu.:2016     
##                     Max.   :1.0000   Max.   :737.0   Max.   :2016     
##                                                                       
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:78703       Min.   : 1.00            Min.   : 1.00            
##  Class :character   1st Qu.:21.00            1st Qu.: 8.00            
##  Mode  :character   Median :34.00            Median :16.00            
##                     Mean   :31.58            Mean   :15.84            
##                     3rd Qu.:42.00            3rd Qu.:23.00            
##                     Max.   :53.00            Max.   :31.00            
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.000          Min.   : 0.000       Min.   : 0.000  
##  1st Qu.: 0.000          1st Qu.: 1.000       1st Qu.: 2.000  
##  Median : 1.000          Median : 2.000       Median : 2.000  
##  Mean   : 0.904          Mean   : 2.448       Mean   : 1.845  
##  3rd Qu.: 2.000          3rd Qu.: 3.000       3rd Qu.: 2.000  
##  Max.   :19.000          Max.   :50.000       Max.   :55.000  
##                                                               
##     children            babies              meal             country         
##  Min.   : 0.00000   Min.   : 0.000000   Length:78703       Length:78703      
##  1st Qu.: 0.00000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.00000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.09231   Mean   : 0.008716                                        
##  3rd Qu.: 0.00000   3rd Qu.: 0.000000                                        
##  Max.   :10.00000   Max.   :10.000000                                        
##  NA's   :4                                                                   
##  market_segment     distribution_channel is_repeated_guest
##  Length:78703       Length:78703         Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03074  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.000         Min.   : 0.0000                Length:78703      
##  1st Qu.: 0.000         1st Qu.: 0.0000                Class :character  
##  Median : 0.000         Median : 0.0000                Mode  :character  
##  Mean   : 0.126         Mean   : 0.1195                                  
##  3rd Qu.: 0.000         3rd Qu.: 0.0000                                  
##  Max.   :26.000         Max.   :58.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:78703       Min.   : 0.0000   Length:78703       Length:78703      
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2117                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:78703       Min.   :  0.0        Length:78703       Min.   :   0.00  
##  Class :character   1st Qu.:  0.0        Class :character   1st Qu.:  65.00  
##  Mode  :character   Median :  0.0        Mode  :character   Median :  89.00  
##                     Mean   :  3.4                           Mean   :  95.21  
##                     3rd Qu.:  0.0                           3rd Qu.: 117.30  
##                     Max.   :391.0                           Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.000             Length:78703      
##  1st Qu.:0.00000             1st Qu.:0.000             Class :character  
##  Median :0.00000             Median :0.000             Mode  :character  
##  Mean   :0.06641             Mean   :0.517                               
##  3rd Qu.:0.00000             3rd Qu.:1.000                               
##  Max.   :3.00000             Max.   :5.000                               
##                                                                          
##  reservation_status_date
##  Length:78703           
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 
# Remove missing values and outliers
bookings <- bookings %>%
  drop_na(is_canceled, arrival_date_year, arrival_date_month) %>%
  filter(adults > 0)

Exploratory Data Analysis

Now that the data is cleaned and preprocessed, we can begin to perform exploratory data analysis (EDA) to identify trends and patterns in the data.

We will start by visualizing the cancellation rate, which is the proportion of bookings that were ultimately canceled. We can create a bar plot using the ‘ggplot’ function and the ‘geom_bar’ layer.

ggplot(bookings, aes(x = is_canceled)) +
  geom_bar() +
  labs(title = "Cancellation Rate")

From the plot, we can see that the majority of bookings were not canceled. However, it would be interesting to see if the cancellation rate has changed over time. We can create a line plot to visualize the cancellation rate by year.

ggplot(bookings, aes(x = arrival_date_year, y = is_canceled)) +
  geom_line() +
  labs(title = "Cancellation Rate Over Time")

We can also visualize the cancellation rate by month to see if there are any seasonal trends.

ggplot(bookings, aes(x = arrival_date_month, y = is_canceled)) +
  geom_line() +
  labs(title = "Cancellation Rate by Month")

#The box plot suggests that the cancellation rate varies significantly by month.

Hypothesis Testing

We have identified some trends and patterns in the data, but we need to use statistical tests to determine whether these trends are statistically significant.

We will start by testing the hypothesis that the cancellation rate has decreased over time. We can use a t-test to compare the means of the is_canceled variable for the two years in the dataset.

# Create a new variable indicating the year
bookings$year <- as.factor(bookings$arrival_date_year)

# Perform the t-test
t.test(is_canceled ~ year, data = bookings)
## 
##  Welch Two Sample t-test
## 
## data:  is_canceled by year
## t = 3.0306, df = 39741, p-value = 0.002443
## alternative hypothesis: true difference in means between group 2015 and group 2016 is not equal to 0
## 95 percent confidence interval:
##  0.004104647 0.019134903
## sample estimates:
## mean in group 2015 mean in group 2016 
##          0.3706268          0.3590070

The test results show that the difference in means is statistically significant, indicating that the cancellation rate has indeed decreased over time.

Next, we will test the hypothesis that the cancellation rate varies significantly by month. We can use a ANOVA test to compare the means of the ‘is_canceled’ variable for each month.

aov.test <- aov(is_canceled ~ arrival_date_month, data = bookings)
summary(aov.test)
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## arrival_date_month    11     90   8.209    35.7 <2e-16 ***
## Residuals          78418  18029   0.230                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test results show that the difference in means is statistically significant, indicating that the cancellation rate does vary significantly by month.

Conclusion

In this analysis, we used the Hotel Booking Demand dataset to identify trends and patterns in booking demand. We found that the cancellation rate has decreased over time and varies significantly by month. These insights can be valuable for hotel management, as they can inform pricing, marketing, and other business strategies. For example, the hotel may want to target marketing efforts towards months with higher cancellation rates in order to encourage bookings and increase revenue.

Overall, this analysis demonstrates the importance of understanding factors that influence booking demand in the hotel industry, and the value of using data analysis tools like R to uncover actionable insights.