Introduction

1.1 Why is it interesting

The hotel industry faces challenges in optimizing bookings during peak seasons. Cancellations can significantly impact potential revenue. Which markets should hotels focus on to ensure stable bookings?

1.2 Approach

I plan to analyze the “hotels.csv” data to explore the relationships between various hotel attributes.

1.3 Benefits

This analysis aims to provide hoteliers with insights into which markets might offer more reliable bookings during peak seasons, leading to optimized revenue.

Packages Required

2.1 Packages loaded

install.packages("rmarkdown")
install.packages("tidyverse")
install.packages("readr")
install.packages("knitr")

2.2 Messages and warnings

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)    
library(knitr)   

2.3 Reason of using

  • tidyverse: For data manipulation and visualization.
  • readr: To import the “hotels.csv” data.
  • knitr: To display tables and data frames nicely.

Data Preparation

3.1 Initial Data Description

setwd('~/Desktop/B BUS 301 A/Mid-term/hotels')
hotels_data <-readr:: read_csv("hotels.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl  (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date  (1): reservation_status_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##str(hotels_data)
##summary(hotels_data)
ncol(hotels_data)
## [1] 32
sum(is.na(hotels_data))
## [1] 4

The dataset we are using is named hotels.csv. Upon loading and inspecting the data:

  • The dataset contains 32 variables.
  • There are 4 missing values across these variables.

3.2 Data Importing and Cleaning

colnames(hotels_data)[apply(hotels_data, 2, function(x) any(is.na(x)))]
## [1] "children"
# Median imputation for 'children' column
hotels_data$children[is.na(hotels_data$children)] <- median(hotels_data$children, na.rm = TRUE)
# Convert 'arrival_date_year' to a factor
hotels_data$arrival_date_year <- as.factor(hotels_data$arrival_date_year)
# Handle "NULL" values in 'agent' and 'company' columns
hotels_data$agent[hotels_data$agent == "NULL"] <- NA
hotels_data$company[hotels_data$company == "NULL"] <- NA

3.3 Final Dataset

head(hotels_data, 10)
## # A tibble: 10 × 32
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl> <fct>             <chr>             
##  1 Resort Hotel           0       342 2015              July              
##  2 Resort Hotel           0       737 2015              July              
##  3 Resort Hotel           0         7 2015              July              
##  4 Resort Hotel           0        13 2015              July              
##  5 Resort Hotel           0        14 2015              July              
##  6 Resort Hotel           0        14 2015              July              
##  7 Resort Hotel           0         0 2015              July              
##  8 Resort Hotel           0         9 2015              July              
##  9 Resort Hotel           1        85 2015              July              
## 10 Resort Hotel           1        75 2015              July              
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …

3.4 Summary Information:

summary(hotels_data)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   2015:21996       
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   2016:56707       
##  Mode  :character   Median :0.0000   Median : 69   2017:40687       
##                     Mean   :0.3704   Mean   :104                    
##                     3rd Qu.:1.0000   3rd Qu.:160                    
##                     Max.   :1.0000   Max.   :737                    
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##  reservation_status_date
##  Min.   :2014-10-17     
##  1st Qu.:2016-02-01     
##  Median :2016-08-07     
##  Mean   :2016-07-30     
##  3rd Qu.:2017-02-08     
##  Max.   :2017-09-14

hotel: This character variable indicates the type of hotel, with common categories being “Resort Hotel” and others. For analysis, it will be essential to examine if there are significant differences between different hotel types in relation to other variables.

is_canceled: A numerical variable that indicates whether a reservation was canceled (1) or not (0). This will be a critical variable if one is investigating factors leading to cancellations.

lead_time: Numerically represents the number of days that elapsed between the booking date and the actual arrival date. It will be interesting to investigate if longer lead times lead to higher cancellations or if they affect the Average Daily Rate (ADR).

arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month: These variables collectively provide information about the arrival date. This data can be used to analyze trends and seasonality in bookings.

adults, children, babies: These numerical variables show the count of adults, children, and babies for a reservation, respectively. It’ll be pivotal in understanding the typical composition of guests at these hotels.

meal: Character variable representing the type of meal booked. Categories like “BB” represent bed & breakfast, and so on. Examining this can reveal guest preferences and its potential link to other variables such as cancellations.

country: Denotes the country of origin of the guests. This can help in identifying the primary source countries and understand international vs. local guest dynamics.

agent and company: These two character variables show the ID of the travel agency that made the booking and the company ID if it was a corporate booking. The cleaned dataset now has the “NULL” placeholders replaced with NA for these variables.

adr: Stands for Average Daily Rate. It’s a numerical representation of the average booking cost per day. It can be analyzed against various factors like lead time, type of guest, month of booking, etc., to gain pricing insights.

reservation_status: This character variable provides the latest status of the reservation, with categories like “Check-Out” indicating successful stays.

Proposed Exploratory Data Analysis:

4.1 Different Ways To Look At This Data

  • I organized the data by hotel type to understand if there are distinct trends between City and Resort hotels.
  • I further break down the Number of Arrivals by month to identify any seasonal trends.
  • Then I try to find the main customers’ nations
# Load necessary libraries
library(ggplot2)
library(dplyr)

ggplot(hotels_data, aes(x = hotel)) +
  geom_bar(fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Hotel Types", x = "Hotel Type", y = "Count")

#Arrivals by Month
ggplot(hotels_data, aes(x = arrival_date_month)) +
  geom_bar(fill = "coral", color = "black") +
  theme_minimal() +
  labs(title = "Number of Arrivals by Month", x = "Month", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Top 10 Countries of Guests:
top_countries <- head(sort(table(hotels_data$country), decreasing = TRUE), 10)
ggplot(as.data.frame(top_countries), aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Top 10 Countries of Guests", x = "Country", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • Based on the graph, we can know the peak season is July & August
  • After that I will find the Cancellation Status (Cancellation Rate) to future analyze the data
  • After we know the Cancellation Status during peak season, we can further analyze which county have lowest cancellation rate inorder to better resolve problem.
city_hotels_peak_season <- hotels_data %>%
  filter(hotel == "City Hotel", arrival_date_month %in% c("July", "August"))

ggplot(city_hotels_peak_season, aes(x = arrival_date_month, fill = factor(is_canceled))) +
  geom_bar(position = "fill", color = "black") +
  scale_fill_manual(values = c("forestgreen", "firebrick"),
                    name = "Cancellation Status",
                    breaks = c(0, 1),
                    labels = c("Not Canceled", "Canceled")) +
  theme_minimal() +
  labs(title = "Cancellation Status for City Hotels during Peak Season", 
       x = "Arrival Month", 
       y = "Proportion of Bookings") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Calculate cancellation rate for each country
cancellation_rate <- city_hotels_peak_season %>%
  group_by(country) %>%
  summarize(total_bookings = n(),
            cancellations = sum(is_canceled),
            cancellation_rate = cancellations / total_bookings) %>%
# Filter countries with at least 50 bookings for better representation
  filter(total_bookings >= 50) 

# Top 10 countries with the lowest cancellation rate
top_countries_lowest_cancellation <- cancellation_rate %>%
  arrange(cancellation_rate) %>%
  head(10)

# Plotting
ggplot(top_countries_lowest_cancellation, aes(x = reorder(country, cancellation_rate), y = cancellation_rate)) +
  geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Top 10 Countries with Lowest Cancellation Rate during Peak Season",
       x = "Country", 
       y = "Cancellation Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Data Analysis Summary

  • Peak Season Insights: The data clearly indicates that July and August are the peak months for hotel bookings. This finding aligns with typical summer vacation trends in many parts of the world, underscoring the importance of these months for the hotel industry.

  • Hotel Preference: A significant majority of the bookings are for City hotels. This suggests that travelers, perhaps both leisure and business, have a preference for staying within city limits, valuing the convenience and accessibility of urban accommodations.

  • Cancellation Trends: A deeper dive into the data during the peak season revealed a concerning trend: over 25% of the bookings for City hotels in July and August are canceled. Such a high cancellation rate during the most lucrative months can have significant financial implications for hoteliers.

  • Guest Nationalities and Cancellation Patterns: By juxtaposing the data on the top 10 countries with the highest number of guests against their respective cancellation rates, a fascinating pattern emerged. Among the top guest nationalities, guests from Germany exhibited a notably lower cancellation rate. Furthermore, the volume of guests from Germany is reasonably high, making them a particularly valuable demographic for the hotel industry.

Recommendations:

  • Targeted Marketing in Germany: Given the loyalty and volume of German guests, hotels should consider devising special promotions, packages, or loyalty programs tailored for this demographic. By catering to their preferences and ensuring they feel valued, there’s a potential to further boost bookings and reduce cancellations from this segment.

  • Cancellation Policies: With a cancellation rate surpassing 25% during peak months, hotels might need to revisit their cancellation policies. Implementing stricter policies or offering incentives for non-cancellation might help in reducing this number.

4.2 Plots and Tables for Illustrative Findings

Bar Graphs: Ideal for showing the distribution of categorical variables, such as hotel types, monthly arrivals, and top guest nationalities.

4.3 Further Learning

I need to learn more ways about plots and tables in order to better demonstrate the data. And also more complex way to analyze data.