In this project, we address a significant problem in the hospitality industry: understanding and optimizing hotel performance. The interest in this problem stems from the industry’s competitive nature, where efficient management of resources, customer satisfaction, and profitability are paramount.

Problem Statement

The objective is to analyze hotel performance, focusing on key metrics like booking patterns, customer segments, seasonal trends, and cancellation rates. This analysis is vital for hotel management and stakeholders who strive to enhance operational efficiency, customer experience, and profitability.

Data and Methodology

The analysis will serve as a comprehensive tool for enhancing various aspects of hotel management, from operational efficiency and booking patterns to cancellation rates and strategic planning. This holistic approach is crucial for staying competitive and successful in the dynamic hospitality industry.

Packages Required

library(tidyverse) #easy installation of packages
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) #Manipulating data
library(ggplot2) #used for data visualization
library(lubridate) # For handling date-time data
library(DT) # for printing nice HTML output tables

Data Preparation

setwd("/Users/lianganni/Desktop/final")

# Load the original data
hotels_data <- read.csv("hotels.csv")

# Initial inspection of the data
# Total 32 variables in the data set.
str(hotels_data) # provides the structure of the dataset, including the number of variables (columns) and their data types.

summary(hotels_data) #gives a summary of each column, which can reveal how missing values are recorded.
# Handling missing values
hotels_data$children[is.na(hotels_data$children)] <- median(hotels_data$children, na.rm = TRUE)

#The missing values in children are replaced with the median. Because only 4 missing value and is not heavily skewed.

hotels_data$country[is.na(hotels_data$country)] <- 'Unknown'
#The missing country data are filled with unknown because is help maintains the completeness of the dataset and avoids potential biases in analysis.

# Dropping columns with a high percentage of missing values
hotels_data <- subset(hotels_data, select = -c(agent, company))
# dropping columns with a high proportion of missing values helps to enhance data quality and analysis reliability.

# Converting factors to appropriate data types
hotels_data$hotel <- as.factor(hotels_data$hotel)
#Factors are converted to ensure proper data handling in analyses.

#show the top 10 observations of cleaned data
cleaned_data <- head(hotels_data, 500)
datatable(cleaned_data)
summary(hotels_data)

Once the data cleaning process is complete, we gain a thorough understanding of the dataset. This clarity comes from both a detailed overview of the cleaned data and an in-depth summary of each variable. This preparatory step ensures that the subsequent exploration and analysis of the data are based on accurate, reliable, and relevant information.

Exploratory Data Analysis

To analyze hotel performance focusing on booking patterns, customer segments, seasonal trends, and cancellation rates, various approaches can be taken:

I don’t have plan on incorporating any linear regression (or more advanced models).

Booking Patterns Analysis

# Convert the month name to a numeric value
hotels_data$arrival_date_month <- match(hotels_data$arrival_date_month, month.name)

#Create the Date Column: Use the corrected month column to create the date.   
hotels_data$date <- make_date(hotels_data$arrival_date_year, hotels_data$arrival_date_month, hotels_data$arrival_date_day_of_month)

# Aggregate data by month
monthly_bookings <- hotels_data %>%
  group_by(date = floor_date(date, "month")) %>%
  summarise(bookings = n())

# Time series plot
ggplot(monthly_bookings, aes(x = date, y = bookings)) +
  geom_line() +
  labs(title = "Monthly Booking Trends", x = "Month", y = "Number of Bookings") 

From this chart that January is the slowest season for hotels.

Customer Segments Analysis

ggplot(hotels_data, aes(x = customer_type, y = lead_time, fill = customer_type)) +
  geom_boxplot() +
  labs(title = "Lead Time Distribution by Customer Type", x = "Customer Type", y = "Lead Time") 

hotels_data <- hotels_data %>%
  mutate(total_stay_length = stays_in_weekend_nights + stays_in_week_nights)

# Box plot for stay length by customer type
ggplot(hotels_data, aes(x = customer_type, y = total_stay_length, fill = customer_type)) +
  geom_boxplot() +
  labs(title = "Stay Length Distribution by Customer Type", x = "Customer Type", y = "Stay Length")

From this box plot it shows that most of are customer segments from contract and transient-party are booked long time ago. Customers with a contract stay longer than other customers.

adults_with_or_without_children <- hotels_data %>%
  mutate(with_children_or_babies = ifelse(children > 0 | babies > 0, 
"With Children/Babies", 
"Without Children/Babies")) %>%
  group_by(with_children_or_babies) %>%
  summarise(total_adults = sum(adults)) %>%
  ungroup()

ggplot(adults_with_or_without_children, aes(x = with_children_or_babies, y = total_adults, fill = with_children_or_babies)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Number of Adults with and without Children or Babies", 
       x = "Category", 
       y = "Total Number of Adults") 

Cancellation Rates Analysis

# Cancellation rate analysis
cancellation_rates <- hotels_data %>%
  group_by(hotel) %>%
  summarise(cancellation_rate = mean(is_canceled)) 

ggplot(cancellation_rates, aes(x = hotel, y = cancellation_rate, fill = hotel)) +
  geom_bar(stat = "identity") +
  labs(title = "Cancellation Rates by Hotel Type", x = "Hotel Type", y = "Cancellation Rate") 

Summary

The analysis focused on exploring hotel performance data from the ‘hotels.csv’ dataset. The primary objectives were to analyze key metrics such as booking patterns, customer segments, seasonal trends, and cancellation rates. This analysis aimed to provide valuable insights for hotel management and stakeholders to enhance operational efficiency, improve customer experience, and boost profitability.

Addressing the Problem Statement:

Interesting Insights from the Analysis: Identified key booking periods and seasonal trends, highlighting times of peak demand. Gained understanding of the average stay duration and lead time across different customer types.

Implications to the Consumer of the Analysis For hotel managers and stakeholders: Strategic Planning: The insights can inform strategic decisions such as pricing, marketing, and customer relationship management. Targeted Marketing: Understanding customer segments can help tailor marketing efforts and offers. Operational Adjustments: Insights into peak periods can aid in staffing and resource allocation. Policy Development: Cancellation trends can inform policies to mitigate losses.

Limitations of the Analysis: The analysis is limited to the data available in the ‘hotels.csv’ dataset. The analysis is mostly descriptive and exploratory. Predictive modeling was not extensively used.Findings are specific to the dataset and may not be generalizable to all hotels or broader market trends.