The hotel industry faces challenges in optimizing bookings during peak seasons. Cancellations can significantly impact potential revenue. Which markets should hotels focus on to ensure stable bookings?
I plan to analyze the “hotels.csv” data to explore the relationships between various hotel attributes.
This analysis aims to provide hoteliers with insights into which markets might offer more reliable bookings during peak seasons, leading to optimized revenue.
install.packages("rmarkdown")
install.packages("tidyverse")
install.packages("readr")
install.packages("knitr")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(knitr)
tidyverse: For data manipulation and
visualization.readr: To import the “hotels.csv” data.knitr: To display tables and data frames nicely.setwd('~/Desktop/B BUS 301 A/Mid-term/hotels')
hotels_data <-readr:: read_csv("hotels.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##str(hotels_data)
##summary(hotels_data)
ncol(hotels_data)
## [1] 32
sum(is.na(hotels_data))
## [1] 4
The dataset we are using is named hotels.csv. Upon
loading and inspecting the data:
colnames(hotels_data)[apply(hotels_data, 2, function(x) any(is.na(x)))]
## [1] "children"
# Median imputation for 'children' column
hotels_data$children[is.na(hotels_data$children)] <- median(hotels_data$children, na.rm = TRUE)
# Convert 'arrival_date_year' to a factor
hotels_data$arrival_date_year <- as.factor(hotels_data$arrival_date_year)
# Handle "NULL" values in 'agent' and 'company' columns
hotels_data$agent[hotels_data$agent == "NULL"] <- NA
hotels_data$company[hotels_data$company == "NULL"] <- NA
head(hotels_data, 10)
## # A tibble: 10 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <fct> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
summary(hotels_data)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 2015:21996
## Class :character 1st Qu.:0.0000 1st Qu.: 18 2016:56707
## Mode :character Median :0.0000 Median : 69 2017:40687
## Mean :0.3704 Mean :104
## 3rd Qu.:1.0000 3rd Qu.:160
## Max. :1.0000 Max. :737
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
## reservation_status_date
## Min. :2014-10-17
## 1st Qu.:2016-02-01
## Median :2016-08-07
## Mean :2016-07-30
## 3rd Qu.:2017-02-08
## Max. :2017-09-14
hotel: This character variable indicates the type of hotel, with common categories being “Resort Hotel” and others. For analysis, it will be essential to examine if there are significant differences between different hotel types in relation to other variables.
is_canceled: A numerical variable that indicates whether a
reservation was canceled (1) or not (0). This
will be a critical variable if one is investigating factors leading to
cancellations.
lead_time: Numerically represents the number of days that elapsed between the booking date and the actual arrival date. It will be interesting to investigate if longer lead times lead to higher cancellations or if they affect the Average Daily Rate (ADR).
arrival_date_year, arrival_date_month, arrival_date_week_number, arrival_date_day_of_month: These variables collectively provide information about the arrival date. This data can be used to analyze trends and seasonality in bookings.
adults, children, babies: These numerical variables show the count of adults, children, and babies for a reservation, respectively. It’ll be pivotal in understanding the typical composition of guests at these hotels.
meal: Character variable representing the type of meal booked. Categories like “BB” represent bed & breakfast, and so on. Examining this can reveal guest preferences and its potential link to other variables such as cancellations.
country: Denotes the country of origin of the guests. This can help in identifying the primary source countries and understand international vs. local guest dynamics.
agent and company: These two character variables show the ID of the travel agency that made the booking and the company ID if it was a corporate booking. The cleaned dataset now has the “NULL” placeholders replaced with NA for these variables.
adr: Stands for Average Daily Rate. It’s a numerical representation of the average booking cost per day. It can be analyzed against various factors like lead time, type of guest, month of booking, etc., to gain pricing insights.
reservation_status: This character variable provides the latest status of the reservation, with categories like “Check-Out” indicating successful stays.
# Load necessary libraries
library(ggplot2)
library(dplyr)
ggplot(hotels_data, aes(x = hotel)) +
geom_bar(fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Distribution of Hotel Types", x = "Hotel Type", y = "Count")
#Arrivals by Month
ggplot(hotels_data, aes(x = arrival_date_month)) +
geom_bar(fill = "coral", color = "black") +
theme_minimal() +
labs(title = "Number of Arrivals by Month", x = "Month", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Top 10 Countries of Guests:
top_countries <- head(sort(table(hotels_data$country), decreasing = TRUE), 10)
ggplot(as.data.frame(top_countries), aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Top 10 Countries of Guests", x = "Country", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
city_hotels_peak_season <- hotels_data %>%
filter(hotel == "City Hotel", arrival_date_month %in% c("July", "August"))
ggplot(city_hotels_peak_season, aes(x = arrival_date_month, fill = factor(is_canceled))) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values = c("forestgreen", "firebrick"),
name = "Cancellation Status",
breaks = c(0, 1),
labels = c("Not Canceled", "Canceled")) +
theme_minimal() +
labs(title = "Cancellation Status for City Hotels during Peak Season",
x = "Arrival Month",
y = "Proportion of Bookings") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate cancellation rate for each country
cancellation_rate <- city_hotels_peak_season %>%
group_by(country) %>%
summarize(total_bookings = n(),
cancellations = sum(is_canceled),
cancellation_rate = cancellations / total_bookings) %>%
# Filter countries with at least 50 bookings for better representation
filter(total_bookings >= 50)
# Top 10 countries with the lowest cancellation rate
top_countries_lowest_cancellation <- cancellation_rate %>%
arrange(cancellation_rate) %>%
head(10)
# Plotting
ggplot(top_countries_lowest_cancellation, aes(x = reorder(country, cancellation_rate), y = cancellation_rate)) +
geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Top 10 Countries with Lowest Cancellation Rate during Peak Season",
x = "Country",
y = "Cancellation Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Data Analysis Summary
Peak Season Insights: The data clearly indicates that July and August are the peak months for hotel bookings. This finding aligns with typical summer vacation trends in many parts of the world, underscoring the importance of these months for the hotel industry.
Hotel Preference: A significant majority of the bookings are for City hotels. This suggests that travelers, perhaps both leisure and business, have a preference for staying within city limits, valuing the convenience and accessibility of urban accommodations.
Cancellation Trends: A deeper dive into the data during the peak season revealed a concerning trend: over 25% of the bookings for City hotels in July and August are canceled. Such a high cancellation rate during the most lucrative months can have significant financial implications for hoteliers.
Guest Nationalities and Cancellation Patterns: By juxtaposing the data on the top 10 countries with the highest number of guests against their respective cancellation rates, a fascinating pattern emerged. Among the top guest nationalities, guests from Germany exhibited a notably lower cancellation rate. Furthermore, the volume of guests from Germany is reasonably high, making them a particularly valuable demographic for the hotel industry.
Recommendations:
Targeted Marketing in Germany: Given the loyalty and volume of German guests, hotels should consider devising special promotions, packages, or loyalty programs tailored for this demographic. By catering to their preferences and ensuring they feel valued, there’s a potential to further boost bookings and reduce cancellations from this segment.
Cancellation Policies: With a cancellation rate surpassing 25% during peak months, hotels might need to revisit their cancellation policies. Implementing stricter policies or offering incentives for non-cancellation might help in reducing this number.
Bar Graphs: Ideal for showing the distribution of categorical variables, such as hotel types, monthly arrivals, and top guest nationalities.
I need to learn more ways about plots and tables in order to better demonstrate the data. And also more complex way to analyze data.