Hotels often struggle with unpredictable cancellations. The level of occupancy in a hotel impacts the level of staff needed on hand, the pricing of rooms, and overall operational profitability. The dataset allows us to analyze the needs of two separate hotel types: resort and city.
This project looks at:
We’ll also explore how other variables influence these outcomes, such as:
The goal is to identify trends and determine what kind of customers hotels should be targeting.
We are also asking: What should each hotel type’s cancellation policy be? Understanding these patterns can help hotels manage operations more effectively, reduce last-minute cancellations, and make smarter decisions around staffing and pricing.
We will conduct an exploratory data analysis using the
TidyTuesday dataset from February 11, 2020, available
from the rfordatascience
GitHub repository.
The steps in the analysis will include:
Our main approach is to uncover trends in:
This includes identifying which variables are most associated with cancellations, exploring how these trends differ between resort and city hotels, and outlining potential patterns in customer behavior.
Depending on what we find, we may explore basic predictive models like logistic regression or decision trees. However, the primary focus will remain on clear exploratory analysis.
This analysis helps highlight consumer behaviors that influence hotel bookings. These insights can help hotels:
This also impacts consumers. Booking behaviors affect room availability, dynamic pricing, and cancellation policies. If hotels can reduce unpredictable cancellations, it can improve the experience for both the business and the guest.
##2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.
getwd()
## [1] "C:/Users/merkf/OneDrive/Documents/MSBA 660"
setwd("~/MSBA 660")
hotels <- read.csv("hotels.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(stringr)
library(tidyverse)
Tidyverse holds a large variety of packages including ggplot2 which is intended to visualize data in a variety of graphs and charts.
library(lubridate)
Lubridate helps format and organize date and time columns. Since our data is centered around booking dates this will ensure that we are working with uniform numbers rather than character strings.
The data comes from a Data article titled, “Hotel booking demand datasets,” by Antonio, Almeida and Nunes in 2019. This became available in GitHub under Hotels collected in 2020.
The data set is compiled from two different hotels, wherein one of them is a resort hotel (named H1) located in the region of Algarve, Portugal and a city hotel (H2) is in Lisbon, Portugal. Both data were combined to make the entirety of it starting from July 1, 2015 to August 31, 2017. They both share the same structure with 31 variables. However, the observations differ between the hotels. H1 has 40,060 observations and H2 has 79,330 observations.
As the raw data was initially retrieved from a data article, then made available on GitHub, this dataset is mostly cleaned and prepared. The original authors have decided on variables that can easily match between two hotels, which makes it easy to form connections. However, the dataset is not perfect. From the group’s initial prep, we have found significant issues in it. For one, there were several records of guests who had no days stayed. Or that there were no adults, children, or babies recorded staying at the hotel. Another issue to note is that under ADR (Average Daily Rate) has guests paying negative dollars.
colSums(is.na(hotels))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE)
We impute df$children with the median (0) to account for the missing values
hotels %>%
mutate(total_guests = coalesce(adults,0) + coalesce(children,0) + coalesce(babies,0)) %>%
filter(total_guests == 0)
hotels <- hotels %>%
mutate(total_guests = coalesce(adults,0) + coalesce(children,0) + coalesce(babies,0)) %>%
filter(total_guests > 0)
min(hotels$adr, na.rm = TRUE)
## [1] -6.38
max(hotels$adr, na.rm = TRUE)
## [1] 5400
-Average daily rate is likely never below 0, and is likely never 5400 dollars. We will cap outliers in this output.
hotels <- hotels %>%
mutate(
adr = as.numeric(adr),
adr = if_else(adr < 0, NA_real_, adr),
adr = if_else(adr > 600, 600, adr)
)
ncol(hotels)
## [1] 33
nrow(hotels)
## [1] 119210
hotels <- hotels %>%
mutate(
company = na_if(company, "NULL"),
company = na_if(company, ""),
company = as.integer(company),
company = replace_na(company, 0L),
has_company = if_else(company == 0, 0L, 1L))
hotels <- hotels %>%
mutate(
agent = as.integer(agent),
agent = replace_na(agent, 0L),
has_agent = if_else(agent == 0, 0L, 1L)
)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `agent = as.integer(agent)`.
## Caused by warning:
## ! NAs introduced by coercion
hotels$agent
-Some data in hotels$stays_in_weekend_nights/stays_in_week_nights amounts to 0. We should likely just remove this data.
hotels %>%
mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
filter(total_nights == 0 | total_nights > 60)
hotels <- hotels %>%
mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
filter(total_nights > 0)
head(hotels, n = 10)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 7 2015 July
## 2 Resort Hotel 0 13 2015 July
## 3 Resort Hotel 0 14 2015 July
## 4 Resort Hotel 0 14 2015 July
## 5 Resort Hotel 0 0 2015 July
## 6 Resort Hotel 0 9 2015 July
## 7 Resort Hotel 1 85 2015 July
## 8 Resort Hotel 1 75 2015 July
## 9 Resort Hotel 1 23 2015 July
## 10 Resort Hotel 0 35 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## 7 27 1 0
## 8 27 1 0
## 9 27 1 0
## 10 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 1 1 0 0 BB GBR Direct
## 2 1 1 0 0 BB GBR Corporate
## 3 2 2 0 0 BB GBR Online TA
## 4 2 2 0 0 BB GBR Online TA
## 5 2 2 0 0 BB PRT Direct
## 6 2 2 0 0 FB PRT Direct
## 7 3 2 0 0 BB PRT Online TA
## 8 3 2 0 0 HB PRT Offline TA/TO
## 9 4 2 0 0 BB PRT Online TA
## 10 4 2 0 0 HB PRT Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Corporate 0 0
## 3 TA/TO 0 0
## 4 TA/TO 0 0
## 5 Direct 0 0
## 6 Direct 0 0
## 7 TA/TO 0 0
## 8 TA/TO 0 0
## 9 TA/TO 0 0
## 10 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 A C
## 2 0 A A
## 3 0 A A
## 4 0 A A
## 5 0 C C
## 6 0 C C
## 7 0 A A
## 8 0 D D
## 9 0 E E
## 10 0 D D
## booking_changes deposit_type agent company days_in_waiting_list
## 1 0 No Deposit 0 0 0
## 2 0 No Deposit 304 0 0
## 3 0 No Deposit 240 0 0
## 4 0 No Deposit 240 0 0
## 5 0 No Deposit 0 0 0
## 6 0 No Deposit 303 0 0
## 7 0 No Deposit 240 0 0
## 8 0 No Deposit 15 0 0
## 9 0 No Deposit 240 0 0
## 10 0 No Deposit 240 0 0
## customer_type adr required_car_parking_spaces total_of_special_requests
## 1 Transient 75.0 0 0
## 2 Transient 75.0 0 0
## 3 Transient 98.0 0 1
## 4 Transient 98.0 0 1
## 5 Transient 107.0 0 0
## 6 Transient 103.0 0 1
## 7 Transient 82.0 0 1
## 8 Transient 105.5 0 0
## 9 Transient 123.0 0 0
## 10 Transient 145.0 0 0
## reservation_status reservation_status_date total_guests has_company
## 1 Check-Out 2015-07-02 1 0
## 2 Check-Out 2015-07-02 1 0
## 3 Check-Out 2015-07-03 2 0
## 4 Check-Out 2015-07-03 2 0
## 5 Check-Out 2015-07-03 2 0
## 6 Check-Out 2015-07-03 2 0
## 7 Canceled 2015-05-06 2 0
## 8 Canceled 2015-04-22 2 0
## 9 Canceled 2015-06-23 2 0
## 10 Check-Out 2015-07-05 2 0
## has_agent total_nights
## 1 0 1
## 2 1 1
## 3 1 2
## 4 1 2
## 5 0 2
## 6 1 2
## 7 1 3
## 8 1 3
## 9 1 4
## 10 1 4
cat("Rows:", nrow(hotels), " Columns:", ncol(hotels), "\n\n")
## Rows: 118565 Columns: 36
hotels %>%
select(where(is.numeric)) %>%
summary()
## is_canceled lead_time arrival_date_year arrival_date_week_number
## Min. :0.0000 Min. : 0.0 Min. :2015 Min. : 1.00
## 1st Qu.:0.0000 1st Qu.: 18.0 1st Qu.:2016 1st Qu.:16.00
## Median :0.0000 Median : 70.0 Median :2016 Median :28.00
## Mean :0.3726 Mean :104.5 Mean :2016 Mean :27.16
## 3rd Qu.:1.0000 3rd Qu.:161.0 3rd Qu.:2017 3rd Qu.:38.00
## Max. :1.0000 Max. :709.0 Max. :2017 Max. :53.00
##
## arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
## Min. : 1.0 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 8.0 1st Qu.: 0.0000 1st Qu.: 1.000
## Median :16.0 Median : 1.0000 Median : 2.000
## Mean :15.8 Mean : 0.9321 Mean : 2.513
## 3rd Qu.:23.0 3rd Qu.: 2.0000 3rd Qu.: 3.000
## Max. :31.0 Max. :19.0000 Max. :50.000
##
## adults children babies is_repeated_guest
## Min. : 0.00 Min. : 0.0000 Min. : 0.000000 Min. :0.00000
## 1st Qu.: 2.00 1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.:0.00000
## Median : 2.00 Median : 0.0000 Median : 0.000000 Median :0.00000
## Mean : 1.86 Mean : 0.1042 Mean : 0.007962 Mean :0.02951
## 3rd Qu.: 2.00 3rd Qu.: 0.0000 3rd Qu.: 0.000000 3rd Qu.:0.00000
## Max. :55.00 Max. :10.0000 Max. :10.000000 Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled booking_changes
## Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.0000
## Mean : 0.08763 Mean : 0.1372 Mean : 0.2184
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000 Max. :18.0000
##
## agent company days_in_waiting_list adr
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.0
## 1st Qu.: 7.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 70.0
## Median : 9.00 Median : 0.00 Median : 0.000 Median : 95.0
## Mean : 74.78 Mean : 10.68 Mean : 2.333 Mean :102.5
## 3rd Qu.:152.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.:126.0
## Max. :535.00 Max. :543.00 Max. :391.000 Max. :600.0
## NA's :1
## required_car_parking_spaces total_of_special_requests total_guests
## Min. :0.00000 Min. :0.0000 Min. : 1.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 2.000
## Median :0.00000 Median :0.0000 Median : 2.000
## Mean :0.06276 Mean :0.5719 Mean : 1.972
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.: 2.000
## Max. :8.00000 Max. :5.0000 Max. :55.000
##
## has_company has_agent total_nights
## Min. :0.00000 Min. :0.0000 Min. : 1.000
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.: 2.000
## Median :0.00000 Median :1.0000 Median : 3.000
## Mean :0.05647 Mean :0.8644 Mean : 3.445
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.: 4.000
## Max. :1.00000 Max. :1.0000 Max. :69.000
##
We plan to uncover new information that is not self-evident, such as:
We will calculate these by combining existing variables. For example,
the total number of days stayed will be the sum of
stays_in_weekend_nights and
stays_in_week_nights. This total stay length will then
inform the total booking cost by multiplying it by the
average_daily_rate.
These uncovered variables will help us assess the monetary impact of cancellations, giving hotels a clearer picture of lost revenue and helping to guide better staffing and pricing decisions.
We also plan to slice the data based on whether a reservation was cancelled or not. This will help us analyze:
This filtering allows us to better identify and target specific guest profiles, raise profits, and support more informed marketing decisions. On the consumer side, this information can provide insight into room availability, price fluctuations, and booking policies.
To summarize the data and answer key questions, we will use a combination of tables and plots that reveal hidden patterns and trends.
We will use the following types of visualizations:
Cancellation Rate by Hotel Type (City vs
Resort)
Type: Stacked Bar Plot
Purpose: To see if city hotels experience more cancellations than resort
hotels.
Monthly Trends in Bookings vs Cancellations
Type: Line Plot
Purpose: To identify seasonal trends. Helps inform when hotels need to
be fully staffed or when guests are most likely to cancel.
Average Daily Rate by Hotel Type
Type: Boxplot
Purpose: To compare how pricing differs between resort and city
hotels.
Lead Time vs Cancellation
Type: Boxplot
Purpose: To determine if cancellations are more likely with longer lead
times.
Special Requests vs Cancellation
Type: Scatter Plot or Boxplot
Purpose: To see if guests who are more engaged (e.g. making special
requests) are less likely to cancel.
Total Number of Guests vs Cancellation
Type: Histogram
Purpose: To explore whether group size affects cancellation
behavior.
Guest Profile vs Cancellation
Type: Histogram
Purpose: To analyze if certain types of guests cancel more often than
others.
Total Nights Stayed vs Cancellation
Type: Histogram
Purpose: To assess if the length of stay is related to
cancellations.
We will include tables such as:
Cancellation by Deposit Type
Purpose: To understand whether requiring a deposit affects the
likelihood of cancellation.
Summary of Guest Profiles by Hotel Type
Purpose: To identify the current demographics of guests staying at city
vs resort hotels.
Right now, we do not know how to do predictive modeling or forecasting. We are able to describe and visualize cancellations, but we cannot predict them or forecast future cancellations and demand.
In addition, other examples show the submitted HTML format with multiple tabs. This is something that will keep our final submission tidy and organized into each section. We will need to learn to do this in order to execute it in the final submission.
Finally, our current output is static. If we were to take this beyond the classroom, it would be useful to create an interactive dashboard using tools like Shiny or flexdashboard. This would allow users to filter and explore the data based on their specific needs.
As mentioned earlier, we are not currently familiar with machine learning techniques like linear regression, discriminant analysis, or cluster analysis. If we wanted to incorporate these approaches to answer deeper questions or build predictive models, we would need to take the time to learn those methods.