Introduction

1.1 Provide an introduction that explains the problem statement you are addressing. Why should I be interested in this? 

Hotels often struggle with unpredictable cancellations. The level of occupancy in a hotel impacts the level of staff needed on hand, the pricing of rooms, and overall operational profitability. The dataset allows us to analyze the needs of two separate hotel types: resort and city.

This project looks at:

  • Reservations
  • Cancellations
  • Actual stays

We’ll also explore how other variables influence these outcomes, such as:

  • Presence of children
  • Total guests
  • Length of stay
  • Time of booking and arrival
  • Country of origin

The goal is to identify trends and determine what kind of customers hotels should be targeting.

We are also asking: What should each hotel type’s cancellation policy be? Understanding these patterns can help hotels manage operations more effectively, reduce last-minute cancellations, and make smarter decisions around staffing and pricing.

1.2 Provide a short explanation of how you plan to address this problem statement (the data used and the methodology employed)

We will conduct an exploratory data analysis using the TidyTuesday dataset from February 11, 2020, available from the rfordatascience GitHub repository.

The steps in the analysis will include:

  • Cleaning the data
  • Creating meaningful features
  • Visualizing key patterns in reservations, cancellations, and stays

1.3 Discuss your current proposed approach/analytic technique you think will address (fully or partially) this problem. 

Our main approach is to uncover trends in:

  • Cancellations
  • Guest profiles
  • Revenue-related measures

This includes identifying which variables are most associated with cancellations, exploring how these trends differ between resort and city hotels, and outlining potential patterns in customer behavior.

Depending on what we find, we may explore basic predictive models like logistic regression or decision trees. However, the primary focus will remain on clear exploratory analysis.

1.4 Explain how your analysis will help the consumer of your analysis. 

This analysis helps highlight consumer behaviors that influence hotel bookings. These insights can help hotels:

  • Anticipate cancellations
  • Make better pricing decisions
  • Plan staffing levels more efficiently
  • Build stronger relationships with the right customers

This also impacts consumers. Booking behaviors affect room availability, dynamic pricing, and cancellation policies. If hotels can reduce unpredictable cancellations, it can improve the experience for both the business and the guest.

##2.1 All packages used are loaded upfront so the reader knows which are required to replicate the analysis.

getwd()
## [1] "C:/Users/merkf/OneDrive/Documents/MSBA 660"
setwd("~/MSBA 660")

hotels <- read.csv("hotels.csv")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(stringr)

2.2 Messages and warnings resulting from loading the package are suppressed.

2.3 Explanation is provided regarding the purpose of each package (there are over 10,000 packages, don’t assume that I know why you loaded each package).

library(tidyverse)

Tidyverse holds a large variety of packages including ggplot2 which is intended to visualize data in a variety of graphs and charts.

library(lubridate)

Lubridate helps format and organize date and time columns. Since our data is centered around booking dates this will ensure that we are working with uniform numbers rather than character strings.

3.1 Original source where the data was obtained is cited and, if possible, hyperlinked.

The data comes from a Data article titled, “Hotel booking demand datasets,” by Antonio, Almeida and Nunes in 2019. This became available in GitHub under Hotels collected in 2020.

3.2 Source data is thoroughly explained

The data set is compiled from two different hotels, wherein one of them is a resort hotel (named H1) located in the region of Algarve, Portugal and a city hotel (H2) is in Lisbon, Portugal. Both data were combined to make the entirety of it starting from July 1, 2015 to August 31, 2017. They both share the same structure with 31 variables. However, the observations differ between the hotels. H1 has 40,060 observations and H2 has 79,330 observations.

As the raw data was initially retrieved from a data article, then made available on GitHub, this dataset is mostly cleaned and prepared. The original authors have decided on variables that can easily match between two hotels, which makes it easy to form connections. However, the dataset is not perfect. From the group’s initial prep, we have found significant issues in it. For one, there were several records of guests who had no days stayed. Or that there were no adults, children, or babies recorded staying at the hotel. Another issue to note is that under ADR (Average Daily Rate) has guests paying negative dollars.

3.3 Data importing and cleaning steps are explained in the text and follow a logical process.

colSums(is.na(hotels))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
  • There are 4 missing values in the data set, which are all in the children variable
hotels$children[is.na(hotels$children)] <- median(hotels$children, na.rm = TRUE) 

We impute df$children with the median (0) to account for the missing values

  • There are 119390 records and 32 variables in this data set.
hotels %>%
  mutate(total_guests = coalesce(adults,0) + coalesce(children,0) + coalesce(babies,0)) %>%
  filter(total_guests == 0)
  • There are 180 listings where the total number of babies, children and adults = 0. 180/119k rows is a very small amount of the data, it seems best to drop them.
hotels <- hotels %>%
  mutate(total_guests = coalesce(adults,0) + coalesce(children,0) + coalesce(babies,0)) %>%
  filter(total_guests > 0)
min(hotels$adr, na.rm = TRUE)
## [1] -6.38
max(hotels$adr, na.rm = TRUE)
## [1] 5400

-Average daily rate is likely never below 0, and is likely never 5400 dollars. We will cap outliers in this output.

hotels <- hotels %>%
  mutate(
    adr = as.numeric(adr),
    adr = if_else(adr < 0, NA_real_, adr),
    adr = if_else(adr > 600, 600, adr)      
  )
ncol(hotels)
## [1] 33
nrow(hotels)
## [1] 119210
hotels <- hotels %>%
  mutate(
    company = na_if(company, "NULL"),
    company = na_if(company, ""),           
    company = as.integer(company),          
    company = replace_na(company, 0L),      
    has_company = if_else(company == 0, 0L, 1L))
  • hotels$company has many nulls in the data, it is likely better just to replace the NULLs with a numeric 0.
hotels <- hotels %>%
  mutate(
    agent = as.integer(agent),                
    agent = replace_na(agent, 0L),                   
    has_agent = if_else(agent == 0, 0L, 1L)          
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `agent = as.integer(agent)`.
## Caused by warning:
## ! NAs introduced by coercion
  • The same thing with agent
hotels$agent

-Some data in hotels$stays_in_weekend_nights/stays_in_week_nights amounts to 0. We should likely just remove this data.

hotels %>%
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
  filter(total_nights == 0 | total_nights > 60)
hotels <- hotels %>%
  mutate(total_nights = stays_in_weekend_nights + stays_in_week_nights) %>%
  filter(total_nights > 0)

3.4 Once your data is clean, show what the final data set looks like. However, do not print off a data frame with 200+ rows; show me the data in the most condensed form possible.

head(hotels, n = 10)
##           hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1  Resort Hotel           0         7              2015               July
## 2  Resort Hotel           0        13              2015               July
## 3  Resort Hotel           0        14              2015               July
## 4  Resort Hotel           0        14              2015               July
## 5  Resort Hotel           0         0              2015               July
## 6  Resort Hotel           0         9              2015               July
## 7  Resort Hotel           1        85              2015               July
## 8  Resort Hotel           1        75              2015               July
## 9  Resort Hotel           1        23              2015               July
## 10 Resort Hotel           0        35              2015               July
##    arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1                        27                         1                       0
## 2                        27                         1                       0
## 3                        27                         1                       0
## 4                        27                         1                       0
## 5                        27                         1                       0
## 6                        27                         1                       0
## 7                        27                         1                       0
## 8                        27                         1                       0
## 9                        27                         1                       0
## 10                       27                         1                       0
##    stays_in_week_nights adults children babies meal country market_segment
## 1                     1      1        0      0   BB     GBR         Direct
## 2                     1      1        0      0   BB     GBR      Corporate
## 3                     2      2        0      0   BB     GBR      Online TA
## 4                     2      2        0      0   BB     GBR      Online TA
## 5                     2      2        0      0   BB     PRT         Direct
## 6                     2      2        0      0   FB     PRT         Direct
## 7                     3      2        0      0   BB     PRT      Online TA
## 8                     3      2        0      0   HB     PRT  Offline TA/TO
## 9                     4      2        0      0   BB     PRT      Online TA
## 10                    4      2        0      0   HB     PRT      Online TA
##    distribution_channel is_repeated_guest previous_cancellations
## 1                Direct                 0                      0
## 2             Corporate                 0                      0
## 3                 TA/TO                 0                      0
## 4                 TA/TO                 0                      0
## 5                Direct                 0                      0
## 6                Direct                 0                      0
## 7                 TA/TO                 0                      0
## 8                 TA/TO                 0                      0
## 9                 TA/TO                 0                      0
## 10                TA/TO                 0                      0
##    previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1                               0                  A                  C
## 2                               0                  A                  A
## 3                               0                  A                  A
## 4                               0                  A                  A
## 5                               0                  C                  C
## 6                               0                  C                  C
## 7                               0                  A                  A
## 8                               0                  D                  D
## 9                               0                  E                  E
## 10                              0                  D                  D
##    booking_changes deposit_type agent company days_in_waiting_list
## 1                0   No Deposit     0       0                    0
## 2                0   No Deposit   304       0                    0
## 3                0   No Deposit   240       0                    0
## 4                0   No Deposit   240       0                    0
## 5                0   No Deposit     0       0                    0
## 6                0   No Deposit   303       0                    0
## 7                0   No Deposit   240       0                    0
## 8                0   No Deposit    15       0                    0
## 9                0   No Deposit   240       0                    0
## 10               0   No Deposit   240       0                    0
##    customer_type   adr required_car_parking_spaces total_of_special_requests
## 1      Transient  75.0                           0                         0
## 2      Transient  75.0                           0                         0
## 3      Transient  98.0                           0                         1
## 4      Transient  98.0                           0                         1
## 5      Transient 107.0                           0                         0
## 6      Transient 103.0                           0                         1
## 7      Transient  82.0                           0                         1
## 8      Transient 105.5                           0                         0
## 9      Transient 123.0                           0                         0
## 10     Transient 145.0                           0                         0
##    reservation_status reservation_status_date total_guests has_company
## 1           Check-Out              2015-07-02            1           0
## 2           Check-Out              2015-07-02            1           0
## 3           Check-Out              2015-07-03            2           0
## 4           Check-Out              2015-07-03            2           0
## 5           Check-Out              2015-07-03            2           0
## 6           Check-Out              2015-07-03            2           0
## 7            Canceled              2015-05-06            2           0
## 8            Canceled              2015-04-22            2           0
## 9            Canceled              2015-06-23            2           0
## 10          Check-Out              2015-07-05            2           0
##    has_agent total_nights
## 1          0            1
## 2          1            1
## 3          1            2
## 4          1            2
## 5          0            2
## 6          1            2
## 7          1            3
## 8          1            3
## 9          1            4
## 10         1            4
cat("Rows:", nrow(hotels), "  Columns:", ncol(hotels), "\n\n")
## Rows: 118565   Columns: 36

3.5 Provide summary information about the variables of concern in your cleaned data set. Do not just print off a bunch of code chunks with str(), summary(), etc. Rather, provide me with a consolidated explanation, either with a table that provides summary info for each variable or a nicely written summary paragraph with inline code.

hotels %>%
  select(where(is.numeric)) %>%
  summary()
##   is_canceled       lead_time     arrival_date_year arrival_date_week_number
##  Min.   :0.0000   Min.   :  0.0   Min.   :2015      Min.   : 1.00           
##  1st Qu.:0.0000   1st Qu.: 18.0   1st Qu.:2016      1st Qu.:16.00           
##  Median :0.0000   Median : 70.0   Median :2016      Median :28.00           
##  Mean   :0.3726   Mean   :104.5   Mean   :2016      Mean   :27.16           
##  3rd Qu.:1.0000   3rd Qu.:161.0   3rd Qu.:2017      3rd Qu.:38.00           
##  Max.   :1.0000   Max.   :709.0   Max.   :2017      Max.   :53.00           
##                                                                             
##  arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights
##  Min.   : 1.0              Min.   : 0.0000         Min.   : 0.000      
##  1st Qu.: 8.0              1st Qu.: 0.0000         1st Qu.: 1.000      
##  Median :16.0              Median : 1.0000         Median : 2.000      
##  Mean   :15.8              Mean   : 0.9321         Mean   : 2.513      
##  3rd Qu.:23.0              3rd Qu.: 2.0000         3rd Qu.: 3.000      
##  Max.   :31.0              Max.   :19.0000         Max.   :50.000      
##                                                                        
##      adults         children           babies          is_repeated_guest
##  Min.   : 0.00   Min.   : 0.0000   Min.   : 0.000000   Min.   :0.00000  
##  1st Qu.: 2.00   1st Qu.: 0.0000   1st Qu.: 0.000000   1st Qu.:0.00000  
##  Median : 2.00   Median : 0.0000   Median : 0.000000   Median :0.00000  
##  Mean   : 1.86   Mean   : 0.1042   Mean   : 0.007962   Mean   :0.02951  
##  3rd Qu.: 2.00   3rd Qu.: 0.0000   3rd Qu.: 0.000000   3rd Qu.:0.00000  
##  Max.   :55.00   Max.   :10.0000   Max.   :10.000000   Max.   :1.00000  
##                                                                         
##  previous_cancellations previous_bookings_not_canceled booking_changes  
##  Min.   : 0.00000       Min.   : 0.0000                Min.   : 0.0000  
##  1st Qu.: 0.00000       1st Qu.: 0.0000                1st Qu.: 0.0000  
##  Median : 0.00000       Median : 0.0000                Median : 0.0000  
##  Mean   : 0.08763       Mean   : 0.1372                Mean   : 0.2184  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                3rd Qu.: 0.0000  
##  Max.   :26.00000       Max.   :72.0000                Max.   :18.0000  
##                                                                         
##      agent           company       days_in_waiting_list      adr       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000      Min.   :  0.0  
##  1st Qu.:  7.00   1st Qu.:  0.00   1st Qu.:  0.000      1st Qu.: 70.0  
##  Median :  9.00   Median :  0.00   Median :  0.000      Median : 95.0  
##  Mean   : 74.78   Mean   : 10.68   Mean   :  2.333      Mean   :102.5  
##  3rd Qu.:152.00   3rd Qu.:  0.00   3rd Qu.:  0.000      3rd Qu.:126.0  
##  Max.   :535.00   Max.   :543.00   Max.   :391.000      Max.   :600.0  
##                                                         NA's   :1      
##  required_car_parking_spaces total_of_special_requests  total_guests   
##  Min.   :0.00000             Min.   :0.0000            Min.   : 1.000  
##  1st Qu.:0.00000             1st Qu.:0.0000            1st Qu.: 2.000  
##  Median :0.00000             Median :0.0000            Median : 2.000  
##  Mean   :0.06276             Mean   :0.5719            Mean   : 1.972  
##  3rd Qu.:0.00000             3rd Qu.:1.0000            3rd Qu.: 2.000  
##  Max.   :8.00000             Max.   :5.0000            Max.   :55.000  
##                                                                        
##   has_company        has_agent       total_nights   
##  Min.   :0.00000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.: 2.000  
##  Median :0.00000   Median :1.0000   Median : 3.000  
##  Mean   :0.05647   Mean   :0.8644   Mean   : 3.445  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.: 4.000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :69.000  
## 

4.1 Discuss how you plan to uncover new information in the data that is not self-evident. What are different ways you could look at this data to answer the questions you want to answer? Do you plan to slice and dice the data in different ways, create new variables, or join separate data frames to create new summary information? How could you summarize your data to answer key questions?

We plan to uncover new information that is not self-evident, such as:

  • Total booking cost
  • Total number of guests
  • Total number of days stayed

We will calculate these by combining existing variables. For example, the total number of days stayed will be the sum of stays_in_weekend_nights and stays_in_week_nights. This total stay length will then inform the total booking cost by multiplying it by the average_daily_rate.

These uncovered variables will help us assess the monetary impact of cancellations, giving hotels a clearer picture of lost revenue and helping to guide better staffing and pricing decisions.

We also plan to slice the data based on whether a reservation was cancelled or not. This will help us analyze:

  • What types of guests tend to cancel
  • Whether one hotel type is more affected by cancellations
  • Who the ideal customers might be for each hotel

This filtering allows us to better identify and target specific guest profiles, raise profits, and support more informed marketing decisions. On the consumer side, this information can provide insight into room availability, price fluctuations, and booking policies.

To summarize the data and answer key questions, we will use a combination of tables and plots that reveal hidden patterns and trends.

4.2 What types of plots and tables will help you to illustrate the findings to your questions?

Plots

We will use the following types of visualizations:

Cancellation Rate by Hotel Type (City vs Resort)
Type: Stacked Bar Plot
Purpose: To see if city hotels experience more cancellations than resort hotels.

Monthly Trends in Bookings vs Cancellations
Type: Line Plot
Purpose: To identify seasonal trends. Helps inform when hotels need to be fully staffed or when guests are most likely to cancel.

Average Daily Rate by Hotel Type
Type: Boxplot
Purpose: To compare how pricing differs between resort and city hotels.

Lead Time vs Cancellation
Type: Boxplot
Purpose: To determine if cancellations are more likely with longer lead times.

Special Requests vs Cancellation
Type: Scatter Plot or Boxplot
Purpose: To see if guests who are more engaged (e.g. making special requests) are less likely to cancel.

Total Number of Guests vs Cancellation
Type: Histogram
Purpose: To explore whether group size affects cancellation behavior.

Guest Profile vs Cancellation
Type: Histogram
Purpose: To analyze if certain types of guests cancel more often than others.

Total Nights Stayed vs Cancellation
Type: Histogram
Purpose: To assess if the length of stay is related to cancellations.

Tables

We will include tables such as:

Cancellation by Deposit Type
Purpose: To understand whether requiring a deposit affects the likelihood of cancellation.

Summary of Guest Profiles by Hotel Type
Purpose: To identify the current demographics of guests staying at city vs resort hotels.

4.3 What do you not know how to do right now that you need to learn to answer your questions?

Right now, we do not know how to do predictive modeling or forecasting. We are able to describe and visualize cancellations, but we cannot predict them or forecast future cancellations and demand.

In addition, other examples show the submitted HTML format with multiple tabs. This is something that will keep our final submission tidy and organized into each section. We will need to learn to do this in order to execute it in the final submission.

Finally, our current output is static. If we were to take this beyond the classroom, it would be useful to create an interactive dashboard using tools like Shiny or flexdashboard. This would allow users to filter and explore the data based on their specific needs.

4.4 Do you plan on incorporating any machine learning techniques (i.e. linear regression, discriminant analysis, cluster analysis) to answer your questions?

As mentioned earlier, we are not currently familiar with machine learning techniques like linear regression, discriminant analysis, or cluster analysis. If we wanted to incorporate these approaches to answer deeper questions or build predictive models, we would need to take the time to learn those methods.