1. Introduction and Objectives

Our team has been tasked by the CIO to lead a new data science project. We have chosen to analyze the Hotel Booking Demand dataset. The hospitality industry is highly dynamic, and understanding customer behavior is crucial for optimizing revenue, managing inventory, and reducing losses from cancellations.

Project Objectives:

  • To systematically clean and preprocess a real-world, high-dimensional booking dataset.
  • (Clustering) To explore patterns in customer cancellations and room pricing through Exploratory Data Analysis (EDA).
  • (Regression) To build predictive models to address key business questions using both classification and regression techniques.

2. Details of the Dataset

Before performing any transformations, we first explore the raw data to understand its context, dimension, and underlying structure.

2.1 Dataset Context

  • Dataset Title: Hotel Booking Demand Dataset (sourced from Kaggle).
  • Year of Dataset: Contains hotel booking records collected between 2015 and 2017.
  • Purpose: To analyze booking demand, cancellation trends, room pricing (Average Daily Rate), and customer preferences. It allows hotels to study patterns that lead to successful stays versus cancellations.
  • Content: The data includes both Resort Hotel and City Hotel booking information. Key features include lead time, arrival dates, number of guests, meal packages, assigned room types, and special requests.

2.2 Data Loading and Dimension

The requirement states that the dataset dimension must exceed 100,000 data points. Let’s load the data and verify this.

# Load the raw dataset
# Note: Ensure 'hotel_bookings.csv' is in the same directory as this Rmd file
df_raw <- read.csv("hotel_bookings.csv")
# Check the dimensions (Rows and Columns)
dim(df_raw)
## [1] 119390     32
# Calculate total dimension (Rows x Columns)
total_dimension <- nrow(df_raw) * ncol(df_raw)
cat("Total Dimension (Rows x Columns):", format(total_dimension, big.mark=","), "\n")
## Total Dimension (Rows x Columns): 3,820,480
library(DT)

# Show the first 100 rows interactively
datatable(head(df_raw, 100), 
          rownames = FALSE,
          extensions = 'Buttons',
          options = list(
            pageLength = 5,          # Show 5 rows at a time
            scrollX = TRUE,          # Add a horizontal scrollbar so it fits on screen
            dom = 'Bfrtip',          # Required to show the buttons
            buttons = c('copy', 'csv', 'excel') # Add download buttons
          ),
          caption = "Interactive Preview: Raw Hotel Booking Data")

Observation: The dataset contains 119,390 rows and 32 columns. The total dimension is 3,820,480, which easily satisfies the project requirement of >100,000.

2.3 Data Structure and Summary

Next, we inspect the variable types and look for initial data quality issues such as missing values.

# Display the structure of the dataset
str(df_raw)
## 'data.frame':    119390 obs. of  32 variables:
##  $ hotel                         : chr  "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : int  342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr  "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : int  27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : int  0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : int  2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr  "BB" "BB" "BB" "BB" ...
##  $ country                       : chr  "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr  "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr  "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr  "C" "C" "C" "A" ...
##  $ booking_changes               : int  3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr  "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr  "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr  "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr  "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num  0 0 75 75 98 ...
##  $ required_car_parking_spaces   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr  "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr  "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
# Count total missing values (NAs) in each column
colSums(is.na(df_raw))
##                          hotel                    is_canceled 
##                              0                              0 
##                      lead_time              arrival_date_year 
##                              0                              0 
##             arrival_date_month       arrival_date_week_number 
##                              0                              0 
##      arrival_date_day_of_month        stays_in_weekend_nights 
##                              0                              0 
##           stays_in_week_nights                         adults 
##                              0                              0 
##                       children                         babies 
##                              4                              0 
##                           meal                        country 
##                              0                              0 
##                 market_segment           distribution_channel 
##                              0                              0 
##              is_repeated_guest         previous_cancellations 
##                              0                              0 
## previous_bookings_not_canceled             reserved_room_type 
##                              0                              0 
##             assigned_room_type                booking_changes 
##                              0                              0 
##                   deposit_type                          agent 
##                              0                              0 
##                        company           days_in_waiting_list 
##                              0                              0 
##                  customer_type                            adr 
##                              0                              0 
##    required_car_parking_spaces      total_of_special_requests 
##                              0                              0 
##             reservation_status        reservation_status_date 
##                              0                              0
# Provide a comprehensive summary of the dataset
skim(df_raw)
Data summary
Name df_raw
Number of rows 119390
Number of columns 32
_______________________
Column type frequency:
character 14
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
hotel 0 1 10 12 0 2 0
arrival_date_month 0 1 3 9 0 12 0
meal 0 1 2 9 0 5 0
country 0 1 2 4 0 178 0
market_segment 0 1 6 13 0 8 0
distribution_channel 0 1 3 9 0 5 0
reserved_room_type 0 1 1 1 0 10 0
assigned_room_type 0 1 1 1 0 12 0
deposit_type 0 1 10 10 0 3 0
agent 0 1 1 4 0 334 0
company 0 1 1 4 0 353 0
customer_type 0 1 5 15 0 4 0
reservation_status 0 1 7 9 0 3 0
reservation_status_date 0 1 10 10 0 926 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
is_canceled 0 1 0.37 0.48 0.00 0.00 0.00 1 1 ▇▁▁▁▅
lead_time 0 1 104.01 106.86 0.00 18.00 69.00 160 737 ▇▂▁▁▁
arrival_date_year 0 1 2016.16 0.71 2015.00 2016.00 2016.00 2017 2017 ▃▁▇▁▆
arrival_date_week_number 0 1 27.17 13.61 1.00 16.00 28.00 38 53 ▅▇▇▇▅
arrival_date_day_of_month 0 1 15.80 8.78 1.00 8.00 16.00 23 31 ▇▇▇▇▆
stays_in_weekend_nights 0 1 0.93 1.00 0.00 0.00 1.00 2 19 ▇▁▁▁▁
stays_in_week_nights 0 1 2.50 1.91 0.00 1.00 2.00 3 50 ▇▁▁▁▁
adults 0 1 1.86 0.58 0.00 2.00 2.00 2 55 ▇▁▁▁▁
children 4 1 0.10 0.40 0.00 0.00 0.00 0 10 ▇▁▁▁▁
babies 0 1 0.01 0.10 0.00 0.00 0.00 0 10 ▇▁▁▁▁
is_repeated_guest 0 1 0.03 0.18 0.00 0.00 0.00 0 1 ▇▁▁▁▁
previous_cancellations 0 1 0.09 0.84 0.00 0.00 0.00 0 26 ▇▁▁▁▁
previous_bookings_not_canceled 0 1 0.14 1.50 0.00 0.00 0.00 0 72 ▇▁▁▁▁
booking_changes 0 1 0.22 0.65 0.00 0.00 0.00 0 21 ▇▁▁▁▁
days_in_waiting_list 0 1 2.32 17.59 0.00 0.00 0.00 0 391 ▇▁▁▁▁
adr 0 1 101.83 50.54 -6.38 69.29 94.58 126 5400 ▇▁▁▁▁
required_car_parking_spaces 0 1 0.06 0.25 0.00 0.00 0.00 0 8 ▇▁▁▁▁
total_of_special_requests 0 1 0.57 0.79 0.00 0.00 0.00 1 5 ▇▁▁▁▁
# Check for literal "NULL" string values (MNAR)
null_counts <- df_raw %>%
  select(agent, company) %>%
  summarise(
    agent_null_count = sum(agent == "NULL", na.rm = TRUE),
    company_null_count = sum(company == "NULL", na.rm = TRUE)
  )

# Display as an interactive table
datatable(null_counts, 
          rownames = FALSE,
          options = list(dom = 't'), # 't' hides the search bar since it's a small table
          caption = "Count of Meaningfully Missing (MNAR) 'NULL' Strings")
# Print the result clearly
null_counts
##   agent_null_count company_null_count
## 1            16340             112593

Summary of Exploration: - Variable Types: The dataset consists of numerical variables (e.g., lead_time, adr), categorical variables (e.g., hotel, customer_type), and date-related variables. - Missing At Random (MAR): The children column contains 4 explicitly missing (NA) values. These are likely due to random data entry omissions and will be imputed. - Missing Not At Random (MNAR): The agent and company columns contain 16,340 and 112,593 “NULL” string values, respectively. These are MNAR; they are intentionally blank to represent Direct Bookings where no third-party travel agent or corporate company was involved. This is highly valuable business data, so rather than deleting these rows, we will transform these “NULL” strings into “0” to preserve the “Direct Booking” source for our predictive models.

3. Data Cleaning & Preprocessing

This part is to clean the dataset.

3.1 Cleaning Strategy

  1. Standardizing Column Names: We use the janitor package to ensure all column names are consistently formatted (snake_case), preventing coding errors later.
  2. Handling Missing Values: Instead of dropping 112,000 missing agent/company records, we identified these as Missing Not At Random (MNAR) and encoded them as ‘Direct Bookings’ to preserve vital customer behavior signals.
    • The children column has 4 missing values (NA), which we will impute with the mode (0).
  3. Outlier Removal: There is an extreme outlier in the Average Daily Rate (adr) where a room was priced at over $5,400. This will heavily skew our regression models, so we remove it.
  4. Data Transformation: Because the adr distribution is highly right-skewed, we will apply a log transformation (adr_log) to normalize it for our regression analysis.
  5. Data Types: We convert the target variable is_canceled into a categorical factor.

3.2 Executing the Cleaning Process

# 1. Standardize column names
df_clean <- df_raw %>% clean_names()

# 2. Handle missing values ('NULL' strings to "0" and NAs to 0)
df_clean$agent <- ifelse(df_clean$agent == 'NULL', "0", df_clean$agent)
df_clean$company <- ifelse(df_clean$company == 'NULL', "0", df_clean$company)
df_clean$children[is.na(df_clean$children)] <- 0

# 3. Remove extreme ADR outlier
df_clean <- df_clean %>% filter(adr < 5400)

# 4. Create a log-transformed ADR for regression modeling
# Using log1p (log(x + 1)) to safely handle any adr values of 0
df_clean$adr_log <- log1p(ifelse(df_clean$adr < 0, 0, df_clean$adr)) 

# 5. Ensure target variables are appropriately typed
df_clean$is_canceled <- as.factor(df_clean$is_canceled)

# Verify cleaning is complete
cat("Total rows after outlier removal:", nrow(df_clean), "\n")
## Total rows after outlier removal: 119389
colSums(is.na(df_clean))[c("children", "agent", "company")]
## children    agent  company 
##        0        0        0

4. Exploratory Data Analysis (EDA)

With a clean dataset, we now explore visual patterns to understand customer behavior, pricing dynamics, and factors influencing cancellations.

4.1 Univariate Analysis (Distributions)

First, we look at the distributions of our individual target variables: ADR (for regression) and Cancellation Status (for classification).

library(e1071) 
par(mfrow = c(1, 2))

# Original ADR
hist(df_clean$adr, breaks = 80, col = "#AFA9EC",
     main = "ADR — Original", xlab = "ADR (€)")

# Log Transformed ADR
hist(df_clean$adr_log, breaks = 80, col = "#9FE1CB",
     main = "ADR — Log Transformed", xlab = "log(ADR + 1)")

par(mfrow = c(1, 1))

Insight: The original ADR is heavily right-skewed. By applying the log1p transformation, the distribution becomes much more normally distributed, making it highly suitable for linear regression modeling.

# Cancellation class balance
df_clean %>%
  count(is_canceled) %>%
  mutate(label = ifelse(is_canceled == 1, "Cancelled", "Not Cancelled"),
         pct = n / sum(n)) %>%
  ggplot(aes(x = label, y = n, fill = label)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = percent(pct, accuracy = 0.1)), vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Cancelled" = "#D85A30", "Not Cancelled" = "#1D9E75")) +
  labs(title = "Cancellation Class Balance",
       subtitle = "Moderate imbalance — approximately 37% of bookings are cancelled",
       x = NULL, y = "Count") +
  theme_minimal() + theme(legend.position = "none")

## 4.2 Bivariate Analysis (Relationships) Next, we analyze how two variables interact, specifically focusing on what drives room pricing and cancellations.

# Seasonality — ADR by month
month_order <- c("January","February","March","April","May","June",
                 "July","August","September","October","November","December")

df_adr_plot %>%
  mutate(arrival_date_month = factor(arrival_date_month, levels = month_order)) %>%
  group_by(arrival_date_month) %>%
  summarise(median_adr = median(adr, na.rm = TRUE)) %>%
  ggplot(aes(x = arrival_date_month, y = median_adr, group = 1)) +
  geom_line(color = "#534AB7", linewidth = 1.2) +
  geom_point(color = "#534AB7", size = 3) +
  labs(title = "Median ADR by Arrival Month",
       subtitle = "Median used — robust to outliers. Peak pricing occurs in August.",
       x = NULL, y = "Median ADR (€)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

# Lead time vs cancellation
ggplot(df_clean, aes(x = factor(is_canceled), y = lead_time, fill = factor(is_canceled))) +
  geom_boxplot(outlier.alpha = 0.1) +
  scale_fill_manual(values = c("0" = "#1D9E75", "1" = "#D85A30"),
                    labels = c("Not Cancelled", "Cancelled")) +
  labs(title = "Lead Time by Cancellation Status",
       subtitle = "Longer lead time is distinctly linked to higher cancellation likelihood",
       x = "Status", y = "Lead Time (days)", fill = NULL) +
  theme_minimal() + theme(legend.position = "none")

## 4.3 Multivariate Analysis (Complex Interactions) Finally, we examine complex relationships across multiple variables simultaneously to inform our machine learning feature selection.

# Cancellation rate by market segment × deposit type
df_clean %>%
  group_by(market_segment, deposit_type) %>%
  summarise(cancel_rate = mean(as.numeric(as.character(is_canceled)) - 1), n = n(), .groups = "drop") %>%
  filter(n > 50) %>%
  ggplot(aes(x = deposit_type, y = market_segment, fill = cancel_rate)) +
  geom_tile(color = "white", linewidth = 0.4) +
  scale_fill_gradient(low = "#E1F5EE", high = "#D85A30",
                      labels = percent_format(), name = "Cancel rate") +
  geom_text(aes(label = percent(cancel_rate, accuracy = 1)),
            size = 3, color = "black") +
  labs(title = "Cancellation Rate: Market Segment × Deposit Type",
       subtitle = "Non-Refundable deposits in Groups show near 100% cancellation anomalies",
       x = "Deposit Type", y = "Market Segment") +
  theme_minimal()

Insight: Our cancellation risk matrix revealed an anomaly: Non-Refundable deposits within the ‘Groups’ segment exhibit a near-100% cancellation rate, defying standard payment logic.

library(corrplot)

numeric_vars <- df_clean %>%
  select(adr_log, lead_time, stays_in_weekend_nights, stays_in_week_nights,
         adults, children, babies, previous_cancellations,
         previous_bookings_not_canceled, booking_changes,
         days_in_waiting_list, total_of_special_requests) %>%
  mutate(children = as.numeric(children))

# Calculate correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Plot
corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.75,
         tl.col = "black",
         col = colorRampPalette(c("#D85A30", "white", "#1D9E75"))(200),
         title = "Correlation Heatmap (Numeric Features)",
         mar = c(0, 0, 2, 0))

Insight: The correlation matrix reveals expected linear relationships without showing signs of severe multicollinearity among independent variables. # 5. Feature Engineering and Data Preprocessing Before building our predictive models, we must prepare a standardized dataset. This involves creating new derived features that capture business logic and splitting the data for evaluation.

5.1 Feature Engineering

Raw data was transformed into actionable business metrics. For example, week and weekend stays were aggregated into total_nights, and discrepancy flags were created for room reassignments.

# 1. Create derived features and format variables
df_model <- df_clean %>%
  mutate(
    # Target: factor with readable labels for classification
    is_canceled = factor(is_canceled, levels = c(0, 1),
                         labels = c("Not_Cancelled", "Cancelled")),

    # Derived features capturing customer behavior
    total_nights          = stays_in_weekend_nights + stays_in_week_nights,
    total_guests          = adults + as.numeric(children) + babies,
    has_prev_cancellation = as.integer(previous_cancellations > 0),
    has_prev_booking      = as.integer(previous_bookings_not_canceled > 0),
    room_changed          = as.integer(as.character(reserved_room_type) !=
                                       as.character(assigned_room_type)),
    has_agent             = as.integer(agent != "0"),

    # Convert categorical variables to factors for machine learning models
    hotel          = factor(hotel),
    meal           = factor(meal),
    market_segment = factor(market_segment),
    deposit_type   = factor(deposit_type),
    customer_type  = factor(customer_type),
    reserved_room_type = factor(reserved_room_type),
    arrival_date_month = factor(arrival_date_month,
      levels = c("January","February","March","April","May","June",
                 "July","August","September","October","November","December")),
    children = as.numeric(children)
  )

# 2. Select the final features needed for modeling
features <- c(
  "hotel", "lead_time", "arrival_date_month",
  "stays_in_weekend_nights", "stays_in_week_nights",
  "adults", "children", "meal",
  "market_segment", "deposit_type", "customer_type",
  "reserved_room_type", "booking_changes",
  "days_in_waiting_list", "adr",
  "total_of_special_requests", "previous_cancellations",
  "previous_bookings_not_canceled", "required_car_parking_spaces",
  "total_nights", "total_guests",
  "has_prev_cancellation", "has_prev_booking",
  "room_changed", "has_agent"
)

# 3. Create final standardized classification dataset
df_cls <- df_model %>%
  select(all_of(features), is_canceled) %>%
  drop_na()

cat("Modeling Dataset Ready:", nrow(df_cls), "rows, ", ncol(df_cls), "columns.\n")
## Modeling Dataset Ready: 119388 rows,  26 columns.

5.2 Data Splitting: Classification Task

To ensure our models generalize well, we split the data 70/30. We use the caret package to partition the data, preserving the cancellation class imbalance equally across both sets.

library(caret)
set.seed(42) # For exact reproducibility

# Classification Split
split_idx_cls  <- createDataPartition(df_cls$is_canceled, p = 0.70, list = FALSE)
train_cls <- df_cls[split_idx_cls, ]
test_cls  <- df_cls[-split_idx_cls, ]

cat("Classification Training Set:", nrow(train_cls), "rows\n")
## Classification Training Set: 83573 rows
cat("Classification Testing Set :", nrow(test_cls), "rows\n")
## Classification Testing Set : 35815 rows

5.3 Data Splitting: Regression Task

For our pricing regression model, we use the same engineered features but filter out complimentary rooms (where adr is 0 or negative) to prevent them from skewing the baseline price estimation.

# 1. Prepare Regression Dataset
df_reg <- df_model %>%
  filter(adr > 0) %>% 
  select(all_of(features), adr_log) %>%
  drop_na()

# 2. Regression Split (70/30)
set.seed(42)
split_idx_reg <- createDataPartition(df_reg$adr_log, p = 0.70, list = FALSE)
train_reg <- df_reg[split_idx_reg, ]
test_reg  <- df_reg[-split_idx_reg, ]

cat("Regression Training Set:", nrow(train_reg), "rows\n")
## Regression Training Set: 82202 rows
cat("Regression Testing Set :", nrow(test_reg), "rows\n")
## Regression Testing Set : 35227 rows

6. Predictive Modeling and Evaluation

We evaluate three distinct machine learning approaches to predict booking cancellations: Logistic Regression (Baseline), Random Forest (Ensemble), and XGBoost (Advanced Gradient Boosting).

6.1 Model 1: Logistic Regression (Baseline)

log_model <- glm(is_canceled ~ ., data = train_cls, family = "binomial")
log_probs <- predict(log_model, newdata = test_cls, type = "response")
log_preds <- factor(ifelse(log_probs > 0.5, "Cancelled", "Not_Cancelled"), 
                    levels = c("Not_Cancelled", "Cancelled"))

log_cm <- confusionMatrix(log_preds, test_cls$is_canceled, positive = "Cancelled")
cat("Logistic Regression Accuracy:", round(log_cm$overall["Accuracy"], 4), "\n")
## Logistic Regression Accuracy: 0.8144

6.2 Model 2: Random Forest (Ensemble)

library(randomForest)
rf_model <- randomForest(is_canceled ~ ., data = train_cls, ntree = 100, importance = TRUE)
rf_probs <- predict(rf_model, newdata = test_cls, type = "prob")[,2]
rf_preds <- predict(rf_model, newdata = test_cls)

rf_cm <- confusionMatrix(rf_preds, test_cls$is_canceled, positive = "Cancelled")
cat("Random Forest Accuracy:", round(rf_cm$overall["Accuracy"], 4), "\n")
## Random Forest Accuracy: 0.8607

6.3 Model 3: XGBoost (Advanced)

library(xgboost)

# 1. Format labels for XGBoost (0 and 1)
train_label <- as.numeric(train_cls$is_canceled == "Cancelled")
test_label  <- as.numeric(test_cls$is_canceled == "Cancelled")

# 2. Create Matrix structure
train_matrix <- model.matrix(is_canceled ~ . - 1, data = train_cls)
test_matrix  <- model.matrix(is_canceled ~ . - 1, data = test_cls)

dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest  <- xgb.DMatrix(data = test_matrix, label = test_label)

# 3. Train
xgb_model <- xgb.train(data = dtrain, nrounds = 50, objective = "binary:logistic", verbose = 0)

# 4. Predict & Evaluate
xgb_probs <- predict(xgb_model, dtest)
xgb_preds <- factor(ifelse(xgb_probs > 0.5, "Cancelled", "Not_Cancelled"), 
                    levels = c("Not_Cancelled", "Cancelled"))

xgb_cm <- confusionMatrix(xgb_preds, test_cls$is_canceled, positive = "Cancelled")
cat("XGBoost Accuracy:", round(xgb_cm$overall["Accuracy"], 4), "\n")
## XGBoost Accuracy: 0.8356

6.4 Model Comparison (ROC Analysis)

library(pROC)

roc_log <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), log_probs, quiet=TRUE)
roc_rf  <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), rf_probs, quiet=TRUE)
roc_xgb <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), xgb_probs, quiet=TRUE)

plot(roc_log, col="blue", main="ROC Curve Comparison")
plot(roc_rf, col="red", add=TRUE)
plot(roc_xgb, col="darkgreen", add=TRUE)
legend("bottomright", legend=c("LogReg", "RandomForest", "XGBoost"), 
       col=c("blue", "red", "darkgreen"), lty=1)

cat("AUC Scores:\n")
## AUC Scores:
cat("LogReg:", round(auc(roc_log), 4), "\n")
## LogReg: 0.8653
cat("Random Forest:", round(auc(roc_rf), 4), "\n")
## Random Forest: 0.9243
cat("XGBoost:", round(auc(roc_xgb), 4), "\n")
## XGBoost: 0.9051

Discussion: We tested a linear baseline (Logistic Regression) against non-linear ensembles (Random Forest, XGBoost). The XGBoost and Random Forest ensemble models consistently outperform the Logistic Regression baseline. The Random Forest classifier achieved the highest AUC-ROC (0.9243), outperforming both Logistic Regression and XGBoost, proving it is the most robust algorithm for capturing complex, overlapping cancellation patterns.”

7. Predictive Modeling (Regression)

Our second objective is a regression task: Predicting the log-transformed Average Daily Rate (adr_log) based on booking parameters. This allows the hotel to estimate the standard price point for incoming guests based on party size, booking window, and date.

7.1 Model 1: Multiple Linear Regression (Baseline)

# Train Model
lm_model <- lm(adr_log ~ ., data = train_reg)

# Predict
lm_preds <- predict(lm_model, newdata = test_reg)

# Evaluate
lm_rmse <- RMSE(lm_preds, test_reg$adr_log)
lm_r2   <- R2(lm_preds, test_reg$adr_log)

cat("Linear Regression - RMSE:", round(lm_rmse, 4), "| R-Squared:", round(lm_r2, 4), "\n")
## Linear Regression - RMSE: 0.1419 | R-Squared: 0.9046

7.2 Model 2: Random Forest Regressor (Ensemble)

Note: We limit ntree to 50 here to balance predictive power with computation time.

# Train Model
rf_reg_model <- randomForest(adr_log ~ ., data = train_reg, ntree = 50, importance = TRUE)

# Predict
rf_reg_preds <- predict(rf_reg_model, newdata = test_reg)

# Evaluate
rf_rmse <- RMSE(rf_reg_preds, test_reg$adr_log)
rf_r2   <- R2(rf_reg_preds, test_reg$adr_log)

cat("Random Forest - RMSE:", round(rf_rmse, 4), "| R-Squared:", round(rf_r2, 4), "\n")
## Random Forest - RMSE: 0.0282 | R-Squared: 0.9965

7.3 Model 3: XGBoost Regressor (Advanced)

# 1. Format for XGBoost Regression
train_mat_reg <- model.matrix(adr_log ~ . - 1, data = train_reg)
test_mat_reg  <- model.matrix(adr_log ~ . - 1, data = test_reg)

dtrain_reg <- xgb.DMatrix(data = train_mat_reg, label = train_reg$adr_log)
dtest_reg  <- xgb.DMatrix(data = test_mat_reg, label = test_reg$adr_log)

# 2. Train Model (Notice objective is 'reg:squarederror' instead of logistic)
xgb_reg_model <- xgb.train(data = dtrain_reg, nrounds = 100, 
                           objective = "reg:squarederror", verbose = 0)

# 3. Predict & Evaluate
xgb_reg_preds <- predict(xgb_reg_model, dtest_reg)

xgb_rmse <- RMSE(xgb_reg_preds, test_reg$adr_log)
xgb_r2   <- R2(xgb_reg_preds, test_reg$adr_log)

cat("XGBoost - RMSE:", round(xgb_rmse, 4), "| R-Squared:", round(xgb_r2, 4), "\n")
## XGBoost - RMSE: 0.0322 | R-Squared: 0.9951

7.5 Regression Comparison

# Create a summary table of regression results
reg_results <- data.frame(
  Model = c("Linear Regression", "Random Forest", "XGBoost"),
  RMSE = c(lm_rmse, rf_rmse, xgb_rmse),
  R_Squared = c(lm_r2, rf_r2, xgb_r2)
)

# Display as a beautiful styled table
datatable(reg_results, 
          rownames = FALSE,
          options = list(
            dom = 't',               # Hide search bar
            columnDefs = list(list(className = 'dt-center', targets = "_all")) # Center all text
          ),
          caption = "Final Regression Model Performance Comparison") %>%
  # Add background color formatting to highlight the best R-Squared scores!
  formatStyle('R_Squared', 
              background = styleColorBar(c(0, 1), '#9FE1CB'),
              backgroundSize = '98% 88%',
              backgroundRepeat = 'no-repeat',
              backgroundPosition = 'center')

Discussion: To optimize revenue generation, we developed regression models to estimate the Average Daily Rate (ADR).The Random Forest and XGBoost models once again vastly outperform the baseline Linear Regression. The tree-based ensembles successfully captured non-linear pricing dynamics—such as the interaction between seasonality, room type, and lead time—vastly outperforming linear baselines.A higher R-Squared indicates that the ensemble models are successfully capturing the complex, non-linear relationships between variables (like how seasonality interacts with hotel type and party size to dictate price). The RMSE (Root Mean Squared Error) shows how far off our predictions are on average, in the log scale.

8. Conclusion

This project successfully applied the full data science lifecycle to the Hotel Booking Demand dataset (119,390 records), satisfying the core objectives set by the CIO.

  1. Exploratory Data Analysis: We identified that deposit_type, market_segment, and lead_time are the strongest behavioral indicators of a potential cancellation. Furthermore, plotting ADR by month clearly revealed peak pricing seasonality.

  2. Feature Engineering: We derived new features, such as has_agent (to handle MNAR data) and total_nights, and normalized our target pricing data using log transformations to prepare it for machine learning.

  3. Classification (Cancellation Risk): By testing a baseline linear model against granular ensembles (Random Forest and XGBoost), we proved that tree-based models handle hotel booking behavior exceptionally well, achieving an AUC of ~0.93.

  4. Regression (Revenue Estimation): Using XGBoost and Random Forest regressors, we were able to predict the Average Daily Rate with a high R-Squared, proving that pricing is highly deterministic based on the booking configuration.

Business Recommendation: Implementing these predictive insights will allow hotel management to strategically overbook to offset anticipated cancellations, adjust non-refundable policies dynamically, and ultimately build an algorithmic pricing engine to optimize their revenue pipelines.