Our team has been tasked by the CIO to lead a new data science project. We have chosen to analyze the Hotel Booking Demand dataset. The hospitality industry is highly dynamic, and understanding customer behavior is crucial for optimizing revenue, managing inventory, and reducing losses from cancellations.
Before performing any transformations, we first explore the raw data to understand its context, dimension, and underlying structure.
The requirement states that the dataset dimension must exceed 100,000 data points. Let’s load the data and verify this.
# Load the raw dataset
# Note: Ensure 'hotel_bookings.csv' is in the same directory as this Rmd file
df_raw <- read.csv("hotel_bookings.csv")## [1] 119390 32
# Calculate total dimension (Rows x Columns)
total_dimension <- nrow(df_raw) * ncol(df_raw)
cat("Total Dimension (Rows x Columns):", format(total_dimension, big.mark=","), "\n")## Total Dimension (Rows x Columns): 3,820,480
library(DT)
# Show the first 100 rows interactively
datatable(head(df_raw, 100),
rownames = FALSE,
extensions = 'Buttons',
options = list(
pageLength = 5, # Show 5 rows at a time
scrollX = TRUE, # Add a horizontal scrollbar so it fits on screen
dom = 'Bfrtip', # Required to show the buttons
buttons = c('copy', 'csv', 'excel') # Add download buttons
),
caption = "Interactive Preview: Raw Hotel Booking Data")Observation: The dataset contains 119,390 rows and 32 columns. The total dimension is 3,820,480, which easily satisfies the project requirement of >100,000.
Next, we inspect the variable types and look for initial data quality issues such as missing values.
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## company days_in_waiting_list
## 0 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status reservation_status_date
## 0 0
| Name | df_raw |
| Number of rows | 119390 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| hotel | 0 | 1 | 10 | 12 | 0 | 2 | 0 |
| arrival_date_month | 0 | 1 | 3 | 9 | 0 | 12 | 0 |
| meal | 0 | 1 | 2 | 9 | 0 | 5 | 0 |
| country | 0 | 1 | 2 | 4 | 0 | 178 | 0 |
| market_segment | 0 | 1 | 6 | 13 | 0 | 8 | 0 |
| distribution_channel | 0 | 1 | 3 | 9 | 0 | 5 | 0 |
| reserved_room_type | 0 | 1 | 1 | 1 | 0 | 10 | 0 |
| assigned_room_type | 0 | 1 | 1 | 1 | 0 | 12 | 0 |
| deposit_type | 0 | 1 | 10 | 10 | 0 | 3 | 0 |
| agent | 0 | 1 | 1 | 4 | 0 | 334 | 0 |
| company | 0 | 1 | 1 | 4 | 0 | 353 | 0 |
| customer_type | 0 | 1 | 5 | 15 | 0 | 4 | 0 |
| reservation_status | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
| reservation_status_date | 0 | 1 | 10 | 10 | 0 | 926 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| is_canceled | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1 | 1 | ▇▁▁▁▅ |
| lead_time | 0 | 1 | 104.01 | 106.86 | 0.00 | 18.00 | 69.00 | 160 | 737 | ▇▂▁▁▁ |
| arrival_date_year | 0 | 1 | 2016.16 | 0.71 | 2015.00 | 2016.00 | 2016.00 | 2017 | 2017 | ▃▁▇▁▆ |
| arrival_date_week_number | 0 | 1 | 27.17 | 13.61 | 1.00 | 16.00 | 28.00 | 38 | 53 | ▅▇▇▇▅ |
| arrival_date_day_of_month | 0 | 1 | 15.80 | 8.78 | 1.00 | 8.00 | 16.00 | 23 | 31 | ▇▇▇▇▆ |
| stays_in_weekend_nights | 0 | 1 | 0.93 | 1.00 | 0.00 | 0.00 | 1.00 | 2 | 19 | ▇▁▁▁▁ |
| stays_in_week_nights | 0 | 1 | 2.50 | 1.91 | 0.00 | 1.00 | 2.00 | 3 | 50 | ▇▁▁▁▁ |
| adults | 0 | 1 | 1.86 | 0.58 | 0.00 | 2.00 | 2.00 | 2 | 55 | ▇▁▁▁▁ |
| children | 4 | 1 | 0.10 | 0.40 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| babies | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0 | 10 | ▇▁▁▁▁ |
| is_repeated_guest | 0 | 1 | 0.03 | 0.18 | 0.00 | 0.00 | 0.00 | 0 | 1 | ▇▁▁▁▁ |
| previous_cancellations | 0 | 1 | 0.09 | 0.84 | 0.00 | 0.00 | 0.00 | 0 | 26 | ▇▁▁▁▁ |
| previous_bookings_not_canceled | 0 | 1 | 0.14 | 1.50 | 0.00 | 0.00 | 0.00 | 0 | 72 | ▇▁▁▁▁ |
| booking_changes | 0 | 1 | 0.22 | 0.65 | 0.00 | 0.00 | 0.00 | 0 | 21 | ▇▁▁▁▁ |
| days_in_waiting_list | 0 | 1 | 2.32 | 17.59 | 0.00 | 0.00 | 0.00 | 0 | 391 | ▇▁▁▁▁ |
| adr | 0 | 1 | 101.83 | 50.54 | -6.38 | 69.29 | 94.58 | 126 | 5400 | ▇▁▁▁▁ |
| required_car_parking_spaces | 0 | 1 | 0.06 | 0.25 | 0.00 | 0.00 | 0.00 | 0 | 8 | ▇▁▁▁▁ |
| total_of_special_requests | 0 | 1 | 0.57 | 0.79 | 0.00 | 0.00 | 0.00 | 1 | 5 | ▇▁▁▁▁ |
# Check for literal "NULL" string values (MNAR)
null_counts <- df_raw %>%
select(agent, company) %>%
summarise(
agent_null_count = sum(agent == "NULL", na.rm = TRUE),
company_null_count = sum(company == "NULL", na.rm = TRUE)
)
# Display as an interactive table
datatable(null_counts,
rownames = FALSE,
options = list(dom = 't'), # 't' hides the search bar since it's a small table
caption = "Count of Meaningfully Missing (MNAR) 'NULL' Strings")## agent_null_count company_null_count
## 1 16340 112593
Summary of Exploration: - Variable Types: The dataset consists of numerical variables (e.g., lead_time, adr), categorical variables (e.g., hotel, customer_type), and date-related variables. - Missing At Random (MAR): The children column contains 4 explicitly missing (NA) values. These are likely due to random data entry omissions and will be imputed. - Missing Not At Random (MNAR): The agent and company columns contain 16,340 and 112,593 “NULL” string values, respectively. These are MNAR; they are intentionally blank to represent Direct Bookings where no third-party travel agent or corporate company was involved. This is highly valuable business data, so rather than deleting these rows, we will transform these “NULL” strings into “0” to preserve the “Direct Booking” source for our predictive models.
This part is to clean the dataset.
janitor package to ensure all column names are consistently
formatted (snake_case), preventing coding errors later.children column has 4 missing values
(NA), which we will impute with the mode
(0).adr) where a room was priced at over
$5,400. This will heavily skew our regression models, so we remove
it.adr
distribution is highly right-skewed, we will apply a log transformation
(adr_log) to normalize it for our regression analysis.is_canceled into a categorical factor.# 1. Standardize column names
df_clean <- df_raw %>% clean_names()
# 2. Handle missing values ('NULL' strings to "0" and NAs to 0)
df_clean$agent <- ifelse(df_clean$agent == 'NULL', "0", df_clean$agent)
df_clean$company <- ifelse(df_clean$company == 'NULL', "0", df_clean$company)
df_clean$children[is.na(df_clean$children)] <- 0
# 3. Remove extreme ADR outlier
df_clean <- df_clean %>% filter(adr < 5400)
# 4. Create a log-transformed ADR for regression modeling
# Using log1p (log(x + 1)) to safely handle any adr values of 0
df_clean$adr_log <- log1p(ifelse(df_clean$adr < 0, 0, df_clean$adr))
# 5. Ensure target variables are appropriately typed
df_clean$is_canceled <- as.factor(df_clean$is_canceled)
# Verify cleaning is complete
cat("Total rows after outlier removal:", nrow(df_clean), "\n")## Total rows after outlier removal: 119389
## children agent company
## 0 0 0
With a clean dataset, we now explore visual patterns to understand customer behavior, pricing dynamics, and factors influencing cancellations.
First, we look at the distributions of our individual target variables: ADR (for regression) and Cancellation Status (for classification).
library(e1071)
par(mfrow = c(1, 2))
# Original ADR
hist(df_clean$adr, breaks = 80, col = "#AFA9EC",
main = "ADR — Original", xlab = "ADR (€)")
# Log Transformed ADR
hist(df_clean$adr_log, breaks = 80, col = "#9FE1CB",
main = "ADR — Log Transformed", xlab = "log(ADR + 1)")Insight: The original ADR is heavily right-skewed. By applying the log1p transformation, the distribution becomes much more normally distributed, making it highly suitable for linear regression modeling.
# Cancellation class balance
df_clean %>%
count(is_canceled) %>%
mutate(label = ifelse(is_canceled == 1, "Cancelled", "Not Cancelled"),
pct = n / sum(n)) %>%
ggplot(aes(x = label, y = n, fill = label)) +
geom_col(width = 0.5) +
geom_text(aes(label = percent(pct, accuracy = 0.1)), vjust = -0.5, size = 4) +
scale_fill_manual(values = c("Cancelled" = "#D85A30", "Not Cancelled" = "#1D9E75")) +
labs(title = "Cancellation Class Balance",
subtitle = "Moderate imbalance — approximately 37% of bookings are cancelled",
x = NULL, y = "Count") +
theme_minimal() + theme(legend.position = "none")
## 4.2 Bivariate Analysis (Relationships) Next, we analyze how two
variables interact, specifically focusing on what drives room pricing
and cancellations.
# Seasonality — ADR by month
month_order <- c("January","February","March","April","May","June",
"July","August","September","October","November","December")
df_adr_plot %>%
mutate(arrival_date_month = factor(arrival_date_month, levels = month_order)) %>%
group_by(arrival_date_month) %>%
summarise(median_adr = median(adr, na.rm = TRUE)) %>%
ggplot(aes(x = arrival_date_month, y = median_adr, group = 1)) +
geom_line(color = "#534AB7", linewidth = 1.2) +
geom_point(color = "#534AB7", size = 3) +
labs(title = "Median ADR by Arrival Month",
subtitle = "Median used — robust to outliers. Peak pricing occurs in August.",
x = NULL, y = "Median ADR (€)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 35, hjust = 1))# Lead time vs cancellation
ggplot(df_clean, aes(x = factor(is_canceled), y = lead_time, fill = factor(is_canceled))) +
geom_boxplot(outlier.alpha = 0.1) +
scale_fill_manual(values = c("0" = "#1D9E75", "1" = "#D85A30"),
labels = c("Not Cancelled", "Cancelled")) +
labs(title = "Lead Time by Cancellation Status",
subtitle = "Longer lead time is distinctly linked to higher cancellation likelihood",
x = "Status", y = "Lead Time (days)", fill = NULL) +
theme_minimal() + theme(legend.position = "none") ##
4.3 Multivariate Analysis (Complex Interactions) Finally, we examine
complex relationships across multiple variables simultaneously to inform
our machine learning feature selection.
# Cancellation rate by market segment × deposit type
df_clean %>%
group_by(market_segment, deposit_type) %>%
summarise(cancel_rate = mean(as.numeric(as.character(is_canceled)) - 1), n = n(), .groups = "drop") %>%
filter(n > 50) %>%
ggplot(aes(x = deposit_type, y = market_segment, fill = cancel_rate)) +
geom_tile(color = "white", linewidth = 0.4) +
scale_fill_gradient(low = "#E1F5EE", high = "#D85A30",
labels = percent_format(), name = "Cancel rate") +
geom_text(aes(label = percent(cancel_rate, accuracy = 1)),
size = 3, color = "black") +
labs(title = "Cancellation Rate: Market Segment × Deposit Type",
subtitle = "Non-Refundable deposits in Groups show near 100% cancellation anomalies",
x = "Deposit Type", y = "Market Segment") +
theme_minimal()
Insight: Our cancellation risk matrix revealed an
anomaly: Non-Refundable deposits within the ‘Groups’ segment exhibit a
near-100% cancellation rate, defying standard payment logic.
library(corrplot)
numeric_vars <- df_clean %>%
select(adr_log, lead_time, stays_in_weekend_nights, stays_in_week_nights,
adults, children, babies, previous_cancellations,
previous_bookings_not_canceled, booking_changes,
days_in_waiting_list, total_of_special_requests) %>%
mutate(children = as.numeric(children))
# Calculate correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
# Plot
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.75,
tl.col = "black",
col = colorRampPalette(c("#D85A30", "white", "#1D9E75"))(200),
title = "Correlation Heatmap (Numeric Features)",
mar = c(0, 0, 2, 0))
Insight: The correlation matrix reveals expected linear
relationships without showing signs of severe multicollinearity among
independent variables. # 5. Feature Engineering and Data Preprocessing
Before building our predictive models, we must prepare a standardized
dataset. This involves creating new derived features that capture
business logic and splitting the data for evaluation.
Raw data was transformed into actionable business metrics. For example, week and weekend stays were aggregated into total_nights, and discrepancy flags were created for room reassignments.
# 1. Create derived features and format variables
df_model <- df_clean %>%
mutate(
# Target: factor with readable labels for classification
is_canceled = factor(is_canceled, levels = c(0, 1),
labels = c("Not_Cancelled", "Cancelled")),
# Derived features capturing customer behavior
total_nights = stays_in_weekend_nights + stays_in_week_nights,
total_guests = adults + as.numeric(children) + babies,
has_prev_cancellation = as.integer(previous_cancellations > 0),
has_prev_booking = as.integer(previous_bookings_not_canceled > 0),
room_changed = as.integer(as.character(reserved_room_type) !=
as.character(assigned_room_type)),
has_agent = as.integer(agent != "0"),
# Convert categorical variables to factors for machine learning models
hotel = factor(hotel),
meal = factor(meal),
market_segment = factor(market_segment),
deposit_type = factor(deposit_type),
customer_type = factor(customer_type),
reserved_room_type = factor(reserved_room_type),
arrival_date_month = factor(arrival_date_month,
levels = c("January","February","March","April","May","June",
"July","August","September","October","November","December")),
children = as.numeric(children)
)
# 2. Select the final features needed for modeling
features <- c(
"hotel", "lead_time", "arrival_date_month",
"stays_in_weekend_nights", "stays_in_week_nights",
"adults", "children", "meal",
"market_segment", "deposit_type", "customer_type",
"reserved_room_type", "booking_changes",
"days_in_waiting_list", "adr",
"total_of_special_requests", "previous_cancellations",
"previous_bookings_not_canceled", "required_car_parking_spaces",
"total_nights", "total_guests",
"has_prev_cancellation", "has_prev_booking",
"room_changed", "has_agent"
)
# 3. Create final standardized classification dataset
df_cls <- df_model %>%
select(all_of(features), is_canceled) %>%
drop_na()
cat("Modeling Dataset Ready:", nrow(df_cls), "rows, ", ncol(df_cls), "columns.\n")## Modeling Dataset Ready: 119388 rows, 26 columns.
To ensure our models generalize well, we split the data 70/30. We use
the caret package to partition the data, preserving the
cancellation class imbalance equally across both sets.
library(caret)
set.seed(42) # For exact reproducibility
# Classification Split
split_idx_cls <- createDataPartition(df_cls$is_canceled, p = 0.70, list = FALSE)
train_cls <- df_cls[split_idx_cls, ]
test_cls <- df_cls[-split_idx_cls, ]
cat("Classification Training Set:", nrow(train_cls), "rows\n")## Classification Training Set: 83573 rows
## Classification Testing Set : 35815 rows
For our pricing regression model, we use the same engineered features but filter out complimentary rooms (where adr is 0 or negative) to prevent them from skewing the baseline price estimation.
# 1. Prepare Regression Dataset
df_reg <- df_model %>%
filter(adr > 0) %>%
select(all_of(features), adr_log) %>%
drop_na()
# 2. Regression Split (70/30)
set.seed(42)
split_idx_reg <- createDataPartition(df_reg$adr_log, p = 0.70, list = FALSE)
train_reg <- df_reg[split_idx_reg, ]
test_reg <- df_reg[-split_idx_reg, ]
cat("Regression Training Set:", nrow(train_reg), "rows\n")## Regression Training Set: 82202 rows
## Regression Testing Set : 35227 rows
We evaluate three distinct machine learning approaches to predict booking cancellations: Logistic Regression (Baseline), Random Forest (Ensemble), and XGBoost (Advanced Gradient Boosting).
log_model <- glm(is_canceled ~ ., data = train_cls, family = "binomial")
log_probs <- predict(log_model, newdata = test_cls, type = "response")
log_preds <- factor(ifelse(log_probs > 0.5, "Cancelled", "Not_Cancelled"),
levels = c("Not_Cancelled", "Cancelled"))
log_cm <- confusionMatrix(log_preds, test_cls$is_canceled, positive = "Cancelled")
cat("Logistic Regression Accuracy:", round(log_cm$overall["Accuracy"], 4), "\n")## Logistic Regression Accuracy: 0.8144
library(randomForest)
rf_model <- randomForest(is_canceled ~ ., data = train_cls, ntree = 100, importance = TRUE)
rf_probs <- predict(rf_model, newdata = test_cls, type = "prob")[,2]
rf_preds <- predict(rf_model, newdata = test_cls)
rf_cm <- confusionMatrix(rf_preds, test_cls$is_canceled, positive = "Cancelled")
cat("Random Forest Accuracy:", round(rf_cm$overall["Accuracy"], 4), "\n")## Random Forest Accuracy: 0.8607
library(xgboost)
# 1. Format labels for XGBoost (0 and 1)
train_label <- as.numeric(train_cls$is_canceled == "Cancelled")
test_label <- as.numeric(test_cls$is_canceled == "Cancelled")
# 2. Create Matrix structure
train_matrix <- model.matrix(is_canceled ~ . - 1, data = train_cls)
test_matrix <- model.matrix(is_canceled ~ . - 1, data = test_cls)
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# 3. Train
xgb_model <- xgb.train(data = dtrain, nrounds = 50, objective = "binary:logistic", verbose = 0)
# 4. Predict & Evaluate
xgb_probs <- predict(xgb_model, dtest)
xgb_preds <- factor(ifelse(xgb_probs > 0.5, "Cancelled", "Not_Cancelled"),
levels = c("Not_Cancelled", "Cancelled"))
xgb_cm <- confusionMatrix(xgb_preds, test_cls$is_canceled, positive = "Cancelled")
cat("XGBoost Accuracy:", round(xgb_cm$overall["Accuracy"], 4), "\n")## XGBoost Accuracy: 0.8356
library(pROC)
roc_log <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), log_probs, quiet=TRUE)
roc_rf <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), rf_probs, quiet=TRUE)
roc_xgb <- roc(as.numeric(test_cls$is_canceled == "Cancelled"), xgb_probs, quiet=TRUE)
plot(roc_log, col="blue", main="ROC Curve Comparison")
plot(roc_rf, col="red", add=TRUE)
plot(roc_xgb, col="darkgreen", add=TRUE)
legend("bottomright", legend=c("LogReg", "RandomForest", "XGBoost"),
col=c("blue", "red", "darkgreen"), lty=1)## AUC Scores:
## LogReg: 0.8653
## Random Forest: 0.9243
## XGBoost: 0.9051
Discussion: We tested a linear baseline (Logistic Regression) against non-linear ensembles (Random Forest, XGBoost). The XGBoost and Random Forest ensemble models consistently outperform the Logistic Regression baseline. The Random Forest classifier achieved the highest AUC-ROC (0.9243), outperforming both Logistic Regression and XGBoost, proving it is the most robust algorithm for capturing complex, overlapping cancellation patterns.”
Our second objective is a regression task: Predicting the
log-transformed Average Daily Rate (adr_log) based on
booking parameters. This allows the hotel to estimate the standard price
point for incoming guests based on party size, booking window, and
date.
# Train Model
lm_model <- lm(adr_log ~ ., data = train_reg)
# Predict
lm_preds <- predict(lm_model, newdata = test_reg)
# Evaluate
lm_rmse <- RMSE(lm_preds, test_reg$adr_log)
lm_r2 <- R2(lm_preds, test_reg$adr_log)
cat("Linear Regression - RMSE:", round(lm_rmse, 4), "| R-Squared:", round(lm_r2, 4), "\n")## Linear Regression - RMSE: 0.1419 | R-Squared: 0.9046
Note: We limit ntree to 50 here to balance predictive power with computation time.
# Train Model
rf_reg_model <- randomForest(adr_log ~ ., data = train_reg, ntree = 50, importance = TRUE)
# Predict
rf_reg_preds <- predict(rf_reg_model, newdata = test_reg)
# Evaluate
rf_rmse <- RMSE(rf_reg_preds, test_reg$adr_log)
rf_r2 <- R2(rf_reg_preds, test_reg$adr_log)
cat("Random Forest - RMSE:", round(rf_rmse, 4), "| R-Squared:", round(rf_r2, 4), "\n")## Random Forest - RMSE: 0.0282 | R-Squared: 0.9965
# 1. Format for XGBoost Regression
train_mat_reg <- model.matrix(adr_log ~ . - 1, data = train_reg)
test_mat_reg <- model.matrix(adr_log ~ . - 1, data = test_reg)
dtrain_reg <- xgb.DMatrix(data = train_mat_reg, label = train_reg$adr_log)
dtest_reg <- xgb.DMatrix(data = test_mat_reg, label = test_reg$adr_log)
# 2. Train Model (Notice objective is 'reg:squarederror' instead of logistic)
xgb_reg_model <- xgb.train(data = dtrain_reg, nrounds = 100,
objective = "reg:squarederror", verbose = 0)
# 3. Predict & Evaluate
xgb_reg_preds <- predict(xgb_reg_model, dtest_reg)
xgb_rmse <- RMSE(xgb_reg_preds, test_reg$adr_log)
xgb_r2 <- R2(xgb_reg_preds, test_reg$adr_log)
cat("XGBoost - RMSE:", round(xgb_rmse, 4), "| R-Squared:", round(xgb_r2, 4), "\n")## XGBoost - RMSE: 0.0322 | R-Squared: 0.9951
# Create a summary table of regression results
reg_results <- data.frame(
Model = c("Linear Regression", "Random Forest", "XGBoost"),
RMSE = c(lm_rmse, rf_rmse, xgb_rmse),
R_Squared = c(lm_r2, rf_r2, xgb_r2)
)
# Display as a beautiful styled table
datatable(reg_results,
rownames = FALSE,
options = list(
dom = 't', # Hide search bar
columnDefs = list(list(className = 'dt-center', targets = "_all")) # Center all text
),
caption = "Final Regression Model Performance Comparison") %>%
# Add background color formatting to highlight the best R-Squared scores!
formatStyle('R_Squared',
background = styleColorBar(c(0, 1), '#9FE1CB'),
backgroundSize = '98% 88%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')Discussion: To optimize revenue generation, we developed regression models to estimate the Average Daily Rate (ADR).The Random Forest and XGBoost models once again vastly outperform the baseline Linear Regression. The tree-based ensembles successfully captured non-linear pricing dynamics—such as the interaction between seasonality, room type, and lead time—vastly outperforming linear baselines.A higher R-Squared indicates that the ensemble models are successfully capturing the complex, non-linear relationships between variables (like how seasonality interacts with hotel type and party size to dictate price). The RMSE (Root Mean Squared Error) shows how far off our predictions are on average, in the log scale.
This project successfully applied the full data science lifecycle to the Hotel Booking Demand dataset (119,390 records), satisfying the core objectives set by the CIO.
Exploratory Data Analysis: We identified that deposit_type, market_segment, and lead_time are the strongest behavioral indicators of a potential cancellation. Furthermore, plotting ADR by month clearly revealed peak pricing seasonality.
Feature Engineering: We derived new features, such as has_agent (to handle MNAR data) and total_nights, and normalized our target pricing data using log transformations to prepare it for machine learning.
Classification (Cancellation Risk): By testing a baseline linear model against granular ensembles (Random Forest and XGBoost), we proved that tree-based models handle hotel booking behavior exceptionally well, achieving an AUC of ~0.93.
Regression (Revenue Estimation): Using XGBoost and Random Forest regressors, we were able to predict the Average Daily Rate with a high R-Squared, proving that pricing is highly deterministic based on the booking configuration.
Business Recommendation: Implementing these predictive insights will allow hotel management to strategically overbook to offset anticipated cancellations, adjust non-refundable policies dynamically, and ultimately build an algorithmic pricing engine to optimize their revenue pipelines.