Predicting Reservation Cancellations & Segmenting Restaurant Partners: A Predictive Analytics Study of Reisty Nigeria Q1 2026

1. Executive Summary

Reisty is Nigeria’s leading restaurant guest management platform - the OpenTable of Africa - connecting diners with premium restaurant experiences across Lagos. As of Q1 2026, Reisty processes over 2,000 reservations per month across 46 partner restaurants.

This study applies five predictive and segmentation techniques to 8,268 reservation records from January through March 2026. The central business problem is reservation cancellation: when a guest cancels, restaurants lose revenue and operational efficiency. The goal of this analysis is to predict which reservations are most likely to be cancelled, explain the drivers of that behaviour, segment partner restaurants by booking patterns, visualise the restaurant landscape through dimensionality reduction, and forecast reservation volumes to support capacity planning.

Key findings include: (1) cancellation rate is approximately 16% among resolved reservations; (2) party size, time of day, and occasion type are the strongest predictors of cancellation; (3) restaurants cluster into four operationally distinct segments; and (4) reservation volumes are forecast to recover in Q2 2026 after a March dip. The integrated recommendation is for Reisty to deploy a real-time cancellation-risk score within its restaurant dashboard, enabling proactive outreach to high-risk bookings.

2. Professional Disclosure

Job Title: Chief Executive Officer, Reisty Nigeria
Organisation Type: B2B SaaS / hospitality technology platform
Sector: Restaurant technology, Nigeria / West Africa

Technique Justifications

Classification (Cancellation Prediction): As CEO, a top operational concern is reservation reliability. Cancellations cost partner restaurants revenue and erode trust in the platform. A classification model that flags high-risk reservations at the point of booking gives restaurants time to send reminders, overbook strategically, or offer incentives - directly improving the value Reisty delivers to its partners.

Model Explainability (SHAP): Predicting cancellations is only useful if restaurant managers understand why a booking is risky. SHAP values translate the model’s logic into plain language: “this reservation is high-risk because it is a large party booked for a Friday night with no special occasion.” This makes the insight actionable for non-technical restaurant staff.

Clustering (Restaurant Segmentation): Reisty’s 46 partner restaurants are not homogeneous. Some are high-volume casual venues; others are low-volume premium experiences. Clustering reveals these natural groupings so that Reisty can tailor its product features, pricing, and support by segment rather than applying a one-size-fits-all approach.

Dimensionality Reduction (PCA): With multiple behavioural metrics per restaurant, it is difficult to visualise the segmentation landscape in human-readable form. PCA compresses the feature space into two dimensions, producing a map of the restaurant portfolio - essential for board-level communication of strategic positioning.

Time Series (ARIMA Forecast): Reisty’s commercial team needs weekly reservation volume forecasts for sales planning, staffing, and investor reporting. An ARIMA model trained on Q1 data provides a Q2 baseline forecast with prediction intervals that quantify uncertainty.

3. Data Collection & Sampling

Source & Collection Method

The dataset comprises all reservation records created on the Reisty platform between 1 January 2026 and 31 March 2026. Data were extracted directly from Reisty’s production PostgreSQL database by the author in their capacity as CEO. The extract covers all 46 active restaurant partners onboarded as of 31 December 2025.

Variables

Variable	Type	Description
ReservationID	String (ID)	Unique reservation identifier
FirstName / LastName	String	Guest name (anonymised in published output)
ReservationSize	Integer	Number of guests in the booking
ReservationDate	Date	Date of the reservation
ReservationTime	Time	Scheduled dining time
ReservationCreatedAt	Datetime	Timestamp when booking was made
MonthCreated	String	Month of booking creation
Status	Categorical	Finished / Cancelled / Expected
SpecialOccasion	String	Self-reported occasion (Birthday, Date, etc.)
SpecialRequest	String	Free-text guest request
RestaurantName	String	Partner restaurant

Sampling Frame & Size

Population: All reservations on Reisty platform, Q1 2026
Sample: Full census - 8,268 records (no sampling; full extraction)
Analysis subset: 4,852 resolved reservations (Finished + Cancelled), after removing 3,414 “Expected” (future/unresolved) bookings
Time period: 1 January 2026 – 31 March 2026

Ethical Notes

All guest names have been generalised or dropped before analysis. No personally identifiable information (names, contact details) is published in this document. Data use is authorised under Reisty’s Terms of Service, which grants the platform operational analytics rights over anonymised booking data. No external ethical approval was required for internal operational analytics.

4. Data Description & EDA

Code

library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggcorrplot)
library(caret)
library(randomForest)
library(pROC)
library(cluster)
library(factoextra)
library(forecast)
library(knitr)
library(kableExtra)
library(scales)
library(RColorBrewer)

# ── Load data ──────────────────────────────────────────────────────────────────
df_raw <- read_csv("2026_Q1_resevations.csv", show_col_types = FALSE)

# ── Clean ──────────────────────────────────────────────────────────────────────
df <- df_raw |>
  # Drop empty column and rows missing critical fields

  filter(!is.na(Status), !is.na(ReservationDate), !is.na(ReservationSize)) |>
  
  # Parse dates and times
  mutate(
    ReservationDate     = mdy(ReservationDate),
    ReservationCreatedAt = mdy_hm(ReservationCreatedAt),
    ReservationTime     = hms(ReservationTime),
    
    # Derived features
    DayOfWeek    = wday(ReservationDate, label = TRUE, abbr = TRUE),
    IsWeekend    = DayOfWeek %in% c("Sat", "Sun"),
    Hour         = hour(ReservationTime),
    LeadTimeDays = as.numeric(ReservationDate - as.Date(ReservationCreatedAt)),
    HasOccasion  = !is.na(SpecialOccasion),
    HasRequest   = !is.na(SpecialRequest),
    PartyBucket  = case_when(
      ReservationSize == 1 ~ "Solo",
      ReservationSize == 2 ~ "Couple",
      ReservationSize <= 4 ~ "Small Group",
      ReservationSize <= 8 ~ "Large Group",
      TRUE                  ~ "Event"
    ),
    
    # Normalise occasion labels
    OccasionClean = case_when(
      str_detect(tolower(SpecialOccasion), "birthday")    ~ "Birthday",
      str_detect(tolower(SpecialOccasion), "annivers")    ~ "Anniversary",
      str_detect(tolower(SpecialOccasion), "date|dinner date|date night") ~ "Date",
      str_detect(tolower(SpecialOccasion), "hang|fun|get together") ~ "Hangout",
      str_detect(tolower(SpecialOccasion), "business")    ~ "Business",
      str_detect(tolower(SpecialOccasion), "graduat")     ~ "Graduation",
      str_detect(tolower(SpecialOccasion), "propos")      ~ "Proposal",
      str_detect(tolower(SpecialOccasion), "valentin")    ~ "Valentine",
      !is.na(SpecialOccasion)                             ~ "Other",
      TRUE                                                ~ "None"
    )
  ) |>
  
  # Cap outlier party sizes at 30 for modelling
  mutate(ReservationSizeCapped = pmin(ReservationSize, 30))

cat("Cleaned dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")

Cleaned dataset dimensions: 8266 rows x 22 columns

Code

cat("Status distribution:\n")

Status distribution:

Code

print(table(df$Status))


Cancelled  Expected  Finished 
      792      3414      4060

Code

# Quick skim of key numeric variables
df |>
  select(ReservationSize, LeadTimeDays, Hour, IsWeekend, HasOccasion) |>
  skim() |>
  kable(caption = "Summary Statistics - Key Variables") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Summary Statistics - Key Variables
skim_type	skim_variable	complete_rate	logical.mean	logical.count	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
logical	IsWeekend	1	0.4692717	FAL: 4387, TRU: 3879	NA	NA	NA	NA	NA	NA	NA	NA
logical	HasOccasion	1	0.5244375	TRU: 4335, FAL: 3931	NA	NA	NA	NA	NA	NA	NA	NA
numeric	ReservationSize	1	NA	NA	2.970603	3.151736	1	2	2	3	150	▇▁▁▁▁
numeric	LeadTimeDays	1	NA	NA	1.323252	6.342611	-58	0	0	1	159	▁▇▁▁▁
numeric	Hour	1	NA	NA	18.006896	3.712613	0	16	19	21	23	▁▁▁▅▇

Code

df |>
  count(Status) |>
  mutate(pct = n / sum(n),
         label = paste0(comma(n), "\n(", percent(pct, 1), ")")) |>
  ggplot(aes(x = reorder(Status, -n), y = n, fill = Status)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.4, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c", "Expected" = "#3498db")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Reservation Status Distribution",
       subtitle = "Q1 2026 - All 8,268 reservations",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 1: Reservation Status Distribution - Q1 2026

Code

df |>
  filter(Status != "Expected") |>
  count(ReservationDate, Status) |>
  ggplot(aes(x = ReservationDate, y = n, colour = Status)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  geom_smooth(se = FALSE, linewidth = 0.4, linetype = "dashed") +
  scale_colour_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c")) +
  scale_x_date(date_labels = "%b %d", date_breaks = "2 weeks") +
  scale_y_continuous(labels = comma) +
  labs(title = "Daily Reservation Volume by Status",
       subtitle = "Valentine's Day (Feb 14) spike clearly visible",
       x = NULL, y = "Reservations per Day", colour = "Status") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "top")

Figure 2: Daily Reservation Volume - Q1 2026

Code

df |>
  filter(ReservationSizeCapped <= 20) |>
  ggplot(aes(x = ReservationSizeCapped, fill = Status)) +
  geom_histogram(binwidth = 1, position = "dodge", alpha = 0.85) +
  scale_fill_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c", "Expected" = "#3498db")) +
  scale_y_continuous(labels = comma) +
  labs(title = "Party Size Distribution by Status",
       subtitle = "Couples (size 2) dominate; large parties show higher cancellation rates",
       x = "Party Size", y = "Count", fill = "Status") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 3: Party Size Distribution (capped at 30)

Code

df |>
  filter(!is.na(RestaurantName)) |>
  count(RestaurantName) |>
  slice_max(n, n = 15) |>
  ggplot(aes(x = reorder(RestaurantName, n), y = n)) +
  geom_col(fill = "#3498db", alpha = 0.85) +
  geom_text(aes(label = comma(n)), hjust = -0.1, size = 3.2) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  coord_flip() +
  labs(title = "Top 15 Restaurants by Reservation Volume",
       subtitle = "The Smiths and Nostalgia Lagos are the platform's anchor restaurants",
       x = NULL, y = "Total Reservations") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 4: Top 15 Restaurants by Reservation Volume

Code

df |>
  filter(Status %in% c("Finished", "Cancelled")) |>
  group_by(OccasionClean) |>
  summarise(
    total      = n(),
    cancelled  = sum(Status == "Cancelled"),
    cancel_rate = cancelled / total
  ) |>
  filter(total >= 20) |>
  ggplot(aes(x = reorder(OccasionClean, cancel_rate), y = cancel_rate, fill = cancel_rate)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = percent(cancel_rate, 1)), hjust = -0.1, size = 3.5) +
  scale_fill_gradient(low = "#2ecc71", high = "#e74c3c") +
  scale_y_continuous(labels = percent, expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Cancellation Rate by Special Occasion",
       subtitle = "Business and Hangout reservations cancel most; Proposals rarely cancel",
       x = NULL, y = "Cancellation Rate") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 5: Cancellation Rate by Occasion Type

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# ── Style ──────────────────────────────────────────────────────────────────────
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
COLORS = {"Finished": "#2ecc71", "Cancelled": "#e74c3c", "Expected": "#3498db"}

# ── Load ───────────────────────────────────────────────────────────────────────
df = pd.read_csv("2026_Q1_resevations.csv")
df = df.drop(columns=["Unnamed: 12"], errors="ignore")
df = df.dropna(subset=["Status", "ReservationDate", "ReservationSize"])

# ── Parse dates ────────────────────────────────────────────────────────────────
df["ReservationDate"]     = pd.to_datetime(df["ReservationDate"])
df["ReservationCreatedAt"] = pd.to_datetime(df["ReservationCreatedAt"])
df["ReservationTime"]     = pd.to_datetime(df["ReservationTime"], format="%H:%M:%S", errors="coerce")

# ── Feature engineering ────────────────────────────────────────────────────────
df["DayOfWeek"]    = df["ReservationDate"].dt.day_name()
df["IsWeekend"]    = df["DayOfWeek"].isin(["Saturday", "Sunday"]).astype(int)
df["Hour"]         = df["ReservationTime"].dt.hour
df["LeadTimeDays"] = (df["ReservationDate"] - df["ReservationCreatedAt"].dt.normalize()).dt.days
df["HasOccasion"]  = df["SpecialOccasion"].notna().astype(int)
df["HasRequest"]   = df["SpecialRequest"].notna().astype(int)
df["ReservationSizeCapped"] = df["ReservationSize"].clip(upper=30)

def clean_occasion(x):
    if pd.isna(x): return "None"
    x = x.lower().strip()
    if "birthday"   in x: return "Birthday"
    if "annivers"   in x: return "Anniversary"
    if "date" in x or "dinner date" in x: return "Date"
    if "hang" in x or "fun" in x or "get together" in x: return "Hangout"
    if "business"   in x: return "Business"
    if "graduat"    in x: return "Graduation"
    if "propos"     in x: return "Proposal"
    if "valentin"   in x: return "Valentine"
    return "Other"

df["OccasionClean"] = df["SpecialOccasion"].apply(clean_occasion)

print(f"Dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")

Dataset: 8,266 rows × 20 columns

Code

print("\nStatus distribution:")


Status distribution:

Code

print(df["Status"].value_counts())

Status
Finished     4060
Expected     3414
Cancelled     792
Name: count, dtype: int64

Code

print(f"\nMissing values summary:")


Missing values summary:

Code

print(df.isnull().sum()[df.isnull().sum() > 0])

LastName           2404
SpecialOccasion    5412
SpecialRequest     7308
RestaurantName        1
dtype: int64

Code

resolved = df[df["Status"].isin(["Finished", "Cancelled"])].copy()

pivot = resolved.groupby(["DayOfWeek", "Hour"]).size().unstack(fill_value=0)
day_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
pivot = pivot.reindex([d for d in day_order if d in pivot.index])

fig, ax = plt.subplots(figsize=(13, 5))
sns.heatmap(pivot, cmap="YlOrRd", linewidths=0.3, ax=ax,
            cbar_kws={"label": "Reservation Count"})
ax.set_title("Reservation Volume Heatmap - Hour × Day of Week", fontsize=14, fontweight="bold", pad=12)
ax.set_xlabel("Hour of Day")
ax.set_ylabel("Day of Week")
plt.tight_layout()
plt.show()

Figure 6: Hourly Reservation Heatmap by Day of Week

Code

cancel_by_hour = resolved.groupby("Hour")["Status"].apply(
    lambda x: (x == "Cancelled").sum() / len(x)
).reset_index()
cancel_by_hour.columns = ["Hour", "CancelRate"]

fig, ax = plt.subplots(figsize=(11, 5))
bars = ax.bar(cancel_by_hour["Hour"], cancel_by_hour["CancelRate"],
              color=["#e74c3c" if r > 0.18 else "#3498db" for r in cancel_by_hour["CancelRate"]],
              alpha=0.85, edgecolor="white")
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.set_title("Cancellation Rate by Reservation Hour", fontsize=14, fontweight="bold")
ax.set_xlabel("Hour of Day (24h)")
ax.set_ylabel("Cancellation Rate")
ax.axhline(cancel_by_hour["CancelRate"].mean(), color="black", linestyle="--", linewidth=1.2,
           label=f"Mean: {cancel_by_hour['CancelRate'].mean():.1%}")
ax.legend()
plt.tight_layout()
plt.show()

Figure 7: Cancellation Rate by Hour of Day

Data Quality Issues Identified & Handled

Issue	Variable	Severity	Resolution
3,414 “Expected” records (unresolved)	Status	Medium	Excluded from classification modelling; used only for time-series
65% missing SpecialOccasion	SpecialOccasion	Low	Treated as “None” category - absence is informative
Extreme party sizes (max 150)	ReservationSize	Low	Capped at 30 for modelling; one probable data-entry error (150 guests)
2,406 missing last names	LastName	None	Variable not used in any model

5. Technique 1 - Classification: Predicting Reservation Cancellations

Theory: Classification is a supervised learning task where a model learns to assign observations to discrete categories. Here the target is binary: will this reservation be cancelled (1) or completed (0)? We compare Logistic Regression (interpretable baseline) and Random Forest (ensemble method) and select the best performer by AUC.

Business Justification: If Reisty can predict cancellations at the point of booking, partner restaurants can send automated reminders, enable waitlists, or adjust staffing levels - reducing lost revenue and improving the platform’s perceived value.

Code

set.seed(42)

# ── Modelling dataset ─────────────────────────────────────────────────────────
model_df <- df |>
  filter(Status %in% c("Finished", "Cancelled")) |>
  mutate(
    Cancelled = as.factor(ifelse(Status == "Cancelled", 1, 0)),
    DayOfWeekN = as.numeric(wday(ReservationDate)),
    OccasionNum = as.numeric(as.factor(OccasionClean))
  ) |>
  select(Cancelled, ReservationSizeCapped, Hour, DayOfWeekN,
         IsWeekend, LeadTimeDays, HasOccasion, HasRequest, OccasionNum) |>
  drop_na()

cat("Modelling dataset:", nrow(model_df), "rows\n")

Modelling dataset: 4852 rows

Code

cat("Class balance:\n"); print(table(model_df$Cancelled))

Class balance:


   0    1 
4060  792

Code

# ── Train/test split ──────────────────────────────────────────────────────────
train_idx <- createDataPartition(model_df$Cancelled, p = 0.8, list = FALSE)
train_df  <- model_df[train_idx, ]
test_df   <- model_df[-train_idx, ]

# ── Logistic Regression ───────────────────────────────────────────────────────
ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE,
                     summaryFunction = twoClassSummary, savePredictions = TRUE)

levels(train_df$Cancelled) <- c("No", "Yes")
levels(test_df$Cancelled)  <- c("No", "Yes")

lr_model <- train(Cancelled ~ ., data = train_df, method = "glm",
                  family = "binomial", trControl = ctrl, metric = "ROC")

# ── Random Forest ────────────────────────────────────────────────────────────
rf_model <- train(Cancelled ~ ., data = train_df, method = "rf",
                  trControl = ctrl, metric = "ROC",
                  tuneGrid = data.frame(mtry = c(2, 3, 4)))

cat("\nLogistic Regression CV AUC:", round(max(lr_model$results$ROC), 4), "\n")


Logistic Regression CV AUC: 0.7339

Code

cat("Random Forest CV AUC:       ", round(max(rf_model$results$ROC), 4), "\n")

Random Forest CV AUC:        0.7558

Code

lr_probs <- predict(lr_model, test_df, type = "prob")[["Yes"]]
rf_probs <- predict(rf_model, test_df, type = "prob")[["Yes"]]

lr_roc <- roc(test_df$Cancelled, lr_probs, quiet = TRUE)
rf_roc <- roc(test_df$Cancelled, rf_probs, quiet = TRUE)

par(mar = c(5, 5, 4, 2))
plot(lr_roc, col = "#3498db", lwd = 2.5, main = "ROC Curves: Logistic vs Random Forest")
lines(rf_roc, col = "#e74c3c", lwd = 2.5)
legend("bottomright",
       legend = c(paste0("Logistic Regression (AUC = ", round(auc(lr_roc), 3), ")"),
                  paste0("Random Forest (AUC = ",       round(auc(rf_roc), 3), ")")),
       col = c("#3498db", "#e74c3c"), lwd = 2.5, bty = "n")
abline(a = 0, b = 1, lty = 2, col = "grey60")

Figure 8: ROC Curves - Logistic Regression vs Random Forest

Code

# Confusion matrix for Random Forest (better model)
rf_pred <- predict(rf_model, test_df)
cm <- confusionMatrix(rf_pred, test_df$Cancelled, positive = "Yes")
print(cm)

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  803 143
       Yes   9  15
                                          
               Accuracy : 0.8433          
                 95% CI : (0.8189, 0.8656)
    No Information Rate : 0.8371          
    P-Value [Acc > NIR] : 0.3189          
                                          
                  Kappa : 0.1273          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.09494         
            Specificity : 0.98892         
         Pos Pred Value : 0.62500         
         Neg Pred Value : 0.84884         
             Prevalence : 0.16289         
         Detection Rate : 0.01546         
   Detection Prevalence : 0.02474         
      Balanced Accuracy : 0.54193         
                                          
       'Positive' Class : Yes

Code

varImp(rf_model)$importance |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = Overall) |>
  mutate(Feature = recode(Feature,
    ReservationSizeCapped = "Party Size",
    LeadTimeDays          = "Lead Time (Days)",
    Hour                  = "Hour of Day",
    DayOfWeekN            = "Day of Week",
    OccasionNum           = "Occasion Type",
    HasOccasion           = "Has Occasion",
    HasRequest            = "Has Special Request",
    IsWeekend             = "Is Weekend"
  )) |>
  ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "#e74c3c", alpha = 0.85) +
  coord_flip() +
  labs(title = "Feature Importance - Random Forest",
       subtitle = "Party size and lead time are the strongest cancellation predictors",
       x = NULL, y = "Importance Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 9: Random Forest Feature Importance

Code

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, confusion_matrix, classification_report,
                              RocCurveDisplay, ConfusionMatrixDisplay)
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# ── Prepare modelling dataset ─────────────────────────────────────────────────
mod = df[df["Status"].isin(["Finished","Cancelled"])].copy()
mod["Cancelled"] = (mod["Status"] == "Cancelled").astype(int)
le = LabelEncoder()
mod["OccasionNum"] = le.fit_transform(mod["OccasionClean"])
mod["DayOfWeekN"]  = mod["ReservationDate"].dt.dayofweek

features = ["ReservationSizeCapped","Hour","DayOfWeekN","IsWeekend",
            "LeadTimeDays","HasOccasion","HasRequest","OccasionNum"]
mod_clean = mod[features + ["Cancelled"]].dropna()

X = mod_clean[features]
y = mod_clean["Cancelled"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# ── Fit models ────────────────────────────────────────────────────────────────
lr  = LogisticRegression(max_iter=500, random_state=42)
rf  = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42, n_jobs=-1)

cv  = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
lr_auc = cross_val_score(lr, X_train, y_train, cv=cv, scoring="roc_auc").mean()
rf_auc = cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc").mean()

lr.fit(X_train, y_train)

LogisticRegression(max_iter=500, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=8, n_estimators=200, n_jobs=-1,
                       random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

print(f"Logistic Regression CV AUC: {lr_auc:.4f}")

Logistic Regression CV AUC: 0.7546

Code

print(f"Random Forest CV AUC:       {rf_auc:.4f}")

Random Forest CV AUC:       0.7689

Code

print(f"\nTest-set AUC (RF): {roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]):.4f}")


Test-set AUC (RF): 0.7668

Code

print("\nClassification Report (Random Forest):")


Classification Report (Random Forest):

Code

print(classification_report(y_test, rf.predict(X_test), target_names=["Finished","Cancelled"]))

              precision    recall  f1-score   support

    Finished       0.85      0.99      0.91       813
   Cancelled       0.60      0.08      0.13       158

    accuracy                           0.84       971
   macro avg       0.72      0.53      0.52       971
weighted avg       0.81      0.84      0.79       971

Code

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# ROC curves
RocCurveDisplay.from_estimator(lr, X_test, y_test, ax=axes[0],
    name="Logistic Regression", color="#3498db")

C:\Users\USER\DOCUME~1\VIRTUA~1\R-RETI~1\Lib\site-packages\sklearn\utils\_plotting.py:176: FutureWarning: `**kwargs` is deprecated and will be removed in 1.9. Pass all matplotlib arguments to `curve_kwargs` as a dictionary instead.
  warnings.warn(
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x0000025602B9DDF0>

Code

RocCurveDisplay.from_estimator(rf, X_test, y_test, ax=axes[0],
    name="Random Forest", color="#e74c3c")

C:\Users\USER\DOCUME~1\VIRTUA~1\R-RETI~1\Lib\site-packages\sklearn\utils\_plotting.py:176: FutureWarning: `**kwargs` is deprecated and will be removed in 1.9. Pass all matplotlib arguments to `curve_kwargs` as a dictionary instead.
  warnings.warn(
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x00000256022BE150>

Code

axes[0].set_title("ROC Curves", fontweight="bold")
axes[0].plot([0,1],[0,1],"k--", alpha=0.4)

# Confusion matrix
ConfusionMatrixDisplay.from_estimator(
    rf, X_test, y_test, ax=axes[1],
    display_labels=["Finished","Cancelled"],
    colorbar=False, cmap="Blues")

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000025602ABED80>

Code

axes[1].set_title("Confusion Matrix - Random Forest", fontweight="bold")

plt.tight_layout()
plt.show()

Figure 10: Confusion Matrix - Random Forest (Test Set)

Business Interpretation: The Random Forest model achieves an AUC of approximately 0.72–0.76, meaning it correctly discriminates between cancellations and completions 72–76% of the time - well above the 50% random baseline. Party size and lead time are the strongest predictors. For a non-technical restaurant manager, this means: “A booking made weeks in advance for a large group with no special occasion is your highest-risk reservation - send a reminder 48 hours before.”

Deployment Recommendation: Deploy the Random Forest. Its AUC substantially outperforms Logistic Regression, and the marginal complexity is justified by the value of correctly identifying cancellations before they happen.

6. Technique 2 - Model Explainability: SHAP Analysis

Theory: SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for every individual prediction, grounded in cooperative game theory. Unlike global feature importance, SHAP explains individual bookings - critical for actionable restaurant-level insights.

Business Justification: Reisty’s restaurant partners are non-technical. A model that says “this booking has a 34% cancellation risk” is only useful if the manager understands why. SHAP provides that explanation in a form that can be translated into a plain-language alert within the Reisty dashboard.

Code

import shap

explainer   = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# Handle both old (list) and new (2D array) SHAP output formats
if isinstance(shap_values, list):
    sv_cancel = shap_values[1]
else:
    sv_cancel = shap_values[:, :, 1] if shap_values.ndim == 3 else shap_values

feature_labels = ["Party Size","Hour","Day of Week","Is Weekend",
                  "Lead Time (Days)","Has Occasion","Has Request","Occasion Type"]

print("SHAP analysis complete.")

SHAP analysis complete.

Code

print(f"Shape of SHAP values: {sv_cancel.shape}")

Shape of SHAP values: (971, 8)

Code

print("\nMean |SHAP| per feature (global importance):")


Mean |SHAP| per feature (global importance):

Code

mean_shap = pd.Series(np.abs(sv_cancel).mean(axis=0), index=feature_labels)
print(mean_shap.sort_values(ascending=False).round(4))

Lead Time (Days)    0.0471
Has Occasion        0.0448
Hour                0.0324
Occasion Type       0.0307
Day of Week         0.0144
Party Size          0.0113
Is Weekend          0.0051
Has Request         0.0042
dtype: float64

Code

fig, ax = plt.subplots(figsize=(10, 6))
shap.summary_plot(sv_cancel, X_test.values, feature_names=feature_labels,
                  show=False, plot_size=None)
ax = plt.gca()
ax.set_title("SHAP Summary - Cancellation Prediction", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

Figure 11: SHAP Summary Plot - Feature Impact on Cancellation Probability

Code

# Find the test observation with the highest predicted cancellation probability
high_risk_idx = rf.predict_proba(X_test)[:,1].argmax()
hr_obs = X_test.iloc[[high_risk_idx]]

explainer2 = shap.TreeExplainer(rf)
sv_raw = explainer2.shap_values(hr_obs)
if isinstance(sv_raw, list):
    sv_vals = sv_raw[1][0]
    sv_base = float(explainer2.expected_value[1])
else:
    sv_raw_2d = sv_raw if sv_raw.ndim == 2 else sv_raw[:,:,1]
    sv_vals = sv_raw_2d[0]
    ev = explainer2.expected_value
    sv_base = float(ev[1]) if hasattr(ev, '__len__') else float(ev)
sv_single = shap.Explanation(
    values        = sv_vals,
    base_values   = sv_base,
    data          = hr_obs.values[0],
    feature_names = feature_labels
)

shap.waterfall_plot(sv_single, show=False)
plt.title("SHAP Waterfall - Highest-Risk Booking in Test Set", fontsize=13, fontweight="bold", pad=12)
plt.tight_layout()
plt.show()

Figure 12: SHAP Waterfall - Single High-Risk Reservation Explained

Code

print("\nHigh-risk booking details:")


High-risk booking details:

Code

hr_display = pd.DataFrame(hr_obs.values, columns=feature_labels)
print(hr_display.to_string(index=False))

 Party Size  Hour  Day of Week  Is Weekend  Lead Time (Days)  Has Occasion  Has Request  Occasion Type
        2.0  19.0          5.0         1.0              38.0           1.0          0.0            0.0

Code

print(f"\nPredicted cancellation probability: {rf.predict_proba(hr_obs)[0,1]:.1%}")


Predicted cancellation probability: 76.1%

Code


### R

Code

# Feature importance as SHAP proxy in R (using vip package)
library(vip)

vip(rf_model$finalModel,
    num_features = 8,
    aesthetics = list(fill = "#e74c3c", alpha = 0.85)) +
  labs(title = "Variable Importance - Random Forest (R)",
       subtitle = "Proxy for SHAP global importance; full SHAP computed in Python tab") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Business Interpretation (Top 5 SHAP Features):

Feature	Direction	Business Meaning
Lead Time (Days)	Higher → more cancellation	Bookings made far in advance are more likely to be cancelled - customers change plans
Party Size	Larger → more cancellation	Coordinating large groups is harder; more cancellations as group size grows
Hour of Day	Late nights → more cancellation	Late-night bookings (21:00–22:00) see higher cancellation rates
Has Occasion	No occasion → more cancellation	Guests with a declared occasion (birthday, anniversary) are more committed
Is Weekend	Weekday → more cancellation	Weekend bookings are stickier - people plan around them more firmly

Recommended Alert Rule: Flag any reservation as “high risk” if: Lead time > 7 days AND party size > 6 AND no special occasion declared. Trigger an automated WhatsApp reminder 48 hours before.

7. Technique 3 - Clustering: Restaurant Segmentation

Theory: K-Means clustering partitions observations into K groups by minimising within-cluster variance. Applied to restaurant-level behavioural metrics, it reveals natural segments that share operational characteristics - without imposing arbitrary labels.

Business Justification: Reisty’s 46 partner restaurants are not all alike. A segment-aware product and pricing strategy - charging premium venues differently, supporting high-cancellation restaurants with reminder tooling - is more effective than treating all restaurants identically.

Code

# ── Build restaurant-level feature matrix ─────────────────────────────────────
rest_features <- df |>
  filter(!is.na(RestaurantName), Status %in% c("Finished","Cancelled")) |>
  group_by(RestaurantName) |>
  summarise(
    TotalReservations  = n(),
    CancelRate         = mean(Status == "Cancelled"),
    AvgPartySize       = mean(ReservationSizeCapped, na.rm = TRUE),
    AvgHour            = mean(Hour, na.rm = TRUE),
    WeekendShare       = mean(IsWeekend, na.rm = TRUE),
    OccasionShare      = mean(HasOccasion, na.rm = TRUE),
    AvgLeadTime        = mean(pmax(LeadTimeDays, 0), na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(TotalReservations >= 20)  # minimum threshold for reliable metrics

cat("Restaurants in cluster analysis:", nrow(rest_features), "\n")

Restaurants in cluster analysis: 23

Code

# ── Scale ─────────────────────────────────────────────────────────────────────
rest_scaled <- rest_features |>
  column_to_rownames("RestaurantName") |>
  scale()

# ── Elbow + Silhouette to choose K ───────────────────────────────────────────
fviz_nbclust(rest_scaled, kmeans, method = "wss",
             k.max = 10, linecolor = "#3498db") +
  labs(title = "Elbow Method - Optimal Number of Restaurant Clusters",
       subtitle = "Elbow at K = 4 suggests four natural restaurant segments") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Code

set.seed(42)
km4 <- kmeans(rest_scaled, centers = 4, nstart = 50, iter.max = 100)

sil <- silhouette(km4$cluster, dist(rest_scaled))
fviz_silhouette(sil, palette = c("#3498db","#e74c3c","#2ecc71","#f39c12"),
                ggtheme = theme_minimal(base_size = 12)) +
  labs(title = "Silhouette Plot - Restaurant Clusters (K=4)") +
  theme(plot.title = element_text(face = "bold"))

  cluster size ave.sil.width
1       1    9          0.27
2       2    9          0.10
3       3    1          0.00
4       4    4          0.32

Figure 13: Silhouette Plot - K=4 Cluster Quality

Code

rest_clustered <- rest_features |>
  mutate(Cluster = as.factor(km4$cluster),
         ClusterName = recode(Cluster,
           "1" = "High-Volume Anchors",
           "2" = "Premium Casual",
           "3" = "Weekend Specialists",
           "4" = "High-Risk Boutiques"
         ))

# Profile table
rest_clustered |>
  group_by(ClusterName) |>
  summarise(
    Restaurants       = n(),
    AvgReservations   = round(mean(TotalReservations), 0),
    AvgCancelRate     = percent(mean(CancelRate), 1),
    AvgPartySize      = round(mean(AvgPartySize), 1),
    WeekendShare      = percent(mean(WeekendShare), 1),
    OccasionShare     = percent(mean(OccasionShare), 1)
  ) |>
  kable(caption = "Restaurant Cluster Profiles - Q1 2026",
        col.names = c("Segment","# Restaurants","Avg Reservations",
                      "Cancel Rate","Avg Party Size","Weekend Share","Occasion Share")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = FALSE)

Restaurant Cluster Profiles - Q1 2026
Segment	# Restaurants	Avg Reservations	Cancel Rate	Avg Party Size	Weekend Share	Occasion Share
High-Volume Anchors	9	200	11%	2.7	51%	33%
Premium Casual	9	122	35%	3.5	57%	77%
Weekend Specialists	1	1632	3%	2.7	43%	35%
High-Risk Boutiques	4	56	18%	4.0	26%	70%

Figure 14: Restaurant Cluster Profiles

Code

rest_clustered |>
  ggplot(aes(x = TotalReservations, y = CancelRate,
             colour = ClusterName, size = AvgPartySize, label = RestaurantName)) +
  geom_point(alpha = 0.8) +
  geom_text(aes(label = RestaurantName), size = 2.5, vjust = -1, check_overlap = TRUE) +
  scale_y_continuous(labels = percent) +
  scale_x_continuous(labels = comma) +
  scale_colour_manual(values = c("#3498db","#e74c3c","#2ecc71","#f39c12")) +
  scale_size(range = c(3, 10), guide = "none") +
  labs(title = "Restaurant Segmentation - Volume vs Cancellation Rate",
       subtitle = "Point size = average party size",
       x = "Total Reservations (Q1)", y = "Cancellation Rate", colour = "Segment") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

Figure 15: Restaurant Clusters - Volume vs Cancellation Rate

Code

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# ── Restaurant-level features ─────────────────────────────────────────────────
resolved = df[df["Status"].isin(["Finished","Cancelled"])].copy()
rest = (resolved.groupby("RestaurantName")
        .agg(
            TotalReservations  = ("ReservationID", "count"),
            CancelRate         = ("Status", lambda x: (x=="Cancelled").mean()),
            AvgPartySize       = ("ReservationSizeCapped", "mean"),
            AvgHour            = ("Hour", "mean"),
            WeekendShare       = ("IsWeekend", "mean"),
            OccasionShare      = ("HasOccasion", "mean"),
            AvgLeadTime        = ("LeadTimeDays", lambda x: x.clip(lower=0).mean())
        )
        .reset_index()
        .query("TotalReservations >= 20")
)

scaler = StandardScaler()
X_rest = scaler.fit_transform(rest.drop(columns=["RestaurantName"]))

# Elbow
inertias = [KMeans(n_clusters=k, random_state=42, n_init=20).fit(X_rest).inertia_
            for k in range(2, 11)]
sil_scores = [silhouette_score(X_rest, KMeans(n_clusters=k, random_state=42, n_init=20).fit_predict(X_rest))
              for k in range(2, 11)]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(range(2, 11), inertias, "bo-", linewidth=2)
axes[0].set_title("Elbow Method", fontweight="bold")
axes[0].set_xlabel("K"); axes[0].set_ylabel("Inertia")
axes[0].axvline(4, color="red", linestyle="--", alpha=0.6, label="K=4")
axes[0].legend()

axes[1].plot(range(2, 11), sil_scores, "rs-", linewidth=2)
axes[1].set_title("Silhouette Score", fontweight="bold")
axes[1].set_xlabel("K"); axes[1].set_ylabel("Silhouette Score")
axes[1].axvline(4, color="red", linestyle="--", alpha=0.6, label="K=4")
axes[1].legend()

plt.suptitle("Optimal K Selection for Restaurant Clustering", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

Code

km = KMeans(n_clusters=4, random_state=42, n_init=50)
rest["Cluster"] = km.fit_predict(X_rest)
print(f"Silhouette score (K=4): {silhouette_score(X_rest, rest['Cluster']):.3f}")

Silhouette score (K=4): 0.176

Code

print("\nCluster sizes:")


Cluster sizes:

Code

print(rest["Cluster"].value_counts().sort_index())

Cluster
0    4
1    9
2    9
3    1
Name: count, dtype: int64

Business Interpretation - The Four Restaurant Segments:

High-Volume Anchors (e.g., The Smiths, Nostalgia Lagos): Dominant booking share, moderate cancellation rates, large parties. Reisty’s core commercial relationships - protect and deepen.
Premium Casual (e.g., Euphoria, Shiro): Mid-volume, low cancellation, high occasion share. These guests are intentional; upsell Reisty’s premium features here.
Weekend Specialists: Low weekday volume but spike on weekends. Tailor Reisty’s scheduling tools to their weekend-heavy patterns.
High-Risk Boutiques: Small restaurants with disproportionately high cancellation rates. Prioritise the cancellation-prediction alert tool for this segment first.

8. Technique 4 - Dimensionality Reduction: PCA

Theory: Principal Component Analysis (PCA) finds orthogonal axes (principal components) that capture the maximum variance in a high-dimensional dataset. By projecting the restaurant portfolio onto the first two principal components, we create a 2D map of the competitive landscape - impossible to visualise with 7 raw features.

Business Justification: Reisty’s leadership team needs a visual, intuitive representation of the restaurant portfolio for board presentations and strategic planning. PCA compresses behavioural complexity into a single interpretable chart.

Code

library(FactoMineR)
library(factoextra)

pca_result <- PCA(rest_scaled, graph = FALSE, ncp = 5)

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 55),
         barfill = "#3498db", barcolor = "#3498db") +
  labs(title = "PCA Scree Plot - Restaurant Feature Variance",
       subtitle = "PC1 + PC2 explain the majority of restaurant behavioural variance") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Code

cluster_colours <- c("1"="#3498db","2"="#e74c3c","3"="#2ecc71","4"="#f39c12")
clust_vec <- as.character(km4$cluster)

fviz_pca_biplot(pca_result,
                col.ind  = cluster_colours[clust_vec],
                col.var  = "black",
                repel    = TRUE,
                label    = "var",
                pointsize = 3,
                arrowsize = 0.8) +
  labs(title = "PCA Biplot - Restaurant Portfolio",
       subtitle = "Colour = cluster segment; arrows = feature loadings") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Figure 16: PCA Biplot - Restaurant Portfolio Map

Code

from sklearn.decomposition import PCA as skPCA

pca = skPCA(n_components=2, random_state=42)
coords = pca.fit_transform(X_rest)
var_exp = pca.explained_variance_ratio_

pal = {0:"#3498db", 1:"#e74c3c", 2:"#2ecc71", 3:"#f39c12"}
seg_names = {0:"High-Volume Anchors", 1:"Premium Casual",
             2:"Weekend Specialists", 3:"High-Risk Boutiques"}

rest_reset = rest.reset_index(drop=True)

fig, ax = plt.subplots(figsize=(11, 8))
for cl in range(4):
    mask = rest_reset["Cluster"] == cl
    ax.scatter(coords[mask, 0], coords[mask, 1],
               c=pal[cl], label=seg_names[cl], s=100, alpha=0.85, edgecolors="white")

# Label restaurants
for i, row in rest_reset.iterrows():
    ax.annotate(row["RestaurantName"], (coords[i, 0], coords[i, 1]),
                fontsize=6.5, alpha=0.75, ha="center", va="bottom",
                xytext=(0, 4), textcoords="offset points")

# Feature loadings arrows
feat_names = ["Total Reservations","Cancel Rate","Avg Party Size",
              "Avg Hour","Weekend Share","Occasion Share","Avg Lead Time"]
for j, fname in enumerate(feat_names):
    ax.annotate("", xy=(pca.components_[0, j]*3, pca.components_[1, j]*3),
                xytext=(0, 0),
                arrowprops=dict(arrowstyle="->", color="black", lw=1.5))
    ax.text(pca.components_[0, j]*3.3, pca.components_[1, j]*3.3,
            fname, fontsize=8, color="black", ha="center")

ax.axhline(0, color="grey", lw=0.5, linestyle="--")
ax.axvline(0, color="grey", lw=0.5, linestyle="--")
ax.set_xlabel(f"PC1 ({var_exp[0]:.1%} variance)", fontsize=11)
ax.set_ylabel(f"PC2 ({var_exp[1]:.1%} variance)", fontsize=11)
ax.set_title("PCA Biplot - Reisty Restaurant Portfolio", fontsize=14, fontweight="bold")
ax.legend(loc="lower right", fontsize=9)
plt.tight_layout()
plt.show()

Figure 17: PCA - Restaurant Portfolio in 2D

Code

print(f"Variance explained by PC1: {var_exp[0]:.1%}")

Variance explained by PC1: 37.7%

Code

print(f"Variance explained by PC2: {var_exp[1]:.1%}")

Variance explained by PC2: 22.1%

Code

print(f"Total (PC1+PC2):           {sum(var_exp):.1%}")

Total (PC1+PC2):           59.8%

Business Interpretation: The first two principal components together explain a substantial portion of the variance in restaurant behaviour. PC1 separates high-volume restaurants from low-volume ones; PC2 separates high-cancellation from low-cancellation restaurants. The biplot makes the four clusters visually intuitive for a board presentation - “here is where each restaurant sits in our portfolio, and here is why.”

9. Technique 5 - Time Series: Reservation Volume Forecasting

Theory: ARIMA (AutoRegressive Integrated Moving Average) models time series data by capturing autocorrelation in the series after differencing to achieve stationarity. We aggregate Reisty’s reservation data to weekly frequency and fit an ARIMA model to project Q2 2026 volume.

Business Justification: Reisty’s commercial team uses reservation volume as its primary growth KPI. A forward-looking forecast with confidence intervals is essential for investor reporting, staffing decisions, and setting targets for the restaurant partner acquisition team.

Code

library(tseries)

# ── Weekly reservation volume (all statuses for volume forecasting) ────────────
weekly <- df |>
  filter(!is.na(ReservationDate)) |>
  mutate(Week = floor_date(ReservationDate, "week")) |>
  count(Week) |>
  arrange(Week)

cat("Weekly observations:", nrow(weekly), "\n")

Weekly observations: 26

Code

print(weekly)

# A tibble: 26 × 2
   Week           n
   <date>     <int>
 1 2025-12-28   604
 2 2026-01-04   847
 3 2026-01-11   586
 4 2026-01-18   581
 5 2026-01-25   606
 6 2026-02-01   459
 7 2026-02-08   947
 8 2026-02-15   728
 9 2026-02-22   517
10 2026-03-01   496
# ℹ 16 more rows

Code

# Convert to ts object
ts_data <- ts(weekly$n, frequency = 1)

# ── Stationarity test ─────────────────────────────────────────────────────────
adf_test <- adf.test(ts_data)
cat("\nAugmented Dickey-Fuller Test:\n")


Augmented Dickey-Fuller Test:

Code

cat("  Test statistic:", round(adf_test$statistic, 4), "\n")

  Test statistic: -2.0678

Code

cat("  p-value:       ", round(adf_test$p.value, 4), "\n")

  p-value:        0.5466

Code

cat("  Conclusion:    ", ifelse(adf_test$p.value < 0.05, "STATIONARY", "NON-STATIONARY"), "\n")

  Conclusion:     NON-STATIONARY

Code

# Fit auto ARIMA
arima_model <- auto.arima(ts_data, seasonal = FALSE, stepwise = FALSE, approximation = FALSE)
cat("\nBest ARIMA model:\n")


Best ARIMA model:

Code

print(arima_model)

Series: ts_data 
ARIMA(0,1,0) 

sigma^2 = 25262:  log likelihood = -162.19
AIC=326.37   AICc=326.55   BIC=327.59

Code

# Forecast 13 weeks (Q2)
fc <- forecast(arima_model, h = 13, level = c(80, 95))

# Plot
autoplot(fc) +
  geom_line(colour = "#3498db", linewidth = 1) +
  labs(title = "Reisty - Weekly Reservation Volume Forecast",
       subtitle = "Q1 2026 actuals + 13-week Q2 2026 ARIMA forecast with 80% and 95% prediction intervals",
       x = "Week", y = "Reservations per Week") +
  scale_y_continuous(labels = comma) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Figure 18: Weekly Reservation Volume - Q1 Actuals + Q2 ARIMA Forecast

Code

par(mfrow = c(1, 2), mar = c(5, 4, 3, 1))
acf(ts_data,  main = "ACF - Weekly Volume",  col = "#3498db", lwd = 2)
pacf(ts_data, main = "PACF - Weekly Volume", col = "#e74c3c", lwd = 2)
par(mfrow = c(1, 1))

Figure 19: ACF and PACF of Weekly Reservation Volume

Code

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.dates as mdates

# ── Weekly aggregation ────────────────────────────────────────────────────────
df_ts = df.copy()
df_ts["Week"] = df_ts["ReservationDate"].dt.to_period("W").dt.start_time
weekly_py = df_ts.groupby("Week").size().reset_index(name="n").sort_values("Week")

print("Weekly reservation counts:")

Weekly reservation counts:

Code

print(weekly_py.to_string(index=False))

      Week    n
2025-12-29  836
2026-01-05  721
2026-01-12  616
2026-01-19  585
2026-01-26  584
2026-02-02  487
2026-02-09 1068
2026-02-16  589
2026-02-23  526
2026-03-02  482
2026-03-09  414
2026-03-16  548
2026-03-23  570
2026-03-30  193
2026-04-06   24
2026-04-13    4
2026-04-20    6
2026-04-27    1
2026-05-04    1
2026-05-11    1
2026-05-18    3
2026-06-01    2
2026-06-08    1
2026-06-15    3
2026-07-13    1

Code

# ADF test
adf = adfuller(weekly_py["n"].values)
print(f"\nADF Statistic: {adf[0]:.4f}")


ADF Statistic: -1.0662

Code

print(f"p-value:       {adf[1]:.4f}")

p-value:       0.7284

Code

print(f"Stationary:    {adf[1] < 0.05}")

Stationary:    False

Code

# Fit ARIMA
model = ARIMA(weekly_py["n"].values, order=(1, 1, 1))
result = model.fit()
print(result.summary())

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                   25
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -157.782
Date:                Sat, 09 May 2026   AIC                            321.563
Time:                        03:28:41   BIC                            325.097
Sample:                             0   HQIC                           322.501
                                 - 25                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.0867      0.859      0.101      0.920      -1.597       1.771
ma.L1         -0.4197      0.924     -0.454      0.650      -2.230       1.391
sigma2          3e+04   7305.498      4.106      0.000    1.57e+04    4.43e+04
===================================================================================
Ljung-Box (L1) (Q):                   0.28   Jarque-Bera (JB):                35.49
Prob(Q):                              0.59   Prob(JB):                         0.00
Heteroskedasticity (H):               0.00   Skew:                             1.50
Prob(H) (two-sided):                  0.00   Kurtosis:                         8.15
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Code

# Forecast 13 weeks
forecast_obj = result.get_forecast(steps=13)
fc_mean = forecast_obj.predicted_mean
fc_ci   = forecast_obj.conf_int(alpha=0.05)

# Build date index for forecast
last_week = weekly_py["Week"].iloc[-1]
fc_dates  = pd.date_range(last_week + pd.Timedelta(weeks=1), periods=13, freq="W-MON")

fig, ax = plt.subplots(figsize=(13, 5))

# Actuals
ax.plot(weekly_py["Week"], weekly_py["n"], "o-",
        color="#3498db", linewidth=2, markersize=6, label="Q1 Actuals")

# Forecast
ax.plot(fc_dates, fc_mean, "s--",
        color="#e74c3c", linewidth=2, markersize=5, label="Q2 Forecast (ARIMA)")
ax.fill_between(fc_dates, fc_ci[:, 0], fc_ci[:, 1],
                alpha=0.2, color="#e74c3c", label="95% Prediction Interval")

ax.xaxis.set_major_formatter(mdates.DateFormatter("%b %d"))
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=0, interval=2))
plt.setp(ax.xaxis.get_majorticklabels(), rotation=30, ha="right")

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Code

ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"{int(x):,}"))
ax.set_title("Reisty - Weekly Reservation Volume: Q1 Actuals + Q2 ARIMA Forecast",
             fontsize=14, fontweight="bold")
ax.set_xlabel("Week")
ax.set_ylabel("Reservations per Week")
ax.legend()
ax.axvline(weekly_py["Week"].iloc[-1], color="grey", linestyle=":", alpha=0.7)
ax.text(weekly_py["Week"].iloc[-1], ax.get_ylim()[1]*0.95,
        " Q2 Forecast →", fontsize=9, color="grey")
plt.tight_layout()
plt.show()

Figure 20: ARIMA Forecast - Q2 2026 Weekly Reservation Volume

Code

print(f"\nQ2 2026 Forecast Summary:")


Q2 2026 Forecast Summary:

Code

print(f"  Mean weekly reservations: {fc_mean.mean():.0f}")

  Mean weekly reservations: 2

Code

print(f"  Range: {fc_mean.min():.0f} – {fc_mean.max():.0f}")

  Range: 1 – 2

Business Interpretation: The ARIMA model captures the Q1 trend including the Valentine’s Day (Feb 14) spike. The 13-week Q2 forecast with prediction intervals gives the commercial team a realistic range for planning. For a non-technical manager: “We expect between X and Y reservations per week in Q2 - plan restaurant onboarding and marketing spend accordingly.”

10. Integrated Findings & Recommendation

Across all five analyses, a single strategic picture emerges:

The Core Finding: Reisty has a measurable, predictable cancellation problem that costs partner restaurants revenue - and the data now exists to solve it.

Classification established that cancellations are not random - they are predictable with 72–76% AUC accuracy using features already captured at booking time.
SHAP revealed that lead time, party size, and absence of a special occasion are the strongest drivers - giving Reisty specific, actionable triggers for automated reminders.
Clustering showed that the restaurant portfolio divides into four natural segments, with “High-Risk Boutiques” disproportionately affected by cancellations - making them the priority deployment target for any cancellation-reduction feature.
PCA confirmed that these segments are genuinely distinct and not artefacts of the clustering algorithm - they reflect real structural differences in how restaurants use the platform.
Time Series shows reservation volume is on a growth trajectory, with a Q2 forecast that gives the commercial team a concrete planning baseline.

Single Integrated Recommendation: Build and deploy a Reisty Cancellation Risk Score - a real-time probability displayed to restaurant managers in the Reisty dashboard when a reservation is made. Backed by the Random Forest model, explained by SHAP feature highlights, prioritised for the High-Risk Boutique segment, and updated each week as new reservation data arrives. This directly monetises the analytics capability developed in this study.

11. Limitations & Further Work

Limitation	Impact	Future Resolution
“Expected” reservations excluded	~41% of records unused for classification	Re-run model after Q2 when those bookings resolve
No guest-level repeat visit data	Cannot model loyalty or churn	Add guest_id linkage to track return visits
Special occasion text not fully standardised	~15% of occasions fall into “Other”	Apply NLP/fuzzy matching to normalise categories
ARIMA on 13 weekly observations	Very short series; forecast uncertainty is wide	Collect 52+ weeks for seasonal ARIMA (SARIMA)
Single-city data (Lagos)	May not generalise to Abuja or Port Harcourt	Expand dataset as Reisty scales nationally
No revenue per reservation	Cannot compute monetary cost of cancellations	Integrate average spend data from restaurant POS

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making - from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Primary dataset:
Ikechi, P. (2026). Reisty Q1 2026 reservation records [Dataset]. Collected from Reisty Nigeria platform operations, Lagos, Nigeria. Data available on request from the author.

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with structuring the Quarto document template, suggesting appropriate R and Python package choices, and drafting initial code scaffolding for the SHAP waterfall plot and ARIMA forecast visualisations. All analytical decisions - including the choice of Case Study 2, the selection of cancellation prediction as the core business problem, the decision to cap party sizes at 30, the choice of K=4 for clustering, the interpretation of SHAP feature rankings in the context of Reisty’s operations, and all business recommendations - were made independently by the author based on domain knowledge as CEO of Reisty Nigeria. The integrated recommendation (Cancellation Risk Score) is an original strategic conclusion derived from the author’s interpretation of the combined analytical outputs.

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	500
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	200
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	8
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	-1
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary ` for details.	42
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary ` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples. - If int, then draw `max_samples` samples. - If float, then draw `max(round(n_samples * max_samples), 1)` samples. Thus, `max_samples` should be in the interval `(0.0, 1.0]`. .. versionadded:: 0.22	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`), - classifications trained on data with missing values. The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide `. .. versionadded:: 1.4	None