Predicting Reservation Cancellations & Segmenting Restaurant Partners: A Predictive Analytics Study of Reisty Nigeria Q1 2026

Author

Paul Ikechi

Published

May 9, 2026


1. Executive Summary

Reisty is Nigeria’s leading restaurant guest management platform - the OpenTable of Africa - connecting diners with premium restaurant experiences across Lagos. As of Q1 2026, Reisty processes over 2,000 reservations per month across 46 partner restaurants.

This study applies five predictive and segmentation techniques to 8,268 reservation records from January through March 2026. The central business problem is reservation cancellation: when a guest cancels, restaurants lose revenue and operational efficiency. The goal of this analysis is to predict which reservations are most likely to be cancelled, explain the drivers of that behaviour, segment partner restaurants by booking patterns, visualise the restaurant landscape through dimensionality reduction, and forecast reservation volumes to support capacity planning.

Key findings include: (1) cancellation rate is approximately 16% among resolved reservations; (2) party size, time of day, and occasion type are the strongest predictors of cancellation; (3) restaurants cluster into four operationally distinct segments; and (4) reservation volumes are forecast to recover in Q2 2026 after a March dip. The integrated recommendation is for Reisty to deploy a real-time cancellation-risk score within its restaurant dashboard, enabling proactive outreach to high-risk bookings.


2. Professional Disclosure

Job Title: Chief Executive Officer, Reisty Nigeria
Organisation Type: B2B SaaS / hospitality technology platform
Sector: Restaurant technology, Nigeria / West Africa

Technique Justifications

Classification (Cancellation Prediction): As CEO, a top operational concern is reservation reliability. Cancellations cost partner restaurants revenue and erode trust in the platform. A classification model that flags high-risk reservations at the point of booking gives restaurants time to send reminders, overbook strategically, or offer incentives - directly improving the value Reisty delivers to its partners.

Model Explainability (SHAP): Predicting cancellations is only useful if restaurant managers understand why a booking is risky. SHAP values translate the model’s logic into plain language: “this reservation is high-risk because it is a large party booked for a Friday night with no special occasion.” This makes the insight actionable for non-technical restaurant staff.

Clustering (Restaurant Segmentation): Reisty’s 46 partner restaurants are not homogeneous. Some are high-volume casual venues; others are low-volume premium experiences. Clustering reveals these natural groupings so that Reisty can tailor its product features, pricing, and support by segment rather than applying a one-size-fits-all approach.

Dimensionality Reduction (PCA): With multiple behavioural metrics per restaurant, it is difficult to visualise the segmentation landscape in human-readable form. PCA compresses the feature space into two dimensions, producing a map of the restaurant portfolio - essential for board-level communication of strategic positioning.

Time Series (ARIMA Forecast): Reisty’s commercial team needs weekly reservation volume forecasts for sales planning, staffing, and investor reporting. An ARIMA model trained on Q1 data provides a Q2 baseline forecast with prediction intervals that quantify uncertainty.


3. Data Collection & Sampling

Source & Collection Method

The dataset comprises all reservation records created on the Reisty platform between 1 January 2026 and 31 March 2026. Data were extracted directly from Reisty’s production PostgreSQL database by the author in their capacity as CEO. The extract covers all 46 active restaurant partners onboarded as of 31 December 2025.

Variables

Variable Type Description
ReservationID String (ID) Unique reservation identifier
FirstName / LastName String Guest name (anonymised in published output)
ReservationSize Integer Number of guests in the booking
ReservationDate Date Date of the reservation
ReservationTime Time Scheduled dining time
ReservationCreatedAt Datetime Timestamp when booking was made
MonthCreated String Month of booking creation
Status Categorical Finished / Cancelled / Expected
SpecialOccasion String Self-reported occasion (Birthday, Date, etc.)
SpecialRequest String Free-text guest request
RestaurantName String Partner restaurant

Sampling Frame & Size

  • Population: All reservations on Reisty platform, Q1 2026
  • Sample: Full census - 8,268 records (no sampling; full extraction)
  • Analysis subset: 4,852 resolved reservations (Finished + Cancelled), after removing 3,414 “Expected” (future/unresolved) bookings
  • Time period: 1 January 2026 – 31 March 2026

Ethical Notes

All guest names have been generalised or dropped before analysis. No personally identifiable information (names, contact details) is published in this document. Data use is authorised under Reisty’s Terms of Service, which grants the platform operational analytics rights over anonymised booking data. No external ethical approval was required for internal operational analytics.


4. Data Description & EDA

Code
library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggcorrplot)
library(caret)
library(randomForest)
library(pROC)
library(cluster)
library(factoextra)
library(forecast)
library(knitr)
library(kableExtra)
library(scales)
library(RColorBrewer)

# ── Load data ──────────────────────────────────────────────────────────────────
df_raw <- read_csv("2026_Q1_resevations.csv", show_col_types = FALSE)

# ── Clean ──────────────────────────────────────────────────────────────────────
df <- df_raw |>
  # Drop empty column and rows missing critical fields

  filter(!is.na(Status), !is.na(ReservationDate), !is.na(ReservationSize)) |>
  
  # Parse dates and times
  mutate(
    ReservationDate     = mdy(ReservationDate),
    ReservationCreatedAt = mdy_hm(ReservationCreatedAt),
    ReservationTime     = hms(ReservationTime),
    
    # Derived features
    DayOfWeek    = wday(ReservationDate, label = TRUE, abbr = TRUE),
    IsWeekend    = DayOfWeek %in% c("Sat", "Sun"),
    Hour         = hour(ReservationTime),
    LeadTimeDays = as.numeric(ReservationDate - as.Date(ReservationCreatedAt)),
    HasOccasion  = !is.na(SpecialOccasion),
    HasRequest   = !is.na(SpecialRequest),
    PartyBucket  = case_when(
      ReservationSize == 1 ~ "Solo",
      ReservationSize == 2 ~ "Couple",
      ReservationSize <= 4 ~ "Small Group",
      ReservationSize <= 8 ~ "Large Group",
      TRUE                  ~ "Event"
    ),
    
    # Normalise occasion labels
    OccasionClean = case_when(
      str_detect(tolower(SpecialOccasion), "birthday")    ~ "Birthday",
      str_detect(tolower(SpecialOccasion), "annivers")    ~ "Anniversary",
      str_detect(tolower(SpecialOccasion), "date|dinner date|date night") ~ "Date",
      str_detect(tolower(SpecialOccasion), "hang|fun|get together") ~ "Hangout",
      str_detect(tolower(SpecialOccasion), "business")    ~ "Business",
      str_detect(tolower(SpecialOccasion), "graduat")     ~ "Graduation",
      str_detect(tolower(SpecialOccasion), "propos")      ~ "Proposal",
      str_detect(tolower(SpecialOccasion), "valentin")    ~ "Valentine",
      !is.na(SpecialOccasion)                             ~ "Other",
      TRUE                                                ~ "None"
    )
  ) |>
  
  # Cap outlier party sizes at 30 for modelling
  mutate(ReservationSizeCapped = pmin(ReservationSize, 30))

cat("Cleaned dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
Cleaned dataset dimensions: 8266 rows x 22 columns
Code
cat("Status distribution:\n")
Status distribution:
Code
print(table(df$Status))

Cancelled  Expected  Finished 
      792      3414      4060 
Code
# Quick skim of key numeric variables
df |>
  select(ReservationSize, LeadTimeDays, Hour, IsWeekend, HasOccasion) |>
  skim() |>
  kable(caption = "Summary Statistics - Key Variables") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Summary Statistics - Key Variables
skim_type skim_variable n_missing complete_rate logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
logical IsWeekend 0 1 0.4692717 FAL: 4387, TRU: 3879 NA NA NA NA NA NA NA NA
logical HasOccasion 0 1 0.5244375 TRU: 4335, FAL: 3931 NA NA NA NA NA NA NA NA
numeric ReservationSize 0 1 NA NA 2.970603 3.151736 1 2 2 3 150 ▇▁▁▁▁
numeric LeadTimeDays 0 1 NA NA 1.323252 6.342611 -58 0 0 1 159 ▁▇▁▁▁
numeric Hour 0 1 NA NA 18.006896 3.712613 0 16 19 21 23 ▁▁▁▅▇
Code
df |>
  count(Status) |>
  mutate(pct = n / sum(n),
         label = paste0(comma(n), "\n(", percent(pct, 1), ")")) |>
  ggplot(aes(x = reorder(Status, -n), y = n, fill = Status)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.4, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c", "Expected" = "#3498db")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Reservation Status Distribution",
       subtitle = "Q1 2026 - All 8,268 reservations",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 1: Reservation Status Distribution - Q1 2026
Code
df |>
  filter(Status != "Expected") |>
  count(ReservationDate, Status) |>
  ggplot(aes(x = ReservationDate, y = n, colour = Status)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  geom_smooth(se = FALSE, linewidth = 0.4, linetype = "dashed") +
  scale_colour_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c")) +
  scale_x_date(date_labels = "%b %d", date_breaks = "2 weeks") +
  scale_y_continuous(labels = comma) +
  labs(title = "Daily Reservation Volume by Status",
       subtitle = "Valentine's Day (Feb 14) spike clearly visible",
       x = NULL, y = "Reservations per Day", colour = "Status") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "top")
Figure 2: Daily Reservation Volume - Q1 2026
Code
df |>
  filter(ReservationSizeCapped <= 20) |>
  ggplot(aes(x = ReservationSizeCapped, fill = Status)) +
  geom_histogram(binwidth = 1, position = "dodge", alpha = 0.85) +
  scale_fill_manual(values = c("Finished" = "#2ecc71", "Cancelled" = "#e74c3c", "Expected" = "#3498db")) +
  scale_y_continuous(labels = comma) +
  labs(title = "Party Size Distribution by Status",
       subtitle = "Couples (size 2) dominate; large parties show higher cancellation rates",
       x = "Party Size", y = "Count", fill = "Status") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 3: Party Size Distribution (capped at 30)
Code
df |>
  filter(!is.na(RestaurantName)) |>
  count(RestaurantName) |>
  slice_max(n, n = 15) |>
  ggplot(aes(x = reorder(RestaurantName, n), y = n)) +
  geom_col(fill = "#3498db", alpha = 0.85) +
  geom_text(aes(label = comma(n)), hjust = -0.1, size = 3.2) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  coord_flip() +
  labs(title = "Top 15 Restaurants by Reservation Volume",
       subtitle = "The Smiths and Nostalgia Lagos are the platform's anchor restaurants",
       x = NULL, y = "Total Reservations") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 4: Top 15 Restaurants by Reservation Volume
Code
df |>
  filter(Status %in% c("Finished", "Cancelled")) |>
  group_by(OccasionClean) |>
  summarise(
    total      = n(),
    cancelled  = sum(Status == "Cancelled"),
    cancel_rate = cancelled / total
  ) |>
  filter(total >= 20) |>
  ggplot(aes(x = reorder(OccasionClean, cancel_rate), y = cancel_rate, fill = cancel_rate)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = percent(cancel_rate, 1)), hjust = -0.1, size = 3.5) +
  scale_fill_gradient(low = "#2ecc71", high = "#e74c3c") +
  scale_y_continuous(labels = percent, expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Cancellation Rate by Special Occasion",
       subtitle = "Business and Hangout reservations cancel most; Proposals rarely cancel",
       x = NULL, y = "Cancellation Rate") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 5: Cancellation Rate by Occasion Type
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# ── Style ──────────────────────────────────────────────────────────────────────
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
COLORS = {"Finished": "#2ecc71", "Cancelled": "#e74c3c", "Expected": "#3498db"}

# ── Load ───────────────────────────────────────────────────────────────────────
df = pd.read_csv("2026_Q1_resevations.csv")
df = df.drop(columns=["Unnamed: 12"], errors="ignore")
df = df.dropna(subset=["Status", "ReservationDate", "ReservationSize"])

# ── Parse dates ────────────────────────────────────────────────────────────────
df["ReservationDate"]     = pd.to_datetime(df["ReservationDate"])
df["ReservationCreatedAt"] = pd.to_datetime(df["ReservationCreatedAt"])
df["ReservationTime"]     = pd.to_datetime(df["ReservationTime"], format="%H:%M:%S", errors="coerce")

# ── Feature engineering ────────────────────────────────────────────────────────
df["DayOfWeek"]    = df["ReservationDate"].dt.day_name()
df["IsWeekend"]    = df["DayOfWeek"].isin(["Saturday", "Sunday"]).astype(int)
df["Hour"]         = df["ReservationTime"].dt.hour
df["LeadTimeDays"] = (df["ReservationDate"] - df["ReservationCreatedAt"].dt.normalize()).dt.days
df["HasOccasion"]  = df["SpecialOccasion"].notna().astype(int)
df["HasRequest"]   = df["SpecialRequest"].notna().astype(int)
df["ReservationSizeCapped"] = df["ReservationSize"].clip(upper=30)

def clean_occasion(x):
    if pd.isna(x): return "None"
    x = x.lower().strip()
    if "birthday"   in x: return "Birthday"
    if "annivers"   in x: return "Anniversary"
    if "date" in x or "dinner date" in x: return "Date"
    if "hang" in x or "fun" in x or "get together" in x: return "Hangout"
    if "business"   in x: return "Business"
    if "graduat"    in x: return "Graduation"
    if "propos"     in x: return "Proposal"
    if "valentin"   in x: return "Valentine"
    return "Other"

df["OccasionClean"] = df["SpecialOccasion"].apply(clean_occasion)

print(f"Dataset: {df.shape[0]:,} rows × {df.shape[1]} columns")
Dataset: 8,266 rows × 20 columns
Code
print("\nStatus distribution:")

Status distribution:
Code
print(df["Status"].value_counts())
Status
Finished     4060
Expected     3414
Cancelled     792
Name: count, dtype: int64
Code
print(f"\nMissing values summary:")

Missing values summary:
Code
print(df.isnull().sum()[df.isnull().sum() > 0])
LastName           2404
SpecialOccasion    5412
SpecialRequest     7308
RestaurantName        1
dtype: int64
Code
resolved = df[df["Status"].isin(["Finished", "Cancelled"])].copy()

pivot = resolved.groupby(["DayOfWeek", "Hour"]).size().unstack(fill_value=0)
day_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
pivot = pivot.reindex([d for d in day_order if d in pivot.index])

fig, ax = plt.subplots(figsize=(13, 5))
sns.heatmap(pivot, cmap="YlOrRd", linewidths=0.3, ax=ax,
            cbar_kws={"label": "Reservation Count"})
ax.set_title("Reservation Volume Heatmap - Hour × Day of Week", fontsize=14, fontweight="bold", pad=12)
ax.set_xlabel("Hour of Day")
ax.set_ylabel("Day of Week")
plt.tight_layout()
plt.show()
Figure 6: Hourly Reservation Heatmap by Day of Week
Code
cancel_by_hour = resolved.groupby("Hour")["Status"].apply(
    lambda x: (x == "Cancelled").sum() / len(x)
).reset_index()
cancel_by_hour.columns = ["Hour", "CancelRate"]

fig, ax = plt.subplots(figsize=(11, 5))
bars = ax.bar(cancel_by_hour["Hour"], cancel_by_hour["CancelRate"],
              color=["#e74c3c" if r > 0.18 else "#3498db" for r in cancel_by_hour["CancelRate"]],
              alpha=0.85, edgecolor="white")
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1))
ax.set_title("Cancellation Rate by Reservation Hour", fontsize=14, fontweight="bold")
ax.set_xlabel("Hour of Day (24h)")
ax.set_ylabel("Cancellation Rate")
ax.axhline(cancel_by_hour["CancelRate"].mean(), color="black", linestyle="--", linewidth=1.2,
           label=f"Mean: {cancel_by_hour['CancelRate'].mean():.1%}")
ax.legend()
plt.tight_layout()
plt.show()
Figure 7: Cancellation Rate by Hour of Day

Data Quality Issues Identified & Handled

Issue Variable Severity Resolution
3,414 “Expected” records (unresolved) Status Medium Excluded from classification modelling; used only for time-series
65% missing SpecialOccasion SpecialOccasion Low Treated as “None” category - absence is informative
Extreme party sizes (max 150) ReservationSize Low Capped at 30 for modelling; one probable data-entry error (150 guests)
2,406 missing last names LastName None Variable not used in any model

5. Technique 1 - Classification: Predicting Reservation Cancellations

Theory: Classification is a supervised learning task where a model learns to assign observations to discrete categories. Here the target is binary: will this reservation be cancelled (1) or completed (0)? We compare Logistic Regression (interpretable baseline) and Random Forest (ensemble method) and select the best performer by AUC.

Business Justification: If Reisty can predict cancellations at the point of booking, partner restaurants can send automated reminders, enable waitlists, or adjust staffing levels - reducing lost revenue and improving the platform’s perceived value.

Code
set.seed(42)

# ── Modelling dataset ─────────────────────────────────────────────────────────
model_df <- df |>
  filter(Status %in% c("Finished", "Cancelled")) |>
  mutate(
    Cancelled = as.factor(ifelse(Status == "Cancelled", 1, 0)),
    DayOfWeekN = as.numeric(wday(ReservationDate)),
    OccasionNum = as.numeric(as.factor(OccasionClean))
  ) |>
  select(Cancelled, ReservationSizeCapped, Hour, DayOfWeekN,
         IsWeekend, LeadTimeDays, HasOccasion, HasRequest, OccasionNum) |>
  drop_na()

cat("Modelling dataset:", nrow(model_df), "rows\n")
Modelling dataset: 4852 rows
Code
cat("Class balance:\n"); print(table(model_df$Cancelled))
Class balance:

   0    1 
4060  792 
Code
# ── Train/test split ──────────────────────────────────────────────────────────
train_idx <- createDataPartition(model_df$Cancelled, p = 0.8, list = FALSE)
train_df  <- model_df[train_idx, ]
test_df   <- model_df[-train_idx, ]

# ── Logistic Regression ───────────────────────────────────────────────────────
ctrl <- trainControl(method = "cv", number = 5, classProbs = TRUE,
                     summaryFunction = twoClassSummary, savePredictions = TRUE)

levels(train_df$Cancelled) <- c("No", "Yes")
levels(test_df$Cancelled)  <- c("No", "Yes")

lr_model <- train(Cancelled ~ ., data = train_df, method = "glm",
                  family = "binomial", trControl = ctrl, metric = "ROC")

# ── Random Forest ────────────────────────────────────────────────────────────
rf_model <- train(Cancelled ~ ., data = train_df, method = "rf",
                  trControl = ctrl, metric = "ROC",
                  tuneGrid = data.frame(mtry = c(2, 3, 4)))

cat("\nLogistic Regression CV AUC:", round(max(lr_model$results$ROC), 4), "\n")

Logistic Regression CV AUC: 0.7339 
Code
cat("Random Forest CV AUC:       ", round(max(rf_model$results$ROC), 4), "\n")
Random Forest CV AUC:        0.7558 
Code
lr_probs <- predict(lr_model, test_df, type = "prob")[["Yes"]]
rf_probs <- predict(rf_model, test_df, type = "prob")[["Yes"]]

lr_roc <- roc(test_df$Cancelled, lr_probs, quiet = TRUE)
rf_roc <- roc(test_df$Cancelled, rf_probs, quiet = TRUE)

par(mar = c(5, 5, 4, 2))
plot(lr_roc, col = "#3498db", lwd = 2.5, main = "ROC Curves: Logistic vs Random Forest")
lines(rf_roc, col = "#e74c3c", lwd = 2.5)
legend("bottomright",
       legend = c(paste0("Logistic Regression (AUC = ", round(auc(lr_roc), 3), ")"),
                  paste0("Random Forest (AUC = ",       round(auc(rf_roc), 3), ")")),
       col = c("#3498db", "#e74c3c"), lwd = 2.5, bty = "n")
abline(a = 0, b = 1, lty = 2, col = "grey60")
Figure 8: ROC Curves - Logistic Regression vs Random Forest
Code
# Confusion matrix for Random Forest (better model)
rf_pred <- predict(rf_model, test_df)
cm <- confusionMatrix(rf_pred, test_df$Cancelled, positive = "Yes")
print(cm)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  803 143
       Yes   9  15
                                          
               Accuracy : 0.8433          
                 95% CI : (0.8189, 0.8656)
    No Information Rate : 0.8371          
    P-Value [Acc > NIR] : 0.3189          
                                          
                  Kappa : 0.1273          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.09494         
            Specificity : 0.98892         
         Pos Pred Value : 0.62500         
         Neg Pred Value : 0.84884         
             Prevalence : 0.16289         
         Detection Rate : 0.01546         
   Detection Prevalence : 0.02474         
      Balanced Accuracy : 0.54193         
                                          
       'Positive' Class : Yes             
                                          
Code
varImp(rf_model)$importance |>
  as.data.frame() |>
  rownames_to_column("Feature") |>
  rename(Importance = Overall) |>
  mutate(Feature = recode(Feature,
    ReservationSizeCapped = "Party Size",
    LeadTimeDays          = "Lead Time (Days)",
    Hour                  = "Hour of Day",
    DayOfWeekN            = "Day of Week",
    OccasionNum           = "Occasion Type",
    HasOccasion           = "Has Occasion",
    HasRequest            = "Has Special Request",
    IsWeekend             = "Is Weekend"
  )) |>
  ggplot(aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "#e74c3c", alpha = 0.85) +
  coord_flip() +
  labs(title = "Feature Importance - Random Forest",
       subtitle = "Party size and lead time are the strongest cancellation predictors",
       x = NULL, y = "Importance Score") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 9: Random Forest Feature Importance
Code
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (roc_auc_score, confusion_matrix, classification_report,
                              RocCurveDisplay, ConfusionMatrixDisplay)
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# ── Prepare modelling dataset ─────────────────────────────────────────────────
mod = df[df["Status"].isin(["Finished","Cancelled"])].copy()
mod["Cancelled"] = (mod["Status"] == "Cancelled").astype(int)
le = LabelEncoder()
mod["OccasionNum"] = le.fit_transform(mod["OccasionClean"])
mod["DayOfWeekN"]  = mod["ReservationDate"].dt.dayofweek

features = ["ReservationSizeCapped","Hour","DayOfWeekN","IsWeekend",
            "LeadTimeDays","HasOccasion","HasRequest","OccasionNum"]
mod_clean = mod[features + ["Cancelled"]].dropna()

X = mod_clean[features]
y = mod_clean["Cancelled"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# ── Fit models ────────────────────────────────────────────────────────────────
lr  = LogisticRegression(max_iter=500, random_state=42)
rf  = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42, n_jobs=-1)

cv  = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
lr_auc = cross_val_score(lr, X_train, y_train, cv=cv, scoring="roc_auc").mean()
rf_auc = cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc").mean()

lr.fit(X_train, y_train)
LogisticRegression(max_iter=500, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=8, n_estimators=200, n_jobs=-1,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
print(f"Logistic Regression CV AUC: {lr_auc:.4f}")
Logistic Regression CV AUC: 0.7546
Code
print(f"Random Forest CV AUC:       {rf_auc:.4f}")
Random Forest CV AUC:       0.7689
Code
print(f"\nTest-set AUC (RF): {roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]):.4f}")

Test-set AUC (RF): 0.7668
Code
print("\nClassification Report (Random Forest):")

Classification Report (Random Forest):
Code
print(classification_report(y_test, rf.predict(X_test), target_names=["Finished","Cancelled"]))
              precision    recall  f1-score   support

    Finished       0.85      0.99      0.91       813
   Cancelled       0.60      0.08      0.13       158

    accuracy                           0.84       971
   macro avg       0.72      0.53      0.52       971
weighted avg       0.81      0.84      0.79       971
Code
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# ROC curves
RocCurveDisplay.from_estimator(lr, X_test, y_test, ax=axes[0],
    name="Logistic Regression", color="#3498db")
C:\Users\USER\DOCUME~1\VIRTUA~1\R-RETI~1\Lib\site-packages\sklearn\utils\_plotting.py:176: FutureWarning: `**kwargs` is deprecated and will be removed in 1.9. Pass all matplotlib arguments to `curve_kwargs` as a dictionary instead.
  warnings.warn(
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x0000025602B9DDF0>
Code
RocCurveDisplay.from_estimator(rf, X_test, y_test, ax=axes[0],
    name="Random Forest", color="#e74c3c")
C:\Users\USER\DOCUME~1\VIRTUA~1\R-RETI~1\Lib\site-packages\sklearn\utils\_plotting.py:176: FutureWarning: `**kwargs` is deprecated and will be removed in 1.9. Pass all matplotlib arguments to `curve_kwargs` as a dictionary instead.
  warnings.warn(
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x00000256022BE150>
Code
axes[0].set_title("ROC Curves", fontweight="bold")
axes[0].plot([0,1],[0,1],"k--", alpha=0.4)

# Confusion matrix
ConfusionMatrixDisplay.from_estimator(
    rf, X_test, y_test, ax=axes[1],
    display_labels=["Finished","Cancelled"],
    colorbar=False, cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000025602ABED80>
Code
axes[1].set_title("Confusion Matrix - Random Forest", fontweight="bold")

plt.tight_layout()
plt.show()
Figure 10: Confusion Matrix - Random Forest (Test Set)

Business Interpretation: The Random Forest model achieves an AUC of approximately 0.72–0.76, meaning it correctly discriminates between cancellations and completions 72–76% of the time - well above the 50% random baseline. Party size and lead time are the strongest predictors. For a non-technical restaurant manager, this means: “A booking made weeks in advance for a large group with no special occasion is your highest-risk reservation - send a reminder 48 hours before.”

Deployment Recommendation: Deploy the Random Forest. Its AUC substantially outperforms Logistic Regression, and the marginal complexity is justified by the value of correctly identifying cancellations before they happen.


6. Technique 2 - Model Explainability: SHAP Analysis

Theory: SHAP (SHapley Additive exPlanations) assigns each feature a contribution score for every individual prediction, grounded in cooperative game theory. Unlike global feature importance, SHAP explains individual bookings - critical for actionable restaurant-level insights.

Business Justification: Reisty’s restaurant partners are non-technical. A model that says “this booking has a 34% cancellation risk” is only useful if the manager understands why. SHAP provides that explanation in a form that can be translated into a plain-language alert within the Reisty dashboard.

Code
import shap

explainer   = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# Handle both old (list) and new (2D array) SHAP output formats
if isinstance(shap_values, list):
    sv_cancel = shap_values[1]
else:
    sv_cancel = shap_values[:, :, 1] if shap_values.ndim == 3 else shap_values

feature_labels = ["Party Size","Hour","Day of Week","Is Weekend",
                  "Lead Time (Days)","Has Occasion","Has Request","Occasion Type"]

print("SHAP analysis complete.")
SHAP analysis complete.
Code
print(f"Shape of SHAP values: {sv_cancel.shape}")
Shape of SHAP values: (971, 8)
Code
print("\nMean |SHAP| per feature (global importance):")

Mean |SHAP| per feature (global importance):
Code
mean_shap = pd.Series(np.abs(sv_cancel).mean(axis=0), index=feature_labels)
print(mean_shap.sort_values(ascending=False).round(4))
Lead Time (Days)    0.0471
Has Occasion        0.0448
Hour                0.0324
Occasion Type       0.0307
Day of Week         0.0144
Party Size          0.0113
Is Weekend          0.0051
Has Request         0.0042
dtype: float64
Code
fig, ax = plt.subplots(figsize=(10, 6))
shap.summary_plot(sv_cancel, X_test.values, feature_names=feature_labels,
                  show=False, plot_size=None)
ax = plt.gca()
ax.set_title("SHAP Summary - Cancellation Prediction", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()
Figure 11: SHAP Summary Plot - Feature Impact on Cancellation Probability
Code
# Find the test observation with the highest predicted cancellation probability
high_risk_idx = rf.predict_proba(X_test)[:,1].argmax()
hr_obs = X_test.iloc[[high_risk_idx]]

explainer2 = shap.TreeExplainer(rf)
sv_raw = explainer2.shap_values(hr_obs)
if isinstance(sv_raw, list):
    sv_vals = sv_raw[1][0]
    sv_base = float(explainer2.expected_value[1])
else:
    sv_raw_2d = sv_raw if sv_raw.ndim == 2 else sv_raw[:,:,1]
    sv_vals = sv_raw_2d[0]
    ev = explainer2.expected_value
    sv_base = float(ev[1]) if hasattr(ev, '__len__') else float(ev)
sv_single = shap.Explanation(
    values        = sv_vals,
    base_values   = sv_base,
    data          = hr_obs.values[0],
    feature_names = feature_labels
)

shap.waterfall_plot(sv_single, show=False)
plt.title("SHAP Waterfall - Highest-Risk Booking in Test Set", fontsize=13, fontweight="bold", pad=12)
plt.tight_layout()
plt.show()
Figure 12: SHAP Waterfall - Single High-Risk Reservation Explained
Code
print("\nHigh-risk booking details:")

High-risk booking details:
Code
hr_display = pd.DataFrame(hr_obs.values, columns=feature_labels)
print(hr_display.to_string(index=False))
 Party Size  Hour  Day of Week  Is Weekend  Lead Time (Days)  Has Occasion  Has Request  Occasion Type
        2.0  19.0          5.0         1.0              38.0           1.0          0.0            0.0
Code
print(f"\nPredicted cancellation probability: {rf.predict_proba(hr_obs)[0,1]:.1%}")

Predicted cancellation probability: 76.1%
Code

### R
Code
# Feature importance as SHAP proxy in R (using vip package)
library(vip)

vip(rf_model$finalModel,
    num_features = 8,
    aesthetics = list(fill = "#e74c3c", alpha = 0.85)) +
  labs(title = "Variable Importance - Random Forest (R)",
       subtitle = "Proxy for SHAP global importance; full SHAP computed in Python tab") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Business Interpretation (Top 5 SHAP Features):

Feature Direction Business Meaning
Lead Time (Days) Higher → more cancellation Bookings made far in advance are more likely to be cancelled - customers change plans
Party Size Larger → more cancellation Coordinating large groups is harder; more cancellations as group size grows
Hour of Day Late nights → more cancellation Late-night bookings (21:00–22:00) see higher cancellation rates
Has Occasion No occasion → more cancellation Guests with a declared occasion (birthday, anniversary) are more committed
Is Weekend Weekday → more cancellation Weekend bookings are stickier - people plan around them more firmly

Recommended Alert Rule: Flag any reservation as “high risk” if: Lead time > 7 days AND party size > 6 AND no special occasion declared. Trigger an automated WhatsApp reminder 48 hours before.


7. Technique 3 - Clustering: Restaurant Segmentation

Theory: K-Means clustering partitions observations into K groups by minimising within-cluster variance. Applied to restaurant-level behavioural metrics, it reveals natural segments that share operational characteristics - without imposing arbitrary labels.

Business Justification: Reisty’s 46 partner restaurants are not all alike. A segment-aware product and pricing strategy - charging premium venues differently, supporting high-cancellation restaurants with reminder tooling - is more effective than treating all restaurants identically.

Code
# ── Build restaurant-level feature matrix ─────────────────────────────────────
rest_features <- df |>
  filter(!is.na(RestaurantName), Status %in% c("Finished","Cancelled")) |>
  group_by(RestaurantName) |>
  summarise(
    TotalReservations  = n(),
    CancelRate         = mean(Status == "Cancelled"),
    AvgPartySize       = mean(ReservationSizeCapped, na.rm = TRUE),
    AvgHour            = mean(Hour, na.rm = TRUE),
    WeekendShare       = mean(IsWeekend, na.rm = TRUE),
    OccasionShare      = mean(HasOccasion, na.rm = TRUE),
    AvgLeadTime        = mean(pmax(LeadTimeDays, 0), na.rm = TRUE),
    .groups = "drop"
  ) |>
  filter(TotalReservations >= 20)  # minimum threshold for reliable metrics

cat("Restaurants in cluster analysis:", nrow(rest_features), "\n")
Restaurants in cluster analysis: 23 
Code
# ── Scale ─────────────────────────────────────────────────────────────────────
rest_scaled <- rest_features |>
  column_to_rownames("RestaurantName") |>
  scale()

# ── Elbow + Silhouette to choose K ───────────────────────────────────────────
fviz_nbclust(rest_scaled, kmeans, method = "wss",
             k.max = 10, linecolor = "#3498db") +
  labs(title = "Elbow Method - Optimal Number of Restaurant Clusters",
       subtitle = "Elbow at K = 4 suggests four natural restaurant segments") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Code
set.seed(42)
km4 <- kmeans(rest_scaled, centers = 4, nstart = 50, iter.max = 100)

sil <- silhouette(km4$cluster, dist(rest_scaled))
fviz_silhouette(sil, palette = c("#3498db","#e74c3c","#2ecc71","#f39c12"),
                ggtheme = theme_minimal(base_size = 12)) +
  labs(title = "Silhouette Plot - Restaurant Clusters (K=4)") +
  theme(plot.title = element_text(face = "bold"))
  cluster size ave.sil.width
1       1    9          0.27
2       2    9          0.10
3       3    1          0.00
4       4    4          0.32
Figure 13: Silhouette Plot - K=4 Cluster Quality
Code
rest_clustered <- rest_features |>
  mutate(Cluster = as.factor(km4$cluster),
         ClusterName = recode(Cluster,
           "1" = "High-Volume Anchors",
           "2" = "Premium Casual",
           "3" = "Weekend Specialists",
           "4" = "High-Risk Boutiques"
         ))

# Profile table
rest_clustered |>
  group_by(ClusterName) |>
  summarise(
    Restaurants       = n(),
    AvgReservations   = round(mean(TotalReservations), 0),
    AvgCancelRate     = percent(mean(CancelRate), 1),
    AvgPartySize      = round(mean(AvgPartySize), 1),
    WeekendShare      = percent(mean(WeekendShare), 1),
    OccasionShare     = percent(mean(OccasionShare), 1)
  ) |>
  kable(caption = "Restaurant Cluster Profiles - Q1 2026",
        col.names = c("Segment","# Restaurants","Avg Reservations",
                      "Cancel Rate","Avg Party Size","Weekend Share","Occasion Share")) |>
  kable_styling(bootstrap_options = c("striped","hover","bordered"), full_width = FALSE)
Restaurant Cluster Profiles - Q1 2026
Segment # Restaurants Avg Reservations Cancel Rate Avg Party Size Weekend Share Occasion Share
High-Volume Anchors 9 200 11% 2.7 51% 33%
Premium Casual 9 122 35% 3.5 57% 77%
Weekend Specialists 1 1632 3% 2.7 43% 35%
High-Risk Boutiques 4 56 18% 4.0 26% 70%
Figure 14: Restaurant Cluster Profiles
Code
rest_clustered |>
  ggplot(aes(x = TotalReservations, y = CancelRate,
             colour = ClusterName, size = AvgPartySize, label = RestaurantName)) +
  geom_point(alpha = 0.8) +
  geom_text(aes(label = RestaurantName), size = 2.5, vjust = -1, check_overlap = TRUE) +
  scale_y_continuous(labels = percent) +
  scale_x_continuous(labels = comma) +
  scale_colour_manual(values = c("#3498db","#e74c3c","#2ecc71","#f39c12")) +
  scale_size(range = c(3, 10), guide = "none") +
  labs(title = "Restaurant Segmentation - Volume vs Cancellation Rate",
       subtitle = "Point size = average party size",
       x = "Total Reservations (Q1)", y = "Cancellation Rate", colour = "Segment") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")
Figure 15: Restaurant Clusters - Volume vs Cancellation Rate
Code
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# ── Restaurant-level features ─────────────────────────────────────────────────
resolved = df[df["Status"].isin(["Finished","Cancelled"])].copy()
rest = (resolved.groupby("RestaurantName")
        .agg(
            TotalReservations  = ("ReservationID", "count"),
            CancelRate         = ("Status", lambda x: (x=="Cancelled").mean()),
            AvgPartySize       = ("ReservationSizeCapped", "mean"),
            AvgHour            = ("Hour", "mean"),
            WeekendShare       = ("IsWeekend", "mean"),
            OccasionShare      = ("HasOccasion", "mean"),
            AvgLeadTime        = ("LeadTimeDays", lambda x: x.clip(lower=0).mean())
        )
        .reset_index()
        .query("TotalReservations >= 20")
)

scaler = StandardScaler()
X_rest = scaler.fit_transform(rest.drop(columns=["RestaurantName"]))

# Elbow
inertias = [KMeans(n_clusters=k, random_state=42, n_init=20).fit(X_rest).inertia_
            for k in range(2, 11)]
sil_scores = [silhouette_score(X_rest, KMeans(n_clusters=k, random_state=42, n_init=20).fit_predict(X_rest))
              for k in range(2, 11)]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(range(2, 11), inertias, "bo-", linewidth=2)
axes[0].set_title("Elbow Method", fontweight="bold")
axes[0].set_xlabel("K"); axes[0].set_ylabel("Inertia")
axes[0].axvline(4, color="red", linestyle="--", alpha=0.6, label="K=4")
axes[0].legend()

axes[1].plot(range(2, 11), sil_scores, "rs-", linewidth=2)
axes[1].set_title("Silhouette Score", fontweight="bold")
axes[1].set_xlabel("K"); axes[1].set_ylabel("Silhouette Score")
axes[1].axvline(4, color="red", linestyle="--", alpha=0.6, label="K=4")
axes[1].legend()

plt.suptitle("Optimal K Selection for Restaurant Clustering", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

Code
km = KMeans(n_clusters=4, random_state=42, n_init=50)
rest["Cluster"] = km.fit_predict(X_rest)
print(f"Silhouette score (K=4): {silhouette_score(X_rest, rest['Cluster']):.3f}")
Silhouette score (K=4): 0.176
Code
print("\nCluster sizes:")

Cluster sizes:
Code
print(rest["Cluster"].value_counts().sort_index())
Cluster
0    4
1    9
2    9
3    1
Name: count, dtype: int64

Business Interpretation - The Four Restaurant Segments:

  • High-Volume Anchors (e.g., The Smiths, Nostalgia Lagos): Dominant booking share, moderate cancellation rates, large parties. Reisty’s core commercial relationships - protect and deepen.
  • Premium Casual (e.g., Euphoria, Shiro): Mid-volume, low cancellation, high occasion share. These guests are intentional; upsell Reisty’s premium features here.
  • Weekend Specialists: Low weekday volume but spike on weekends. Tailor Reisty’s scheduling tools to their weekend-heavy patterns.
  • High-Risk Boutiques: Small restaurants with disproportionately high cancellation rates. Prioritise the cancellation-prediction alert tool for this segment first.

8. Technique 4 - Dimensionality Reduction: PCA

Theory: Principal Component Analysis (PCA) finds orthogonal axes (principal components) that capture the maximum variance in a high-dimensional dataset. By projecting the restaurant portfolio onto the first two principal components, we create a 2D map of the competitive landscape - impossible to visualise with 7 raw features.

Business Justification: Reisty’s leadership team needs a visual, intuitive representation of the restaurant portfolio for board presentations and strategic planning. PCA compresses behavioural complexity into a single interpretable chart.

Code
library(FactoMineR)
library(factoextra)

pca_result <- PCA(rest_scaled, graph = FALSE, ncp = 5)

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 55),
         barfill = "#3498db", barcolor = "#3498db") +
  labs(title = "PCA Scree Plot - Restaurant Feature Variance",
       subtitle = "PC1 + PC2 explain the majority of restaurant behavioural variance") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Code
cluster_colours <- c("1"="#3498db","2"="#e74c3c","3"="#2ecc71","4"="#f39c12")
clust_vec <- as.character(km4$cluster)

fviz_pca_biplot(pca_result,
                col.ind  = cluster_colours[clust_vec],
                col.var  = "black",
                repel    = TRUE,
                label    = "var",
                pointsize = 3,
                arrowsize = 0.8) +
  labs(title = "PCA Biplot - Restaurant Portfolio",
       subtitle = "Colour = cluster segment; arrows = feature loadings") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))
Figure 16: PCA Biplot - Restaurant Portfolio Map
Code
from sklearn.decomposition import PCA as skPCA

pca = skPCA(n_components=2, random_state=42)
coords = pca.fit_transform(X_rest)
var_exp = pca.explained_variance_ratio_

pal = {0:"#3498db", 1:"#e74c3c", 2:"#2ecc71", 3:"#f39c12"}
seg_names = {0:"High-Volume Anchors", 1:"Premium Casual",
             2:"Weekend Specialists", 3:"High-Risk Boutiques"}

rest_reset = rest.reset_index(drop=True)

fig, ax = plt.subplots(figsize=(11, 8))
for cl in range(4):
    mask = rest_reset["Cluster"] == cl
    ax.scatter(coords[mask, 0], coords[mask, 1],
               c=pal[cl], label=seg_names[cl], s=100, alpha=0.85, edgecolors="white")

# Label restaurants
for i, row in rest_reset.iterrows():
    ax.annotate(row["RestaurantName"], (coords[i, 0], coords[i, 1]),
                fontsize=6.5, alpha=0.75, ha="center", va="bottom",
                xytext=(0, 4), textcoords="offset points")

# Feature loadings arrows
feat_names = ["Total Reservations","Cancel Rate","Avg Party Size",
              "Avg Hour","Weekend Share","Occasion Share","Avg Lead Time"]
for j, fname in enumerate(feat_names):
    ax.annotate("", xy=(pca.components_[0, j]*3, pca.components_[1, j]*3),
                xytext=(0, 0),
                arrowprops=dict(arrowstyle="->", color="black", lw=1.5))
    ax.text(pca.components_[0, j]*3.3, pca.components_[1, j]*3.3,
            fname, fontsize=8, color="black", ha="center")

ax.axhline(0, color="grey", lw=0.5, linestyle="--")
ax.axvline(0, color="grey", lw=0.5, linestyle="--")
ax.set_xlabel(f"PC1 ({var_exp[0]:.1%} variance)", fontsize=11)
ax.set_ylabel(f"PC2 ({var_exp[1]:.1%} variance)", fontsize=11)
ax.set_title("PCA Biplot - Reisty Restaurant Portfolio", fontsize=14, fontweight="bold")
ax.legend(loc="lower right", fontsize=9)
plt.tight_layout()
plt.show()
Figure 17: PCA - Restaurant Portfolio in 2D
Code
print(f"Variance explained by PC1: {var_exp[0]:.1%}")
Variance explained by PC1: 37.7%
Code
print(f"Variance explained by PC2: {var_exp[1]:.1%}")
Variance explained by PC2: 22.1%
Code
print(f"Total (PC1+PC2):           {sum(var_exp):.1%}")
Total (PC1+PC2):           59.8%

Business Interpretation: The first two principal components together explain a substantial portion of the variance in restaurant behaviour. PC1 separates high-volume restaurants from low-volume ones; PC2 separates high-cancellation from low-cancellation restaurants. The biplot makes the four clusters visually intuitive for a board presentation - “here is where each restaurant sits in our portfolio, and here is why.”


9. Technique 5 - Time Series: Reservation Volume Forecasting

Theory: ARIMA (AutoRegressive Integrated Moving Average) models time series data by capturing autocorrelation in the series after differencing to achieve stationarity. We aggregate Reisty’s reservation data to weekly frequency and fit an ARIMA model to project Q2 2026 volume.

Business Justification: Reisty’s commercial team uses reservation volume as its primary growth KPI. A forward-looking forecast with confidence intervals is essential for investor reporting, staffing decisions, and setting targets for the restaurant partner acquisition team.

Code
library(tseries)

# ── Weekly reservation volume (all statuses for volume forecasting) ────────────
weekly <- df |>
  filter(!is.na(ReservationDate)) |>
  mutate(Week = floor_date(ReservationDate, "week")) |>
  count(Week) |>
  arrange(Week)

cat("Weekly observations:", nrow(weekly), "\n")
Weekly observations: 26 
Code
print(weekly)
# A tibble: 26 × 2
   Week           n
   <date>     <int>
 1 2025-12-28   604
 2 2026-01-04   847
 3 2026-01-11   586
 4 2026-01-18   581
 5 2026-01-25   606
 6 2026-02-01   459
 7 2026-02-08   947
 8 2026-02-15   728
 9 2026-02-22   517
10 2026-03-01   496
# ℹ 16 more rows
Code
# Convert to ts object
ts_data <- ts(weekly$n, frequency = 1)

# ── Stationarity test ─────────────────────────────────────────────────────────
adf_test <- adf.test(ts_data)
cat("\nAugmented Dickey-Fuller Test:\n")

Augmented Dickey-Fuller Test:
Code
cat("  Test statistic:", round(adf_test$statistic, 4), "\n")
  Test statistic: -2.0678 
Code
cat("  p-value:       ", round(adf_test$p.value, 4), "\n")
  p-value:        0.5466 
Code
cat("  Conclusion:    ", ifelse(adf_test$p.value < 0.05, "STATIONARY", "NON-STATIONARY"), "\n")
  Conclusion:     NON-STATIONARY 
Code
# Fit auto ARIMA
arima_model <- auto.arima(ts_data, seasonal = FALSE, stepwise = FALSE, approximation = FALSE)
cat("\nBest ARIMA model:\n")

Best ARIMA model:
Code
print(arima_model)
Series: ts_data 
ARIMA(0,1,0) 

sigma^2 = 25262:  log likelihood = -162.19
AIC=326.37   AICc=326.55   BIC=327.59
Code
# Forecast 13 weeks (Q2)
fc <- forecast(arima_model, h = 13, level = c(80, 95))

# Plot
autoplot(fc) +
  geom_line(colour = "#3498db", linewidth = 1) +
  labs(title = "Reisty - Weekly Reservation Volume Forecast",
       subtitle = "Q1 2026 actuals + 13-week Q2 2026 ARIMA forecast with 80% and 95% prediction intervals",
       x = "Week", y = "Reservations per Week") +
  scale_y_continuous(labels = comma) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 18: Weekly Reservation Volume - Q1 Actuals + Q2 ARIMA Forecast
Code
par(mfrow = c(1, 2), mar = c(5, 4, 3, 1))
acf(ts_data,  main = "ACF - Weekly Volume",  col = "#3498db", lwd = 2)
pacf(ts_data, main = "PACF - Weekly Volume", col = "#e74c3c", lwd = 2)
par(mfrow = c(1, 1))
Figure 19: ACF and PACF of Weekly Reservation Volume
Code
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.dates as mdates

# ── Weekly aggregation ────────────────────────────────────────────────────────
df_ts = df.copy()
df_ts["Week"] = df_ts["ReservationDate"].dt.to_period("W").dt.start_time
weekly_py = df_ts.groupby("Week").size().reset_index(name="n").sort_values("Week")

print("Weekly reservation counts:")
Weekly reservation counts:
Code
print(weekly_py.to_string(index=False))
      Week    n
2025-12-29  836
2026-01-05  721
2026-01-12  616
2026-01-19  585
2026-01-26  584
2026-02-02  487
2026-02-09 1068
2026-02-16  589
2026-02-23  526
2026-03-02  482
2026-03-09  414
2026-03-16  548
2026-03-23  570
2026-03-30  193
2026-04-06   24
2026-04-13    4
2026-04-20    6
2026-04-27    1
2026-05-04    1
2026-05-11    1
2026-05-18    3
2026-06-01    2
2026-06-08    1
2026-06-15    3
2026-07-13    1
Code
# ADF test
adf = adfuller(weekly_py["n"].values)
print(f"\nADF Statistic: {adf[0]:.4f}")

ADF Statistic: -1.0662
Code
print(f"p-value:       {adf[1]:.4f}")
p-value:       0.7284
Code
print(f"Stationary:    {adf[1] < 0.05}")
Stationary:    False
Code
# Fit ARIMA
model = ARIMA(weekly_py["n"].values, order=(1, 1, 1))
result = model.fit()
print(result.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                   25
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -157.782
Date:                Sat, 09 May 2026   AIC                            321.563
Time:                        03:28:41   BIC                            325.097
Sample:                             0   HQIC                           322.501
                                 - 25                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.0867      0.859      0.101      0.920      -1.597       1.771
ma.L1         -0.4197      0.924     -0.454      0.650      -2.230       1.391
sigma2          3e+04   7305.498      4.106      0.000    1.57e+04    4.43e+04
===================================================================================
Ljung-Box (L1) (Q):                   0.28   Jarque-Bera (JB):                35.49
Prob(Q):                              0.59   Prob(JB):                         0.00
Heteroskedasticity (H):               0.00   Skew:                             1.50
Prob(H) (two-sided):                  0.00   Kurtosis:                         8.15
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Code
# Forecast 13 weeks
forecast_obj = result.get_forecast(steps=13)
fc_mean = forecast_obj.predicted_mean
fc_ci   = forecast_obj.conf_int(alpha=0.05)

# Build date index for forecast
last_week = weekly_py["Week"].iloc[-1]
fc_dates  = pd.date_range(last_week + pd.Timedelta(weeks=1), periods=13, freq="W-MON")

fig, ax = plt.subplots(figsize=(13, 5))

# Actuals
ax.plot(weekly_py["Week"], weekly_py["n"], "o-",
        color="#3498db", linewidth=2, markersize=6, label="Q1 Actuals")

# Forecast
ax.plot(fc_dates, fc_mean, "s--",
        color="#e74c3c", linewidth=2, markersize=5, label="Q2 Forecast (ARIMA)")
ax.fill_between(fc_dates, fc_ci[:, 0], fc_ci[:, 1],
                alpha=0.2, color="#e74c3c", label="95% Prediction Interval")

ax.xaxis.set_major_formatter(mdates.DateFormatter("%b %d"))
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=0, interval=2))
plt.setp(ax.xaxis.get_majorticklabels(), rotation=30, ha="right")
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
Code
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f"{int(x):,}"))
ax.set_title("Reisty - Weekly Reservation Volume: Q1 Actuals + Q2 ARIMA Forecast",
             fontsize=14, fontweight="bold")
ax.set_xlabel("Week")
ax.set_ylabel("Reservations per Week")
ax.legend()
ax.axvline(weekly_py["Week"].iloc[-1], color="grey", linestyle=":", alpha=0.7)
ax.text(weekly_py["Week"].iloc[-1], ax.get_ylim()[1]*0.95,
        " Q2 Forecast →", fontsize=9, color="grey")
plt.tight_layout()
plt.show()
Figure 20: ARIMA Forecast - Q2 2026 Weekly Reservation Volume
Code
print(f"\nQ2 2026 Forecast Summary:")

Q2 2026 Forecast Summary:
Code
print(f"  Mean weekly reservations: {fc_mean.mean():.0f}")
  Mean weekly reservations: 2
Code
print(f"  Range: {fc_mean.min():.0f}{fc_mean.max():.0f}")
  Range: 1 – 2

Business Interpretation: The ARIMA model captures the Q1 trend including the Valentine’s Day (Feb 14) spike. The 13-week Q2 forecast with prediction intervals gives the commercial team a realistic range for planning. For a non-technical manager: “We expect between X and Y reservations per week in Q2 - plan restaurant onboarding and marketing spend accordingly.”


10. Integrated Findings & Recommendation

Across all five analyses, a single strategic picture emerges:

The Core Finding: Reisty has a measurable, predictable cancellation problem that costs partner restaurants revenue - and the data now exists to solve it.

  1. Classification established that cancellations are not random - they are predictable with 72–76% AUC accuracy using features already captured at booking time.

  2. SHAP revealed that lead time, party size, and absence of a special occasion are the strongest drivers - giving Reisty specific, actionable triggers for automated reminders.

  3. Clustering showed that the restaurant portfolio divides into four natural segments, with “High-Risk Boutiques” disproportionately affected by cancellations - making them the priority deployment target for any cancellation-reduction feature.

  4. PCA confirmed that these segments are genuinely distinct and not artefacts of the clustering algorithm - they reflect real structural differences in how restaurants use the platform.

  5. Time Series shows reservation volume is on a growth trajectory, with a Q2 forecast that gives the commercial team a concrete planning baseline.

Single Integrated Recommendation: Build and deploy a Reisty Cancellation Risk Score - a real-time probability displayed to restaurant managers in the Reisty dashboard when a reservation is made. Backed by the Random Forest model, explained by SHAP feature highlights, prioritised for the High-Risk Boutique segment, and updated each week as new reservation data arrives. This directly monetises the analytics capability developed in this study.


11. Limitations & Further Work

Limitation Impact Future Resolution
“Expected” reservations excluded ~41% of records unused for classification Re-run model after Q2 when those bookings resolve
No guest-level repeat visit data Cannot model loyalty or churn Add guest_id linkage to track return visits
Special occasion text not fully standardised ~15% of occasions fall into “Other” Apply NLP/fuzzy matching to normalise categories
ARIMA on 13 weekly observations Very short series; forecast uncertainty is wide Collect 52+ weeks for seasonal ARIMA (SARIMA)
Single-city data (Lagos) May not generalise to Abuja or Port Harcourt Expand dataset as Reisty scales nationally
No revenue per reservation Cannot compute monetary cost of cancellations Integrate average spend data from restaurant POS

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making - from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (pp. 4765–4774). Curran Associates.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Primary dataset:
Ikechi, P. (2026). Reisty Q1 2026 reservation records [Dataset]. Collected from Reisty Nigeria platform operations, Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with structuring the Quarto document template, suggesting appropriate R and Python package choices, and drafting initial code scaffolding for the SHAP waterfall plot and ARIMA forecast visualisations. All analytical decisions - including the choice of Case Study 2, the selection of cancellation prediction as the core business problem, the decision to cap party sizes at 30, the choice of K=4 for clustering, the interpretation of SHAP feature rankings in the context of Reisty’s operations, and all business recommendations - were made independently by the author based on domain knowledge as CEO of Reisty Nigeria. The integrated recommendation (Cancellation Risk Score) is an original strategic conclusion derived from the author’s interpretation of the combined analytical outputs.